I am a social scientist and I have frequently the impression that people in my field think that data visualization is the fun part of research. As some sort of gimmick with no added value. Okay, for a very long time I also did not put too much effort on visualization. Making a scatter plot before running some statistical analysis was the standard procedure during my whole undergrad life, well, as far as I can remember. My perspective changed dramatically ever since I realised that a visualization is the key to communicate central ideas and research results. I hope that I can convince you that acquiring visualization skills is worth the trouble, even though you’re already reading this blog, so probably, I am trying to convince the wrong person 😊.
In the social sciences, we take not much time or effort to teach our students how to visualise, at least not in my opinion. The same applies to tools which we do not use much in academia. What do you think how many people are not capable to do some basic calculations with Excel? Please, don’t feel offended if you’re are one of them. The point is, we should teach important skills in correspondence to the field and handling data in different environments is cleary one of them. We will work on Excel skills later, but the lack of visualization (and other programming) skills is quite a shame. We pretend that we are all familiar with different kind of visualizations and that our students are aware of those, including potential pitfalls when creating it, even though we are no experts in the field. Well, I still don’t know all the ins and outs when I started to spend more time on data visualization and I am still learning day by day.
So, let’s stop complaining and start learning a bit about visualization, don’t we? I started to collect examples to convince my students that visualization is not just a gimmick. This post summarizes some of those examples, I hope that these examples convince your students, supervisor, boss or at least, entertain you for a couple of minutes.
Let’s start with the most basic purpose of visualisation: Make your story coherent and clear!
As I said, visualization is a powerful way to communicate central insights from data analysis. This point has real-life implications, people will (not) understand your research findings in case your graph is hard to capture. I’m exaggerating, but this point can be illustrated with the hockey stick graph from Mann & Bradely (1999). It depicts the temperature anomaly based on a line diagram and somewhere in the late 20 century we see how the temperature rises.
Any ideas why this graphy is famous and well-known as the hockey stick graph? At the end of observational frame, the temperature rises like a hockey stick and this graph is one of the first visualisation that shows how climate change evolve. Speaking about climate change, can you believe that a first science note about the consequences of climate change already appeared in 1912, in a New Zealand newspaper (Rodney and Otamatea Times). It’s hard to believe, isn’t it?
So, let’s pretend we don’t not anything about climate change, any hard facts. Would you believe that humankind will face serious problems because the temperature rises like a hockey stick at the end of the time period? Depending on the ecological knowledge and personal experience, some may or may not believe it. There is another way to visualize the same time trend. This example comes from the climate lab book and I love it:
This is an awesome visualization. It gives us an intuition how long the climate remained unchanged and how dramatic the chance has become, only in a few decades. Please, don’t get me wrong. The hockey stick graph was the first graph to show the climate change and I would never say that it is not well suited for its purpose. That’s not the point. The question is, can we do it better and make it easy so that everybody can grasp the deeper meaning? Do we have the data, the technical and visualisation skills to convince our peers and the broader public?
Certainly, the latter graph looks awesome and fancy, but that’s not the main point. It’s all about whether you can transport the main message with your visualisation at hand. In my opinion, the climate lab book example is a strong case, even though there is nothing wrong about the hockey stick graph. The main question is whether your visualisation makes your message easily accessible and transports it – hopefully – to a broad audience. And of course, nobody expects that you to make an animated graph (which we will learn in a later session;) if you are not familiar with data visualization yet.
To illustrate things ….
It’s probably worthless to mention, that visualization is the easiest way to illustrate things. You can help your audience to grasp the deeper meaning of a phenomenon with a visualization, make a complex process easier to understand, or show a development you are talking about. You may even use it to highlight the current state of research. I guess this sounds a bit boring, but it is worth to mention. Let me explain by another nice example: Many reasonable people believe that there is a climate change, but as we all know some people do not believe or even deny it. So, let’s assume we want to investigate what kind of stuff those people deny, say or tweet. Therefore we are about to collect Twitter data from, well let’s call him Donald, to classify the topics of Donald’s tweets. Luckily, Lazaro Gamio / Axios have been classiying Donalds tweets and they have even provided us with a vivid graph to highlight their results. By the time this chart was made, the media (Fake news!) got the most attention from Donald, while climate change did not even appear on the list.
Regardless of the content, Lazaro Gamio / Axios classified the tweets in a very creative way by making an alluvial chart. The latter is often used to illustrate a flow or change over time. For instance, you can depict the recent voting results and compare it with the previous ones. By doing so, you see the political flow and we see how many people switched their political position and voted for the same or another party. Irrespective of alluvial charts and the amount of each twitter category, I hope this example makes clear that people will most likely remember the graphical representation than any number. That’s the reason why visualization is powerful to and we need to think about how information the graph should contain. So, I guess in a few weeks from now on you can still remember Donald, even if you have forgotten how those charts are called or how many times the democrats were attacked on Twitter. At least that’s happening to me all the time.
Many people think of visualisation as a nice gimmick, but it helps to remember and understand the deeper meaning of a phenomenon. For example, the number of overweight and obese people significantly increased in almost all industrialized countries in the last few decades. This trend is impressively documented by the Global Health Observatory Data of the World Health Organization (WHO) which makes the problem much more tangible. The next figure shows the proportion of overweight adults on a world map in 2017, based on the Body Mass Index (BMI ≥ 25). Apart from a few exceptions in Central Africa and Asia, in most countries high percentages of overweight people are visible. A high proportion of the adult population is overweight in all western countries, especially in North America, Europe and Oceania.
I could have provided you with the same information with any visualization, but displaying the same information in a world map shows the bigger picture. I made that figure for a textbook about impact evaluation (see Treischl & Wolbring 2020) and it took me quite some time to learn in detail how to make a such map (by the way it’s ggmap). And it is okay to spend a little time and effort if you want to publish a graph or in case you want to use it for your BA/MA thesis.
Neverthelesss, there are plenty of useful online tools which help you to create a visualisation if you don’t have much time. For example, if you want to create a map you can use Makechart. Those tools are a good start if you are not familiar with programming because you can manually adjust a graph and ink single countries and get instantly the results. Of course, you must dig a little bit deeper and learn how to program it by yourself if you want to adjust the graph, but again, that’s not the point. Providing people with a clear picture about the research topic, the purpose of the class today, or the prevalence of a phenomenon in the example above broadens our perspective and motivates to understand.
Make your analysis robust
Ultimately, let’s have a look at the Animation from Tom Westlake. What do you think you see? Some random data noise? Some data coincidence that happens to appear in a shape of a star, a circle, or a dinosaur. The datasaurus is my favourite and I hope I don’t look too nerdy. Never mind of my personal taste, this animation does not depict random noise, it provides us with another important visualisation purpose. This data is “made-up” and comes from Matejka & Fitzmaurice (2017). They use algorithms to underline the importance of visualization techniques. They show us that algorithms can be used to generate data with a certain mean value, median, and correlation.
The data can be generated in any form and that’s why a datasaurus emerges. You can download the data, all visualisation and the paper from their website. This includes the animation below, which shows the same data, but this time it displays the corresponding values of the mean, SD, and correlation to make clear that central tendencies of each simulated data set are almost identical. So, check out how your data looks like and use visualization techniques before analysing your data. Please don’t make the same mistakes as I did and neglect data visualisation and its core strength. It clearly not all about the significance of some effects. Instead we have to ask whether we can use the statistical methods we are about to apply and we can use visualizations techniques to getter a clearer picture how our data looks like.
Another prominent visualization example comes from Francis Anscombe (1973). He constructed a data set to emphasize problems that may occure during a regression analysis. For instance, he shows how much impact an outlier may have on our estimation technique. Have a look at Anscombe’s quartet in the next graph. All data sets have almost identical descriptive statistics, with different distributions. Hopefully our data looks like in the first scatter plot (top left) if we want to run a linear regression. Clearly, we must think about the estimation assumptions in case of the second data set (top right). We probably assume that there is a linear relationship between X and Y, but that’s clearly not what the data looks like. I guess you are familiar with outliers and other influential observation, so let’s skip the discussion about regression assumptions and possible consequences in terms of violation.
Anscombe, Francis J. (1973): Graphs in statistical analysis. American Statistician, 27, 17–21.
Mann, M. E. & Bradely, R.S. (1999): Northern Hemisphere Temperatures During the Past Millennium: Inferences, Uncertainties, and Limitations. Geophysical Research Letters, 26(6), 759-762.
Matejka J. & Fitzmaurice, G. (2017) Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. Online available.
Treischl, E., Wolbring, T. (2020): Wirkungsevaluation. Grundlagen, Standards, Beispiele. Book Series: Standards standardisierter und nicht-standardisierter Sozialforschung. Basel/Weinheim: Beltz Juventa. Expected for 2020. 150 pages.
Wagenmakers, E.-J. & Gronau, Q. F. (2019): A Compendium of Clean Graphs in R. Online available.