I am a social scientist and I have frequently the impression that people in my field think data visualization is the fun part of research. Like a gimmick with no added value. Okay, for a very long time I also did not put too much effort on visualization. Making a scatter plot before running some statistical analysis was the standard procedure during my whole undergrad life. Well, as far as I can remember. My perspective changed since I realised that visualization is the key to communication of central ideas and research results. I hope I can convince you that gaining visualization skills is worth the trouble, even though you’re already reading this blog, so probably, I am trying to convince the wrong person 😊.
In the social sciences, we take not much time or effort to teach our students how to visualise, at least not in my opinion. The same applies to tools which we use little in academia. What do you think how many people are not capable to do some basic calculations with Excel? Please, don’t feel offended if you’re are one of them. The point is, we should teach important skills in correspondence to the field and handling data in different environments is clearly one of them. We will work on Excel skills later, but the lack of visualization (and other programming) skills is quite a shame. We pretend that we are all familiar with different kind of visualization techniques and that our students know them, even though we are not experts in the field. Well, I still don’t know all the ins and outs when I started to spend more time on data visualization and I am still learning day by day.
So, let’s stop complaining and start learning a bit about visualization, don’t we? I started to collect examples to convince my students that visualization is not just a gimmick. This post summarizes some of those examples, I hope that these examples convince your students, supervisor, boss or at least, entertain you for a few minutes. Let’s start with the most basic purpose of visualisation:
Make your story coherent and clear!
Visualization is a powerful way to communicate central insights from data analysis. This point has real-life implications, people will (not) understand your research findings in case your graph is hard to capture. I’m exaggerating, but this point can be illustrated with the hockey stick graph from Mann & Bradely (1999). It depicts the temperature anomaly based on a line diagram and somewhere in the late 20 century we see how the temperature rises.
Do you have any idea why this graph is famous and well-known as the hockey stick graph? At the end of observational frame, the temperature rises like a hockey stick and this graph is one of the first visualisation that shows how climate change evolve. Speaking about climate change, can you believe that a first science note about the consequences of climate change appeared in 1912, in a New Zealand newspaper (Rodney and Otamatea Times). It’s hard to believe, isn’t it?
So, let’s pretend we know nothing about climate change, any hard facts. Would you believe that humankind will face serious problems because the temperature rises like a hockey stick at the end of the time? Depending on the ecological knowledge and personal experience, some people may or may not believe it. There is another way to visualize the same time trend. This example comes from the climate lab book and I love it:
This is an awesome visualization. It gives us an intuition how long the climate remained unchanged and how dramatic the chance has become, only in a few decades. Please don’t get me wrong. The hockey stick graph was the first graph to show the climate change and I would never say that it is not well suited for its purpose. That’s not the point. The question is, can we do it better and make it so easy that everybody can grasp the deeper meaning? Do we have the data, the technical and visualisation skills to convince our peers and the broader public?
The latter graph looks awesome and fancy, but that’s not the main point. It’s all about whether you can transport the main message with your visualisation at hand. From my point of view, the climate lab book example is a strong case, even though there is nothing wrong about the hockey stick graph. The main question is whether your visualisation makes your message easily accessible and transports it–hopefully–to a broad audience. And nobody expects from you to make an animated graph (which we will learn in a later session;) if you are not familiar with data visualization yet.
To illustrate things ….
It’s probably worthless to mention that visualization is the easiest way to illustrate things. You can help your audience to grasp the deeper meaning of a phenomenon with a visualization, make a complex process easier to understand, or show a development you are talking about. You may even use it to highlight the current state of research. I guess this sounds boring, but it is worth to mention. Let me explain this point with another example I like: Many reasonable people believe that there is a climate change, but as we all know some people do not believe or even deny it. So, let’s assume we want to investigate what kind of other things these people deny, say or tweet. Therefore, we are about to collect Twitter data from, well let’s call him Donald, to classify the topics of Donald’s tweets. Luckily, Lazaro Gamio / Axios have been classifying Donalds tweets and they have given us with a vivid graph to highlight their results. By the time they made this chart, the media (Fake news!) got the most attention from Donald, while climate change did not even appear on the list.
Regardless of the content, Lazaro Gamio / Axios classified the tweets in a very creative way by making an alluvial chart. The latter is often used to illustrate a flow or change of time. For instance, you can depict the recent voting results and compare it with the previous ones. By doing so, you see the political flow and we see how many people switched their political position and voted for the same or another party. Irrespective of alluvial charts and the amount of each twitter category, I hope this example clarifies that people will most likely remember the graphical representation than any number. That’s the reason why visualization is powerful and we need to think which information a graph should contain. So, I guess in a few weeks from now on you can still remember Donald, even if you have forgotten the name of the chart or the number of attacks. At least that’s happening to me all the time.
Many people think of visualisation as a nice gimmick, but it helps to remember and understand the deeper meaning of a phenomenon. For example, the number of overweight and obese people significantly increased in almost all industrialized countries in the last decades. This trend is documented by the Global Health Observatory Data of the World Health Organization (WHO) which makes the problem much more tangible. The next figure shows the proportion of overweight adults on a world map in 2017, based on the Body Mass Index (BMI ≥ 25). Apart from a few exceptions in Central Africa and Asia, in most countries high percentages of overweight people are visible. A high proportion of the adult population is overweight in all western countries, especially in North America, Europe and Oceania.
I could have told you the same information, but displaying the information in a world map shows the bigger picture. I made that figure for a textbook about impact evaluation (see Treischl & Wolbring 2020) and it took me quite some time to learn how to make a such map (it’s ggmap). And it is okay to spend a little time and effort if you want to publish a graph or in case you want to use it for your BA/MA thesis.
Nevertheless, there are plenty of useful online tools which help you create a visualisation even if you have little time. For example, if you want to create a map you can use Makechart. Those tools are a good start if you are not familiar with programming because you can manually adjust a graph, ink single countries and get the results instantly. Of course, you must dig a bit deeper and learn how to program if you want to adjust the graph, but again, that’s not the point. Providing people with a clear picture about the research topic, the purpose of the class today, or the prevalence of a phenomenon like in the example above broadens our perspective and motivates to understand.
Make your analysis robust
Ultimately, let’s have a look at the Animation from Tom Westlake. What do you think you see? Some random data noise? Some data coincidence that appears in a shape of a star, a circle, or a dinosaur. The datasaurus is my favourite and I hope I don’t look too nerdy. Never mind of my personal taste, this animation does not depict random noise, it provides us with another important visualisation purpose. This data is “made-up” and comes from Matejka & Fitzmaurice (2017). They use algorithms to underline the importance of visualization techniques. They show us that algorithms can generate data with a certain mean value, median, and correlation.
The data can be generated in any form and that’s why a datasaurus emerges. You can download the data, all visualisation and the paper from their website. This includes the animation below, which shows the same data, but this time it displays the corresponding values of the mean, SD, and correlation to clarify that central tendencies of each simulated data set are almost identical. So check out how your data looks like and use visualization techniques before you analyse data. Please don’t make the same mistakes as I did. Don’t neglect data visualisation and its core strength. It is obviously not all about the significance of some effects: We have to ask whether we can use the statistical methods we are about to apply and we can use visualizations techniques to get a clear picture on how our data looks like.
Another prominent visualization example comes from Francis Anscombe (1973). He constructed a data set to emphasize problems that may occur during a regression analysis. For instance, he shows how much impact an outlier may have on our estimation technique. Have a look at Anscombe’s quartet in the next graph. All data sets have almost identical descriptive statistics, with different distributions. Hopefully, our data looks like in the first scatter plot (top left) if we want to run a linear regression. We must think about the estimation assumptions in case of the second data set (top right). We probably assume that there is a linear relationship between X and Y, but that’s not what the data looks like. I guess you are familiar with outliers and other influential observation, so let’s skip the discussion about regression assumptions and the consequences in terms of violation.
Anscombe, Francis J. (1973): Graphs in statistical analysis. American Statistician, 27, 17–21.
Mann, M. E. & Bradely, R.S. (1999): Northern Hemisphere Temperatures During the Past Millennium: Inferences, Uncertainties, and Limitations. Geophysical Research Letters, 26(6), 759-762.
Matejka J. & Fitzmaurice, G. (2017) Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. Online available.
Treischl, E., Wolbring, T. (2020): Wirkungsevaluation. Grundlagen, Standards, Beispiele. Book Series: Standards standardisierter und nicht-standardisierter Sozialforschung. Basel/Weinheim: Beltz Juventa. Expected for 2020. 150 pages.
Wagenmakers, E.-J. & Gronau, Q. F. (2019): A Compendium of Clean Graphs in R. Online available.