Why learn Data Visualization?

In the age of data, chances are that you have to design and present data whether you’re in business or academia. Data visualization is about communicating your message. The graph you choose and how you design it determines how your message is understood. It also reflects on you and the perception of your work. Good data visualization makes your message both impactful and memorable.

Simple principles make the difference

Unfortunately, poorly designed graphs are all too common. Cluttered and confusing visuals obfuscate the message and frustrate the audience. However, knowing just a few basic data visualization principles enables you to create clear and compelling graphs that elevate your work above the rest.

This pocket guide to data visualization will teach you to pick the type of graph for your data and effectively design it.

You will learn interactively by iteratively improving examples of bad graph design. Learning by doing has been the most effective way for me to make concepts stick and memorable, and I hope it will do the same for you.

This is a work in progress, and I hope to improve it with your feedback.

The goal in mind

Data visualization is about communicating information. A well-crafted graph conveys information faster, clearer, and more memorably than any spreadsheet or text paragraph can hope to do.

Picking the Right Tools: Know Your Graphs

After recognizing the importance of data visualization, the next step is knowing how to do it well. The first rule is choosing the right type of graph for your data. Should it be a bar chart, a pie chart, or a scatter plot? Each has its strengths and weaknesses, and your choice affects the impact of your message.

Cutting Through the Clutter: Design for Impact

Lastly, the visual design elements—color, scale, and labels—are equally vital. We must choose and arrange them carefully to avoid confusion and highlight our key points.

Data visualization is a powerful tool for effective communication. Done correctly, it can elevate your data and make your message resonate. Let’s dive in and transform the way you present data.

Marks and Channels, Building Blocks of Data Visualization

I will refrain from jargon; marks and visual channels are the only two “insider” words you need in your vocabulary.

What are Marks?

Marks are the basic shapes you see in any visual representation. Imagine you’re doodling on a piece of paper. A dot is like a zero-dimensional mark—it’s just a point. If you draw a simple line, that’s a one-dimensional mark. You create a two-dimensional mark or area when you color in a square or draw a filled circle. You could even draw a 3D box. But please don’t. I will rant about 3D graphs later in this guide.

What are Visual Channels?

Now, channels are the fun part—Let’s say you have a dot on your chart. You can change its “spatial position” by moving the dot. You could also change its color, making it lighter, darker, or even a different hue entirely. That’s messing with the color channels like “hue”, “saturation”, and “luminance.” The different Visual Channels are important to know because humans can distinguish and estimate some channels better than others.

Play with the visual channels below. Can you accurately predict by how much the value changes with the respective changes in position, area? Is it easier to guess the value for position than for area, or the other way around?

0100
0
1400
0

Want to make that dot bigger or smaller? You’re using the “size” channel. If it’s a line, you’re dealing with its length; the rectangle above changes the “area,” and if you’re brave enough to mess with 3D, you’re changing its volume.

You can also make your mark move—that’s the “motion” channel.

Tilt your mark to change its “angle,“. There are more channels out there, this book is a very good resource

The four basic types of graphs

Now that you know all about visual channels, let’s try to put them to use to design effective graphs for actual data. Therefore, it’s natural for us to deal with some nice football data.

Here are the 20 Premier League teams and the number of goals they scored in the 2022/2023 season.

Team Name Goals Scored

How would you visualize this data?

Pie Charts are everywhere for this type of data. Unfortunately, they are one of the least effective and most misleading charts. Review the visual channels above. They utilize area and angle. Both are visual channels humans have trouble with. In addition, depending on the orientation, you can present some areas as larger than others, potentially misleading your audience.

As you can see, Pie Charts also quickly clutter up the space.

Avoid Pie Charts if you can.

Let’s have a look at the data again:

Team Name Goals Scored

Tables and Numbers

Do you notice something? A simple table is already much more effective in communicating this data. A table could be the best choice, depending on what you want to do. You will learn how to design effective tables quickly later in this guide.

Utilizing Position

The most effective Visual Channels are Position and Length. We have one type of numeric data, the number of goals, and 20 Categories for the different teams. A bar chart naturally and almost always most effectively helps you compare categories.

Notice how it’s easier to distinguish and compare the teams now?

Even though this is a massive improvement to our Pie Chart, we can still do better.

Remember the principle of simplicity: we want to reduce the cognitive load and enable the audience to focus on the information visualized. Look at the Chart above. Do you think everything is necessary? How could we improve it?

Redundancy

For one, we have a lot of redundant encodings. Redundancy can be good if you want to underscore and emphasize a relationship. However, here we have the Team Names, the crest, and different colors for each bar. These encodings do not add anything useful. They don’t help you emphasize or highlight information. They do, however, add to the cognitive load and distract. Try removing one, two, or all of these Labels:

Ticks, Markers and Annotations

In the same spirit, do you see all the ticks, the grid, and markers from the axes? They can be helpful, but you may not need as many. Reducing the ticks by increasing the intervals clears clutter. Often, it’s best to remove them altogether. Does it help if you make the axes blend more into the background? Change their color?

Sometimes, you don’t need any axes at all. Play around with the score labels, and see if it looks better with the labels in the bar, below, or on top. You may not even need any numbers, depending on what you want to show. Experiment!

Order of Bar Charts

Use the ordering, sorted and unsorted to help show what you want to communicate. Is there a large drop? Are they evenly distributed?

Don’t ever use 3D

Before we move on to the next type of chart, let’s see how we could still mess up a perfectly good bar chart. Or any type of chart for that matter.

3D Charts
Image by pikisuperstar on Freepik

3D uses the least and most difficult-to-estimate visual channels and is incredibly biased towards perspective. They are hard to read, mislead, and often just plain ugly. Don’t use 3D.

2D Data

Some data have several dimensions and features whose relationship you want to visualize. We will look at different types of 2D data and choose the best kind of chart for each together.

Multiple Categories with two features each

We start with visualizing the relationship between multiple categories where each data point has two features. Let’s expand our Premier League Data also to include the number of Goals each team has conceded during the 2023 season:

Team Name Goals Scored Goals Against

Now, with two features, we could try to use two bars for each team:

Bar Charts are simple and effective, so this is often a good option to compare two categories. However, a grouped bar chart is not the best choice when you want to compare two numeric relationships instead of two categories across many numeric relationships. It’s hard to see how the different teams relate to one another, and you likely don’t care about the exact numeric difference but the overall relationship.

The simplest (and often best) choice for this type of data is a so-called scatter chart.

Let’s repeat the same exercise and improve the scatter chart as we did with the bar chart. This time, however, you can do it all at once and by yourself.

Here is what I came up with; how does yours look like? There is no exact right or wrong. A lot of it is still subjective and a matter of preference. The important thing is understanding the principles, simplicity, and effectiveness.

Another type of chart I’ve sadly often seen used to visualize this type of data are line-charts. Line-charts are great, but they’re not made for this kind of data. They mislead the audience into thinking that there is some kind of continuous relationship between the teams:

Categories with two features across time or space

Line-charts, however are great at comparing 2D data of multiple categories across time. Let’s expand our data to include the goals of the previous 5 seasons and focus on a single team.

Manchester United Goals per Season

Improve the line-chart!

Good Job. Here is my attempt.

Great, we can also use it to compare categories (teams in our case) across time or some other continuum, be sure to improve the chart again!

It doesn’t work well if you have more than two or three categories; however, remember the cognitive load and clutter. You may want to use a grouped bar chart if you have many categories.

Which chart do you prefer? Which makes it easiest for you to compare the different teams and seasons and lets you learn and think about the data the most efficiently?

Goals Per Season

Complex Graphs

By now you know the principles and how to apply them. The basic Chart types will solve the majority of your data visualization needs effectively. And if you stick to the basics of visualization and iteratively improve your charts as we’ve done so far your graphs will be more visually appealing and effective than most of your peers.

However, sometimes the basic graphs just don’t cut it. Data can be more complex or numerous and you want to visualize and communicate relationships across multiple dimensions and categories.

In this section we explore some more complex data, doing some very simple EDA (Exploratative Data Analysis), decide what information we want to communicate and how to deliver this message with effective data visualization.

Expanding and Grouping

In the last Section we looked at Manchester Uniteds Goals Scored in the past few seasons. This led to an interesting Line Chart. We can now try to expand this to compare the performance with Manchester City and Liverpool.

Season ManU Liverpool ManCity
2022-23 58 75 94
2021-22 57 94 99
2020-21 73 68 83

Goals Per Season

Personally I find that a grouped bar chart would fit slightly better for this type of data. However this is a matter of preference. Judge for yourself. Improve both charts.

Goals Per Season

Have you noticed any distinct patterns or outliers? Does grouping by “Team” reveal insights that aren’t immediately apparent when grouping by “Season”? For instance, in the 2021-22 season, both Liverpool and Man City topped the scoring charts, while Manchester United lagged behind with the least goals. Exploratory Data Analysis (EDA) is a pivotal step in data science and machine learning. It’s essential to understand your data. Visualization not only deepens this understanding but also enhances your ability to interpret data visually.

More Data

We stick to football, however, to keep things interesting, lets get more data. We will now look at each team, the number of shots they took and how many goals they converted. We could even use this data to get a very crude xG baseline for each team.

Distributions

It is often the case that a distribution can give you a great overview and understanding of the underlying data. There are many ways to visualize distributions and it depends on what we want to show.
Do you want to highlight individual datapoints, details and relationships? Or see if the distribution follows some classic (Normal? Beta? etc.)?

Density Plots are a great choice for the later:

Density Plot of the player distribution.

We for now are more interested in the individual points and want to see details. A great choice for this type of plot are Jitterplot. The “Jitter” comes from the fact that we create a “fake” second dimension by adding noise to it, to make the individual points more discernible. This is common practice, but make sure it’s clear to the audience and mention it if necessary to not mislead and confuse.

Go through the same exercise again:

undefined

Great job, here is my attempt:

Combining mulitple Graphs

Line Chart and Bar Charts

Jitterplot and KDE/Histogram

Tables are effective

Too many people want to avoid Tables because they are simple. But, as you know by now, simple is what we want. When we visualize, we want to effectively communicate a message with data, and in many cases, Tables do this best.

Conventions and best practices

Like all visualizations we’ve looked at, tables are also simple. As long as you follow a few basic rules, you can already create tables that are way better than the most you find in the wild.

Proximity, Alignment and White spaces

Tables are visualizations, when well designed they effectively use proximity and white-spaces to help our brain group information.

White spaces

It often seems difficult, but you can always create a little space in your tables. Sometimes I see tight table designs where the rules and grid touch the letters in the cells or, even worse, cross them. This is lazy. Try to use smaller fonts or get rid of the grid and borders. In most cases, the table will look better for it. White space and good alignment should be enough for our brains to group and separate data.

White space does, however, have some limitations. As the rows grow wider, we get another case of cognitive overload. Then, you may consider adding borders, shading, or another type of visualization if this level of detail is unnecessary.

Alignment Conventions

In general, text columns are aligned to the left, numeric columns are aligned to the right. This creates more white space and helps our eyes follow a pattern that it is used to.

Adjust the table below and see how you feel with the conventions and by cleaning it up. Note that in some browsers the borders will show no matter what.

Team Goals Shots

We only go over the most basic adjustments that should return a good table design for you. In your own design you might want to play with border widths and trying to only separate some borders or even play with the font style of a specific row or column to highlight it.

Did you notice how following the conventions naturally creates more white space?

Grid, Borders and Alternating

From the basic building blocks we remember that clutter is our enemy. This principle holds even more so for tables. Unfortunately when people make tables to visualize data they often add as much clutter as is possible. Alternatingly filled Rows, Columns (or both!), borders in each direction and thick delineation between headers, index columns, and footers.

Try to remove the clutter in the table below and see the difference!

Team Goals Shots

This is my attempt, again no right or wrong here:

Grouping Tables

If you have a lot of data and information you want to show and absolutely need to stick to tables and provide detail, consider using subgroups. Here are a few examples on how we can split and group different data from one table into multiple and create better oversight:

Team Goals Shots SOTP SOT SH90 SOT90
Team Goals Shots SOTP
Team SOT SH90 SOT90

It’s also worth checking if you can use some interactivity depending on your medium of visualization. Like adding a tooltip on hover that explains abbreviations or offers more detail on demand. There is a popular mantra by Shneiderman which stats “Overview first, details on demand”.

You can also create an overview by adding a heatmap:

Team Goals Shots

Colors effectively draw the reader’s attention to notable information, like outliers. Liverpool and Manchester City immediately stand out as clear overperformers, while Norwich is distinctly highlighted in bright blue. This prominence is due to color being a primary preattentive attribute. Preattentive attributes are visual cues our brain processes subconsciously. If you’re familiar with “Thinking, Fast and Slow” by Daniel Kahneman, they engage our System 1.

As with all visualizations, this table can still be improved a lot, please share your designs and suggestions in the discord server.

In the next Chapter, we’ll dive deeper into Highlighting and how to use those Preattentive Attributes to our advantage and lighten the cognitive load on our readers’ working memory.

Best practices in Table Design

Stephen Few provides a great extended summary of the best practices on Table design in his book ”Show me the numbers”, I highly recommend it, it’s well written and includes a lot of empircally tested examples.

Here is a short version of the most important basics to keep in mind when designing tables:

1. White Space:

  • Prioritize white space for clear separation of rows and columns. Avoid grids.

2. Data Arrangement:

  • Organize categorical data logically, either horizontally or vertically.
  • Maintain consistency in grouping and structure.
  • Place related data columns closely.

3. Text Formatting:

  • Use horizontal orientation and ensure legibility.
  • Align data consistently; for instance, right-align numbers.
  • Opt for precise and consistent representations, especially for dates and large numbers.

4. Summarization:

  • Differentiate summary columns and ensure they’re easily identifiable.

5. Page Information:

  • Ensure continuity by repeating headers on each page.

Remember, the overarching goal is clarity and ease of comprehension. The table should convey information in a straightforward manner, making it easy for the viewer to understand and interpret the data.

Highlighting

Great job on making it this far! You now understand the basics of data visualization and design, have a toolset of simple and complex graphs that should satisfy nearly all your visualization needs and know how you could always expand and combine these.

All of the data visualization needs we have looked at so far did give overviews. For the final part, let’s see how we can emphasize a particular datapoint or relationship effectively. We will do so using the visual channels of color, size and motion.

Preattentive Attributes

We’ve mentioned them in the previous chapter. Here is a basic example, can you tell me how many fives there are?:

987349790275647902894728624092406037070570279072 803208029007302501270237008374082078720272007508 324780260270379377570970737797066746209709470278 279797097230972309793927109272797987349726080272

It's not easy, isn't it? Try highlighting.

Color

Color is an extremely effective way of highlighting and emphasizing. If there is a particular datapoint you want to show, much like the improvements we did early in the basic graphs section, it’s enough to simply reduce the clutter and distractions. And if you want to highlight a single datapoint all other datapoints are distractions.

987349790275647902894728624092406037070570279072 803208029007302501270237008374082078720272007508 324780260270379377570970737797066746209709470278 279797097230972309793927109272797987349726080272

Coloring the fives makes them jump out even more.

But remember the principle of simplicity, too many colors are as bad as no colors. Can you tell how many fives there are? They are still bold, but the many colors lead to cognitive overload and it’s very hard.

987349790275647902894728624092406037070570279072 803208029007302501270237008374082078720272007508 324780260270379377570970737797066746209709470278 279797097230972309793927109272797987349726080272

Using Color is a simple and effective way of highlighting data points. Try with these familiar (now cleaned up) charts:
# Goals scored by every Team 2023

I chose Manchester United for illustration purposes, since the team colors are red and it’s a popular color to use in highlighting. If you go on my mondaystats site you can highlight any other team you like.

#Shots Taken by every Team 2023
undefined

Play around in your favorite data visualization library and highlight. Try to let the rest of your chart blend into the background to add extra emphasis, play around! Data Viz is fun and should be a creative task first and foremost!

Size

Another mark that utilizes preattentive attributes effectively is “size”.

987349790275647902894728624092406037070570279072 803208029007302501270237008374082078720272007508 324780260270379377570970737797066746209709470278 279797097230972309793927109272797987349726080272

Coloring helps add even more emphasis to the fives, can you tell?

It works on other charts too, and often size can also encode data in an extra dimension, think of population charts like the ones on the popular Gapminder website. This btw, is an interactive chart too, you can try to improve it a little bit and highlight. Personally I am not too big of a fan of this UI, what do you think?

The Gapminder website is great fun, and it’s one of the first data visualization tools that really amazed me. In part because it was so interactive, which brings us to the next section, Interactivity.

Summary of Highlighting techniques

Highlighting key information directs attention to what's important. Use highlighting with intent and sparingly. A rule of thumb is at most 10% of a visual. Highlight marks and text with these tools:

  • Bold, italics, and underlining:
    Best for titles, labels, captions, and short sequences. Bolding is most effective. Italics are less noticeable and legible while underlining can reduce legibility and should be used cautiously.
  • CASE and typeface:
    Uppercase is beneficial for short sequences, especially for titles, labels, and keywords. Avoid multiple font styles because they make your text less legible.
  • Color:
    Effective when used sparingly. Color is most effective when combined with other techniques like bolding.
  • Inversing elements:
    Although effective in capturing attention, it can clutter the design. Use with caution.
  • Size:
    Adjusting the size is a straightforward way to signify importance and attract attention. Effective when combined with boldening and color.

Interactivity

Overview First, Details on Demand.
- Ben Shneiderman

We revisit Shneiderman’s mantra - “Overview First, Details on Demand”. This guiding principle is at the heart of how interactivity enhances data visualization.

In data visualization, our aim isn’t merely to present data and communicate its essence visually. While designing charts with interactive components, we must remember that our minds, remarkable as they are, come with a limited working memory. Information stacks up, and too much can make comprehension cumbersome. Here lies the importance of managing Cognitive Load – ensuring the least possible mental processing power is required to understand our visualizations, enabling our visualizations to communicate effectively. Interactivity acts as our ally in this endeavor. With interactivity, we can craft graphs and charts that are easy to understand and provide an overview. Yet, they hold within them deeper layers, revealing more details upon interaction.

You can leverage interactive elements to reveal details and provide alternative views of your data. You saw a few above in the Gapminder Chart (and perhaps on its website if you wanted to explore). Some of the simplest interactive features include Tooltips and Selections.

  • Hover Display Information: Hovering over a specific data point or segment can reveal more about it without cluttering the view.
  • Hover Highlight: This emphasizes the data point under scrutiny, subtly dimming others, ensuring the viewer’s attention is directed appropriately.
  • Click Select: This feature allows users to click on specific data points or segments to lock in a view or see further associated data, perhaps in an additional window or sidebar.
  • Movement: Elements within the visualization can respond dynamically to user movement, adjusting, animating, or morphing to provide deeper insights or different perspectives.
#Shots Taken by every Team 2023
undefined

By harnessing these tools, we can ensure that our data visualizations provide an immediate, clear overview yet still offer rich details, but only when the viewer demands. As Shneiderman reminds us, this balance is crucial in effective visual communication.

The Importance of Design

People perceive beautiful things as more useful. A phenomenon that Masaaki Kurosu and Kaori Kashimura first described, called the Aesthetic Usability effect. Aesthetic Usability means that how “beautiful” data visualizations are is as important as effective visual communication in our designs. Luckily, there isn’t much more to it than we have already learned. The principles mostly complement one another.

Following the basic guidelines you have learned, produce more aesthetically pleasing graphs.

In this section, we’ll dive a little deeper into a few common pitfalls visualizers still make and how to fix or avoid them.

Solving design problems in graphs

Overplotting

Problem: Overplotting occurs when data points occupy the same space. Overplotting makes the visualization challenging to interpret and conceals patterns, trends, or anomalies that might be present.

Placeholder for Example: [Image showing a scatter plot with numerous overlapping data points.]

Possible Solutions:

  • Use transparency to make overlapping points more visible. By making the data points semi-transparent, areas with high data density will appear darker.
  • Consider using a different type of visualization like a histogram or a heatmap.
  • Add “jitter” - small random noise to data positions to prevent exact overlaps.

Flow

Problem: Flow is the path the viewer’s eyes take when scanning a page. You want the reader’s attention to “flow” like a river, not having to jump from one place to another.

Placeholder for Example: [Image of a graph with no clear sequence or hierarchy.]

Solutions:

  • Structure your visualization with a clear start and end point, guiding the viewer’s eyes in a logical sequence.
  • Use visual cues like arrows, lines, or highlighted pathways to guide the viewer. Be frugal! Remember to reduce clutter first and foremost!
  • Ensure related data points or groups are close to make comparisons easier.

Layout

Problem: Layout and flow go together. A cluttered or poorly organized layout distracts from the message.

Placeholder for Example: [Image of a graph with elements too close together, labels overlapping, and no clear focal point.]

Solution:

  • Provide ample white space between elements to reduce visual clutter.
  • Group related elements and keep unrelated elements separate.
  • Ensure all labels are clear and legible, and adjust their positioning to prevent overlaps.

Alignment

Problem: Misalignment makes visualizations messy and unprofessional.

Placeholder for Example: [Image of a bar chart where the bars and the corresponding labels are not aligned.]

Solution:

  • Use white space and conventions to align elements, if that fails;
  • Use grids or guides
  • For multi-part visualizations (like a dashboard), ensure consistent alignment across all parts for visual harmony.

Remember, the goal of a data visualization is not just to present data, but to make it comprehensible, insightful, and engaging. By paying attention to these design elements, you can elevate the quality of your visuals and make your data truly shine!

No Screenshots of images!

If you include images, don’t take screenshots of images. Try to save and use the image or at least use dedicated software or the screenshot functionality of your browser if you absolutely have to take a screenshot of some web application or site.

Contrast!

Don’t do purple on gray, or gray on purple. Yellow on white, blue on green and be aware that 12% of men and 4% of women are red green colorblind. I’ve seen this too often: purple gray, dark blue on green

User friendliness

In addition to the above issues, we also often forget about user friendliness. One of my favorite books in this regard is Krug’s Don’t make me think. Make buttons look like buttons and follow conventions, it should require no thought at all (or as little as possible) for your user to find out how to interact with your data.