Choosing the right metrics
The needs of our audience must drive every choice we make in data storytelling. We must ask ourselves, what does our audience need to know? What do we want them to take away from our data story, and what action do we want them to take next?
Sometimes lady luck smiles upon us. Our raw data is in great shape; all we need to do is select and configure the right visualization and we’re ready to start crafting a story that will provide value to our audience. However, the data we happen to have on hand is often far from ready to be presented to an audience. Frequently, the metrics available in our raw data set will not be the best metrics to convey the insights that will be most valuable to our audience.
So, what do we mean by metrics, and how do we choose the best metrics to tell our story?
Let’s look at an example. A firm wants to understand if it is getting value from its marketing spend. The raw data available to the analyst is shown in the table below.
The data set contains three columns:
Month – the month of the year.
Total Spend – the total spend on marketing efforts each month.
Total Leads – the number of new leads the firm has acquired each month.
The “Total Spend” and “Total Leads” columns in the above table are both metrics. Metrics are quantitative values that can be expressed as numbers. They allow us to measure something of importance to us.
The “Month” column is not a metric because it doesn’t contain numeric values. Rather, it’s a dimension. A dimension is a column that contains categories that we can use to segment our data. In this case, it contains a time component. A dimension column could also contain data like geographical areas, market segments, types of products, or any other category that we can use to segment our data.
Our table of data already contains some metrics. Of course, it’s hard to make sense of a large table of data, so we’ll try visualizing the data using a chart. Charts will help us quickly identify patterns and trends in the data. When visualizing trends over time, a line chart is an excellent starting point. Here’s what we get if we select our raw data in Excel and insert a line chart using Excel’s default settings.
The original question our audience asked is: Are we getting value from our marketing spend? Now, does this chart help our audience see the answer? We can certainly see that we’re getting more leads, but we’re also spending a lot more on marketing. Without doing some arithmetic, it’s impossible to determine if we’re getting better value from our marketing spend. If our audience must perform arithmetic to understand the point of a chart, it’s time to rethink our chart design. Can you think of a better metric we could use to help our audience see if we’re getting better value from our marketing spend?
One option is to calculate the average cost per lead each month. This new metric called “Cost Per Lead” is just a new column in our data set that is derived by dividing the “Total Spend” by the “Total Leads” each month to get the average cost of each lead. Data transformations like this are so simple, yet they can drive a huge improvement in the quality of our data story. As we go through this article, we’ll explore some more advanced transformations, but nothing we do will require anything more than making the right choices around what simple arithmetic to perform.
Let’s visualize our newly created metric.
Now we see an obvious pattern: Our average cost per lead has dropped dramatically and stabilized at around half what we were paying at the start of the year. We’re now poised to create an impactful data story that directly answers our audience’s question about value for money from our marketing spend. Here’s the chart with some commentary and improved formatting:
We’ve produced a far more useful data story with some simple arithmetic and careful consideration of our audience’s needs.
The close cousin of metric selection is dimension selection. When performing metric selection, we create a new metric column derived from the existing metric columns in our raw data. Similarly, when performing dimension selection, we will create a new dimension column derived from the dimension columns available in our raw data.
Let’s examine another example. We have a theme park that wants to understand when peak volumes of guests occur to better plan their staffing, procurement, and maintenance activities. The theme park has captured the total number of attendees entering the park each day for one year. Here are the first few rows of this data set:
The “Date” column is a dimension, and the “Attendance” column is a metric. If we visualize the whole data set spanning one year using a line chart, we get this:
We can certainly see some interesting trends here. There are some large seasonal fluctuations with greater attendance in the northern hemisphere summer months. We can also see some regular spikes, which correspond to weekends. However, our audience can’t determine with certainty that these spikes are driven by weekends just by looking at the chart because the x-axis labels aren’t granular enough to show what days the spikes occur. We’re required to either make an educated guess that weekends are driving the spikes, or we must inspect the raw data.
To improve the interpretability of our data, we will create two new dimensions from our existing date column: A dimension for the month and a dimension for the day of the week. These new dimensions will allow us to provide two different views of our data set that isolate the respective impact of the month and the day of the week on attendance. We can then create two simpler charts like these:
By isolating the month and day of the week as separate dimensions, we can create a visualization that quickly shows the impact of each dimension on the total attendance. Can we further improve our choice of dimensions and metrics to serve the needs of our audience better? Remember: Our audience wants to understand when peak volumes of guests occur to better plan their staffing, procurement, and maintenance activities.
Metric selection can help us once again. We’ve just added up the total number of people who attended each month or day of the week in the charts above. These charts give us a high-level view of when the park is busiest. But our audience is most interested in peak levels of attendance that it needs to plan for.
Instead of adding the total number of attendees per month, which a software like Excel will do by default in a PivotTable, we could focus on the maximum number of attendees observed on any day throughout each month. This metric will better address our audience’s need to understand peak demand levels.
We could take this a step further and create a metric that shows the maximum daily attendance each month, the minimum daily attendance each month, and the average daily attendance each month.
This chart is called a floating bar chart. It can be used to show a range of values from minimum to maximum. We’ve also incorporated a dash to show the average daily attendance per month. This chart is rich in insights that are useful to our audience. We can quickly see the peak daily volumes in each month, enabling us to prepare to service those peak volumes. We can also see that planning will be easier in the northern hemisphere winter, with a much narrower range of attendee volumes. Whereas in the summer, there is a wide range of daily volumes. We’re still going to have some very light days, but we’re also going to have some extremely busy days.
We can add one further dimension to segment our data set and make it more useful. Let’s add a dimension to segment our data set by “Weekend” versus “Weekdays.” Now we’re slicing by two dimensions: The month of the year and whether the day of the week is a “Weekend” or a “Weekday.” For example, in July, there are eight weekend days. We will take the highest of those eight daily values to give us the peak weekend volume in July which is 46,641 attendees.
Here’s our final visualization, capturing peak weekend and weekday volumes by month.
This chart is presented without commentary as it would be used as a tool for management to understand the levels of peak volume they need to prepare for throughout the year. In this example, we’ve used a dumbbell dot plot. A dot plot, with its prominent markers, will emphasize individual values, compared to a line chart which will draw our eyes’ movements across time. Because the key information our audience cares about is the precise peak volumes they need to plan for, a dot plot is an effective way of focusing attention on those values.
As you can see, we’ve come a long way in improving the quality of our data storytelling by refining the metrics and dimensions used in our visualization. Selecting the best metrics and dimensions is an iterative process. We always start with the needs of our audience first. We can then experiment with different visualizations. Each time we try an experiment, we ask ourselves, does this visualization allow my audience to quickly and efficiently glean the information they need? We repeat this process, testing new metrics and dimensions, until our answer is a resounding yes!