Recently I got hold of some regional spending forecast data. I quickly plotted it using ggplot2, and here’s the first version of it.
The data is from 2014 and the values from 2015 to 2019 are the forecasted values. For now don’t worry about the validity of this data or the lack of margin of error in the forecasted values. Lets just concentrate on the problems with the visual elements of this chart.
There are two problems here…
- Most of the data is hugging the x-axis, something you see very often when you have highly skewed data.
- Way too many categories (regions) which makes it hard to figure out which line is for which region, even with the legend.
A crude approach to fix problem #1 would be to use log scale on y-axis, which believe me is not the way to go especially when you want to present this chart to business execs and not scientific community. For problem #2, I tried using various different color palettes but none gave enough distinction in the hues to be able to distinguish all the ten regions correctly. This is not a problem of the hues. When you have more than 4 or 5 hues it is hard to distinguish each one, especially for line or point plots where the ink to plot ratio is not high (as opposed to say a bar plot).
Let’s tackle each separately. First we tackle problem #1, that of most data hugging the x-axis. In this case it was a very easy fix, the data is very easily splittable in to two clusters a)
spending < 10,000, and b)
spending > 10,000. I did just that by creating another variable which indicated which group the data belonged to and used
facet_grid to create two charts, but hid the facet strip and text. Here’s the result.
This is much better, the data that was hugging the x-axis has now better visibility. It is worth noting that the chart is in fact two plots, arranged vertically on top of each other, and the y-axis scale is for the bottom chart is 1,000 while that of the top chart is 10,000. This just gives an illusion of a single y-axis. Also worth noting is that we could do this because the data was easily separable. I believe when data is easily separable as is the case here, this approach is a better alternative to using log scale. Bob Rudis helped in adding a visual separator where the scale breaks.
WARNING : Not many people are fans of scale breaking, and I too would advice caution when using such an approach. Perhaps a better alternative is to simply plot the two charts separately.
Now on to problem #2. What we really want here is a better way to indicate which line belongs to which region. If we can directly label the lines instead of using a legend on the side, then we have our solution for problem #2. After some googling around I found this Twitter conversation and a subsequent blog post by Bob Rudis. Hadley suggested using the
directlables package, and Bob used
ggpolt2::annotation_custom. I first went with the
directlabels package but quickly realized that none of the options were working out for me. The labels were either overlapping each other or overlapping the data, neither of which was acceptable. So I explored Bob’s option, and there too I gave up on
ggplot2::annotation_custom for the same reason, overlapping labels.
So I came up with an alternative approach, which involved using the
ggrepel package to make sure the labels didn’t overlap. Here’s the final result.
This is even better, not only do we get labels right next to each line but we also get non-overlapping labels. The only thing I wish was an option in the
ggrepel package that would allow repelling in just one direction, i.e. in this case it would be even nicer if I can left align all the labels, but still get vertical separation. Other than that I am happy with the result. They key to obtaining this chart was using
ggrepel and using
ggplot2::expand_limit() function to make sure there was enough room along the x-axis for the labels to not get chopped. By default ggplot2 will leave just a small area around each scale, so I had to use the
expand_limit function to make room for the labels.
Here is the code for the final plot.
library(ggplot2) library(readr) library(tidyr) library(dplyr) library(ggthemes) library(ggrepel) region.forecast <- read_csv('./region-spending-forecast.csv') # Add a column so we can split the plot into two plots. region.forecast$Budget <- ifelse(region.forecast$`2019`<10000,'Low','High') # Convert wide format data to long format as required by ggplot2 df.tidy <- region.forecast %>% gather(Year,Spending,-Region,-Budget) g <- ggplot(df.tidy, aes(x=Year,y=Spending,group=Region,color=Region)) + # Solid line for actual data geom_line(data=df.tidy %>% filter(Year<2015), linetype='solid', size=0.75) + # dashed line for forecast data geom_line(data=df.tidy %>% filter(Year>=2014), linetype='dotted', size=0.75) + # Mark each data point geom_point(shape=8,size=0.75) + # Add labels right after the last value of each series. geom_label_repel(data=df.tidy %>% filter(Year==2019), aes(label=Region, fill=Region), nudge_x = 0.5, size=3, color='white', force=1.5, segment.color='#bbbbbb') + # Split the plot in to two plots on top of each other facet_grid(Budget ~ ., scales = 'free_y') + scale_y_continuous(labels = scales::comma) + # Add commas to y-axis labels scale_x_discrete() + theme_minimal() + scale_color_tableau() + scale_fill_tableau() + # Tableau Colors theme(strip.text = element_blank(), # Hide facet text legend.position = 'none', # Hide legend panel.grid.minor = element_blank(), panel.grid.major.x = element_blank()) + expand_limits(x=9) # So that we have enough room along x-axis for labels. # This is to insert a ↑ as a scale seperation indicator between the two plots. library(grid) library(gtable) library(gridExtra) gb <- ggplot_build(g) gt <- ggplot_gtable(gb) gt <- gtable_add_grob(gt, textGrob(label="↑", gp=gpar(fontsize=30)), 5, 3, clip="on", name="separator") gt$heights[] <- unit(30, "pt") grid.arrange(gt)
There are some noteworthy thoughts here.
- I could use
facet_gridbecause the data was easily splittable, otherwise I would have had to come up with some other clever option for lifting the series data that was hugging the x-axis.
- Not everyone is a fan of a split sclae in an axis so tread carefully.
ggrepelfor now does not allow repelling along a single axis. This makes the label positions less than ideal. The ideal solution is to have the labels vertically aligned and enough separation along the y-axis. But fear not there is already a feature request for it.
- I tried various color schemes, and Tableau colors from the
ggthemespackage gave the best hues.
- You have to specify the scaled data values for
expand_limits. In my case, as I had seven data points, one for each year of 2013 to 2019, so I added room for two more using
expand_limits(x=9)to accomodate the labels.