In this post we’ll cover how to create box plots in Excel. Along with scatter plots, box plots are a standard in statistical analysis – especially during the exploratory phase, when you’re just familiarizing yourself with the data. That’s because they’re great at showing the spread of values – what’s your maximum, minimum, and important in-between values.
The following tutorial was written for Excel 2013 and later, but can also be done in earlier versions. Most steps should be similar enough for you to figure out, but if you have any trouble, feel free to let me know in the comments section below!
Setting Up the Data
Let’s say we have Test and Control groups for which we’ve taken some measurements. We now want a graphical representation of the two group so we can make some visual comparison. The data is stored in Boxplot_Example.xlsx, which you can download from my GitHub page, if you’d like to follow along.
Calculate the Min/Max and Quartiles
Calculating the quartiles for each group separately is a great way to summarize their spread. The easiest formula to do this in Excel is the
=QUARTILE.INC() function. This function will calculate whatever quartile you specify when you give it some data. In case you were wondering, the .INC part means that we are INCLUDING the minimum and maximum values.
Here’s a summary of the function:
|array||Range of cells for which to calculate a quartile||Any range of cells containing numerical values|
|quart||The quartile to calculate||0 – Minimum
1 – First quartile (25th percentile)
2 – Median (50th percentile)
3 – Third quartile (75th percentile)
4 – Maximum
The formulas should end up looking like the following:
Calculate Graph Values
We’re not quite ready for a chart yet. First, we need a version of the data that works well with a stacked bar chart, since that’s really all a boxplot is.
Right now, if we were to throw the true quartile values straight into a graph, the sum of the values would end up being more than the true maximum, making the bar chart much taller than it should be. So in order to make this work, we should calculate the cumulative versions of each quartile. That way, as you stack one value on top of the previous, their cumulative sum equals the true value you want. In other words:
|Cumulative Value||True Value|
|Min + Q1||=||Q1|
|Min + Q1 + Q2||=||Q2|
|Min + Q1 + Q2 + Q3||=||Q3|
|Min + Q1 + Q2 + Q3 + Max||=||Max|
Doing a little bit of algebra, we get the following calculations for each cumulative value.
|Cumulative Value||True Value|
|Q1||=||Q1 – Min|
|Q2||=||Q2 – Q1|
|Q3||=||Q3 – Q2|
|Max||=||Max – Q3|
We can put those calculations right below the true values, like in the picture below.
Create the Graph
This is where things start coming together. Highlight cells D12:F17, and insert a stacked bar chart. At first, the chart will look like this:
But we can switch the axes by right-clicking on the chart, selecting ‘Select Data…’, and clicking the ‘Switch Row/Column’ button.
So far things still aren’t looking much like a boxplot, but we’re almost there! Right-click the bottom blue section of the bars, and select ‘Format Data Series…’. Under ‘Fill’ select ‘Solid Fill’ and increase the transparency to 100%.
Working our way up, left-click the next (orange) section. Click on the plus sign near the top right of the chart. Click on the arrow next to ‘Error Bars’, and select ‘More Options…’. In the menu, select ‘Minus’, ‘No Cap’, and set the percent to 100%. This gives us an error bar that spans the entire length (100%) of this orange section of the graph.
Now that we have the error bar, select the orange section again, and set it’s transparency to 100% as well. And while we’re here, why don’t we change the color of the error bar to a nice shade of blue.
The next gray and yellow sections represent the interquartile range. The bottom of the gray is Q1, the top of the yellow is Q3, and where they meet is the median. We’ll format both of these regions the same way by making them light blue, with a darker blue border.
Now just add the the top whisker the same way we did the bottom, by adding another error bar.
And look what we have here – a boxplot! Not bad, but it’s still missing something: the average. So let’s calculate that for each group, and store it in E18:F18. Now, right-click on the chart, select “Select Data…”, and click “Add”. For the series name, enter “Avg”, and for the series values, highlight E18:F18. The chart will look screwed up at first.
To fix it, right-click on the green bars and select “Change Series Chart Type…” In the resulting window, scroll down to Avg, and select “Line”.
Format the line to include some nice blue data markers, and delete the line itself. Then just clean it up with a little extra formatting and that’s how you make data look good!
P.S. If all of this formatting was too tedious for you (I don’t blame you), I’ll cover a great time-saving tip in my next post.