Grouping in PivotTables

Grouping in PivotTables is a way of combining data to perform analyses without having to use functions. You can group numeric columns to turn them into categories, you can group date columns by date ranges to get even intervals, and you can group text columns to put together similar values. We’ll go through all three scenarios in this blog post, and also talk a bit at the end about making histograms from grouped PivotTables.

For the following examples, we’ll be using a human resources dataset from Kaggle, download it here if you want to follow along. As usual, first I’ll turn my dataset into a Table (see why in this blog post).

Grouping Categorical Columns

Let’s say you want to analyze the reasons for terminations at your company. Make a new PivotTable, and put the “Reason for Term” column into the Rows box. Put “Employee Name” into the Values box to do a simple count for eaach item in the “Reason for Term” column. In this scenario, you think that this is too many options, and want to group similar items together. When you want to group text, it’s a bit tedious. We have to select the items we want to group individually for each group we want to make. Let’s say we want to group “attendance”, “gross misconduct”, “no-call, no-show”, and “performance” together in a group called “Employee Misconduct”. Select these items (do this by holding down Command on a Mac or Control on a PC and clicking each item individually), then right-click on any item and select “Group…”:

Now, let’s make a group from “Another position”, “career change”, “hours”, “more money” and “unhappy” and we’ll call that group “Employee Unhappy”. We’ll follow the same process as before:

Finally, let’s group everything else except “N/A – still employed” together into a group called “Other”, and then minimize the groups using the small minus signs to the left of the group names:

Notice it aggregates the counts for us when we minimize the groups. So we can interpret this as 63 employees left the company because they were unhappy for some reason or another, 15 were terminated for misconduct, 35 left for other reasons, and 188 are still employed. Your final grouped PivotTable should look like this:

text grouped pivottable

Grouping Dates

Grouping date/time columns is much easier, we don’t have to manually highlight the groups. Make a new PivotTable (I’m going to copy and paste the one we just did and modify it). Remove everything from the rows box and put “Date of Hire” into the rows box. Depending on your version of Excel, Excel might automatically group this for you. That’s what mine did, you can see from the rows box that it automatically grouped the hire dates by year AND quarter (and actually months too as we’ll see in a second, but that’s not as obvious from the rows box):

We can change this grouping by right-clicking anywhere in the first column of the PivotTable. A window now appears! You can see that on my version, it automatically chose to group by Years, Quarters, AND Months because those three options are highlighted. We only want to group by Years, so select only Years and click OK. If yours didn’t automatically group like mine did, you should still be able to follow this same process.

If you want to select multiple items in a list like the one in the dates grouping window, hold down Command on a Mac and click what you want (it’s Control on a PC).

Grouping Numeric Columns

Grouping numeric columns is also much easier, we don’t have to manually highlight the groups as long as we are choosing groups with even intervals. Let’s say we’re interested in grouping our employees in the dataset by pay rate. We want to create a “High” group, a “Medium” group and a “Low” group depending on how they are paid. There are several ways to do this in Excel, not just using grouping, but grouping is usually one of the easiest ways to accomplish this. Make another PivotTable, and put the “Pay Rate” column into the Rows box, and the “Employee Name” column into the Values box to count how many employees have those pay rates:

non-grouped pay pivottable

Normally we wouldn’t put a numeric column into the Rows box of a PivotTable, but we’re going to be turning it into a categorical column with the groups. Right click anywhere in the first column of the PivotTable, and choose “Group…”. You should see this window:

Numeric pivottable group window

This is telling you that the minimum value is 14, the maximum value is 80, and Excel is suggesting we group by every $10. You can change the minimum and maximum if you want, but we’re going to leave them in this situation. We are going to change the “By” box – instead of grouping by $10, let’s make our groups by $22 so we have three groups that we can then re-name to Low, Medium and High. Put 22 into the “By” box and click OK.

Numeric pivottable group window

And the result should look like this:

Numeric pivottable group window

See how the first group is 14-36? 36-14 = 22, so the span is $22. Same for the other two groups. We can re-name these groups by simply clicking on the group title, and using the formula bar to type the name we want:

So what did this accomplish? We can now get a quick count of how many employees fall into each group. Most employees are in the low range (214 out of a total of 301), 74 receive a “medium” rate and only 13 employees receive a high pay rate. Grouping in PivotTables is a quick way to create groups or “bins” of numeric data in Excel without using formulas.

Different Sized Groups In The Same Workbook

Unfortunately, we can’t just change the grouping on the second one. If you do, it will also change the grouping on the first one:

Don’t actually do what’s in this next video, it’s just to show you that it doesn’t work.

As far as I understand, this is because Excel keeps a PivotTable cache. The easiest way I’ve encountered to clear the cache is to copy a grouped PivotTable to an entirely new workbook, ungroup it, re-group it with the new groups, then paste it back in the original workbook. The whole process looks like this:

Make a Histogram

Notice earlier in the post, I said grouping is a quick way to make bins in Excel – bins are what you need for a histogram! There are now many ways to make a histogram in Excel. One is not necessarily better than another, but if you want to customize your bins or make several comparable histograms, grouping will most likely be the fastest way to accomplish this. Let’s see it in action, using the PivotTable we just made with groups of $10. Add the “Sex” column to the Columns box:

Create Pivot Window Without table

We want to make two histograms that can be compared, one for males and one for females. On the old version of Excel, you could probably highlight the first two columns of the PivotTable to get just the females, but on the new version, Excel will pick up the entire PivotTable as part of your chart. So, we need to use references to copy the PivotTable so that it is no longer a PivotTable, but it’s still connected to the data source. For more information about copying using references, see my Tables and Connecting Data Structures post. But, for now, all you need to do is refer to the top-left cell of your PivotTable (in my case, A42, but yours might be different) in another cell that is a column or two away from your PivotTable. Then, drag that reference down and to the right:

Now, let’s highlight the bins column and the female column, and go to Insert >> Column Chart. Then do the same for the men by highlighting the bins column and the male column. (Use Command on a Mac to highlight columns that are not next to each other. It’s Control on a PC). Remove the gaps (characteristic of histograms) and make sure the axes match since we’re comparing them, and you’re done! The whole thing looks like this:

Takeaways

Congrats! You made it to the end and now can group data in PivotTables! It’s pretty handy, right? You can skip all the nested IF statements and approximate-match VLOOKUPs now, the grouping will do that for you.

Make a Bubble Plot in Excel

Most people don’t know that bubble plots even exist in Excel. In this blog post, we’ll walk through how to take advantage of these very effective charts! They are great for comparing three quantitative variables at once.

Getting the Right Data

A standard plot is used for comparing three quantitative variables in one chart. Think of a quantitative variable as a number column in Excel. Remember, a good way to choose which type of chart to use when you are performing data analysis is to consider the types of data you are working with. Check out Chart Cheat Sheet blog post for a comprehensive guide to choosing the right visualization.

Let’s take an example to understand this better. Say you work for a nutrition nonprofit and are interested in learning about nutritional information for breakfast cereals. We have this dataset from Kaggle that I modified a tiny bit (I changed values in about 4 or 5 cells) so that the bubble plot example would be easier. Download the .xlsx file here if you’d like to follow along. Here’s a preview of the dataset:

namemfrtypecaloriesproteinfatsodiumfibercarbosugarspotassvitaminsshelfweightcupsrating
100% BranNC7041130105628025310.3368.402973
100% Natural BranQC1203515288135031133.983679
All-BranKC704126097532025310.3359.425505
All-Bran with Extra FiberKC5040140148033025310.593.704912
Almond DelightRC110222001148-125310.7534.384843
Apple Cinnamon CheeriosGC110221801.510.5107025110.7529.509541
Apple JacksKC1102012511114302521133.174094
Basic 4GC1303221021881002531.330.7537.038562
Bran ChexRC9021200415612525110.6749.120253
Bran FlakesPC9030210513519025310.6753.313813
Cap’n’CrunchQC12012220012123525210.7518.042851
CheeriosGC11062290217110525111.2550.764999
Cinnamon Toast CrunchGC1201321001394525210.7519.823573
ClustersGC11032140213710525310.540.400208
Cocoa PuffsGC1101118001213552521122.736446

We want to compare the sugar, fat, and caloric content of the cereal brands all at once. How can we squeeze all of that information into one plot effectively? A bubble chart!

Choose your Data

You need to decide on a few things when making a bubble plot:

X Axis

Any quantitative variable will do, but you can get more out of your visualization by picking a variable that you think is influencing the y variable. In this case, I hypothesized that the higher the average number of calories, the higher the average fat content would be for each brand of cereal. So, I made average number of calories my x axis.

Y Axis

Same as the x axis, any quantitative variable will do, but you’ll get more out of your visualization by picking a variable that you think is being influenced by the variable you chose for the x axis. So, we’re going with average fat content for our y axis.

Size of Bubble

The size of the bubble can also be any quantitative variable, and it should be one that will be relevant when compared to the x and y variables. However, generally, it’s a good idea to pick a variable that has a lot of variation if you can so that your bubbles aren’t all the same size and are actually showing something interesting. We’re going to go with average sugar content for our bubble size.

Optional: Bubble Color

You can get a FOURTH variable into a bubble plot believe it or not! If you want to add a categorical variable (think of that as a text column in Excel) to the three quantitative columns, you can do it by making the color of the bubbles represent values from the categorical variable. For this first example, we’re going to have the bubble color represent a unique manufacturer of cereals.

Set Up the Data in Excel

Let’s start by making a Table, then a PivotTable. You don’t have to make a PivotTable in order to make a Bubble Plot (in fact, later in this post we’ll make a Bubble Plot without making a PivotTable). But since I want to aggregate the information by manufacturer, a PivotTable makes sense here. If you aren’t familiar with PivotTables, you should still be able to follow these instructions to produce one. If you want to understand PivotTables better, check out my blog post “PivotTable Basics”.

Step 1:

Click on any non-blank cell in the dataset. Go to Insert >> Table. Tables make things easier to work with (a blog post about tables is coming next!). Then either leave the Table highlighted, or just click on any non-blank cell again. Now go to Insert >> PivotTable. Check out the GIF if you are lost:

Step 2:

Add the variables to your PivotTable like shown in the next GIF below. Make sure your variables in the Values box are in the exact same order that mine are. For almost every chart, Excel wants the data in a particular order. You don’t have to follow this order, but it will make your life easier if you do. For bubble plots, it wants x axis first, then y axis, then size of bubble. Also make sure they are averages. You’ll notice that the sums didn’t make sense (that represents the total fat, sugar, and calories in all cereals produced by each manufacturer which doesn’t make logical sense for what we’re seeking).

Step 3:

Most versions of Excel won’t let you make a bubble plot directly from a PivotTable. Some will, but most won’t. If yours won’t let you, simply copy the PivotTable and paste it as values, or even better, copy it as references. That way if something changes in the PivotTable later, your references will also change, and your chart will automatically update too. Again, check out my blog post on Tables if you’re interested in learning more about efficient workbook layout in Excel.

Make a Bubble Plot!

We’re finally ready to make our Bubble Plot!! Highlight all the numbers in the copied data (NUMBERS ONLY, no text or titles) and go to Insert >> Scatterplot and choose Bubble. If you’re on an older version of Excel, your scatterplot button might be in a different place.

Let’s clean this up a bit. For bubble and scatter plots, you can modify the axes to zoom in on the points. So, let’s set the minimum value for the x-axis to 85. Let’s also label the axes:

This looks good, and we can already glean some information from this chart. The bubbles aren’t really in a line, but there is a very small positive trend. The bubble size doesn’t seem to get larger in any direction in this case, but that’s something you could look for too. However, we don’t know which one is which. We can add color and data labels to help distinguish the bubbles.

Since we only have one value for each cereal brand in this example (that is, one row for each cereal brand and therefore just one bubble per color) we can right-click on any bubble, choose “Format Data Series” and get to the spill-paint icon. From there, choose Fill and then check off “Vary Colors by Point”:

final bubble plot

Finally, to add data labels, right-click on any bubble again, and choose “Add Data Labels”. If you’re on a PC, you can right-click on any data label, choose “Format Data Labels” and then check off “Value from Cells” in the Format Data Labels panel that appears to the right. Highlight the names of the cereal manufacturers in the copied cells and hit OK. Then un-check Y value and you should just see the names of the brands remaining next to the bubbles.

Unfortunately, on a Mac, this is much more tedious because there is no “Value from Cells” option. On a Mac you need still right-click on the bubble and add the Data Labels, but from here you have to hand-type the names of the cereal manufacturers in each of the label text boxes. Your final product should look like this:

final bubble plot

What if I want multiple bubbles that are the same color?

You can do this, and in this example we’ll make a bubble plot directly from the raw data. Using our whole dataset would produce a pretty messy plot with too many bubbles. So, let’s say you’re interested in only this subset of cereals:

mfrtypecaloriesfatsugars
KC160213
KC7015
KC5000.1
QC12058
QC5000.1
RC150311

Now we have multiple rows that are the same manufacturer (3 K’s, 2 Q’s, and 1 R). This gets a bit more tedious, but labeling the colors is much easier.

Step 1: Get The First Series In

As usual, we’ll make the data into a table first. For the first series in any multi-color chart, you can highlight all of the numbers associated with that series, and the insert the chart. So, we’re going to highlight all of the numbers associated with K, and insert a bubble plot:

To get the other manufacturers in as separate colors, we need to right-click on the chart and choose Select Data. You should see a windows like this:

final bubble plot

It will look different if you are using a PC, but the four boxes that are important should be the same: Name, X values, Y values, and Sizes (or Size of Bubble). Since we highlighted the data for K, it autopopulated these fields for us. Let’s name this series K before we move on to the others. That way, when we add a Legend to our chart later, it will automatically show the names of the manufacturers. In the name box, either type K or click on one of the cells with K in it inside the dataset.

Then, we’ll add the Q series. On a Mac, hit the + sign. On a PC, you’ll want to choose the “Add” button. We’ll do Q next. Type Q or click on one of the cells that has Q in it for the Name. For the X values, we want to highlight all calorie cells associated with the Q manufacturer. For the Y values, we want to highlight all of the fat values associated with the Q manufacturer. And, for the size of the bubble, we want to highlight all values in the sugar column that are associated with the Q manufacturer. Putting it all together looks like this:

Finally, let’s add the R manufacturer’s series. There’s only one row for R, so it would look like this:

You can change the axes and add titles like we did in the first example. I chose my minimum x-axis value to be 45, and my max to be 170. Don’t forget to add a legend too, so we can tell which color is which. On a Mac, start by clicking on the chart. Then the “Chart Design” menu should appear in the ribbon at the top of the screen. Click the button all the way to the left, “Add Chart Element” (this is the same button used to add the titles above, see the GIF if you are lost). Then choose Legend >> Right. The final result looks like this:

final bubble plot

If you want the legend to show the name of the manufacturer instead of the letter abbreviation, you have to change this in the original dataset like so:

Technically, I didn’t have to change all of them, only the ones I referred to in the “Name” box when we were in the “Select Data” dialog. But, I couldn’t remember which I chose, and there are so few in this data subset that I changed all of them 🙂

Notice it looks like there are only 5 bubbles – there are actually 6, but two of them have the exact same x and y values, so they are overlapping. It’s the ones that have 0 fat and 50 calories. Be careful of situations like this – as you can see Excel seems to have randomly chosen to put the orange bubble on top, so someone who is not familiar with the data might not realize that there is a blue bubble underneath it. Also take note that this is no longer averages. We didn’t do a PivotTable to aggregate the data before making this chart, so it’s total calories, fat, and sugars per cereal.

Takeaways

Aren’t bubble plots cool? Admittedly, people who are new to analytics may have a bit of trouble fully understanding what is going on the first time they look at a bubble plot. However, if they are created properly, they are an effective chart that is not misleading. Sometimes it’s tempting to squeeze a lot of variables into one visualization in other ways that do become misleading. But here, we were able to compare four variables at once (manufacturer, calories, fat, and sugar) in an effective and accurate way.