## Pie Charts

``You toggled code blocks on! Code blocks will appear in gray below. Click the button again to turn them off.``

# Introduction

It seems as if people are split on pie charts: either you passionately hate them, or you are indifferent. In this article, we are going to explain why pie charts are problematic and, if you fall into the latter category, what you can do when creating pie charts to avoid upsetting those in the former.

# Why are pie charts problematic?

## They use size to convey information

A pie chart uses the size of a portion (slice) of a circle (pie) to display a numerical variable. This factor is not an issue in and of itself, as many chart types use size to convey information, including bubble charts and bar charts; however, while bubble charts and bar charts use diameter and height, respectively, to convey information, pie charts rely on the angle describing a slice---and the human eye is not very good at recognizing differences in angles.

Suppose we took a survey on people's favorite kinds of pie. In the chart below, it is difficult to see how the categories relate to each other; individually, Cherry and Rhubarb seem to comprise a smaller portion of the pie than either Apple or Pumpkin, but it may not be obvious (without looking at the data) which is the smaller slice.

``````#Adjusting plot size and margins
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,1), mai = c(0.5, 0, 0.75, 0))

#Data for pie chart
x = c(18, 15, 13, 25, 29)
labels = c("Key Lime", "Cherry", "Rhubarb", "Pumpkin", "Apple")
cols = c("greenyellow", "red", "mediumvioletred", "darkorange", "cornsilk")

#Build the pie chart
pie(x, labels, radius = 1, col=cols)``````

## They cannot display many categories well

This issue of conveying size via angle is even more pronounced when many categories are shown in a single pie chart. Furthermore, unlike some charts that are used to display several categories at once, such as bar charts, pie charts depend on differences in color to denote category; therefore, a large palette of colors is necessary, and without proper selection of the palette, the results could be either garish or ambiguous.

``````#Adjusting plot size and margins
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,1), mai = c(0.55, 0, 0.8, 0))

#Data for pie chart
x = c(2, 4, 5, 10, 13, 15, 15, 17, 19)
labels = c("Key Lime", "Pecan", "Cherry", "Blueberry", "Rhubarb", "Lemon Meringue", "Blackberry", "Pumpkin", "Apple")
cols = c("greenyellow", "tan4", "red", "darkblue", "mediumvioletred", "yellow", "black", "darkorange", "cornsilk2")

#Build the pie chart
pie(x, labels, radius = 1, col=cols)``````

## They show parts of a whole

Pie charts represent a whole as its components. Therefore, if your dataset is a subset of a larger dataset (and thus does not represent the whole) or if your dataset consists of independent categories (and thus represents multiple wholes), then a pie chart may not be appropriate.

# Pie charts in popular packages

We wouldn't want to assume anyone's opinion on as divisive a topic as the pie chart, but perhaps the disdain for this chart type is best exhibited by the lack of built-in functions for creating them in two very popular data visualization packages: ggplot2 (R) and seaborn (Python). With both packages, a pie chart can be created only through trickery.

## Trickery

It is convenient---perhaps a little too convenient---that a pie chart is no more than a single stacked bar displayed in polar coordinates. The code below builds the pie chart shown above, but using ggplot2.

``````#Adjusting plot size and margins
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,1), mai = c(0.55, 0, 0.8, 0))

#Data for the pie chart
values = c(9, 2, 5, 10, 13, 15, 10, 17, 19)
labels = c("Key \nLime", "Pecan", "Cherry", "Blueberry", "Rhubarb",
"Lemon \nMeringue", "Blackberry", "Pumpkin", "Apple")
cols = c("Key \nLime"="greenyellow", "Pecan"="tan4", "Cherry"="red", "Blueberry"="darkblue",
"Rhubarb"="mediumvioletred", "Lemon \nMeringue"="yellow", "Blackberry"="black",
"Pumpkin"="darkorange", "Apple"="cornsilk2")

data = data.frame(labels, values)

#Build the pie chart
ggplot(data, aes(x="", y=values, fill=labels))+
geom_bar(width = 1, stat = "identity") +
scale_fill_manual(values=cols) +
coord_polar("y", start=0) +  #Use polar coordinates
theme(axis.title=element_blank(),
axis.text=element_blank(),
legend.title=element_blank())``````

# What chart types can be used to replace pie charts?

## Bar charts

Similar to pie charts, bar charts use size to convey information; however, for bar charts, the height of a rectangle varies, and differences between the heights of bars are easier to recognize than the differences between the angles of portions of a circle. Furthermore, bar charts can be configured to show absolute numbers, percentages, or both!

``````#Adjusting plot size and margins
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,1), mai = c(0.5, 1, 0.2, 1))

#Data for bar chart
values = c(9, 2, 5, 10, 13, 15, 10, 17, 19)
labels = c("Key \nLime", "Pecan", "Cherry", "Blueberry", "Rhubarb",
"Lemon \nMeringue", "Blackberry", "Pumpkin", "Apple")

data = data.frame(labels, values)
data = data[order(-values),]

#Build the bar chart
barplot(height=data\$values,
names.arg=data\$labels,
ylim = c(0, 20),
cex.names=0.7)``````

## Waffle Charts

Waffle charts, which are growing in popularity, use number rather than size to visualize a numerical dimension. The resulting graph is similar to a stacked bar or tree map; however, because each square is a unit, compared to alternatives that rely solely on size, it is easier for a person to confirm if a perceived difference between categories is real without relying on text.

``````#Adjusting plot size and margins
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,1), mai = c(0.5, 1, 0.2, 1))

# Create data
pies = c("Pecan"=2, "Cherry"=5, "Key Lime"=9, "Blueberry"=10, "Blackberry"=10,
"Rhubarb"=13, "Lemon Meringue"=15, "Pumpkin"=17, "Apple"=19)

waffle(pies, rows=5, size=1.5,
colors=c("tan4", "red", "greenyellow", "darkblue", "black",
"mediumvioletred", "yellow", "darkorange", "cornsilk2"),
xlab="1 square = 1 vote", legend_pos = "bottom")``````

# But what if I don't like the alternatives?

Even though there are many alternatives (e.g., bar charts, stacked bars, waffle charts, lollipop charts, tree maps), pie charts are a familiar chart type to most people, and depending on the audience, familiarity may be an important factor that affects interpretability. So if you want to stick with pie charts, consider taking the following advice.

## Limit the number of categories via grouping

To avoid visual clutter and to ensure your pie chart is readable, the number of categories should be small. Therefore, it may be useful to group categories that individually comprise a small proportion of the pie into a single category. Note that, when using this approach, it may be helpful to list the items contained in the derived category. Furthermore, it is best to ensure that the new category does not form the majority of the resulting pie.

## Show percentages or absolute numbers (or both) as text

Text can be used to prevent misunderstandings due to ambiguity. By including text information, a person can see if there are differences among the categories. However, if it is necessary to include text, then one can argue that the visualization itself is ineffective (so be prepared to defend your choice of chart type).

``````#Adjusting plot size and margins
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,1), mai = c(0.55, 0, 0.8, 0))

#Data for pie chart
x = c(15, 20, 35, 30)
labels = c("Other (15%)", "Cherry (20%)", "Pumpkin (35%)", "Apple (30%)")
cols = c("black", "red", "darkorange", "cornsilk2")

#Build the pie chart
pie(x, labels, radius = 1, col=cols)``````

# Conclusions

We hope you found our discussion of pie charts informative. While pie charts can be avoided in most cases, they remain a pithy little chart on which many, many people have little to no opinion. However, to avoid a mass uptake of pitchforks and torches, please remember to employ pie charts responsibly and to use caution when including any controversial chart type in your next presentation.

# Required libraries

Click the Show/Hide Code button to view the libraries.

``````library(repr);
library(dplyr);
library(plotly);
library(waffle);
library(ggplot2);
library(RColorBrewer);``````

## Useful Python Snippets

The goal of this blog post is a compilation of little tidbits and code snippets that address common issues when programming for data analysis in Python.

# General Snippets

### Difference between JSON and XML

This page gives a great example of the difference between data in JSON format and XML format. It shows the exact same data in both formats: https://json.org/example.html

### Converting scientific notation into numbers

Converting from scientific notation in a Pandas Dataframe: https://re-thought.com/how-to-suppress-scientific-notation-in-pandas/

### Remove ellipses from pandas dataframe preview:

``````pd.options.display.max_columns = 2000

#If you don't want to make the change permanently for the notebook,
(e.g., to avoid excessive output in other cells), you can also use
pd.option_context:

with pd.option_context('display.max_columns', 2000):
print(df.describe())
#temporarily display all columns
with pd.option_context('display.max_seq_items', None):
print (df.columns)``````

### Isolate date columns:

``````datecols2 = []

for item in prod.columns:
if 'Date' in item:
datecols2.append(item)

datecols2``````

### Add grand total column to a pivot table:

``````test_df = pd.pivot_table(prod, index="Color", columns="Class", values="ListPrice", aggfunc=np.sum)
test_df['Grand Total'] = test_df.sum(axis=1)
test_df``````

### Comparing group by syntax to pivot table syntax:

``````prod.groupby(['Class', 'Style']).count()[['Name']]
pd.pivot_table(prod, index=['Class', 'Style'], values="Name", aggfunc="count")``````

### Use apply for multiple columns in a dataframe:

``avo.apply(lambda row: row.AveragePrice * row['Total Volume'], axis=1)``

### Install packages in Jupyter Notebook

``````# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pytime``````

### Understanding copying objects in python

These two links are excellent at explaining:
https://stackoverflow.com/questions/2612802/how-to-clone-or-copy-a-list
and
https://www.geeksforgeeks.org/copy-python-deep-copy-shallow-copy/

### Reverse Dictionary Function

``````def reverse_dict(lookup_value):
dictionary = {'george' : 16, 'amber' : 19}
for key, value in dictionary.items():
if value == lookup_value:
print(key)

reverse_dict(19)``````

### Removing duplicate rows in a dataframe

https://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html

https://jamesrledoux.com/code/drop_duplicates

### This is an extremely important pandas doc page! Indexing & slicing dataframes

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

### Connect to a mySQL database:

``````import mysql.connector
# Set up your connection to the database
myConnection = mysql.connector.connect( host=

# Read the results of a SQL query into a pandas data frame.
my_table = pd.read_sql('SELECT * FROM table_name, con=myConnection)
``````

### Connect to postgres database:

``````import psycopg2
connection = psycopg2.connect(user =
host = "your-host-here-keep-quotes",
port = "5432",
database = "your-database-here-keep-quotes")
cursor = connection.cursor()
cursor.execute("SELECT * FROM django_session;")
record = cursor.fetchone()
print(record)
``````

### Connect to the Twitter API

``````import numpy as np
import pandas as pd
import json
from pandas.io.json import json_normalize

# You need to replace all the capital words in brackets with your
# ACTUAL keys. The quotation marks stay but the brackets and
# capital words must go.
api = twitter.Api(consumer_key='[CONSUMER KEY GOES HERE]',
consumer_secret='[CONSUMER SECRET GOES HERE]',

# Get the tweet data since that last tweet
# The user_id is for Boxplot's timeline, replace it with your
# own if you'd like!
user_timeline = api.GetUserTimeline(user_id='959273870023905280')

for i in range(len(user_timeline)):
rowasdf]).reset_index(drop=True)

### Loop through a Series and make sure that each subsequent value is greater than or equal to the one before it. If not, set the value equal to the one before it:

``````previous_value = 0

def previous(current):
global previous_value
if current < previous_value:
return_value = previous_value
#        previous_value = current
else:
return_value = current

previous_value = return_value
return return_value
``````

### Remove white space in a column

``````df = pd.DataFrame({'a':[' app le ']})
print(len(df.a[0]))
df.a = df.a.str.strip()
len(df.a[0])``````

# Customizing Matplotlib Visualizations

### How to customize the range of the x-axis and rotate the tick marks:

``````awesome_table1 = pd.pivot_table(data, index='DEGFIELD3',
columns='REGION2', values='CBSERIAL', aggfunc='count')
#len(list(awesome_table1.index))
awesome_table1.plot(figsize=(18,12));
plt.xticks(range(0,13),list(awesome_table1.index),rotation=-45)``````

https://stackoverflow.com/questions/27671748/how-to-print-y-axis-label-horizontally-in-a-matplotlib-pylab-chart

https://stackoverflow.com/questions/10998621/rotate-axis-text-in-python-matplotlib

### Create reusable settings for a chart:

``````def my_scatterplot(x_txt, y_txt, df, colorcol):
df.plot(kind='scatter', x=x_txt, y=y_txt, c=colorcol, colormap='winter', figsize=(10,4), s=10, alpha=.5)

my_scatterplot('Total Bags', 'AveragePrice', avo, 'type_as_num')``````

### Change the size of all charts in a notebook:

``````# put this at the top of the notebook:
plt.rcParams["figure.figsize"] = [15, 10]``````

### Example of changing colors and marker types in scatterplots:

``````colors = ['b', 'c', 'y', 'm', 'r']

en = plt.scatter(books_data[books_data['language_code']=='en']
['average_rating'], books_data[books_data['language_code']=='en']
['ratings_count'], marker='x', color=colors[0])

spa = plt.scatter(books_data[books_data['language_code']=='spa']
['average_rating'], books_data[books_data['language_code']=='spa']
['ratings_count'], color=colors[2])

fre  = plt.scatter(books_data[books_data['language_code']=='fre']
['average_rating'], books_data[books_data['language_code']=='fre']
['ratings_count'], marker='o', color=colors[1])

# a  = plt.scatter(random(10), random(10), marker='o',
color=colors[2])
# h  = plt.scatter(random(10), random(10), marker='o',
color=colors[3])
# hh = plt.scatter(random(10), random(10), marker='o',
color=colors[4])
# ho = plt.scatter(random(10), random(10), marker='x',
color=colors[4])

plt.legend((en,spa,fre),
('English', 'Spanish', 'French'),
scatterpoints=1,
loc='lower left',
ncol=3,
fontsize=8)

plt.show()``````

### Multiple y axes, and forced axis

``````# This is one data point we're trying to plot
chart1 = sets.groupby('year')['num_parts'].count()
# This is the other data point we're trying to plot
chart2 = sets.groupby('year')['num_parts'].mean()
fig, ax = plt.subplots(sharey='col')

# Create a MatPlotLib figure & subplot
ax2 = ax.twinx() # ax2 shares X axis with the ax Axes object
#  This is what forces scale on the second Y-axis!!
ax2.set_ylim(bottom=0, top=799)

# graph both of our data sets, one bar, one line
ax.bar(chart1.index, chart1, color='dodgerblue')
ax2.plot(chart2.index, chart2, color='red')

# Set the size of the resulting figure
fig.set_size_inches(12,8)
``````

# Other Visualizations

### A great tutorial for mapping:

https://towardsdatascience.com/mapping-geograph-data-in-python-610a963d2d7f

### Set x and y axes for seaborn plots:

``````ax = sns.barplot(x = 'val', y = 'cat',
data = fake,
color = 'black')
ax.set(xlabel='common xlabel', ylabel='common ylabel')
plt.show()``````

### Make a word cloud in the shape of a custom image:

``````# If using Jupyter Notebook, you need to install the
# wordcloud module like this

import sys
!{sys.executable} -m pip install wordcloud

# import the libraries needed

from PIL import Image
import numpy as np
import pandas as pd
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt

# generate the word cloud
comment_words = ''

stopwords = set(STOPWORDS)

stopwords.update(["drink", "now", "wine", "flavor", "flavors"])

for val in prod.description.iloc[0:1000]:

val = str(val)

tokens = val.split(' ')

for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
#print(tokens[i])

for word in tokens:
comment_words = comment_words + ' ' + word

wordcloud = WordCloud(background_color="floralwhite",

max_words=1000,

stopwords=stopwords,

contour_width=3,

contour_color='floralwhite').generate(comment_words)

plt.figure(figsize = (48,48), facecolor = None)

plt.imshow(wordcloud, interpolation="bilinear")

plt.axis('off')

plt.title("Frequent Words from Tasters - Wine Form",fontsize = 40,color='gray')

plt.show()``````

### Make a single box and whisker plot with Matplotlib:

``````fig, axs = plt.subplots(1, 1)
axs.boxplot(prod['points'])
axs.set_title('basic plot')
plt.show()``````

### Multiple box and whisker plots with Matplotlib:

``````# To make side by side box and whisker plots (in this example, get
# points for each country, and then make a list of those lists).
# That is what is passed in to the boxplot function:

u = list(prod[prod['country']=='Italy']['points'])
m = list(prod[prod['country']=='Portugal']['points'])
w = list(prod[prod['country']=='Germany']['points'])
final_list = []
final_list.append(u)
final_list.append(m)
final_list.append(w)

fig7, ax7 = plt.subplots()
ax7.set_title('Multiple Samples with Different sizes')
ax7.boxplot(final_list);``````

### Pie Chart:

``````prod.country_other.value_counts().plot(kind='pie',
autopct='%1.0f%%', colors=['skyblue', 'lavender', 'lightpink',
'lightcyan', 'lemonchiffon', 'mistyrose'])
plt.legend(title = 'Wine Country of Origin', loc='best', bbox_to_anchor=(1, 0, 0.5, 1))
plt.figure(figsize=(360, 250))``````

### Calculating correlation with a scatterplot

https://stackoverflow.com/questions/41635448/how-can-i-draw-scatter-trend-line-on-matplot-python-pandas/41635626

## Installing & Running Jupyter Notebook

### Step 1: Install Anaconda

Go to this download webpage on Anaconda’s site. Choose the correct link for your operating system, and then go through the installation process.

### Step 2: Prepare a folder for notebooks

Choose or create a folder on your computer where you will store all Jupyter notebook files. Make sure you choose a place where it will be easy to find them later.

### Step 3: Start up Jupyter Notebook

• Either on Mac or PC, you should be able to open up Anaconda the way you’d open any other program on your computer. So on a Mac this would be the Applications folder, and on a PC this would be the Start menu. Then, once Anaconda opens, click the “Launch” button underneath Jupyter Notebook.
• On a Mac, you can also open up the terminal and type `jupyter notebook`. This might also work on a PC but there may be a few extra steps, so we recommend going with the option above for PC.
• Jupyter notebook should now open in a browser.
• You should see the folder structure of your computer – navigate to the folder where you stored the files in step 2.