Setting up PostgreSQL & pgAdmin

What’s the difference between PostgreSQL and pgAdmin?

PostgreSQL is a free-to-use relational database management system (DBMS); pgAdmin is a graphical user interface (GUI) tool that lets users interact with their PostgreSQL database(s). In other words, PostgreSQL is the actual database system, and pgAdmin is a tool that lets users use that system.

How do I know if I already have PostgreSQL installed?

If you’re a Windows user, verify that PostgreSQL is installed by using SQL Shell. In the search bar in the bottom left-hand corner of your desktop:

type in SQL Shell and this will enable you to open SQL Shell. Hit enter four times, and it will automatically apply the credentials you set when you installed PostgreSQL. Finally, type in SELECT version(); and hit enter, and if it displays the PostgreSQL version you’re using, that means that you have PostgreSQL installed:

If you’re using a Mac, open the terminal. Once in the terminal, type the command “postgres –version”, which will show you the version of PostgreSQL you’ve installed on your machine, or it’ll give you an error if you haven’t installed PostgreSQL.

How do I install PostgreSQL?

Visit the PostgreSQL download page and select your operating system from the options it gives you. Then, click on Download the installer:

On the page it brings you to, select the most up-to-date version of PostgreSQL available for your OS. The rest of the installation should be fairly convenient. In fact, it will even install pgAdmin alongside PostgreSQL by default:

If you want to install pgAdmin as well, make sure to check it off when you get to this screen:

How do I install pgAdmin?

If you didn’t choose to install pgAdmin alongside PostgreSQL, you’ll have to install pgAdmin separately. Fortunately, installing pgAdmin is also very straightforward. First, navigate to Then, select your OS from the list of choices, and that’ll bring you to the download screen. Select the most recently updated version. Finally, click the file with the .exe extension:

From there, the installation is very straightforward.

How do I launch pgAdmin once I have it installed?

For Windows users, go to the search bar in the bottom left-hand side of your desktop, and click on Apps on the top ribbon of the window that appears. Then type in pgAdmin, and it should come right up. Hit enter to launch the pgAdmin application:

For Mac users, pgAdmin is accessible in your Applications folder:

PivotTables in Excel

What is a PivotTable?

A Pivot Table is an analytics tool that can quickly answer key business questions. They are excellent at extracting insights from a vast dataset quickly. PivotTables are one of the most efficient and effective ways to evaluate large quantities of data in Excel. By “pivoting” or aggregating a large data table into a condensed, visually appealing format, PivotTables reveal key information that would not be visible by looking at the raw data.


The Lemonade Stand is a timeless example of operating a successful business-illuminating the importance of intelligent marketing, pricing, and financial decisions. It doesn’t matter if you are the CEO or just started yesterday; we’re going to highlight what a Pivot Table can do for your business.

Fortunately, the lemonade data above is formatted correctly. In this case, select your preference for either Excel or Google Sheets, and follow the video to create a Pivot Table.

The 4 Areas of a Pivot Table:

You should see an empty Pivot Table. Your business question will determine which fields (Employee, Shift, Day of the Week, Sales, Weather, Temperature) you want to include. To help decide, we also need to understand the four areas of a Pivot Table.

The 4 areas are Columns, Rows, Values, and Filters:

Columns – This will create a list of all of the unique values in the field you choose, going horizontally across the page.

Rows – This will create a list of all of the unique values in the field you choose, going vertically down the page.

Values – The field that you want to measure in your Pivot Table, aggregated based on what is in the Rows and Columns boxes. For example, sum of Sales, count of Employees, average of Temperature. Typically a numeric field, but not always.

Filters – You can filter the entire PivotTable using whatever fields you put in here. For example, a filter on Weather allows you to display sales of only the sunny days.


  1. Let’s say our lemonade venture wants to see our sum of Sales across the Day of the Week for each Employee. Select your preference for either Excel or Google Sheets, and follow the video to create this Pivot Table.

  2. Furthermore, let’s say our lemonade venture wants to filter these Sales for when the Temperature is over 70 degrees. The filter area allows us to do this. For a bonus, we will add some conditional formatting.

A Pivot Table is a great quick visual tool to answer these types of questions. These types of questions fall in the category of observational such as head counts or total sales. From the example, we observe Paige had more sales than Sam. But does that mean Paige is a better employee than Sam? Now on questions like these, we have to be careful. Paige worked a Sunny Friday shift, which contributed to the majority of her sales. A Pivot Table is not always a great tool to prove causation and answer these types of questions. To answer this, we would employ a suite of statistical algorithms.

Data Formatting for a Pivot Table

If the tutorial above was a walk in the park, but you still haven’t created a pivot table, the problem is most likely the formatting of your data. This section will discuss data formatting and data cleaning issues that may prevent you from creating a Pivot Table.

Multiple Data Sources:

Sometimes it is the case that data is collected separately in different sheets or workbooks. Referring to the example above, it could be the case that the data was collected independently for each employee. Following the general rule that all values of the same type need to be in one column would solve this issue. A little copy and paste would accomplish this.

Data Structure:

Separately, there is the issue that the weather data (Sunny, Cloudy, Rainy) was collected such that each type of Weather is a column. This column structure is also known as unstacked data or wide data, which can be preferred when modeling. However, for visualizations such as Pivot Tables, it is best to move these (Sunny, Cloudy, Rainy) under one column, so it is stacked data or long data.

Dirty Data:

So far, we have assumed that each cell’s data has been correctly filled out. In other words, the focus has been on the overall structure of the data or the columns. However, when working on real-world problems, it is most often the case that you have some dirty data. For example, the dates are spelled differently. For this example, it is quite trivial to fix this issue, but when your spreadsheet contains thousands or millions of rows, data cleaning can be tremendously time-consuming. Additionally, we have not covered the importance of number of formats in Excel and Google Sheets. In the video examples above, the Days of Weeks are ordered correctly. Depending on the spreadsheet, sometimes this happens automatically, but sometimes the number format of your data can cause Pivot Table display issues.

Final Remarks

Hopefully, you found this post useful. If you have questions on pivot tables or drawing conclusions based on the observations from a pivot table you have already created, Boxplot Analytics is well versed in data visualizations and statistics. Furthermore, if you are spending valuable time trying to structure, clean, or format your data, please don’t hesitate to contact us. Data cleaning is one of the many services BoxPlot Analytics can provide for your business.

Are Decision-Makers Disempowered by the Modern Data Stack?

Future Data 2020

Many winter moons ago, I (virtually) attended Future Data 2020, a conference about the next generation of data systems. During the conference, I watched an interesting talk given by Tristan Handy, founder and CEO of Fishtown Analytics, called The Modern Data Stack: Past, Present, and Future. During the talk, Tristan discussed a so-called Cambrian explosion of data products built upon data warehouses, such as Amazon Redshift, between 2012 and 2016, as well as his opinion that we are on the precipice of a similar paradigm shift, which he referred to as “the second Cambrian explosion.”

Tristan’s perspective on the modern data stack provided much food for thought; however, the topic I want to explore here stems from a brief comment made about the future of self-service in data-driven decision-making: How those who are not necessarily on the cutting-edge of data science can best leverage their data to make informed decisions.

Office Space

To start this exploration, I will first give a simplified version of a past during which I was not of working age and with which I therefore have no direct experience: Prior to the aforementioned (first) Cambrian explosion, data analysis was primarily carried out using spreadsheets, such as (of course) Microsoft Excel. In many theoretical offices in the 90s and 00s, countless nameless and faceless theoretical analyst/decision-makers spent their Mondays through Fridays bouncing among tens of tens of Excel spreadsheets, adding calculated fields in two-lettered columns and introducing errors for which there would be no record; it was a laugh riot, the analyst/decision-makers earned decent theoretical wages for their time spent, and everyone watched Friends in the evenings without feeling obligated to discuss how problematic it was.

In more recent times, with the advent of modern data warehouses, data storage was able to be better separated from data analysis, and many, many SaaS companies profited off this division on scales not easily understood by humans. So rather than the happy-go-lucky Friends’ era paradigm, with data tabulated in one program with nice little cells and able to be analyzed in that same program by analyst/decision-makers, a number of new business intelligence platforms began to make their way into offices, raining on everyone’s parade, and just because the new guy attended a “conference” about the “future” in “Des Moines.”

The Modern Data Racket

Let me take a step back: In or around his talk (source), Tristan made the following comments:

“How do you democratize self-service? Controversial, but I believe the Modern Data Stack disempowered many decision-makers. Those comfortable w/ Excel feel cut off from the source of truth. What if the spreadsheet interface is actually the correct way?”

Tristan prefaces his claim that the modern data stack has disempowered decision-makers with the warning that his opinion may be controversial, but I would argue that his statement is not controversial at all, mostly because it is unarguably true. With the shift of data storage and analysis away from the flexible and easy-to-use spreadsheet and toward ecosystems such as data lakes, data rivers, data abysses, and data Charybdises, end-users (i.e., the analyst/decision-makers of yore), many of whom primarily use data tools as a means to an end, have likely lost their way.

Put simply, it is not as simple to navigate the modern data stack as it was to navigate acres of spreadsheets. Spreadsheets, with all of their flaws, have almost no learning curve: if you can turn on a computer and open a file, you can navigate a spreadsheet. Furthermore, from the start, you are only a few clicks, keystrokes, and neural connections away from mastering formulas, pivot tables, and visualizations. I hate spreadsheets! — but they are a near-perfect balance of usability and flexibility.

In contrast, while modern business intelligence tools may be built with end-users with varying levels of technical expertise in mind, they tend to have a steeper learning curve. For example, while today’s decision-maker, now stripped of his or her ‘analyst’ status, can likely navigate a dashboard and make decisions based on the information presented, he or she has lost the almost-tactile experience of sifting through the data with his or her own hands.

The Second Law

I know what you are thinking—literal metric tons of decisions were made based on little more than a pie chart from an hours-long presentation that was not put together by the person who had final say in the decision-making process—but please allow me to employ the above generalization to support my next point: Every new data technology moves the decision-maker further downstream from the data source.

Today, the data required by a decision-maker may be located in a neatly designed dashboard, on physical servers, somewhere in the cloud, and/or on the backs of napkins, they may have underwent various transformations and exist in several slightly different forms of varying accuracy and transparency, and most likely, that decision-maker does not know which of these sources contains the data he or she needs to make an optimal decision, nor the processing those data underwent; it is complete chaos, and not the fun and festive Bacchian kind (and I haven’t even spoken of the inherent fuzziness of data).

The more technologies a company implements in its data stack, the more points there are for potential misunderstandings, and the more training individual decision-makers have to undergo to become fluent in the data stack on which they rely. In other words, decision-makers are being disempowered by the increasing complexity of the modern data stack.

So… What now?

I see three possible solutions to the problem of disempowerment:

  • Stop trying to reinvent the wheel and keep spreadsheets around for the long haul;
  • Hire decision-makers who are prepared to use and keep up with the modern data stack; or
  • Promote close collaboration between data experts and decision-makers to support decision-making.

All three options have pros and cons, but I am personally a fan of the third. In the last decade or so, there has been a rapid increase in our ability to store and manipulate data, and spreadsheets alone cannot be expected to fulfill all modern data needs. Similarly, in many industries, decision-makers alone cannot be expected to stay on the cutting-edge of data science. Therefore, it follows that close collaboration between data experts and decision-makers is becoming increasingly necessary in the modern office.

Alternatively, perhaps in time an easy-to-use tool will come along that can be used to both store and analyze data… Oh… wait… that’s the spreadsheet.

Pie Charts

Show/Hide Code  

You toggled code blocks on! Code blocks will appear in gray below. Click the button again to turn them off.


It seems as if people are split on pie charts: either you passionately hate them, or you are indifferent. In this article, we are going to explain why pie charts are problematic and, if you fall into the latter category, what you can do when creating pie charts to avoid upsetting those in the former.

Why are pie charts problematic?

They use size to convey information

A pie chart uses the size of a portion (slice) of a circle (pie) to display a numerical variable. This factor is not an issue in and of itself, as many chart types use size to convey information, including bubble charts and bar charts; however, while bubble charts and bar charts use diameter and height, respectively, to convey information, pie charts rely on the angle describing a slice---and the human eye is not very good at recognizing differences in angles.

Suppose we took a survey on people's favorite kinds of pie. In the chart below, it is difficult to see how the categories relate to each other; individually, Cherry and Rhubarb seem to comprise a smaller portion of the pie than either Apple or Pumpkin, but it may not be obvious (without looking at the data) which is the smaller slice.

#Adjusting plot size and margins
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,1), mai = c(0.5, 0, 0.75, 0))

#Data for pie chart
x = c(18, 15, 13, 25, 29)
labels = c("Key Lime", "Cherry", "Rhubarb", "Pumpkin", "Apple")
cols = c("greenyellow", "red", "mediumvioletred", "darkorange", "cornsilk")

#Build the pie chart
pie(x, labels, radius = 1, col=cols)

They cannot display many categories well

This issue of conveying size via angle is even more pronounced when many categories are shown in a single pie chart. Furthermore, unlike some charts that are used to display several categories at once, such as bar charts, pie charts depend on differences in color to denote category; therefore, a large palette of colors is necessary, and without proper selection of the palette, the results could be either garish or ambiguous.

#Adjusting plot size and margins
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,1), mai = c(0.55, 0, 0.8, 0))

#Data for pie chart
x = c(2, 4, 5, 10, 13, 15, 15, 17, 19)
labels = c("Key Lime", "Pecan", "Cherry", "Blueberry", "Rhubarb", "Lemon Meringue", "Blackberry", "Pumpkin", "Apple")
cols = c("greenyellow", "tan4", "red", "darkblue", "mediumvioletred", "yellow", "black", "darkorange", "cornsilk2")

#Build the pie chart
pie(x, labels, radius = 1, col=cols)

They show parts of a whole

Pie charts represent a whole as its components. Therefore, if your dataset is a subset of a larger dataset (and thus does not represent the whole) or if your dataset consists of independent categories (and thus represents multiple wholes), then a pie chart may not be appropriate.

Pie charts in popular packages

We wouldn't want to assume anyone's opinion on as divisive a topic as the pie chart, but perhaps the disdain for this chart type is best exhibited by the lack of built-in functions for creating them in two very popular data visualization packages: ggplot2 (R) and seaborn (Python). With both packages, a pie chart can be created only through trickery.


It is convenient---perhaps a little too convenient---that a pie chart is no more than a single stacked bar displayed in polar coordinates. The code below builds the pie chart shown above, but using ggplot2.

#Adjusting plot size and margins
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,1), mai = c(0.55, 0, 0.8, 0))

#Data for the pie chart
values = c(9, 2, 5, 10, 13, 15, 10, 17, 19)
labels = c("Key \nLime", "Pecan", "Cherry", "Blueberry", "Rhubarb", 
           "Lemon \nMeringue", "Blackberry", "Pumpkin", "Apple")
cols = c("Key \nLime"="greenyellow", "Pecan"="tan4", "Cherry"="red", "Blueberry"="darkblue", 
         "Rhubarb"="mediumvioletred", "Lemon \nMeringue"="yellow", "Blackberry"="black", 
         "Pumpkin"="darkorange", "Apple"="cornsilk2")

data = data.frame(labels, values)

#Build the pie chart
ggplot(data, aes(x="", y=values, fill=labels))+
    geom_bar(width = 1, stat = "identity") +
    scale_fill_manual(values=cols) +
    coord_polar("y", start=0) +  #Use polar coordinates

What chart types can be used to replace pie charts?

Bar charts

Similar to pie charts, bar charts use size to convey information; however, for bar charts, the height of a rectangle varies, and differences between the heights of bars are easier to recognize than the differences between the angles of portions of a circle. Furthermore, bar charts can be configured to show absolute numbers, percentages, or both!

#Adjusting plot size and margins
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,1), mai = c(0.5, 1, 0.2, 1))

#Data for bar chart
values = c(9, 2, 5, 10, 13, 15, 10, 17, 19)
labels = c("Key \nLime", "Pecan", "Cherry", "Blueberry", "Rhubarb", 
           "Lemon \nMeringue", "Blackberry", "Pumpkin", "Apple")

data = data.frame(labels, values)
data = data[order(-values),]

#Build the bar chart
        ylim = c(0, 20),

Waffle Charts

Waffle charts, which are growing in popularity, use number rather than size to visualize a numerical dimension. The resulting graph is similar to a stacked bar or tree map; however, because each square is a unit, compared to alternatives that rely solely on size, it is easier for a person to confirm if a perceived difference between categories is real without relying on text.

#Adjusting plot size and margins
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,1), mai = c(0.5, 1, 0.2, 1))

# Create data
pies = c("Pecan"=2, "Cherry"=5, "Key Lime"=9, "Blueberry"=10, "Blackberry"=10, 
         "Rhubarb"=13, "Lemon Meringue"=15, "Pumpkin"=17, "Apple"=19)

waffle(pies, rows=5, size=1.5, 
       colors=c("tan4", "red", "greenyellow", "darkblue", "black", 
                "mediumvioletred", "yellow", "darkorange", "cornsilk2"),
       xlab="1 square = 1 vote", legend_pos = "bottom")

But what if I don't like the alternatives?

Even though there are many alternatives (e.g., bar charts, stacked bars, waffle charts, lollipop charts, tree maps), pie charts are a familiar chart type to most people, and depending on the audience, familiarity may be an important factor that affects interpretability. So if you want to stick with pie charts, consider taking the following advice.

Limit the number of categories via grouping

To avoid visual clutter and to ensure your pie chart is readable, the number of categories should be small. Therefore, it may be useful to group categories that individually comprise a small proportion of the pie into a single category. Note that, when using this approach, it may be helpful to list the items contained in the derived category. Furthermore, it is best to ensure that the new category does not form the majority of the resulting pie.

Show percentages or absolute numbers (or both) as text

Text can be used to prevent misunderstandings due to ambiguity. By including text information, a person can see if there are differences among the categories. However, if it is necessary to include text, then one can argue that the visualization itself is ineffective (so be prepared to defend your choice of chart type).

#Adjusting plot size and margins
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,1), mai = c(0.55, 0, 0.8, 0))

#Data for pie chart
x = c(15, 20, 35, 30)
labels = c("Other (15%)", "Cherry (20%)", "Pumpkin (35%)", "Apple (30%)")
cols = c("black", "red", "darkorange", "cornsilk2")

#Build the pie chart
pie(x, labels, radius = 1, col=cols)


We hope you found our discussion of pie charts informative. While pie charts can be avoided in most cases, they remain a pithy little chart on which many, many people have little to no opinion. However, to avoid a mass uptake of pitchforks and torches, please remember to employ pie charts responsibly and to use caution when including any controversial chart type in your next presentation.

Required libraries

Click the Show/Hide Code button to view the libraries.


Scatter Plots: What they are and how to make the most of them

Useful Python Snippets

The goal of this blog post is a compilation of little tidbits and code snippets that address common issues when programming for data analysis in Python.

General Snippets

Difference between JSON and XML

This page gives a great example of the difference between data in JSON format and XML format. It shows the exact same data in both formats:

Converting scientific notation into numbers

Converting from scientific notation in a Pandas Dataframe:

Remove ellipses from pandas dataframe preview:

pd.options.display.max_columns = 2000

#If you don't want to make the change permanently for the notebook, 
(e.g., to avoid excessive output in other cells), you can also use 

with pd.option_context('display.max_columns', 2000):
#temporarily display all columns
with pd.option_context('display.max_seq_items', None):
    print (df.columns)

Isolate date columns:

datecols2 = []

for item in prod.columns:
    if 'Date' in item:

Add grand total column to a pivot table:

test_df = pd.pivot_table(prod, index="Color", columns="Class", values="ListPrice", aggfunc=np.sum)
test_df['Grand Total'] = test_df.sum(axis=1)

Comparing group by syntax to pivot table syntax:

prod.groupby(['Class', 'Style']).count()[['Name']]
pd.pivot_table(prod, index=['Class', 'Style'], values="Name", aggfunc="count")

Use apply for multiple columns in a dataframe:

avo.apply(lambda row: row.AveragePrice * row['Total Volume'], axis=1)

Choose an argument for the open function (file i/o)

Install packages in Jupyter Notebook

# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pytime

Understanding copying objects in python

These two links are excellent at explaining:

Reverse Dictionary Function

def reverse_dict(lookup_value):
    dictionary = {'george' : 16, 'amber' : 19}
    for key, value in dictionary.items():  
        if value == lookup_value:

What are args and kwargs?

Removing duplicate rows in a dataframe

This is an extremely important pandas doc page! Indexing & slicing dataframes

Selecting & looping through parts of a dataframe

Connect to a mySQL database:

import mysql.connector
# Set up your connection to the database
myConnection = mysql.connector.connect( host=

# Read the results of a SQL query into a pandas data frame.
my_table = pd.read_sql('SELECT * FROM table_name, con=myConnection)

Connect to postgres database:

import psycopg2
connection = psycopg2.connect(user = 
password = "your-password-here-keep-quotes",
host = "your-host-here-keep-quotes",
port = "5432",
database = "your-database-here-keep-quotes")
cursor = connection.cursor()
cursor.execute("SELECT * FROM django_session;")
record = cursor.fetchone()

Connect to the Twitter API

import numpy as np
import pandas as pd
import json
from import json_normalize
import twitter

# You need to replace all the capital words in brackets with your
# ACTUAL keys. The quotation marks stay but the brackets and 
# capital words must go. 
api = twitter.Api(consumer_key='[CONSUMER KEY GOES HERE]',
                  consumer_secret='[CONSUMER SECRET GOES HERE]',
                  access_token_key='[ACCESS TOKEN KEY GOES HERE]',
                  access_token_secret='[ACCESS TOKEN SECRET]')

# Get the tweet data since that last tweet
# The user_id is for Boxplot's timeline, replace it with your 
# own if you'd like!
user_timeline = api.GetUserTimeline(user_id='959273870023905280')
latest_twitter_data_final = pd.DataFrame()

for i in range(len(user_timeline)):
    rowasdf = \ json_normalize(json.loads(json.dumps(user_timeline[i]._json))) \
    latest_twitter_data_final = pd.concat([latest_twitter_data_final, \ 


Loop through a Series and make sure that each subsequent value is greater than or equal to the one before it. If not, set the value equal to the one before it:

previous_value = 0

def previous(current):
   global previous_value
   if current < previous_value:    
       return_value = previous_value
#        previous_value = current
       return_value = current
   previous_value = return_value
   return return_value

Remove white space in a column

df = pd.DataFrame({'a':[' app le ']})
df.a = df.a.str.strip()

Customizing Matplotlib Visualizations

How to customize the range of the x-axis and rotate the tick marks:

awesome_table1 = pd.pivot_table(data, index='DEGFIELD3', 
columns='REGION2', values='CBSERIAL', aggfunc='count')

Also see:

Create reusable settings for a chart:

def my_scatterplot(x_txt, y_txt, df, colorcol):
    df.plot(kind='scatter', x=x_txt, y=y_txt, c=colorcol, colormap='winter', figsize=(10,4), s=10, alpha=.5)
my_scatterplot('Total Bags', 'AveragePrice', avo, 'type_as_num')

Change the size of all charts in a notebook:

# put this at the top of the notebook:
plt.rcParams["figure.figsize"] = [15, 10]

Example of changing colors and marker types in scatterplots:

colors = ['b', 'c', 'y', 'm', 'r']

en = plt.scatter(books_data[books_data['language_code']=='en']
['average_rating'], books_data[books_data['language_code']=='en']
['ratings_count'], marker='x', color=colors[0])

spa = plt.scatter(books_data[books_data['language_code']=='spa']
['average_rating'], books_data[books_data['language_code']=='spa']
['ratings_count'], color=colors[2])

fre  = plt.scatter(books_data[books_data['language_code']=='fre']
['average_rating'], books_data[books_data['language_code']=='fre']
['ratings_count'], marker='o', color=colors[1])

# a  = plt.scatter(random(10), random(10), marker='o', 
# h  = plt.scatter(random(10), random(10), marker='o', 
# hh = plt.scatter(random(10), random(10), marker='o', 
# ho = plt.scatter(random(10), random(10), marker='x', 

          ('English', 'Spanish', 'French'),
          loc='lower left',

Multiple y axes, and forced axis

# This is one data point we're trying to plot
chart1 = sets.groupby('year')['num_parts'].count() 
# This is the other data point we're trying to plot
chart2 = sets.groupby('year')['num_parts'].mean() 
fig, ax = plt.subplots(sharey='col')  

# Create a MatPlotLib figure & subplot
ax2 = ax.twinx() # ax2 shares X axis with the ax Axes object
#  This is what forces scale on the second Y-axis!!
ax2.set_ylim(bottom=0, top=799)  

# graph both of our data sets, one bar, one line, chart1, color='dodgerblue')
ax2.plot(chart2.index, chart2, color='red')

# Set the size of the resulting figure

Other Visualizations

A great tutorial for mapping:

Side by side boxplots with seaborn:

Set x and y axes for seaborn plots:

ax = sns.barplot(x = 'val', y = 'cat',
              data = fake,
              color = 'black')
ax.set(xlabel='common xlabel', ylabel='common ylabel')

Make a word cloud in the shape of a custom image:

# If using Jupyter Notebook, you need to install the 
# wordcloud module like this

import sys
 !{sys.executable} -m pip install wordcloud

# import the libraries needed

from PIL import Image
 import numpy as np
 import pandas as pd
 from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
 import matplotlib.pyplot as plt

#  import your dataset
prod = pd.read_csv('winemag-data-130k-v2.csv')

# import the image mask
wine_mask = np.array("wine_mask.png"))

# generate the word cloud
comment_words = ''

stopwords = set(STOPWORDS)

stopwords.update(["drink", "now", "wine", "flavor", "flavors"])

for val in prod.description.iloc[0:1000]:

    val = str(val)

    tokens = val.split(' ')

for i in range(len(tokens)):
    tokens[i] = tokens[i].lower()

for word in tokens:
    comment_words = comment_words + ' ' + word

wordcloud = WordCloud(background_color="floralwhite", 






plt.figure(figsize = (48,48), facecolor = None)

plt.imshow(wordcloud, interpolation="bilinear")



plt.title("Frequent Words from Tasters - Wine Form",fontsize = 40,color='gray')

Make a single box and whisker plot with Matplotlib:

fig, axs = plt.subplots(1, 1)
 axs.set_title('basic plot')

Multiple box and whisker plots with Matplotlib:

# To make side by side box and whisker plots (in this example, get 
# points for each country, and then make a list of those lists). 
# That is what is passed in to the boxplot function:

u = list(prod[prod['country']=='Italy']['points'])
m = list(prod[prod['country']=='Portugal']['points'])
w = list(prod[prod['country']=='Germany']['points'])
final_list = []

fig7, ax7 = plt.subplots()
ax7.set_title('Multiple Samples with Different sizes')

Pie Chart:

   autopct='%1.0f%%', colors=['skyblue', 'lavender', 'lightpink',
   'lightcyan', 'lemonchiffon', 'mistyrose'])
plt.legend(title = 'Wine Country of Origin', loc='best', bbox_to_anchor=(1, 0, 0.5, 1))
plt.figure(figsize=(360, 250))

Bubble plot tutorial

Calculating correlation with a scatterplot

A/B Testing Example (Two Proportion Hypothesis Test)

A/B testing (sometimes called split testing) is comparing two versions of a web page, email newsletter, or some other digital content to see which one performs better. A company will compare two web pages by showing the two variants (let’s call them A and B) to similar visitors at the same time. Typically, the company is trying to see which page leads to more sales, so the one that gives a better conversion rate wins.

You work for a nonprofit and your organization has two different webpages designed to solicit donations, we’ll call them page A and page B. The two pages are trying to accomplish the same result – to get the viewer of the page to donate, but they both have different looks and feels from one another. The success of these web pages is measured by the percentage of people who wind up making a donation (in any amount). For example, if 100 people view page A and 10 purchase, A’s conversion rate is 10%.

The nonprofit wants to test if one page is doing statistically significantly better than the other (that is, leading to more donations) and has tasked you with coming up with an answer.

Step 1: Collect Data

The first thing you need to do is choose a period of time. Let’s say one month. Then, collect your data. In one month, page A had 100,000 views and 20,000 people made a donation. In that same month, Page B had 80,000 views and 15,000 people purchased a product.

  • The proportion for success for Page A is pA = 20,000/100,000 = .2 = 20%
  • The proportion for success for Page B is pB = 15,000/80,000 = .1875 = 18.75%
  • The difference between these two proportions is .2 – .1875 = .0125 = 1.25 percentage points

Step 2: Choose a Test

To determine if the conversion rate for page A is significantly higher than page B, we do a difference of proportions test. Choosing a test sometimes can be the most difficult part of a statistical analysis! Different test statistics (T, Z, F, etc.) are used for different types of data. Use the Statistics Cheat Sheet for Dummies chart or other related sites like StatTrek to help you choose the right test based on your sample.

Step 3: Pick a Confidence Level

Almost everyone chooses 95%. If you choose less than that, people may look at you funny or like you have something to hide! Of course there may be appropriate uses for confidence levels less than 95% but it’s not common. If you’re testing something super important, like the safety of airplane parts, you want a confidence level much higher than 95%! Probably like 99.99999% or more!

In this case, we’ll stick with 95%.

Step 4: Null and Alternative Hypotheses

As always with hypothesis testing, we need to specify null and alternative hypotheses. In statistics, we’re never talking about an exact match – it will almost never be that way. See Barbara’s Kakes+ example for more on this. In this case, our hypotheses would be:

Null Hypothesis

pA – pB = 0

That there is no difference between the two pages – that is, statistically, one does not result in more donations than the other. If you subtract two numbers that are equal, you would get 0, which is why the hypothesis is written this way. These are all appropriate ways of stating the null hypothesis in words:

Alternative Hypothesis

You have three options for the alternative hypothesis: pA – pB > 0, pA – pB < 0, or pA – pB ≠ 0. In this case, we’ll choose pA – pB > 0 because we think that page A is performing better than page B, and subtracting a smaller number from a larger number results in a positive answer.

Step 5: Meeting Assumptions

The assumptions associated with a difference of proportions test are discussed in the last column of this table (taken from the Statistics for Dummies Cheat Sheet):

Let’s break down the variables and assumptions into a table:

  Page A Page B
N 100,000 80,000
P 20,000/100,000 = .2 15,000/80,000 = .1875
Np 100,000 * .2 = 20,000 80,000 * .1875 = 15,000
N(1-p) 100,000 * (1-.2) = 80,000 80,000 * (1-.1875) = 65,000

Looks like we met the requirements! Np and n(1-p) are well above 10 for both of these. The reason that they must be above 10 is because statements cannot be made with enough confidence about small samples. You need a large enough sample size to do the test.

Step 6: Find the Pooled p

This is a special calculation we have to do for a difference of proportions hypothesis test. It is essentially finding a weighted average of the two proportions:


Step 7: Calculate a Z score

We see from the table above that we are using the Z test statistic, and the table also provides the formula. So, we just plug in the numbers!

Step 8: Discuss Results

A Z-Score of 6 is huge! Take a look at the normal distribution:

The area underneath the curve (that is, from the curve to the x axis) is the probability of getting a result as large as you did if the null hypothesis is true. To put that in the context of our problem, it means “the probability of getting a difference of .0125 (1.25%) if the difference of the two proportions is actually 0%.” The area between the curve and the x axis at Z = 6 (which is so far to the right it isn’t even on the chart!) is extremely small – less than 1%. We can get the exact value using a p-value calculator.

At 95% confidence, we need the Z score to be above 2 (or equivalently, the p-value to be less than 5%) to reject the null hypothesis. So, since they are, we can at 95% confidence reject the null hypothesis that there is no difference between the two pages in favor of the alternative hypothesis, that page A performs better than page B. These are some other appropriate ways of stating the result:

  • At a 95% confidence level, the proportion of viewers who donate after viewing page A is statistically significantly greater than the proportion of viewers who donate after viewing page B.
  • With 95% confidence, we can state that page A performs better than page B in number of donations solicited
  • At 95% confidence, page A has a statistically significant higher proportion of donors than page B.

Notice that all of these contain “at 95% confidence” or “with 95% confidence,” etc. Language like this is important in statistics! If we had chosen another confidence level, our result may be completely different.

Population Mean Hypothesis Test Example

The Situation: Kakes+, a Pennsylvania company that makes terribly unhealthy small pies/cakes, believes that their machines are overfilling their blueberry pies. Kakes+ wants to test this statistically, and has recruited you to come up with a data-backed answer. The pies should weigh 8 ounces each.

Step 1: Collect Data

You need to weigh the pies to determine the answer. However, of course it would be absurd to weigh every single Kakes+ pie that leaves the factory! Rarely in life can we test an entire population, which is why hypothesis testing is so important – the whole point of hypothesis testing is using a sample to make a statement at some level of confidence about the entire population.

You need to choose a sample size. It’s important to choose a sample size that’s large enough to represent the whole population. If you only weigh 5 pies, it’s not going to be enough data for us to make a confident statement about the average weight of entire population.

There are two options for choosing a sample size – you could attempt to calculate it using the formula from the Statistics for Dummies cheat sheet.

But that requires a lot of guessing (we need to guess a standard deviation and determine a margin of error (MOE). We know that sample sizes greater than 30 are acceptable because of the Central Limit Theorem (given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.), so let’s sample 100 pies.

You sample 100 pies, and get the following numbers:

8.42 4.20 9.60 4.66
10.04 6.63 7.84 8.61
8.04 6.09 7.09 8.70
5.58 5.76 9.78 6.42
9.57 8.23 11.68 6.63
6.60 7.28 9.31 10.83
7.07 8.86 11.48 7.83
8.43 7.44 8.62 8.15
7.75 11.26 6.49 10.14
10.93 7.26 10.99 11.51
8.84 7.40 8.20 7.51
8.03 8.70 5.98 9.28
6.59 7.71 9.97 8.74
8.04 7.84 8.36 8.48
8.39 6.22 9.02 9.99
9.77 7.60 10.47 5.03
9.09 10.18 8.41 8.39
8.91 4.48 9.52 5.34
7.11 5.67 9.57 9.44
8.74 7.81 6.78 7.25
11.16 7.87 6.13 7.97
5.81 11.15 12.92 8.85
6.04 7.48 7.69 7.36
7.09 5.17 7.25 7.36
8.43 9.87 7.26 9.54

The mean of this sample is 8.17. The standard deviation is 1.75. These can both be calculated using formulas in Excel.

Step 2: Choose a Test

We want to estimate the average weight of the pies for the population, so we would choose the population mean hypothesis test. Use the Statistics Cheat Sheet for Dummies chart or other related sites like StatTrek to help you choose the right test based on your sample.

Step 3: Pick a Confidence Level

Almost everyone chooses 95%. If you choose less than that, people may look at you funny or like you have something to hide! Of course there may be appropriate uses for confidence levels less than 95% but it’s not common. If you’re testing something super important, like the safety of airplane parts, you want a confidence level much higher than 95%! Probably like 99.99999% or more!

In this case, we’ll stick with 95%.

Step 4: Null and Alternative Hypotheses

The null hypothesis is that the average weight of the population of Kakes+ blueberry pies is 8 ounces. We choose this because we know that’s what it should be.

u = 8

The alternative hypothesis is that the average weight of the population of Kakes+ blueberry pies is greater than 8 ounces. We chose this because that’s what we think is actually happening.

u > 8

Note: our options for the alternative hypothesis were greater than 8 ounces, less than 8 ounces, or “not equal to” 8 ounces. We’re never saying how much great or less, just that it’s statistically significantly greater or less than 8 ounces.

Also note – this isn’t exact and is not meant to be taken literally. For example, if your sample mean turns out to be 8.00001 ounces, you will fail to reject the null hypothesis because if your sample mean is that close to 8, there’s obviously a good chance that it could actually be exactly 8 if you weighed all the pies. In statistics, you can’t be 100% sure of anything, so you’re always considering an interval with some level of confidence where the true average weight of the pies may lie. (Or, if you’re testing proportions, where the true proportion would lie, or difference of proportions, etc.). See optional step 9 to understand this better.

Step 5: Meeting Assumptions

Take a look at the row for population mean hypothesis testing from the Statistics for Dummies Cheat sheet:

The last column, “Use when” states the assumptions that need to be in place for the test to work. We meet the normal distribution condition for both of these. However, we don’t have a known population standard deviation – we can only use the calculated standard deviation from our sample. So, we need to choose the t-test instead of the z test.

Step 6: Calculate the Z Score

This test uses the T distribution, and the Cheat Sheet tells us that and also gives us the formula for the t-statistic (test statistic). From here, we just plug in the numbers:

Our t-statistic is .9714.

Step 7: State Results

Check out this graph of the normal distribution, where the x axis is standard deviations. So, the 1 on the x axis means 1 standard deviation from the mean. The 0 means 0 standard deviations from the mean (which is the mean itself!)

When you look at a chart like this for a hypothesis test, we’re always looking at it with the mean as the null hypothesis. So in this case, that 0 represents the 8. The 1 is one standard deviation away from the 8 ounces. Our Z score of .9714 falls about at the 1, which means the value 8.17 is about 1 standard deviation away from the mean. At 95% confidence, we can see from the chart that we need something 2 or more standard deviations away from the mean to reject the null hypothesis. Since we didn’t reach 2, we fail to reject the null hypothesis meaning that there is not sufficient evidence at the 95% confidence level that the average weight of the population is greater than 8 ounces. In other words, we don’t have enough evidence to conclude that the machines are overfilling the pies.

Optional Step 8: Calculate a P Value

You’ll hear p value thrown around a lot in statistics! The formal definition is the probability of getting an answer as extreme as the observed result if the null hypothesis is true. In other words, what’s the probability of getting 8.17 if the true average weight of the entire population of pies is 8 ounces? It’s also represented as the area between the curve and the x axis on that normal distribution graph above. As you can see, the further away from the mean (0) we get, the smaller the area between the curve and the axis, and thus the lower the probability of getting a result way out there.

We need a p-value calculator for to get the exact value. We recommend this one because it checks to see if you are doing a one or two-tailed test, and your confidence level. In our case, we’d type in .971 for the t-statistic and 99 for the DF value. DF stands for degrees of freedom, which is equal to the sample size (100 pies in our case) minus one. The significance level is 0.5 because we specified 95% confidence earlier in the test. And it’s a one-tailed test because our alternative hypothesis is greater than 8. If we chose “not equal to” 8 then it would be two-tailed.

The p-value is .166858 or 16.69%. This isn’t giving us any new information, it’s just another way of considering the t-statistic we got earlier. We need a p-value of less than 5% to reject the null hypothesis, and this is way higher than 5%, so we once again conclude that we fail to reject the null hypothesis at 95% confidence.

Optional Step 9: Confidence Intervals

As mentioned in my note earlier, these tests aren’t supposed to be exact, they’re giving a probability of getting a result assuming the null is true. Another way of thinking about this is that they are providing a range of values which, at 95% confidence, the true mean could lie within. If 8 is in that range, we fail to reject the null. If 8 is outside the range, we reject the null.

For our test, the range would be:

The number 8 is in this range, so we fail to reject the null at 95% confidence.

Common Python Errors

This post is updated as appropriate, so keep checking back!
Table of Contents

Errors when installing python with homebrew Errors when writing/running python code

Installing Python

Error: Permission denied @ dir_s_mkdir – /usr/local/Frameworks

Check out this article for help:  

Error: Could not symlink bin/2to3

Check out this article for help:

Writing Python Code

TypeError: Can’t convert …

This means one of a few things:
  1. That you are trying to combine data types that aren’t compatible. You can’t concatenate a string and an integer for example.
  2. You’re trying to perform some operation on the wrong datatype for that function.

NameError: global name ‘—‘ is not defined

Did you forget to put something in quotes? Remember if you didn’t define something as a variable, list, dictionary, etc. previously, and it’s not a number, it needs to be in quotes!  

IndentationError: expected an indented block

There are several types of indentation errors. These are pretty self-explanatory. You either forgot an indent or have too many. Remember, python considers indents to be four spaces or a tab, exactly.  

SyntaxError: invalid syntax

This could mean a lot of things, but basically you aren’t following one of the basic syntax rules of Python. Here are some common examples:
  • Forgetting the parens around the arguments to print
  • Forgetting the colon at the end of the condition in an if statement, or in a for loop
  • Trying to use a reserved word as a variable name
  • Code like: if my_variable = 8: (should be == 8 when in an IF statement!)

IndexError: list index out of range

Typically this means you are trying to access an item in a list that doesn’t exist. For example, : flowers = ["rose", "tulip", "daisy"] print("Flowers in my garden are:", flowers[1], flowers[2], flowers[3]) There is no flowers[3]! Remember, lists start at 0, so it should have been flowers[0], flowers[1], flowers[2].  


These seem scary, but they are similar to the NameError, only specific to dictionaries. They are raised when a key is not found in the set of existing keys. Check for spelling and case sensitivity!  


This is most commonly caused by trying to convert a bad string into a number. For example: my_num = int("Word").

Tables & Linking Data Structures in Excel

Tables are one of the most important features of Excel, but are often overlooked. Tables and keeping analyses in Excel connected, will drastically increase your efficiency in Excel. Let’s start by understanding how they work with PivotTables.

We’re going to use an R Dataset called DoctorContacts. Download the .csv file using this link (and save it immediately as .xlsx!). If the file opens in a new window instead of downloading, try right-clicking the link and choosing “Save Link As…”. You can find the data dictionary for this dataset here. Each row represents a unique visit to the doctor.

If you click on any non-blank cell, and then go to Insert >> PivotTable, you should see this:

Create Pivot Window Without table

Notice this says DoctorContacts!$A$1:$O$20187. Even if you take away the dollar signs, this range of cells is set in stone. For example, let’s say you actually make this PivotTable, and maybe 10 more PivotTables, and a bunch of charts, and calculations, and perhaps even a whole dashboard. And then after you did all that, your boss comes to you and says “oops, I forgot to give you 100 rows of data.” Or maybe you forgot a calculated column, or data is constantly updating from a server. Depending on what is changing in your dataset, and how you set up your PivotTables, calculations and charts, you may have to re-do everything. But that doesn’t happen if you make a Table first. Click cancel to get out of this PivotTable window, we’re going to make a Table first!

Make a Table

Click on any non-blank cell inside your dataset, and then go to Insert >> Table. Excel will find the dataset for you:

Quick tip: Do not highlight the data before making a Table or PivotTable- if you highlight the data first, you could either 1) get blank cells in your table or 2) miss data. The best and easiest way to select your dataset is to simply click on a non-blank cell inside the dataset, and go to Insert >> Table or Insert >> PivotTable.

Insert Table Window

When this window appears, verify that it did find your correct dataset, and that “My table has headers” is checked off. This just means that the titles of the columns are in the first row of your dataset, which they are. Then click OK. Your dataset should have turned blue. This is what the whole thing looks like in case you missed a step:

This is a video, hit the play button!

Besides being easier to read, more importantly, this is now a modifiable data structure. Excel now recognizes this as your dataset. Now, go back to Insert >> PivotTabe. Notice anything different?

Create Pivot Window With table

That’s right, now it says “Table1” instaed of “DoctorContacts!$A$1:$O$20187”. You can add or remove rows to Table1, and it’s still Table1. You can add or remove columns from Table1 and it’s still Table1. Now you no longer have to worry about your PivotTable reference if your original dataset changes. Let’s try it out to see it in action. Click OK to make the PivotTable from Table1. Let’s drag health into the rows box, and linc into the Values box. We want the average income by health level, so change linc from a sum to an average. This is what the process looks like:

and the outcome:

Pivot table with log of income

Those numbers look odd, right? People don’t have an average income of $8. This is because this variable represents the log of income. (It’s actually unclear if this is the log, natural logarithm, or some other formula. We’re going to assume it’s the natural logarithm for this analysis). We can get this back into regular income by creating a new calculated column. So, we’ll be modifying the original dataset so you can see how tables work to keep your data updated!

Back in the table, go to the next blank column to the right, and call it “income”. Type “=EXP(” and then click on cell I2. Your final formula should look like this: “=EXP([@linc])”. Notice formulas are easier to read in tables too! This is more like a sentence, and you don’t have to go back and figure out what I2 is later if your columns are named well. Hit enter, and you should see the formula automatically fill down the whole column.

What if I want to freeze part of a cell reference? (Use the $?) There’s a way to do this with the table syntax, involving the @ symbol and brackets []. However, you can also always just revert back to the normal way of referring to a cell (A2 for example) by hand-typing the reference, and then you can use the dollar signs.

Now, go back to your PivotTable. Notice that your new income column isn’t there yet! Don’t worry, all we have to do is go to the PivotTable Analyze menu, and hit the refresh button. Now it’s there! As usual, here’s the whole thing:

Now, we can swap out average of linc for average income and the numbers look a lot better:

Pivot table with average income

I know these values still don’t look like realistic incomes, but remember we assumed that it was a natural logarithm. The point of this example is that we were able to update the PivotTable using Tables.

An Example with Rows

You can also do the same thing with rows – the PivotTable will recalculate the values when you push the refresh button, and any visualizations that are connected to the PivotTable will also update. Start by making a bar chart from our average income by health level PivotTable. To do this, you shouldn’t have to highlight anything (unless you are on an older version of Excel on a Mac). Just click anywhere inside the PivotTable and go to Insert >> Column Chart:

chart of average income by health level

Now, go back to the original dataset and scroll down to the first blank row below the table (should be row 20188). By the way, the keyboard shortcut Command + down arrow key on a Mac will get you to the bottom of the dataset faster. It’s Control + down arrow key on a PC. In that blank row, we’re going to fill in some fake data to see how the update works. Put “poor” in the H (health) column and 1000000 in the P (income) column. You can leave everything else blank. Go back to your PivotTable, and click inside it. Then, in the PivotTable Analyze menu, choose the Refresh button. Notice how the average income for the poor category recalculates, and the bar increases:

No need to re-do the PivotTable or chart. And imagine if you had spent a ton of time formatting the chart (picking out colors, removing borders, etc.) – it would be an absolute nightmare to have to redo that!

Tables are also essential when you have an Excel file that is connected to a server. I’ll cover that in a future post, but if you have data constantly coming in from a server (or being overwritten) then you’ll want a table so that the table expands and contracts with your dataset.

Referencing to Keep Things Connected

What if you can’t make a chart or calculation directly from a PivotTable? For example, most versions of Excel do not allow you to make a bubble plot directly from a PivotTable. No worries, you can still keep things connected with references! Let’s do it together.

In the sheet with your PivotTable, move the chart out of the way, and then click on cell D3. In cell D3, type =A3. Then drag that formula down to D7 and let go. When I say “drag” I mean hover your mouse over the bottom-right corner of the cell until you see the black cross hair (that little square in the bottom-right is called the “fill handle”) and then pull that. Once you’ve got D3 through D7 filled in, keep that range highlighted, and drag it to the right one column. The whole thing looks like this:

This is a way of copying and pasting the PivotTable without breaking the connection, and now the copied cells aren’t a PivotTable anymore. If you build your chart from these references, and something changes in the dataset, the chart will still update. When you hit the refresh button, the PivotTable will change, which will change the references since they just copy whatever is in the PivotTable, and your chart will change when the references change.

We can see it in action by adding a filter to our PivotTable and watching the references and chart update when we change the filter. Add the “physlim” column to the Filters box of the PivotTable. Change the filter from True to False to both and watch how the references change to match whatever is in the PivotTable, and the chart changes too:


Tables are pretty amazing – I make every dataset I get into a table immediately, I’m yet to find a downside. Tables will make your Excel workbook and work flow more efficient, and having to re-do visualizations and calculations when something changes. Make sure all of your future data structures (PivotTables, references, formulas, visualizations etc.) are all ultimately connected to the original table so that your entire workbook will update with the refresh button.