Useful Python Snippets

The goal of this blog post is a compilation of little tidbits and code snippets that address common issues when programming for data analysis in Python.

General Snippets

Difference between JSON and XML

This page gives a great example of the difference between data in JSON format and XML format. It shows the exact same data in both formats:

Converting scientific notation into numbers

Converting from scientific notation in a Pandas Dataframe:

Remove ellipses from pandas dataframe preview:

pd.options.display.max_columns = 2000

#If you don't want to make the change permanently for the notebook, 
(e.g., to avoid excessive output in other cells), you can also use 

with pd.option_context('display.max_columns', 2000):
#temporarily display all columns
with pd.option_context('display.max_seq_items', None):
    print (df.columns)

Isolate date columns:

datecols2 = []

for item in prod.columns:
    if 'Date' in item:

Add grand total column to a pivot table:

test_df = pd.pivot_table(prod, index="Color", columns="Class", values="ListPrice", aggfunc=np.sum)
test_df['Grand Total'] = test_df.sum(axis=1)

Comparing group by syntax to pivot table syntax:

prod.groupby(['Class', 'Style']).count()[['Name']]
pd.pivot_table(prod, index=['Class', 'Style'], values="Name", aggfunc="count")

Use apply for multiple columns in a dataframe:

avo.apply(lambda row: row.AveragePrice * row['Total Volume'], axis=1)

Choose an argument for the open function (file i/o)

Install packages in Jupyter Notebook

# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pytime

Understanding copying objects in python

These two links are excellent at explaining:

Reverse Dictionary Function

def reverse_dict(lookup_value):
    dictionary = {'george' : 16, 'amber' : 19}
    for key, value in dictionary.items():  
        if value == lookup_value:

What are args and kwargs?

Removing duplicate rows in a dataframe

This is an extremely important pandas doc page! Indexing & slicing dataframes

Selecting & looping through parts of a dataframe

Connect to a mySQL database:

import mysql.connector
# Set up your connection to the database
myConnection = mysql.connector.connect( host=

# Read the results of a SQL query into a pandas data frame.
my_table = pd.read_sql('SELECT * FROM table_name, con=myConnection)

Connect to postgres database:

import psycopg2
connection = psycopg2.connect(user = 
password = "your-password-here-keep-quotes",
host = "your-host-here-keep-quotes",
port = "5432",
database = "your-database-here-keep-quotes")
cursor = connection.cursor()
cursor.execute("SELECT * FROM django_session;")
record = cursor.fetchone()

Connect to the Twitter API

import numpy as np
import pandas as pd
import json
from import json_normalize
import twitter

# You need to replace all the capital words in brackets with your
# ACTUAL keys. The quotation marks stay but the brackets and 
# capital words must go. 
api = twitter.Api(consumer_key='[CONSUMER KEY GOES HERE]',
                  consumer_secret='[CONSUMER SECRET GOES HERE]',
                  access_token_key='[ACCESS TOKEN KEY GOES HERE]',
                  access_token_secret='[ACCESS TOKEN SECRET]')

# Get the tweet data since that last tweet
# The user_id is for Boxplot's timeline, replace it with your 
# own if you'd like!
user_timeline = api.GetUserTimeline(user_id='959273870023905280')
latest_twitter_data_final = pd.DataFrame()

for i in range(len(user_timeline)):
    rowasdf = \ json_normalize(json.loads(json.dumps(user_timeline[i]._json))) \
    latest_twitter_data_final = pd.concat([latest_twitter_data_final, \ 


Loop through a Series and make sure that each subsequent value is greater than or equal to the one before it. If not, set the value equal to the one before it:

previous_value = 0

def previous(current):
   global previous_value
   if current < previous_value:    
       return_value = previous_value
#        previous_value = current
       return_value = current
   previous_value = return_value
   return return_value

Remove white space in a column

df = pd.DataFrame({'a':[' app le ']})
df.a = df.a.str.strip()

Customizing Matplotlib Visualizations

How to customize the range of the x-axis and rotate the tick marks:

awesome_table1 = pd.pivot_table(data, index='DEGFIELD3', 
columns='REGION2', values='CBSERIAL', aggfunc='count')

Also see:

Create reusable settings for a chart:

def my_scatterplot(x_txt, y_txt, df, colorcol):
    df.plot(kind='scatter', x=x_txt, y=y_txt, c=colorcol, colormap='winter', figsize=(10,4), s=10, alpha=.5)
my_scatterplot('Total Bags', 'AveragePrice', avo, 'type_as_num')

Change the size of all charts in a notebook:

# put this at the top of the notebook:
plt.rcParams["figure.figsize"] = [15, 10]

Example of changing colors and marker types in scatterplots:

colors = ['b', 'c', 'y', 'm', 'r']

en = plt.scatter(books_data[books_data['language_code']=='en']
['average_rating'], books_data[books_data['language_code']=='en']
['ratings_count'], marker='x', color=colors[0])

spa = plt.scatter(books_data[books_data['language_code']=='spa']
['average_rating'], books_data[books_data['language_code']=='spa']
['ratings_count'], color=colors[2])

fre  = plt.scatter(books_data[books_data['language_code']=='fre']
['average_rating'], books_data[books_data['language_code']=='fre']
['ratings_count'], marker='o', color=colors[1])

# a  = plt.scatter(random(10), random(10), marker='o', 
# h  = plt.scatter(random(10), random(10), marker='o', 
# hh = plt.scatter(random(10), random(10), marker='o', 
# ho = plt.scatter(random(10), random(10), marker='x', 

          ('English', 'Spanish', 'French'),
          loc='lower left',

Multiple y axes, and forced axis

# This is one data point we're trying to plot
chart1 = sets.groupby('year')['num_parts'].count() 
# This is the other data point we're trying to plot
chart2 = sets.groupby('year')['num_parts'].mean() 
fig, ax = plt.subplots(sharey='col')  

# Create a MatPlotLib figure & subplot
ax2 = ax.twinx() # ax2 shares X axis with the ax Axes object
#  This is what forces scale on the second Y-axis!!
ax2.set_ylim(bottom=0, top=799)  

# graph both of our data sets, one bar, one line, chart1, color='dodgerblue')
ax2.plot(chart2.index, chart2, color='red')

# Set the size of the resulting figure

Other Visualizations

A great tutorial for mapping:

Side by side boxplots with seaborn:

Set x and y axes for seaborn plots:

ax = sns.barplot(x = 'val', y = 'cat',
              data = fake,
              color = 'black')
ax.set(xlabel='common xlabel', ylabel='common ylabel')

Make a word cloud in the shape of a custom image:

# If using Jupyter Notebook, you need to install the 
# wordcloud module like this

import sys
 !{sys.executable} -m pip install wordcloud

# import the libraries needed

from PIL import Image
 import numpy as np
 import pandas as pd
 from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
 import matplotlib.pyplot as plt

#  import your dataset
prod = pd.read_csv('winemag-data-130k-v2.csv')

# import the image mask
wine_mask = np.array("wine_mask.png"))

# generate the word cloud
comment_words = ''

stopwords = set(STOPWORDS)

stopwords.update(["drink", "now", "wine", "flavor", "flavors"])

for val in prod.description.iloc[0:1000]:

    val = str(val)

    tokens = val.split(' ')

for i in range(len(tokens)):
    tokens[i] = tokens[i].lower()

for word in tokens:
    comment_words = comment_words + ' ' + word

wordcloud = WordCloud(background_color="floralwhite", 






plt.figure(figsize = (48,48), facecolor = None)

plt.imshow(wordcloud, interpolation="bilinear")



plt.title("Frequent Words from Tasters - Wine Form",fontsize = 40,color='gray')

Make a single box and whisker plot with Matplotlib:

fig, axs = plt.subplots(1, 1)
 axs.set_title('basic plot')

Multiple box and whisker plots with Matplotlib:

# To make side by side box and whisker plots (in this example, get 
# points for each country, and then make a list of those lists). 
# That is what is passed in to the boxplot function:

u = list(prod[prod['country']=='Italy']['points'])
m = list(prod[prod['country']=='Portugal']['points'])
w = list(prod[prod['country']=='Germany']['points'])
final_list = []

fig7, ax7 = plt.subplots()
ax7.set_title('Multiple Samples with Different sizes')

Pie Chart:

   autopct='%1.0f%%', colors=['skyblue', 'lavender', 'lightpink',
   'lightcyan', 'lemonchiffon', 'mistyrose'])
plt.legend(title = 'Wine Country of Origin', loc='best', bbox_to_anchor=(1, 0, 0.5, 1))
plt.figure(figsize=(360, 250))

Bubble plot tutorial

Calculating correlation with a scatterplot

Installing & Running Jupyter Notebook

Step 1: Install Anaconda

Go to this download webpage on Anaconda’s site. Choose the correct link for your operating system, and then go through the installation process.

Step 2: Prepare a folder for notebooks

Choose or create a folder on your computer where you will store all Jupyter notebook files. Make sure you choose a place where it will be easy to find them later.

Step 3: Start up Jupyter Notebook

  • Either on Mac or PC, you should be able to open up Anaconda the way you’d open any other program on your computer. So on a Mac this would be the Applications folder, and on a PC this would be the Start menu. Then, once Anaconda opens, click the “Launch” button underneath Jupyter Notebook.
  • On a Mac, you can also open up the terminal and type jupyter notebook. This might also work on a PC but there may be a few extra steps, so we recommend going with the option above for PC.
  • Jupyter notebook should now open in a browser.
  • You should see the folder structure of your computer – navigate to the folder where you stored the files in step 2.