Pandas Cheat Sheet

Getting up and running with pandas in Python

13 September, 2022

Contributors

Jason

@jasonmchlee

10,000 hour rule for mastering a skill

In his famous book Outliers, Malcolm Gladwell once said that one would need to dedicate 10,000 hours to be a true master of any skill. The idea behind this was Practice makes Perfect another popular saying that has been engraved in some people’s minds.

While I certainly agree that over time the continual Practice and repetition of a new skill will engrave consistent habits and performance, it is hard to imagine someone knowing everything at all times. If you have ever conducted any data analysis in Pandas, there are essential functions that you simply can live without.

Here is a cheat sheet for essential pandas functions.

Getting Started

For this cheat sheet, we will create a small dataframe of grades in various subjects at a school. Let’s start by importing pandas.

Head

To display the top portion of the data frame, we can use the head function. By default, we will get the top 5 rows. We can pass in a specific number as n if we want another number.

Tail

Similarly, we can get the opposite of head by calling tail on the data frame to get the last five rows and passing n as the argument to change the number displayed.

Column

If we want to only the names of the dataframe, we can use the .columns function

The output would look like this → Index(['Math', 'Science', 'English', 'History'], dtype='object')

Shape

If you’re working with large data frames and it is difficult to count the number of rows or columns manually, we use .shape to find the dimensions.

Info

Before starting any analysis, it is important to get an understanding of what data type you are working with. Using the .info() you can get an overview of the data types you have.

In our example, all the data will be type integer since we are working with whole numbers for each grade.

Describe

As well as .info(), you’ll most likely want to know the descriptive statistics of your datagram. Using the .describe() you can observe the count, mean, standard deviation, quantities, min, and max. In this example, I chained on .round(2) to clean up the output.

Quantiles

Using the .describe() function we automatically got quantiles for 25, 50, and 75. We can also state our own quantiles. Below I have selected 10%, 40%, and 70%. Note — we can pass in as many quantiles in the formula below.

Mean, Standard Deviation, Variance, Count, Median, Min, and Max

We can use a variety of functions on the dataframe to get an aggregated result. You can also pass in a column name to retrieve that column value only.

Renaming Columns

If we want to rename the column, for example, changing the column name to upper case.

Both pieces of code above will return the same result. Using the argument, inplace = True we are telling it to save it into itself, and remember the result. If we don’t state inplace = True we would have to code in the = , to state we want to save the value back into the grades dataframe.

Index Subsetting - iloc

Using iloc, which stands for index location we can find a subset of the dataframe based on their index position. There are many ways we can index, so it is important to understand the different variations of using iloc below. The first part of the iloc represents the rows, and the second half is the columns.

If you didn’t know this before in Python, indexing starts at 0. Meaning the first number would be index 0, NOT index 1.

Notice we included a : in some situations. The : represents a range symbol, with the numbers on the left of the : stating the starting position and the right the index up to, but not including. For example, grades.iloc[:, 1:3] it is broken into two parts.

The first : → all the rows, since there is no start index preceding it, and no end index following itThe second 1:3 → from column 1 up to, but NOT including column 3. Meaning the result will only return columns 1 and 2 for this example.

The second 1:3 → from column 1 up to, but NOT including column 3. Meaning the result will only return columns 1 and 2 for this example.

Location Subsetting - loc

On the opposite side of subsetting, we have location-based subsetting. Here we reference the name of the variables we want and not the index.

For our data, we only have location subset names for the columns. If we had row labels, we could reference them as well. We will see this example further down the post.

Sort Values

If we want to sort our dataframe based on a specific value, we can use sort_values() by default values are sorted in ascending, we can pass in a False parameter to change it to descending.

Concat

Using concat, we can merge two data frames together based on an axis. For example, we can add new values to our dataframe in two scenarios.

Add more rows — axis = 0

Add more columns — axis = 1

First, we created a second data frame with additional grade values called grades2. Then we concat the new dataframe onto the old one. We state the axis as 0 because we are adding more rows. If we were adding an entirely new subject such as Geography, then we would want to use the axis = 1.

If we didn’t have the reset_index added, the index would not line up. Below is what that would look like.

Using the reset_index(drop=True) we are doing two things. Having the index organize and go back into a continual form. We are dropping the index because automatically pandas will create an additional index column, which we don’t need — so we drop it.

Boolean Indexing

A useful technique in pandas is called boolean indexing. It is essentially a filtering technique to find values based on a boolean condition (i.e., True or False). The general syntax for a boolean index is as follows.

With boolean indexing, we need to restate the dataframe inside the parenthesis.

Looking above, we can combine multiple boolean conditions to drill down the filtering abilities. Using the & and | we can state conditions. The & (AND) will return the result if BOTH conditions are met. The | (OR) will return the result if either side of the condition is satisfied.

Observe the code above. We can see that adding iloc or the column name can provide us with a result which returns only the selected columns if the conditions are met.

Subsetting Columns

You can subset columns in pandas as a series or a dataframe.

Series is a type of list in pandas that can take integer values, string values, double values, and more. But in Pandas Series, we return an object in the form of a list, having index starting from 0 to n, Where n is the length of values in series. Series can only contain a single list with index, whereas dataframe can be made of more than one series, or we can say that a dataframe is a collection of series that can be used to analyze the data.

Adding Columns

Adding more columns to a dataframe is as simple as creating a new column name and setting the values equal to it.

Reorder Columns

If the example above, it would make more sense to have the Student as the first column. If we wanted to reorder the columns, we can create a list of the order we want the dataframe columns to be in, and index them on the dataframe.

Pivot Table

A pivot table is a table of statistics that summarizes the data of a more extensive table (such as from a database, spreadsheet, or business intelligence program). This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.

Pivot tables are a technique in data processing. They arrange and rearrange (or “pivot”) statistics to draw attention to useful information.

We can see we grouped the values first by their Class, then by Gender. This provides an aggregated result of the dataframe.

Group By

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Grouped by Gender with average values

The great thing about the groupby function in pandas is chaining on several aggregate functions using the agg function.

Overview

I hope this provides you with either new tools to use in pandas or refreshes your memory on what you already know. Remember, to keep practicing, and if you can get to the so-called “Master” level, Malcolm Gladwell stated, perhaps one day, you won’t need a cheat sheet to reference.

Connect with me on Linkedin or Github

python