Pandas makes it easy to quickly load, manipulate, align, merge, and even visualize data tables directly in Python.
When it comes to working with data in a tabular form, most people reach for a spreadsheet. That’s not a bad choice: Microsoft Excel and similar programs are familiar and loaded with functionality for massaging tables of data. But what if you want more control, precision, and power than Excel alone delivers?
In that case, the open source Pandas library for Python might be what you are looking for. It outfits Python with new data types for loading data fast from tabular sources, and for manipulating, aligning, merging, and doing other processing at scale.
Your first Pandas data set
Pandas is not part of the Python standard library. It’s a third-party project, so you’ll need to install it in your Python runtime with pip install pandas
. Once installed, you can import it into Python with import pandas
.
Pandas gives you two new data types: Series
and DataFrame
. The DataFrame
represents your entire spreadsheet or rectangular data, whereas the Series
is a single column of the DataFrame
. You can also think of the Pandas DataFrame
as a dictionary or collection of Series
objects. You’ll find later that you can use dictionary- and list-like methods for finding elements in a DataFrame
.
You typically work with Pandas by importing data in some other format. A common external tabular data format is CSV, a text file with values separated by commas. If you have a CSV handy, you can use it. For this article, we’ll be using an excerpt from the Gapminder data set prepared by Jennifer Bryan from the University of British Columbia.
To begin using Pandas, we first import the library. Note that it’s a common practice to alias the Pandas library as pd
to save some typing:
import pandas as pd
To start working with the sample data in CSV format, we can load it in as a dataframe using the pd.read_csv
function:
df = pd.read_csv("./gapminder/inst/extdata/gapminder.tsv", sep='t')
The sep
parameter lets us specify that this particular file is tab-delimited rather than comma-delimited.
Once the data’s been loaded, you can peek at its formatting to make sure it’s loaded correctly by using the .head()
method on the dataframe:
print(df.head())
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
Dataframe objects have a shape
attribute that reports the number of rows and columns in the dataframe:
print(df.shape)
(1704, 6) # rows, cols
To list the names of the columns themselves, use .columns
:
print(df.columns)
Index(['country', 'continent', 'year', 'lifeExp',
'pop', 'gdpPercap'], dtype='object')
Dataframes in Pandas work much the same way as dataframes in other languages, such as Julia and R. Each column, or Series
, must be the same type, whereas each row can contain mixed types. For instance, in the current example, the country
column will always be a string, and the year
column is always an integer. We can verify this by using .dtypes
to list the data type of each column:
print(df.dtypes)
country object
continent object
year int64
lifeExp float64
pop int64
gdpPercap float64
dtype: object
For an even more explicit breakdown of your dataframe’s types, you can use .info()
:
df.info() # information is written to console, so no print required
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 1704 non-null object
1 continent 1704 non-null object
2 year 1704 non-null int64
3 lifeExp 1704 non-null float64
4 pop 1704 non-null int64
5 gdpPercap 1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
Each Pandas data type maps to a native Python data type:
object
is handled as a Pythonstr
type.int64
is handled as a Pythonint
. Note that not all Pythonint
s can be converted toint64
types; anything larger than (2 ** 63)-1 will not convert toint64
.float64
is handled as a Pythonfloat
(which is a 64-bitfloat
natively).datetime64
is handled as a Pythondatetime.datetime
object. Note that Pandas does not automatically try to convert things that look like dates into date values; you must tell Pandas you want to do this for a specific column.
Pandas columns, rows, and cells
Now that you’re able to load a simple data file, you want to be able to inspect its contents. You could print
the contents of the dataframe, but most dataframes will be too big to inspect by printing.
A better approach is to look at subsets of the data, as we did with df.head()
, but with more control. Pandas lets you make excerpts from dataframes, using Python’s existing syntax for indexing and creating slices.
Extracting Pandas columns
To examine columns in a Pandas dataframe, you can extract them by their names, positions, or by ranges. For instance, if you want a specific column from your data, you can request it by name using square brackets:
# extract the column "country" into its own dataframe
country_df = df["country"]
# show the first five rows
print(country_df.head())
| 0 Afghanistan
| 1 Afghanistan
| 2 Afghanistan
| 3 Afghanistan
| 4 Afghanistan
Name: country, dtype: object
# show the last five rows
print(country_df.tail())
| 1699 Zimbabwe
| 1700 Zimbabwe
| 1701 Zimbabwe
| 1702 Zimbabwe
| 1703 Zimbabwe
| Name: country, dtype: object
If you want to extract multiple columns, pass a list of the column names:
# Looking at country, continent, and year
subset = df[['country', 'continent', 'year']]
print(subset.head())
country continent year
| 0 Afghanistan Asia 1952
| 1 Afghanistan Asia 1957
| 2 Afghanistan Asia 1962
| 3 Afghanistan Asia 1967
| 4 Afghanistan Asia 1972
print(subset.tail())
country continent year
| 1699 Zimbabwe Africa 1987
| 1700 Zimbabwe Africa 1992
| 1701 Zimbabwe Africa 1997
| 1702 Zimbabwe Africa 2002
| 1703 Zimbabwe Africa 2007
Subsetting rows
If you want to extract rows from a dataframe, you can use one of two methods.
.iloc[]
is the simplest method. It extracts rows based on their position, starting at 0. For fetching the first row in the above dataframe example, you’d use df.iloc[0]
.
If you want to fetch a range of rows, you can use .iloc[]
with Python’s slicing syntax. For instance, for the first 10 rows, you’d use df.iloc[0:10]
. And if you wanted to obtain the last 10 rows in reverse order, you’d use df.iloc[::-1]
.
If you want to extract specific rows, you can use a list of the row IDs; for example, df.iloc[[0,1,2,5,7,10,12]]
. (Note the double brackets—that means you’re providing a list as the first argument.)
Another way to extract rows is with .loc[]
. This extracts a subset based on labels for rows. By default, rows are labeled with an incrementing integer value starting with 0. But data can also be labeled manually by setting the dataframe’s .index property.
For instance, if we wanted to re-index the above dataframe so that each row had an index using multiples of 100, we could use df.index = range(0, len(df)*100, 100)
. Then, if we used, df.loc[100]
, we’d get the second row.
Subsetting columns
If you want to retrieve only a certain subset of columns along with your row slices, you do this by passing a list of columns as a second argument:
df.loc[[rows], [columns]]
For instance, with the above dataset, if we want to get only the country and year columns for all rows, we’d do this:
df.loc[:, ["country","year"]]
The :
in the first position means “all rows” (it’s Python’s slicing syntax). The list of columns follows after the comma.
You can also specify columns by position when using .iloc
:
df.iloc[:, [0,2]]
Or, to get just the first three columns:
df.iloc[:, 0:3]
All of these approaches can be combined, as long as you remember loc
is used for labels and column names, and iloc
is used for numeric indexes. The following tells Pandas to extract the first 100 rows by their numeric labels, and then from that to extract the first three columns by their indexes:
df.loc[0:100].iloc[:, 0:3]
It’s generally least confusing to use actual column names when subsetting data. It makes the code easier to read, and you don’t have to refer back to the dataset to figure out which column corresponds to what index. It also protects you from mistakes if columns are re-ordered.
Grouped and aggregated calculations
Spreadsheets and number-crunching libraries all come with methods for generating statistics about data. Consider the Gapminder data again:
print(df.head(n=10))
| country continent year lifeExp pop gdpPercap
| 0 Afghanistan Asia 1952 28.801 8425333 779.445314
| 1 Afghanistan Asia 1957 30.332 9240934 820.853030
| 2 Afghanistan Asia 1962 31.997 10267083 853.100710
| 3 Afghanistan Asia 1967 34.020 11537966 836.197138
| 4 Afghanistan Asia 1972 36.088 13079460 739.981106
| 5 Afghanistan Asia 1977 38.438 14880372 786.113360
| 6 Afghanistan Asia 1982 39.854 12881816 978.011439
| 7 Afghanistan Asia 1987 40.822 13867957 852.395945
| 8 Afghanistan Asia 1992 41.674 16317921 649.341395
| 9 Afghanistan Asia 1997 41.763 22227415 635.341351
Here are some examples of questions we could ask about this data:
- What’s the average life expectancy for each year in this data?
- What if I want averages across the years and the continents?
- How do I count how many countries in this data are in each continent?
The way to answer these questions with Pandas is to perform a grouped or aggregated calculation. We can split the data along certain lines, apply some calculation to each split segment, and then re-combine the results into a new dataframe.
Grouped means counts
The first method we’d use for this is Pandas’s df.groupby()
operation. We provide a column we want to split the data by:
df.groupby("year")
This allows us to treat all rows with the same year value together, as a distinct object from the dataframe itself.
From there, we can use the “life expectancy” column and calculate its per-year mean:
print(df.groupby('year')['lifeExp'].mean())
year
1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
1977 59.570157
1982 61.533197
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423
This gives us the mean life expectancy for all populations, by year. We could perform the same kinds of calculations for population and GDP by year:
print(df.groupby('year')['pop'].mean())
print(df.groupby('year')['gdpPercap'].mean())
So far, so good. But what if we want to group our data by more than one column? We can do this by passing columns in lists:
print(df.groupby(['year', 'continent'])
[['lifeExp', 'gdpPercap']].mean())
lifeExp gdpPercap
year continent
1952 Africa 39.135500 1252.572466
Americas 53.279840 4079.062552
Asia 46.314394 5195.484004
Europe 64.408500 5661.057435
Oceania 69.255000 10298.085650
1957 Africa 41.266346 1385.236062
Americas 55.960280 4616.043733
Asia 49.318544 5787.732940
Europe 66.703067 6963.012816
Oceania 70.295000 11598.522455
1962 Africa 43.319442 1598.078825
Americas 58.398760 4901.541870
Asia 51.563223 5729.369625
Europe 68.539233 8365.486814
Oceania 71.085000 12696.452430
This .groupby()
operation takes our data and groups it first by year, and then by continent. Then it generates mean values from the life-expectancy and GDP columns. This way, you can create groups in your data and rank how they are to be presented and calculated.
If you want to “flatten” the results into a single, incrementally indexed frame, you can use the .reset_index()
method on the results:
gb = df.groupby(['year', 'continent'])
[['lifeExp', 'gdpPercap']].mean()
flat = gb.reset_index()
print(flat.head())
| year continent lifeExp gdpPercap
| 0 1952 Africa 39.135500 1252.572466
| 1 1952 Americas 53.279840 4079.062552
| 2 1952 Asia 46.314394 5195.484004
| 3 1952 Europe 64.408500 5661.057435
| 4 1952 Oceana 69.255000 10298.085650
Grouped frequency counts
Another thing we often do with data is compute frequencies. The nunique
and value_counts
methods can be used to get unique values in a series, and their frequencies. For instance, here’s how to find out how many countries we have in each continent:
print(df.groupby('continent')['country'].nunique())
continent
Africa 52
Americas 25
Asia 33
Europe 30
Oceana 2
Basic plotting with Pandas and Matplotlib
Most of the time, when you want to visualize data, you’ll use another library such as Matplotlib to generate those graphics. However, you can use Matplotlib directly (along with some other plotting libraries) to generate visualizations from within Pandas.
To use the simple Matplotlib extension for Pandas, first make sure you’ve installed Matplotlib with pip install matplotlib
.
Now let’s look at the yearly life expectancies for the world population again:
global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean()
print(global_yearly_life_expectancy)
| year
| 1952 49.057620
| 1957 51.507401
| 1962 53.609249
| 1967 55.678290
| 1972 57.647386
| 1977 59.570157
| 1982 61.533197
| 1987 63.212613
| 1992 64.160338
| 1997 65.014676
| 2002 65.694923
| 2007 67.007423
| Name: lifeExp, dtype: float64
To create a basic plot from this, use:
import matplotlib.pyplot as plt
global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean()
c = global_yearly_life_expectancy.plot().get_figure()
plt.savefig("output.png")
The plot will be saved to a file in the current working directory as output.png
. The axes and other labeling on the plot can all be set manually, but for quick exports this method works fine.
Conclusion
Python and Pandas offer many features you can’t get from spreadsheets alone. For one, they let you automate your work with data and make the results reproducible. Rather than write spreadsheet macros, which are clunky and limited, you can use Pandas to analyze, segment, and transform data—and use Python’s expressive power and package ecosystem (for instance, for graphing or rendering data to other formats) to do even more than you could with Pandas alone.