More efficient data exploration in Pandas (I)

If you’re like me and come into working with data with more of a hacking mentality, you may be tempted to sticking with whatever tools are needed to “just get done” the most immediate thing.
But my experience says that you need to build and refine good procedures that you can adapt to all data problems. The needs and pitfalls here are more repeatable (at the low level, of course) than in attacking complex software tasks.
What we’ll cover today:
This is part of a practical map of how I might use Pandas to explore data on an example dataset. At this point we want to see what data is present, what may be the relationships between variables, and whether everything looks legitimate and realistic. This is very important in practice, when you don’t get some clean classroom datasets. There can be some errors and misunderstandings that will need to be cleared up before your actual analysis.
The post is intended for people who maybe interacted with Pandas once or twice, but don’t feel competely at ease with it. Or people who find themselves working around the library instead of making use of what it offers natively.
I encourage you to:
- Go through the official 10 minutes to Pandas guide if you haven’t yet – and keep it handy. It’s very good in showing what the library can do and gently guiding you towards the intended ways of doing things.
- If you are going through this post, try the snippets with different columns and arguments – build better memory and mechanical efficiency. When doing real work, it’s good to be able to focus on the problem and not on wrangling with Pandas.
Our example dataset
For the purposes of demonstration, we will use the dataset of “Walrus Haulout and In-water Activity Levels Relative to Sea Ice Availability in the Chukchi Sea: 2008-2014”. You can download the “CSV file”. It’s really a ZIP archive that you need to extract from.
(Here are the metadata provided by the scientists; scroll down to Entity_and_Attribute_Information near the bottom to see what columns there are and what they’re supposed to represent. It’s good to have this description!)
Go to the directory where you extracted the archive and load the dataset
as Pandas DataFrame
:
import pandas as pd
df = pd.read_csv('WalrusCompareBehaviorData_ChukchiSea_2008-2014.csv')
Columns in the dataframe
Let’s confirm which columns are present in the dataset:
df.columns
Index(['Wet', 'Forage', 'LocalHour', 'LagFor', 'LagSwm', 'LagDry', 'IndexWal',
'IndexTime', 'SeaIce', 'WSkmh', 'TempC', 'biomass', 'Land'],
dtype='object')
How Pandas works (Series
-> DataFrame
s representation)
To get some grip on how Pandas works:
- Each column in the
DataFrame
is aSeries
. (you can verify this by looking at the output oftype(df['Wet'])
for example).Series
contains a sequence of values. It is saved in the dataframe under some column name, but can also exist separately. - There is also something called
Index
. An index takes care of the order in which the values occur inSeries
, or aDataFrame
. You don’t often have to worry about it: just know that it’s needed and you can change it if you want.
Extracting views of data
To make sense of data we often need to split it into meaningful subsets.
We need to establish which variables have influence on other ones, and whether we need to treat some subsets differently. For example in a dataset of lions, we have to make sure we don’t (at least unknowingly) draw conclusions the same way for the sea lions.
Indices and index ranges
For pure index lookups, use iloc
. For anything fancier, loc
.
Detailed view of the 10th row in the dataframe:
df.iloc[10]
Rows 10 to 99:
df.iloc[10:100]
Indices + column selection
Rows 10 to 99, but only some columns (you can ask for a list of columns, as below, or one column):
df.loc[10:100, ['LocalHour', 'Forage', 'IndexWal']]
All rows, some columns:
df.loc[:, ['LocalHour', 'Forage', 'IndexWal']]
# When filtering only by columns, you can also just do this:
df[['LocalHour', 'Forage', 'IndexWal']]
Indices… but as vectors and conditions
You can replace explicit numeric ranges in loc
with any specification of a sequence of indices.
Not interesting, but works (getting rows 0, 2, 3):
df.loc[[0, 2, 3]]
Another method is using a sequence of true/false values, for example, checking if the TempC
is
negative:
df.loc[df['TempC'] < 0]
If you look at what df['TempC'] < 0
looks like, it’s indeed just a pandas Series
of boolean values, telling us whether the condition is true for the given value:
>>> df['TempC'] < 0
0 True
1 True
2 True
3 True
4 True
...
46996 False
46997 False
46998 False
46999 False
47000 False
Name: TempC, Length: 47001, dtype: bool
See also:
- .isin(…)
for getting such a
Series
, but filtering for values that belong to a list, - and .isnull() and .isna() for filtering for missing and invalid values.
Avoiding the gotcha in combining conditions
When building a conjuction of conditions, you may be surprised that this doesn’t work (you get “invalid syntax”):
df.loc[df['TempC'] < 0 && df['Forage'] == 0]
You need to use &
instead of &&
and enclose your conditions in parentheses to avoid weird behavior:
df.loc[(df['TempC'] < 0) & (df['Forage'] == 0)]
Look into the query
method
if you want to be more concise with these constructs.
Frequencies and averages
Hopefully now you can extract any partial or conditional view that you want from the data.
So you got the rows. What should you do with them? Examining them manually is one thing, but there are ways to get useful statistics summarizing the data.
Getting basic descriptive statistics
The .describe()
method of DataFrame
s is good for this:
df.describe()
Wet Forage LocalHour LagFor ... WSkmh TempC biomass Land
count 47001.000000 47001.000000 47001.000000 47001.000000 ... 47001.000000 47001.000000 47001.000000 47001.000000
mean 0.795430 0.584115 11.529010 0.584519 ... 17.705407 1.657657 2.378306 0.208506
std 0.403391 0.492879 6.866782 0.492810 ... 10.056416 2.016787 0.421618 0.406245
min 0.000000 0.000000 1.000000 0.000000 ... 0.062650 -10.879832 1.073674 0.000000
25% 1.000000 0.000000 7.000000 0.000000 ... 10.170809 0.605898 2.082655 0.000000
50% 1.000000 1.000000 13.000000 1.000000 ... 16.011138 1.312181 2.426983 0.000000
75% 1.000000 1.000000 19.000000 1.000000 ... 23.182862 2.325118 2.638850 0.000000
max 1.000000 1.000000 22.000000 1.000000 ... 71.231288 15.679251 3.679031 1.000000
[8 rows x 13 columns]
This gives you the mean, the standard deviation and the quartiles of the data. In case you forgot, it’s the value that is higher than 25%, 50% of 75% of the samples that we have.
But as you can see, the output is too wide, some data gets truncated and it’s hard to focus on.
You can use the selection methods (as we’ve discussed above) to narrow down the subset on which
the statistics are computed. To describe only the TempC
column:
df['TempC'].describe()
# or equivalently:
df.loc[:, 'TempC'].describe()
To describe the LocalHour
and biomass
columns for rows with negative temperature:
df.loc[df['TempC'] < 0, ['LocalHour', 'biomass']].describe()
What values are present? (the frequency list)
When you have variables that are categorical (that is, they can have only a finite number of values that are often predetermined), you want to know:
- what values do actually occur,
- and with what frequency.
In the natural language processing space, this is known as the frequency list: the list of words
occurring in a text. And you can see how this relates to data frames by imagining that we fill the
first word position
, second word position
(and so on) variables with some words taken from
vocabulary to create a sentence:
FirstWord SecondWord ThirdWord FourthWord
0 We like ice cream
The .value_counts()
method will show you which values are present for each column and how often
do they occur. Here we can look at the hours when the measures for walruses were taken:
df['LocalHour'].value_counts()
16 5909
13 5904
22 5892
19 5891
10 5881
7 5862
4 5840
1 5822
Name: LocalHour, dtype: int64
So the observations were made every three hours, and we have a very similar number of observations from each of these periods.
And now let’s take the stats for the Forage
, Wet
and LocalHour
columns; and as you can see,
you are really shown the frequencies of combinations of values:
df[['Forage', 'Wet', 'LocalHour']].value_counts()
Forage Wet LocalHour
1 1 1 3554
4 3550
7 3533
22 3475
19 3458
16 3341
10 3302
13 3241
0 0 16 1446
13 1431
1 10 1332
4 1323
0 19 1301
1 7 1293
22 1278
0 10 1247
1 13 1232
1 1220
0 22 1139
1 19 1132
16 1122
0 1 1048
7 1036
4 967
dtype: int64
Now, to understand what just happened here, let’s verify how many rows we have with the
Forage = 0, Wet = 1, LocalHour = 10
combination
(using the query
method that I briefly mentioned):
len(df.query('Forage == 0 & Wet == 1 & LocalHour == 10'))
1332
This is the same number you see in the .value_counts()
output when finding (going from the top)
the row for 0 on Forage
, then, going down from that place, for 1 on Wet
and for 10 on LocalHour
.
One important takeway here is that walruses were always in water when foraging. This is logical
and probably something we may know from biology. Aside from that, the animals are
out of the water (Wet == 0
) most often
around 16:00 (4:00 PM), and least often around 4:00 AM.
Taking a look at column relationships: pivot tables
But if you want to see relationships between variables, a more handy way to see this may be pivot tables.
For example, to see the average probability (loosely speaking) of seeing
a walrus in the water in any given hour, you could construct a pivot table
on the Wet
variable, indexing on LocalHour
:
df.pivot_table('Wet', index='LocalHour')
Wet
LocalHour
1 0.819993
4 0.834418
7 0.823269
10 0.787961
13 0.757622
16 0.755289
19 0.779155
22 0.806687
The conclusions are similar: the probability of being out of the water (i.e. not Wet
)
is lowest at 4:00 AM, and highest at 16:00. Note that results are not sorted.
But pivot tables are used more often on three variables at the time. Let’s see how the
probability of being in the water depends on both LocalHour
and SeaIce
variables:
df.pivot_table('Wet', index='LocalHour', columns='SeaIce')
\SeaIce 0 1
LocalHour
1 0.932233 0.773818
4 0.946439 0.788457
7 0.957697 0.768269
10 0.959436 0.718182
13 0.955477 0.677150
16 0.952103 0.675006
19 0.950117 0.709577
22 0.942488 0.751433
We discover that, in fact, whether there is ice in the sea influences walruses more than the hour of the day.
Note that you are not forced to use averages in your pivot tables. There
is the aggfunc
argument to the .pivot_table(...)
method, where you can
put any function operating on the data series in question.
(See the docs
to read about even wilder stuff that’s possible.)
In the example below, we produce the minimal temperature in which each walrus (IndexWal
)
was either foraging or not foraging:
df.pivot_table('TempC', index='IndexWal', columns='Forage', aggfunc=min)
Making continuous variables categorical: .cut()
It is possible to look at the frequencies in continuous variables in a similar way. We do this by introducting bins, or ranges of values. We can then analyze them as if they were categorical, or in other words discrete.
The standard way to do this in Pandas is with the .cut()
function.
It returns a Series
with the original values replaced with categories.
You can tell it how many bins you want created:
wind_bins = pd.cut(df['WSkmh'], bins=10)
And here’s how wind_bins
looks like in the Python’s REPL command line interface:
0 (14.296, 21.413]
1 (14.296, 21.413]
2 (14.296, 21.413]
3 (21.413, 28.53]
4 (21.413, 28.53]
...
46996 (14.296, 21.413]
46997 (7.18, 14.296]
46998 (14.296, 21.413]
46999 (14.296, 21.413]
47000 (14.296, 21.413]
Name: WSkmh, Length: 47001, dtype: category
Categories (10, interval[float64]): [(-0.00852, 7.18] < (7.18, 14.296] < (14.296, 21.413] <
(21.413, 28.53] ... (42.764, 49.881] < (49.881, 56.998] <
(56.998, 64.114] < (64.114, 71.231]]
To look up which bins were created:
wind_bins.cat.categories
You can also use the now-familiar .value_counts()
method to get the frequency list:
wind_bins.value_counts()
(7.18, 14.296] 14180
(14.296, 21.413] 12764
(21.413, 28.53] 7486
(-0.00852, 7.18] 5901
(28.53, 35.647] 3765
(35.647, 42.764] 1903
(42.764, 49.881] 687
(49.881, 56.998] 249
(56.998, 64.114] 58
(64.114, 71.231] 8
Name: WSkmh, dtype: int64
If you want, you can add this categorical Series
as a new column to the DataFrame
,
like so:
df['WindBins'] = wind_bins
Now the data frame also contains the WindBins
column.
Plain histogram
There’s a simpler graphical way to represent the frequencies of bins (ranges) of values, and this is the histogram. Here’s how to generate it quickly in Pandas. We choose to create a chart for 10 bins in this example.
import matplotlib.pyplot as plt
df['WSkmh'].hist(bins=10)
plt.show()
In reality, the example here uses the default Matplotlib backend for drawing the chart.
That’s why you have to import the additional module and call the show()
function.
You can use many plotting libraries with Pandas, such as Seaborn, Plotly and Altair.
Creating and analyzing a lot of plots is another crucial step in data exploration. I will try to condense efficient ways for doing this in another posting: stay tuned.
Cover image: https://commons.wikimedia.org/wiki/File:Walruses_on_ice_floes.jpg by Kirill.uyutnov; licensed CC BY-SA