If you’re like me and come into working with data with more of a hacking mentality, you may be tempted to sticking with whatever tools are needed to “just get done” the most immediate thing.
But my experience says that you need to build and refine good procedures that you can adapt to all data problems. The needs and pitfalls here are more repeatable (at the low level, of course) than in attacking complex software tasks.
What we’ll cover today:
This is part of a practical map of how I might use Pandas to explore data on an example dataset. At this point we want to see what data is present, what may be the relationships between variables, and whether everything looks legitimate and realistic. This is very important in practice, when you don’t get some clean classroom datasets. There can be some errors and misunderstandings that will need to be cleared up before your actual analysis.
The post is intended for people who maybe interacted with Pandas once or twice, but don’t feel competely at ease with it. Or people who find themselves working around the library instead of making use of what it offers natively.
I encourage you to:
- Go through the official 10 minutes to Pandas guide if you haven’t yet – and keep it handy. It’s very good in showing what the library can do and gently guiding you towards the intended ways of doing things.
- If you are going through this post, try the snippets with different columns and arguments – build better memory and mechanical efficiency. When doing real work, it’s good to be able to focus on the problem and not on wrangling with Pandas.
Our example dataset
For the purposes of demonstration, we will use the dataset of “Walrus Haulout and In-water Activity Levels Relative to Sea Ice Availability in the Chukchi Sea: 2008-2014”. You can download the “CSV file”. It’s really a ZIP archive that you need to extract from.
(Here are the metadata provided by the scientists; scroll down to Entity_and_Attribute_Information near the bottom to see what columns there are and what they’re supposed to represent. It’s good to have this description!)
Go to the directory where you extracted the archive and load the dataset
import pandas as pd df = pd.read_csv('WalrusCompareBehaviorData_ChukchiSea_2008-2014.csv')
Columns in the dataframe
Let’s confirm which columns are present in the dataset:
Index(['Wet', 'Forage', 'LocalHour', 'LagFor', 'LagSwm', 'LagDry', 'IndexWal', 'IndexTime', 'SeaIce', 'WSkmh', 'TempC', 'biomass', 'Land'], dtype='object')
How Pandas works (
To get some grip on how Pandas works:
- Each column in the
Series. (you can verify this by looking at the output of
Seriescontains a sequence of values. It is saved in the dataframe under some column name, but can also exist separately.
- There is also something called
Index. An index takes care of the order in which the values occur in
Series, or a
DataFrame. You don’t often have to worry about it: just know that it’s needed and you can change it if you want.
Extracting views of data
To make sense of data we often need to split it into meaningful subsets.
We need to establish which variables have influence on other ones, and whether we need to treat some subsets differently. For example in a dataset of lions, we have to make sure we don’t (at least unknowingly) draw conclusions the same way for the sea lions.
Indices and index ranges
For pure index lookups, use
iloc. For anything fancier,
Detailed view of the 10th row in the dataframe:
Rows 10 to 99:
Indices + column selection
Rows 10 to 99, but only some columns (you can ask for a list of columns, as below, or one column):
df.loc[10:100, ['LocalHour', 'Forage', 'IndexWal']]
All rows, some columns:
df.loc[:, ['LocalHour', 'Forage', 'IndexWal']] # When filtering only by columns, you can also just do this: df[['LocalHour', 'Forage', 'IndexWal']]
Indices… but as vectors and conditions
You can replace explicit numeric ranges in
loc with any specification of a sequence of indices.
Not interesting, but works (getting rows 0, 2, 3):
df.loc[[0, 2, 3]]
Another method is using a sequence of true/false values, for example, checking if the
df.loc[df['TempC'] < 0]
If you look at what
df['TempC'] < 0 looks like, it’s indeed just a pandas
of boolean values, telling us whether the condition is true for the given value:
>>> df['TempC'] < 0 0 True 1 True 2 True 3 True 4 True ... 46996 False 46997 False 46998 False 46999 False 47000 False Name: TempC, Length: 47001, dtype: bool
for getting such a
Series, but filtering for values that belong to a list,
- and .isnull() and .isna() for filtering for missing and invalid values.
Avoiding the gotcha in combining conditions
When building a conjuction of conditions, you may be surprised that this doesn’t work (you get “invalid syntax”):
df.loc[df['TempC'] < 0 && df['Forage'] == 0]
You need to use
& instead of
&& and enclose your conditions in parentheses to avoid weird behavior:
df.loc[(df['TempC'] < 0) & (df['Forage'] == 0)]
Look into the
if you want to be more concise with these constructs.
Frequencies and averages
Hopefully now you can extract any partial or conditional view that you want from the data.
So you got the rows. What should you do with them? Examining them manually is one thing, but there are ways to get useful statistics summarizing the data.
Getting basic descriptive statistics
.describe() method of
DataFrames is good for this:
Wet Forage LocalHour LagFor ... WSkmh TempC biomass Land count 47001.000000 47001.000000 47001.000000 47001.000000 ... 47001.000000 47001.000000 47001.000000 47001.000000 mean 0.795430 0.584115 11.529010 0.584519 ... 17.705407 1.657657 2.378306 0.208506 std 0.403391 0.492879 6.866782 0.492810 ... 10.056416 2.016787 0.421618 0.406245 min 0.000000 0.000000 1.000000 0.000000 ... 0.062650 -10.879832 1.073674 0.000000 25% 1.000000 0.000000 7.000000 0.000000 ... 10.170809 0.605898 2.082655 0.000000 50% 1.000000 1.000000 13.000000 1.000000 ... 16.011138 1.312181 2.426983 0.000000 75% 1.000000 1.000000 19.000000 1.000000 ... 23.182862 2.325118 2.638850 0.000000 max 1.000000 1.000000 22.000000 1.000000 ... 71.231288 15.679251 3.679031 1.000000 [8 rows x 13 columns]
This gives you the mean, the standard deviation and the quartiles of the data. In case you forgot, it’s the value that is higher than 25%, 50% of 75% of the samples that we have.
But as you can see, the output is too wide, some data gets truncated and it’s hard to focus on.
You can use the selection methods (as we’ve discussed above) to narrow down the subset on which
the statistics are computed. To describe only the
df['TempC'].describe() # or equivalently: df.loc[:, 'TempC'].describe()
To describe the
biomass columns for rows with negative temperature:
df.loc[df['TempC'] < 0, ['LocalHour', 'biomass']].describe()
What values are present? (the frequency list)
When you have variables that are categorical (that is, they can have only a finite number of values that are often predetermined), you want to know:
- what values do actually occur,
- and with what frequency.
In the natural language processing space, this is known as the frequency list: the list of words
occurring in a text. And you can see how this relates to data frames by imagining that we fill the
first word position,
second word position (and so on) variables with some words taken from
vocabulary to create a sentence:
FirstWord SecondWord ThirdWord FourthWord 0 We like ice cream
.value_counts() method will show you which values are present for each column and how often
do they occur. Here we can look at the hours when the measures for walruses were taken:
16 5909 13 5904 22 5892 19 5891 10 5881 7 5862 4 5840 1 5822 Name: LocalHour, dtype: int64
So the observations were made every three hours, and we have a very similar number of observations from each of these periods.
And now let’s take the stats for the
LocalHour columns; and as you can see,
you are really shown the frequencies of combinations of values:
df[['Forage', 'Wet', 'LocalHour']].value_counts()
Forage Wet LocalHour 1 1 1 3554 4 3550 7 3533 22 3475 19 3458 16 3341 10 3302 13 3241 0 0 16 1446 13 1431 1 10 1332 4 1323 0 19 1301 1 7 1293 22 1278 0 10 1247 1 13 1232 1 1220 0 22 1139 1 19 1132 16 1122 0 1 1048 7 1036 4 967 dtype: int64
Now, to understand what just happened here, let’s verify how many rows we have with the
Forage = 0, Wet = 1, LocalHour = 10 combination
query method that I briefly mentioned):
len(df.query('Forage == 0 & Wet == 1 & LocalHour == 10'))
This is the same number you see in the
.value_counts() output when finding (going from the top)
the row for 0 on
Forage, then, going down from that place, for 1 on
Wet and for 10 on
One important takeway here is that walruses were always in water when foraging. This is logical
and probably something we may know from biology. Aside from that, the animals are
out of the water (
Wet == 0) most often
around 16:00 (4:00 PM), and least often around 4:00 AM.
Taking a look at column relationships: pivot tables
But if you want to see relationships between variables, a more handy way to see this may be pivot tables.
For example, to see the average probability (loosely speaking) of seeing
a walrus in the water in any given hour, you could construct a pivot table
Wet variable, indexing on
Wet LocalHour 1 0.819993 4 0.834418 7 0.823269 10 0.787961 13 0.757622 16 0.755289 19 0.779155 22 0.806687
The conclusions are similar: the probability of being out of the water (i.e. not
is lowest at 4:00 AM, and highest at 16:00. Note that results are not sorted.
But pivot tables are used more often on three variables at the time. Let’s see how the
probability of being in the water depends on both
df.pivot_table('Wet', index='LocalHour', columns='SeaIce')
\SeaIce 0 1 LocalHour 1 0.932233 0.773818 4 0.946439 0.788457 7 0.957697 0.768269 10 0.959436 0.718182 13 0.955477 0.677150 16 0.952103 0.675006 19 0.950117 0.709577 22 0.942488 0.751433
We discover that, in fact, whether there is ice in the sea influences walruses more than the hour of the day.
Note that you are not forced to use averages in your pivot tables. There
aggfunc argument to the
.pivot_table(...) method, where you can
put any function operating on the data series in question.
(See the docs
to read about even wilder stuff that’s possible.)
In the example below, we produce the minimal temperature in which each walrus (
was either foraging or not foraging:
df.pivot_table('TempC', index='IndexWal', columns='Forage', aggfunc=min)
Making continuous variables categorical:
It is possible to look at the frequencies in continuous variables in a similar way. We do this by introducting bins, or ranges of values. We can then analyze them as if they were categorical, or in other words discrete.
The standard way to do this in Pandas is with the
It returns a
Series with the original values replaced with categories.
You can tell it how many bins you want created:
wind_bins = pd.cut(df['WSkmh'], bins=10)
And here’s how
wind_bins looks like in the Python’s REPL command line interface:
0 (14.296, 21.413] 1 (14.296, 21.413] 2 (14.296, 21.413] 3 (21.413, 28.53] 4 (21.413, 28.53] ... 46996 (14.296, 21.413] 46997 (7.18, 14.296] 46998 (14.296, 21.413] 46999 (14.296, 21.413] 47000 (14.296, 21.413] Name: WSkmh, Length: 47001, dtype: category Categories (10, interval[float64]): [(-0.00852, 7.18] < (7.18, 14.296] < (14.296, 21.413] < (21.413, 28.53] ... (42.764, 49.881] < (49.881, 56.998] < (56.998, 64.114] < (64.114, 71.231]]
To look up which bins were created:
You can also use the now-familiar
.value_counts() method to get the frequency list:
(7.18, 14.296] 14180 (14.296, 21.413] 12764 (21.413, 28.53] 7486 (-0.00852, 7.18] 5901 (28.53, 35.647] 3765 (35.647, 42.764] 1903 (42.764, 49.881] 687 (49.881, 56.998] 249 (56.998, 64.114] 58 (64.114, 71.231] 8 Name: WSkmh, dtype: int64
If you want, you can add this categorical
Series as a new column to the
df['WindBins'] = wind_bins
Now the data frame also contains the
There’s a simpler graphical way to represent the frequencies of bins (ranges) of values, and this is the histogram. Here’s how to generate it quickly in Pandas. We choose to create a chart for 10 bins in this example.
import matplotlib.pyplot as plt df['WSkmh'].hist(bins=10) plt.show()
In reality, the example here uses the default Matplotlib backend for drawing the chart.
That’s why you have to import the additional module and call the
Creating and analyzing a lot of plots is another crucial step in data exploration. I will try to condense efficient ways for doing this in another posting: stay tuned.
Cover image: https://commons.wikimedia.org/wiki/File:Walruses_on_ice_floes.jpg by Kirill.uyutnov; licensed CC BY-SA