background-shape
feature-image

If you’re like me and come into working with data with more of a hacking mentality, you may be tempted to sticking with whatever tools are needed to “just get done” the most immediate thing.

But my experience says that you need to build and refine good procedures that you can adapt to all data problems. The needs and pitfalls here are more repeatable (at the low level, of course) than in attacking complex software tasks.

What we’ll cover today:

This is part of a practical map of how I might use Pandas to explore data on an example dataset. At this point we want to see what data is present, what may be the relationships between variables, and whether everything looks legitimate and realistic. This is very important in practice, when you don’t get some clean classroom datasets. There can be some errors and misunderstandings that will need to be cleared up before your actual analysis.

The post is intended for people who maybe interacted with Pandas once or twice, but don’t feel competely at ease with it. Or people who find themselves working around the library instead of making use of what it offers natively.

I encourage you to:

  1. Go through the official 10 minutes to Pandas guide if you haven’t yet – and keep it handy. It’s very good in showing what the library can do and gently guiding you towards the intended ways of doing things.
  2. If you are going through this post, try the snippets with different columns and arguments – build better memory and mechanical efficiency. When doing real work, it’s good to be able to focus on the problem and not on wrangling with Pandas.

Our example dataset

For the purposes of demonstration, we will use the dataset of “Walrus Haulout and In-water Activity Levels Relative to Sea Ice Availability in the Chukchi Sea: 2008-2014”. You can download the “CSV file”. It’s really a ZIP archive that you need to extract from.

(Here are the metadata provided by the scientists; scroll down to Entity_and_Attribute_Information near the bottom to see what columns there are and what they’re supposed to represent. It’s good to have this description!)

Go to the directory where you extracted the archive and load the dataset as Pandas DataFrame:

import pandas as pd
df = pd.read_csv('WalrusCompareBehaviorData_ChukchiSea_2008-2014.csv')

Columns in the dataframe

Let’s confirm which columns are present in the dataset:

df.columns
Index(['Wet', 'Forage', 'LocalHour', 'LagFor', 'LagSwm', 'LagDry', 'IndexWal',
       'IndexTime', 'SeaIce', 'WSkmh', 'TempC', 'biomass', 'Land'],
      dtype='object')

How Pandas works (Series -> DataFrames representation)

To get some grip on how Pandas works:

  • Each column in the DataFrame is a Series. (you can verify this by looking at the output of type(df['Wet']) for example). Series contains a sequence of values. It is saved in the dataframe under some column name, but can also exist separately.
  • There is also something called Index. An index takes care of the order in which the values occur in Series, or a DataFrame. You don’t often have to worry about it: just know that it’s needed and you can change it if you want.

Extracting views of data

To make sense of data we often need to split it into meaningful subsets.

We need to establish which variables have influence on other ones, and whether we need to treat some subsets differently. For example in a dataset of lions, we have to make sure we don’t (at least unknowingly) draw conclusions the same way for the sea lions.

Indices and index ranges

For pure index lookups, use iloc. For anything fancier, loc.

Detailed view of the 10th row in the dataframe:

df.iloc[10]

Rows 10 to 99:

df.iloc[10:100]

Indices + column selection

Rows 10 to 99, but only some columns (you can ask for a list of columns, as below, or one column):

df.loc[10:100, ['LocalHour', 'Forage', 'IndexWal']]

All rows, some columns:

df.loc[:, ['LocalHour', 'Forage', 'IndexWal']]
# When filtering only by columns, you can also just do this:
df[['LocalHour', 'Forage', 'IndexWal']]

Indices… but as vectors and conditions

You can replace explicit numeric ranges in loc with any specification of a sequence of indices. Not interesting, but works (getting rows 0, 2, 3):

df.loc[[0, 2, 3]]

Another method is using a sequence of true/false values, for example, checking if the TempC is negative:

df.loc[df['TempC'] < 0]

If you look at what df['TempC'] < 0 looks like, it’s indeed just a pandas Series of boolean values, telling us whether the condition is true for the given value:

>>> df['TempC'] < 0
0         True
1         True
2         True
3         True
4         True
...
46996    False
46997    False
46998    False
46999    False
47000    False
Name: TempC, Length: 47001, dtype: bool

See also:

  • .isin(…) for getting such a Series, but filtering for values that belong to a list,
  • and .isnull() and .isna() for filtering for missing and invalid values.

Avoiding the gotcha in combining conditions

When building a conjuction of conditions, you may be surprised that this doesn’t work (you get “invalid syntax”):

df.loc[df['TempC'] < 0 && df['Forage'] == 0]

You need to use & instead of && and enclose your conditions in parentheses to avoid weird behavior:

df.loc[(df['TempC'] < 0) & (df['Forage'] == 0)]

Look into the query method if you want to be more concise with these constructs.

Frequencies and averages

Hopefully now you can extract any partial or conditional view that you want from the data.

So you got the rows. What should you do with them? Examining them manually is one thing, but there are ways to get useful statistics summarizing the data.

Getting basic descriptive statistics

The .describe() method of DataFrames is good for this:

df.describe()
Wet        Forage     LocalHour        LagFor  ...         WSkmh         TempC       biomass          Land
count  47001.000000  47001.000000  47001.000000  47001.000000  ...  47001.000000  47001.000000  47001.000000  47001.000000
mean       0.795430      0.584115     11.529010      0.584519  ...     17.705407      1.657657      2.378306      0.208506
std        0.403391      0.492879      6.866782      0.492810  ...     10.056416      2.016787      0.421618      0.406245
min        0.000000      0.000000      1.000000      0.000000  ...      0.062650    -10.879832      1.073674      0.000000
25%        1.000000      0.000000      7.000000      0.000000  ...     10.170809      0.605898      2.082655      0.000000
50%        1.000000      1.000000     13.000000      1.000000  ...     16.011138      1.312181      2.426983      0.000000
75%        1.000000      1.000000     19.000000      1.000000  ...     23.182862      2.325118      2.638850      0.000000
max        1.000000      1.000000     22.000000      1.000000  ...     71.231288     15.679251      3.679031      1.000000

[8 rows x 13 columns]

This gives you the mean, the standard deviation and the quartiles of the data. In case you forgot, it’s the value that is higher than 25%, 50% of 75% of the samples that we have.

But as you can see, the output is too wide, some data gets truncated and it’s hard to focus on. You can use the selection methods (as we’ve discussed above) to narrow down the subset on which the statistics are computed. To describe only the TempC column:

df['TempC'].describe()
# or equivalently:
df.loc[:, 'TempC'].describe()

To describe the LocalHour and biomass columns for rows with negative temperature:

df.loc[df['TempC'] < 0, ['LocalHour', 'biomass']].describe()

What values are present? (the frequency list)

When you have variables that are categorical (that is, they can have only a finite number of values that are often predetermined), you want to know:

  1. what values do actually occur,
  2. and with what frequency.

In the natural language processing space, this is known as the frequency list: the list of words occurring in a text. And you can see how this relates to data frames by imagining that we fill the first word position, second word position (and so on) variables with some words taken from vocabulary to create a sentence:

  FirstWord SecondWord ThirdWord FourthWord
0        We       like       ice      cream

The .value_counts() method will show you which values are present for each column and how often do they occur. Here we can look at the hours when the measures for walruses were taken:

df['LocalHour'].value_counts()
16    5909
13    5904
22    5892
19    5891
10    5881
7     5862
4     5840
1     5822
Name: LocalHour, dtype: int64

So the observations were made every three hours, and we have a very similar number of observations from each of these periods.

And now let’s take the stats for the Forage, Wet and LocalHour columns; and as you can see, you are really shown the frequencies of combinations of values:

df[['Forage', 'Wet', 'LocalHour']].value_counts()
Forage  Wet  LocalHour
1       1    1            3554
             4            3550
             7            3533
             22           3475
             19           3458
             16           3341
             10           3302
             13           3241
0       0    16           1446
             13           1431
        1    10           1332
             4            1323
        0    19           1301
        1    7            1293
             22           1278
        0    10           1247
        1    13           1232
             1            1220
        0    22           1139
        1    19           1132
             16           1122
        0    1            1048
             7            1036
             4             967
dtype: int64

Now, to understand what just happened here, let’s verify how many rows we have with the Forage = 0, Wet = 1, LocalHour = 10 combination (using the query method that I briefly mentioned):

len(df.query('Forage == 0 & Wet == 1 & LocalHour == 10'))
1332

This is the same number you see in the .value_counts() output when finding (going from the top) the row for 0 on Forage, then, going down from that place, for 1 on Wet and for 10 on LocalHour.

One important takeway here is that walruses were always in water when foraging. This is logical and probably something we may know from biology. Aside from that, the animals are out of the water (Wet == 0) most often around 16:00 (4:00 PM), and least often around 4:00 AM.

Taking a look at column relationships: pivot tables

But if you want to see relationships between variables, a more handy way to see this may be pivot tables.

For example, to see the average probability (loosely speaking) of seeing a walrus in the water in any given hour, you could construct a pivot table on the Wet variable, indexing on LocalHour:

df.pivot_table('Wet', index='LocalHour')
                Wet
LocalHour
1          0.819993
4          0.834418
7          0.823269
10         0.787961
13         0.757622
16         0.755289
19         0.779155
22         0.806687

The conclusions are similar: the probability of being out of the water (i.e. not Wet) is lowest at 4:00 AM, and highest at 16:00. Note that results are not sorted.

But pivot tables are used more often on three variables at the time. Let’s see how the probability of being in the water depends on both LocalHour and SeaIce variables:

df.pivot_table('Wet', index='LocalHour', columns='SeaIce')
\SeaIce            0         1
LocalHour
1          0.932233  0.773818
4          0.946439  0.788457
7          0.957697  0.768269
10         0.959436  0.718182
13         0.955477  0.677150
16         0.952103  0.675006
19         0.950117  0.709577
22         0.942488  0.751433

We discover that, in fact, whether there is ice in the sea influences walruses more than the hour of the day.

Note that you are not forced to use averages in your pivot tables. There is the aggfunc argument to the .pivot_table(...) method, where you can put any function operating on the data series in question. (See the docs to read about even wilder stuff that’s possible.)

In the example below, we produce the minimal temperature in which each walrus (IndexWal) was either foraging or not foraging:

df.pivot_table('TempC', index='IndexWal', columns='Forage', aggfunc=min)

Making continuous variables categorical: .cut()

It is possible to look at the frequencies in continuous variables in a similar way. We do this by introducting bins, or ranges of values. We can then analyze them as if they were categorical, or in other words discrete.

The standard way to do this in Pandas is with the .cut() function. It returns a Series with the original values replaced with categories. You can tell it how many bins you want created:

wind_bins = pd.cut(df['WSkmh'], bins=10)

And here’s how wind_bins looks like in the Python’s REPL command line interface:

0        (14.296, 21.413]
1        (14.296, 21.413]
2        (14.296, 21.413]
3         (21.413, 28.53]
4         (21.413, 28.53]
               ...
46996    (14.296, 21.413]
46997      (7.18, 14.296]
46998    (14.296, 21.413]
46999    (14.296, 21.413]
47000    (14.296, 21.413]
Name: WSkmh, Length: 47001, dtype: category
Categories (10, interval[float64]): [(-0.00852, 7.18] < (7.18, 14.296] < (14.296, 21.413] <
                                     (21.413, 28.53] ... (42.764, 49.881] < (49.881, 56.998] <
                                     (56.998, 64.114] < (64.114, 71.231]]

To look up which bins were created:

wind_bins.cat.categories

You can also use the now-familiar .value_counts() method to get the frequency list:

wind_bins.value_counts()
(7.18, 14.296]      14180
(14.296, 21.413]    12764
(21.413, 28.53]      7486
(-0.00852, 7.18]     5901
(28.53, 35.647]      3765
(35.647, 42.764]     1903
(42.764, 49.881]      687
(49.881, 56.998]      249
(56.998, 64.114]       58
(64.114, 71.231]        8
Name: WSkmh, dtype: int64

If you want, you can add this categorical Series as a new column to the DataFrame, like so:

df['WindBins'] = wind_bins

Now the data frame also contains the WindBins column.

Plain histogram

There’s a simpler graphical way to represent the frequencies of bins (ranges) of values, and this is the histogram. Here’s how to generate it quickly in Pandas. We choose to create a chart for 10 bins in this example.

import matplotlib.pyplot as plt
df['WSkmh'].hist(bins=10)
plt.show()

histogram of a continuous variable with Matplotlib in Pandas

In reality, the example here uses the default Matplotlib backend for drawing the chart. That’s why you have to import the additional module and call the show() function.

You can use many plotting libraries with Pandas, such as Seaborn, Plotly and Altair.

Creating and analyzing a lot of plots is another crucial step in data exploration. I will try to condense efficient ways for doing this in another posting: stay tuned.

Cover image: https://commons.wikimedia.org/wiki/File:Walruses_on_ice_floes.jpg by Kirill.uyutnov; licensed CC BY-SA