Appendix D — Pandas tutorial

Contents

Appendix D — Pandas tutorial#

The tutorial is being developed as an interactive notebook. See link below for a preview. https://nobsstats.com/tutorials/pandas_tutorial.html

Pandas overview#

Pandas is a Python module for the management and analysis of tabular data.

Pandas is sometimes called the Swiss army knife of data manipulations, since it

This tutorial will introduce the essential parts of the Pandas library, which is a powerful and versatile toolbox for manipulating tabular data. We’ll cover all the data manipulation tasks that you need to know for this book.

This tutorial contains Pandas essentials: any syntax/function/trick we use later in the book must be introduced here, but nothing more (to not overwhelm readers).

Pandas is a Python library, so any prior experience with Python will come in handy. Remember that Appendix C - python_tutorial.ipynb contains a Python tutorial you can use to get up to speed quickly on the syntax, so if you haven’t checked that out yet now would be a good time, before we really start getting into the code examples.

This intro-to-python appendix will show you Python data types (int, float, bool, list, etc.), Python operators (+, -, *, /, **, etc.), and functions (len, sum, print, etc.). There is nothing complicated, but it’s good to learn the basic syntax so you’ll know the meaning of symbols like =, :, [, (, etc.

Installing Pandas#

First let’s make sure pandas is installed using the %pip Jupyter command.

# %pip install pandas

We then import the pandas library as the alias pd.

we can import the pandas module into the current notebook, which is usually done in the beginning of the notebook. There is a widespread convention to import the pandas module under the alias pd, which is shorter and therefore faster to type.

import pandas as pd

All the Pandas functionality is now available behind the alias pd.

Pandas is a high-level toolbox for manipulating data. In particular, if you learn how to work with list-like pd.Series objects, and table-like pd.DataFrame objects, then you’ll know most of what you need to know about data management.

Series#

Pandas Series objects are list-like containers of values. The columns of a data frame are series objects, so work with series whenever performing calculations on individual columns (variables) of a data frame. Let’s start by creating a standalone pd.Series object, and defer the discussion about data frames until the next subsection.

The code line below shows how to create a series from a list of four numbers. Pandas pd.Series objects are similar to Python lists [3,5,7,9]. They are containers for series of values.

s = pd.Series([3, 5, 7, 9])

The series object we created is named s, which is short for series.

We can print the series s by simply typing its name in a new code cell. Recall that the notebook interface automatically prints the last expression in a code cell.

s
0    3
1    5
2    7
3    9
dtype: int64

The numbers printed on the left are called the index of the series, while the numbers on the right are the values of the series. The last line in the output shows some additional information about the series. The series s contains integers, so its dtype (data type) is int64.

Python lists use integers for identifying the elements of the list (first = index 0, second = index 1, last = index len(self)-1). Pandas series support the same functionality. Here are some example of accessing individual values of the series using the default 0-base indexing.

print("First:  index =", 0, " value =", s[0])
print("Second: index =", 1, " value =", s[1])
print("Last:   index =", len(s)-1, " value =", s[len(s)-1])
First:  index = 0  value = 3
Second: index = 1  value = 5
Last:   index = 3  value = 9

The series index attribute tells you all the possible indices for the series.

s.index
RangeIndex(start=0, stop=4, step=1)
list(s.index)
[0, 1, 2, 3]

The series s uses the default index [0, 1, 2, 3], which consists of a range of integers, starting at 0, just like the index of a Python list with four elements.

The values attributes of the series tells you the underlying values without the index.

s.values
array([3, 5, 7, 9])
type(s.values)
numpy.ndarray

Under the hood, the values of the series use stored in a numpy arrays.

You can access the individual elements of the series using the square brackets syntax based on the index labels for the series. The first element in the series is at index 0, so we access it as follows:

s[0]
3

We can select a range of elements from the list using the the square brackets and slice notation for the indices:

s[0:3]
0    3
1    5
2    7
dtype: int64

The slice notation 0:3 refers to the list of indices [0,1,2]. The result of s[0:3] is a new series that contains a subset of the original series that contains the first three elements.

Calculations#

Pandas series have methods for performing common calculations. For example, the method .count() tells us length of the series:

s.count()  # == len(s)
4

The method .sum() computes the sum of the values in the series.

s.sum()
24

You can perform arithmetic operations like +, -, *, / with series. For example, we can convert the counts in the series s to proportions, but dividing the series s by the sum of the values.

s / s.sum()
0    0.125000
1    0.208333
2    0.291667
3    0.375000
dtype: float64

Series have methods for computing descriptive statistics like .min(), .max(), .mean(), .median(), .var(), .std(), .quantile(), etc. For example, the mean (average value) and the standard deviation (dispersion from the mean) are two common statistics we want to compute from data. We can calculate the arithmetic mean of the values in the series s by calling its .mean() method.

s.mean()
6.0

To find the sample standard deviation of the values in the series s, we use the .std() method.

s.std()
2.581988897471611

Pandas makes it really easy to compute all descriptive statistics!

TODO: TABLE showing all methods on Series objects

We can also use arbitrary functions from numpy on a series, and Pandas will apply the function to the values in the series.

import numpy as np
np.log(s)
0    1.098612
1    1.609438
2    1.945910
3    2.197225
dtype: float64

Bonus material 1#

# a series of float values
s2 = pd.Series([0.3, 1.5, 2.2])
s2
0    0.3
1    1.5
2    2.2
dtype: float64
# a series of categorical values
s3 = pd.Series(["a", "b", "b", "c"])
s3
0    a
1    b
2    b
3    c
dtype: object

Bonus material 2#

Pandas series allow arbitrary labels to be used as the index, not just integers. For example, we can use an series index that consists of string labels like ("x", "y", etc.).

s4 = pd.Series(index=["x", "y", "z", "t"],
               data =[ 3,   5,   7,   9 ])
s4
x    3
y    5
z    7
t    9
dtype: int64
s4.index
Index(['x', 'y', 'z', 't'], dtype='object')
s4.values
array([3, 5, 7, 9])

We can now use the string labels to access individual elements of the series.

s4["y"]
5

In other words, Pandas series also act like Python dictionary objects with arbitrary keys. Indeed any quantity that can be used as a key in a dictionary (a Python hashable object), can also be used as a label in a Pandas series. The list of keys of a Python dictionary is the same as the index of a Pandas series.

Data frames#

A Pandas data frame (pd.DataFrame) is a container for tabular data similar to a spreadsheet. You can also think of Pandas data frames as collections of Pandas series.

The most common way to create a data frame is to read data from a CSV (Comma-Separated-Values) file.

!cat "../datasets/minimal.csv"
x,y,team,level
1.0,2.0,a,3
1.5,1.0,a,2
2.0,1.5,a,1
2.5,2.0,b,3
3.0,1.5,b,3

The code sample below shows how to load the data file datasets/minimal.csv into a data frame called df, which is a common name we use for data frames.

df = pd.read_csv("../datasets/minimal.csv")
df
x y team level
0 1.0 2.0 a 3
1 1.5 1.0 a 2
2 2.0 1.5 a 1
3 2.5 2.0 b 3
4 3.0 1.5 b 3

The Pandas function pd.read_csv() read the contents of the data file datasets/minimal.csv and automatically determined the names of the columns based on the first line in the CSV file, which is called the header row.

We’ll use the data frame df for the examples in the remainder of this section. The data corresponds to five players in a computer game. The columns x and y describe the position of the player, the variable team indicates which team the player is part of, and the column level specifies the character’s strength. The meaning of the variables will not be important, since we’ll be focussing on the technical aspects of the data manipulation procedures.

Pandas provides methods for loading data from many data file formats. The function pd.read_excel() can be used to load data from spreadsheet files. There are also functions pd.read_html(), pd.read_json(), and pd.read_xml() for reading data from other file formats. We’ll talk more about various data formats later in this section.

Data frame properties#

Let’s now explore the attributes and methods of the data frame df. First let’s use the Python function type to confirm that df is indeed a data frame object.

type(df)
pandas.core.frame.DataFrame

The above message tells us that df object is an instance of the DataFrame class defined in the Python module pandas.core.frame.

Every data frame has the attributes index and columns, as illustrated in Table [table:anatomy-of-data-frame]{reference-type=”ref” reference=”table:anatomy-of-data-frame”} above (page ). The index is used to refer to the rows of the data frame.

df.index
RangeIndex(start=0, stop=5, step=1)
list(df.index)
[0, 1, 2, 3, 4]

The data frame df uses the “default index” consisting of a range of integers, [0,1,2,3,4], similar to the 0-based indexing used for ordinary Python lists.

The columns-index attribute .columns tells us the names of the columns (variables) in the data frame.

df.columns
Index(['x', 'y', 'team', 'level'], dtype='object')

Column names usually consist of short textual identifiers for the variable (Python strings). Note that spaces and special characters can appear in column names. Column names like "x position" and "level (1 to 3)" are allowed, but generally discouraged since complicated column names make data manipulation code more difficult to read.

The code block below shows an important property of the data frame: its shape.

df.shape
(5, 4)
df.dtypes
x        float64
y        float64
team      object
level      int64
dtype: object

The shape of the data frame df is \(5 \times 4\), which means it has five rows and four columns. The dtypes (data types) attribute prints the information about the types of each of the columns. The columns x and y contain floating point numbers, the column team can contain arbitrary Python objects (in this case Python strings), and the column level contains integers.

The method .info() provides additional facts about the data frame object df, including information about missing values (null values) and the total memory usage.

df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       5 non-null      float64
 1   y       5 non-null      float64
 2   team    5 non-null      object 
 3   level   5 non-null      int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 538.0 bytes

The data frame df takes up 538 bytes of memory, which is not a lot. You don’t have to worry about memory usage for any of the datasets we’ll analyze in this book, since they are all small- and medium-sized. You might have to think about memory usage if you work on larger data sets like logs or databases.

When exploring a data frame, we often want to print the first few rows of the data frame to see what they look like. The data frame method .head(k) prints the first k rows.

df.head(2)
x y team level
0 1.0 2.0 a 3
1 1.5 1.0 a 2

You can also call df.tail(k) to print the last \(k\) rows of the data frame, or call df.sample(k) to select a random sample of size k from the data frame.

# BONUS
# df.axes
# df.memory_usage()
# df.values

Accessing and selecting values#

A data frame is a table and the .loc[] attribute is used to access individual values, rows, and columns from the table. To get the value of the variable y for the third row, we use:

df.loc[2, "y"]
1.5

The general syntax is df.loc[row,col], where row is the row label and col is the column label of the value we want to obtain.

Selecting entire rows#

To select rows from a data frame, we use df.loc[row,:] where the : syntax is shorthand for “all columns.”

row2 = df.loc[2,:]
row2
x        2.0
y        1.5
team       a
level      1
Name: 2, dtype: object

The rows of a data frame are series objects and their index is the same as the columns index of the data frame.

# Rows of the dataframe are Series objects
type(row2), row2.index
(pandas.core.series.Series, Index(['x', 'y', 'team', 'level'], dtype='object'))
row2.values
array([2.0, 1.5, 'a', 1], dtype=object)
row2["y"]
1.5

Selecting entire columns#

We use the syntax df[col] to select the column col from a data frame.

ys = df["y"]
ys
0    2.0
1    1.0
2    1.5
3    2.0
4    1.5
Name: y, dtype: float64

The column ys is a series and its index is the same as df.index.

type(ys), ys.index
(pandas.core.series.Series, RangeIndex(start=0, stop=5, step=1))

The syntax df[col] is shorthand for the expression df.loc[:,col], which means “select all the rows for the column col.” We can verify that df["y"] equals df.loc[:,"y"] using the .equals() method.

df["y"].equals( df.loc[:,"y"] )
True

The column y can also be obtained by accessing the attribute df.y.

df["y"].equals( df.y )
True

Note accessing columns as attributes only works for column names that do not contain spaces or special characters.

# MAYBE
ys.values
array([2. , 1. , 1.5, 2. , 1.5])
# MAYBE
ys[2]
1.5

Selecting multiple columns#

We can extract multiple columns from a data frame by passing a list of column names inside the square brackets.

df[["x", "y"]]
x y
0 1.0 2.0
1 1.5 1.0
2 2.0 1.5
3 2.5 2.0
4 3.0 1.5

The result is a new data frame object that contains only the x and y columns from the original df.

Selecting subsets of the data frame#

We can use a combined selection expression to choose and arbitrary subset of the rows and columns of the data frame.

We rarely need to do this, but for the purpose of illustration of the loc syntax, here is the code for selecting the y and team columns from the last two rows of the data frame.

df.loc[3:5, ["y","team"]]
y team
3 2.0 b
4 1.5 b

Selecting only certain rows#

A common task when working with Pandas data frames is to select the rows that fit one or more criteria. We usually carry out this selection procedure using in two-step process:

  • Build a “selection mask” series that consists of boolean values (True or False). The mask series contains the value True for the rows we want to keep, and the value False for the rows we want to filter out.

  • Select the subset of rows from the data frame using the mask. The result is a new data frame that contains only the rows that correspond to the True values in the selection mask.

For example, to select the rows from the data frame that are part of team b, we first build the election mask.

mask = df["team"] == "b"
mask
0    False
1    False
2    False
3     True
4     True
Name: team, dtype: bool

The rows that match the criterion “team column equal to b” correspond to the True values in the mask, while the remaining values are False.

The actual selection is done by using the mask inside the square brackets.

df[mask]
x y team level
3 2.5 2.0 b 3
4 3.0 1.5 b 3

The result is a new data frame that contains only the rows that correspond to the True values in the mask series.

We often combine the two steps we described above into a single expression df[df["team"]=="b"]. This combined expression is a little hard to read at first, since it contains two pairs of square brackets and two occurrences of the data frame name df, but you’ll quickly get used to it, because you’ll see this type of selection expressions very often.

df[df["team"]=="b"]
x y team level
3 2.5 2.0 b 3
4 3.0 1.5 b 3

We can use the Python bitwise boolean operators & (AND), | (OR) and ~ (NOT) to build selection masks with multiple criteria. For example, to select the rows with team is b where the x value is greater or equal to 3, we would use the following expression.

df[(df["team"] == "b") & (df["x"] >= 3)]
x y team level
4 3.0 1.5 b 3

The selection mask consists of two terms (df["team"]=="b") and (df["x"]>=3) that are combined with the bitwise AND operator &. Note the use of extra parentheses to ensure the masks for the two conditions are computed first before the & operation.

If we want to select multiple values of a variable, we can use the .isin() method and specify a list of values to compare with. For example, to build a mask that select all the observations that have level equal to 2 or 3, we can use the following code.

df["level"].isin([2,3])
0     True
1     True
2    False
3     True
4     True
Name: level, dtype: bool

We see the above expression has correctly selected all observations except the one at index 2, which has level equal to 1.

Creating data frames from scratch#

There are several other ways to create data frames, which are sometimes used for simple datasets. You can create a data frame from a Python dictionary object whose keys are the column names, and the dictionary values are lists of the values of each column.

dict_of_columns = {
    "x": [1.0, 1.5, 2.0, 2.5, 3.0],
    "y": [2.0, 1.0, 1.5, 2.0, 1.5],
    "team": ["a", "a", "a", "b", "b"],
    "level": [3, 2, 1, 3, 3],
}

df2 = pd.DataFrame(dict_of_columns)
df2
x y team level
0 1.0 2.0 a 3
1 1.5 1.0 a 2
2 2.0 1.5 a 1
3 2.5 2.0 b 3
4 3.0 1.5 b 3

The data frame df2 that we created above is identical to the data frame df that we loaded from the CSV file. We can verify this using the .equals() method of the data frame.

df2.equals(df)
True

Indeed, you can think of data frame objects as dictionary-like containers whose keys are the column names, and whose values are Pandas series objects (columns of values). More on this later.

We can also create a data frame from a list of observation records. Each record (row) corresponds to the data of one observation.

list_records = [
    [1.0, 2.0, "a", 3],
    [1.5, 1.0, "a", 2],
    [2.0, 1.5, "a", 1],
    [2.5, 2.0, "b", 3],
    [3.0, 1.5, "b", 3],
]
columns = ["x", "y", "team", "level"]

df3 = pd.DataFrame(list_records, columns=columns)
df3
x y team level
0 1.0 2.0 a 3
1 1.5 1.0 a 2
2 2.0 1.5 a 1
3 2.5 2.0 b 3
4 3.0 1.5 b 3
# df3 is identical to df loaded from minimal.csv
df3.equals(df)
True

When using the list-of-records approach, Pandas can’t determine the names of the columns automatically, so we must pass in the columns argument with a list of the column names we want for the data frame. Note the data frame df3 is identical to the data frames df and df2, which means we’ve now seen three equivalent ways to create the same data frame.

Creating a data frame from a list of dicts:

dict_records = [
    dict(x=1.0, y=2.0, team="a", level=3),
    dict(x=1.5, y=1.0, team="a", level=2),
    dict(x=2.0, y=1.5, team="a", level=1),
    dict(x=2.5, y=2.0, team="b", level=3),
    dict(x=3.0, y=1.5, team="b", level=3),
]
df4 = pd.DataFrame(dict_records)
df4
x y team level
0 1.0 2.0 a 3
1 1.5 1.0 a 2
2 2.0 1.5 a 1
3 2.5 2.0 b 3
4 3.0 1.5 b 3
# df4 is identical to df loaded from minimal.csv
df4.equals(df)
True
# Note dict(key="val") is just an alternative syntax for {"key":"val"}
dict(x=1.0, y=2.0, group="a", level=3) == {"x":1.0, "y":2.0, "group":"a", "level":3}
True

In this section, we illustrated the most common data manipulation techniques you might need, but the Pandas library provides a lot more functionality. If you were able to follow the code examples above and can complete the exercises below, then you’re ready to continue with the rest of the book. Deep knowledge of Pandas is not required for this book, but it’s super useful skill to develop for working with data in real-world projects.

Exercises 1#

Sorting, grouping and aggregation#

Sorting#

TODO

Group by and aggregation#

A common calculation we need to perform in statistics is to compare different groups of observations, where the grouping is determined by one of the variables. The Pandas method .groupby() is used for this purpose. For example, we can group the observations in the data frame df by the value of the team variable using the following code.

df.groupby("team")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f78caf4f610>

The result of calling the .groupby() method is a DataFrameGroupBy object that contains the subsets of the data frame that correspond to the different values of the team variable, df[df["team"]=="a"] and df[df["team"]=="b"].

We can use the DataFrameGroupBy object to do further selection of variables and perform computations. For example, to compute the mean value of the x variable within the two groups, we run the following code.

df.groupby("team")["x"].mean()
team
a    1.50
b    2.75
Name: x, dtype: float64

The result is a series containing the mean for the two groups. The first row contains the value df[df["team"]=="a"]["x"].mean(). The second row’s value is df[df["team"]=="b"]["x"].mean(). Note the result is a series whose index is the team variable (the labels a and b).

We often want to compute several quantities for each group. We can chain the .agg() method (short for aggregate) after the goupyby() method to compute multiple quantities. For example, to compute the sum, the count, and the mean value in each group we use the following code.

df.groupby("team")["x"].aggregate(["sum", "count", "mean"])
sum count mean
team
a 4.5 3 1.50
b 5.5 2 2.75

The above code sample is an example of the “method chaining” pattern, which is used often in Pandas calculations. We start with the data frame df, call its .goupyby() method, select the x column, using the square brackets ["x"], then call the method .agg() on the result.

We can “chain” together any number of Pandas methods to perform complicated data selection and aggregation operations. The above examples show chaining just two methods, but it is common to chain together three or more methods as well. This ability to carry out advanced data manipulations using a sequence of simple method applications is one of the main benefits of using Pandas for data processing. Method chaining operations work because Pandas Series, DataFrames, and GroupBy objects all offer the same methods, so the output of one calculation can be fed into the next.

When using method chaining for data manipulations, the command chains tend to become very long and don’t fit on a single line of input. You’ll therefore see Pandas expressions split on multiple lines using the Python line-continuation character \ to separate methods, as shown in the code example below.

df.groupby("team")["x"] \
  .agg(["sum", "count", "mean"])
sum count mean
team
a 4.5 3 1.50
b 5.5 2 2.75

The result of the above expression is identical to the result in previous code cell. Note the operations are easier to read since we’re using the line-continuation character \ to split the command onto two lines. It is customary (but not required) to indent the second line by a few spaces so the dots line up. This way you read a “bullet list” of operation we have applied to a data frame.

Another way to get the benefits of multi-line commands is to wrap the entire expression in parentheses.

(df
  .groupby("team")["x"]
  .agg(["sum", "count", "mean"])
)
sum count mean
team
a 4.5 3 1.50
b 5.5 2 2.75

The result is identical to the result in previous two code cells. This works because we’re allowed to wrap any Python expression in parentheses without changing its value, so wrapping the command in parentheses doesn’t do anything. The benefit of the parentheses is that new lines are ignored for expressions inside parentheses, so we’re allowed to break the expression onto multiple lines without the need to add the character \ at the end of each line.

Don’t worry too much about the line-continuation and parentheses tricks for multi-line expressions. Most of the Pandas expressions fit on a single line, but wanted you to know about multi-line expressions syntax in case you see it in certain places.

Using crosstab#

TODO

Using pivot and pivot_table#

TODO

Data transformations#

So far we talked about selecting subsets of a data frame, but what if we want to modify the data frame? Pandas provides dozens of methods for modifying the shape, the index, the columns, and the data types of data frames. All the methods we’ll show in this section return a new data frame, so technically speaking we’re not modifying the data frames but returning new, transformed versions of the data frame, which we usually save under a new variable name.

Below is a list of common data transformations you’re likely to encounter in getting “raw” source data into the shape needed for statistical analysis.

  • Renaming: change the column names of a data frame.

  • Reshaping and restructuring the way the data is organized.

  • Imputation: filling in missing values based on the surrounding data or a fixed constant.

  • Merging data from multiple sources to form a combined dataset. For example, we can merge a dataset about downloads per country with another dataset about countries’ populations, so we can compute downloads-per-capita statistics.

  • Filtering: selecting only a subset of the data we’re interested in.

  • Splitting columns.

  • Data cleaning: various procedures for identifying and correcting bad data. Data cleaning procedures include dealing with missing values like None, NaN (not a number), and <NA> (not available), detecting coding errors, duplicate observations, or other inconsistencies in the data.

  • Outlier detection and removal: used to reject certain observations that fall outside the range of expected values.

We’ll now show some examples of common data transformations, which are worth knowing about. We’ll talk about data cleaning and outlier detection in the next section.

Transpose#

The transpose transformation flips a data frame through the diagonal, turning the rows into columns, and columns into rows.

dfT = df.transpose()
dfT
0 1 2 3 4
x 1.0 1.5 2.0 2.5 3.0
y 2.0 1.0 1.5 2.0 1.5
team a a a b b
level 3 2 1 3 3

Note dfT.index is the same as df.columns and dfT.columns is the same as df.index.

Other data shape transformations methods include .melt(), .stack(), .unstack(), .merge(), etc. We’ll discuss shape-transformation methods later in the tutorial as needed.

# add the columns to the index; result is series with a multi-index
# df.stack()
# ALT. way to do transpose
# df.stack().reorder_levels([1,0]).unstack()

Adding new columns#

The most common modification to a data frame is to add a new column. We can add a new column to the data frame df by assigning data to a new column name as shown below.

df["xy"] = df["x"] * df["y"]
print(df)
     x    y team  level   xy
0  1.0  2.0    a      3  2.0
1  1.5  1.0    a      2  1.5
2  2.0  1.5    a      1  3.0
3  2.5  2.0    b      3  5.0
4  3.0  1.5    b      3  4.5
# undo the modification
df = df.drop(columns=["xy"])

On the right side of the assignment operator = we compute the product of the x and y columns, which is a series. We assign this series to a new column called xy. The result of this assignment is a modified data frame df with an additional column.

Modifying a data frame without changing its name is considered bad coding practice, so the above operation is not recommended. In all the code cells up to here, the data frame df has columns ["x", "y", "team", "level"]. In all code cells below, the data frame df will have columns ["x", "y", "team", "level", "xy"]. This change can lead to confusion and problems, since the same variable df refers to different objects in different parts of the notebook, and we have to keep track of this.

When modifying a data frame, I recommend that you create a copy under a new name, and perform the modifications on the copy as shown below.

df2 = df.copy()
df2["xy"] = df["x"] * df["y"]

The resulting df2 is the modified df with the extra xy column, while the original data frame remains unchanged. Using a different name for the modified data frame makes it clear that we did a modification.

The .assign() method is a convenient way to add new columns to a data frame.

df2 = df.assign(xy = df["x"] * df["y"])

The .assign() method returns a new data frame without modifying the original df. We save the result to a new variable df2, so the end result the same as the data frame df2 we obtained above using the copy-then-modify approach.

The .assign() method is very useful when we want to perform multiple transformations to the data frame using method chaining.

import numpy as np
df3 = df.assign(xy = df["x"] * df["y"]) \
        .assign(z = 1) \
        .assign(r = np.sqrt(df["x"]**2 + df["y"]**2)) \
        .assign(team = df["team"].str.upper())
print(df3)
     x    y team  level   xy  z         r
0  1.0  2.0    A      3  2.0  1  2.236068
1  1.5  1.0    A      2  1.5  1  1.802776
2  2.0  1.5    A      1  3.0  1  2.500000
3  2.5  2.0    B      3  5.0  1  3.201562
4  3.0  1.5    B      3  4.5  1  3.354102

The above code example shows a method chain with four .assign() operations. The effect of the first operation is to add the xy column containing the product of x and y values. The second operation adds a new constant column z equal to 1. The third operation adds the column r obtained from the math formula \(r = \sqrt{x^2 + y^2}\), which corresponds to the distance from the origin of a point with coordinates \((x,y)\). Note we used the function np.sqrt from the NumPy module to perform the square root operation. The last transformation changes the values of the team column to be in uppercase letters. The result of these four assign statements is finally saved as the new variable df3.

Dropping rows and and columns#

Pandas provides several “drop” methods for removing unwanted rows or columns from a data frame. For example, to drop the first (index 0), third (index 2), and fifth (index 4) rows of the data frame df we can use the .drop() method and pass in the list of indices to remove to the index argument.

df.drop([0,2,4])
x y team level
1 1.5 1.0 a 2
3 2.5 2.0 b 3

The result is a new data frame that contains only the second (index 1) and fourth (index 3) rows.

To remove columns from a data frame, use .drop() method with the columns option. Here is the code to delete the column level.

df.drop(columns=["level"])
x y team
0 1.0 2.0 a
1 1.5 1.0 a
2 2.0 1.5 a
3 2.5 2.0 b
4 3.0 1.5 b

The result is a new data frame that no longer has the level column. Note we could have obtained the same result by selecting only the columns we want to keep using the code df[["x", "y", "team"]].

Pandas also provides the methods .dropna() for removing rows with missing values and .drop_duplicates() for removing rows that contain duplicate data. We’ll learn more about these methods in the next section when we talk about data cleaning.

Renaming columns and values#

To change the column names of a data frame, we can use the .rename() method and pass in to the columns argument a Python dictionary of the replacements we want to make. For example, the code below renames the columns team and level to use uppercase letters.

df.rename(columns={"team":"TEAM", "level":"LEVEL"})
x y TEAM LEVEL
0 1.0 2.0 a 3
1 1.5 1.0 a 2
2 2.0 1.5 a 1
3 2.5 2.0 b 3
4 3.0 1.5 b 3

The dictionary {"team":"TEAM", "level":"LEVEL"} contains the old:new replacement pairs.

To rename the values in the data frame, we can use the .replace() method, passing in a Python dictionary of replacements we want to do on the values in each column. For example, to replace the values in the team column to UPPERCASE letters, we can run the code.

team_mapping = {"a":"A", "b":"B"}
df.replace({"team":team_mapping})
x y team level
0 1.0 2.0 A 3
1 1.5 1.0 A 2
2 2.0 1.5 A 1
3 2.5 2.0 B 3
4 3.0 1.5 B 3

The dictionary {"a":"A", "b":"B"} contains the old:new replacement pairs, similar to the dictionary we used above for renaming columns.

# # ALT. use str-methods to get uppercase letter
# df.assign(team = df["team"].str.upper())

Reshaping data frames#

One of the most common transformation we need to do, is to convert data frames from “wide” format to “long” format. Data tables in “wide” format contain multiple observations in each row, with the column header indicating some property for the observations in each column. The code example below shows a sample data frame with the viewership numbers for a television series. The viewership numbers are organized by season (rows) and by episode (columns).

views_data = {
    "season": ["Season 1", "Season 2"],
    "Episode 1": [1000, 10000],
    "Episode 2": [2000, 20000],
    "Episode 3": [3000, 30000],
}
tvwide = pd.DataFrame(views_data)
tvwide
season Episode 1 Episode 2 Episode 3
0 Season 1 1000 2000 3000
1 Season 2 10000 20000 30000

This organization is very common for data collected in spreadsheets, since it’s easy for humans to interpret specific values based on the column they appear in.

Data structured in “wide” format is useful for data entry and display purposes, but it makes data selection, grouping, and filtering operations more complicated to perform, so it’s not a good shape for statistical analysis. Instead, we prefer all data to be in “long” format where each row corresponds to a single observation, and each column corresponds to a single variable, like the minimal dataset we discussed above.

The Pandas operation for converting “wide” data to “long” data is called melt, perhaps in reference to melting a wide block of ice into a long stream of water. The method .melt() requires several arguments to specify how to treat each of the columns in the input data frame, and the names we want to assign to the columns in the output data frame. Let’s look at the code first, and explain the meaning of the arguments after.

tvlong = tvwide.melt(id_vars=["season"],
                     var_name="episode",
                     value_name="views")
tvlong
season episode views
0 Season 1 Episode 1 1000
1 Season 2 Episode 1 10000
2 Season 1 Episode 2 2000
3 Season 2 Episode 2 20000
4 Season 1 Episode 3 3000
5 Season 2 Episode 3 30000

The argument id_vars specifies a list of identifier variables, which are the columns that already contains values for the whole row.

All other columns are treated as value variables that apply to the values in that column. The arguments var_name and value_name determine the name of the variable and value columns of the melted data frame.

The result tvlong has six rows, one for each observation in the original data frame tvwide. Each row corresponds to one episode of the TV show, and each column corresponds to a different variable for that episode (the season, the episode, and the number of views). This means the data frame tvlong is in tidy data format.

BONUS EXPLAINER 1: SORT AFTER MELT#

The rows in the data frame tvlong appear out of order after the melt operation. We can fix this using the .sort_values() method and specifying the column names by which we want the data to be sorted.

tvlong.sort_values( by=["season", "episode"]) \
      .reset_index(drop=True)
season episode views
0 Season 1 Episode 1 1000
1 Season 1 Episode 2 2000
2 Season 1 Episode 3 3000
3 Season 2 Episode 1 10000
4 Season 2 Episode 2 20000
5 Season 2 Episode 3 30000

The data is now sorted by season first then by episode, which is the natural order for this data. The extra method .reset_index() is used to re-index the rows using the sequential numbering according to the new sort order.

BONUS EXPLAINER 2: UNDO MELT WITH PIVOT (cut for brevity)#

The method for the opposite transformation (converting long data to wide data) is called pivot and works like this:

tvlong.pivot(index="season",
             columns="episode",
             values="views")
episode Episode 1 Episode 2 Episode 3
season
Season 1 1000 2000 3000
Season 2 10000 20000 30000
# # ALT. to get *exactly* the same data frame as `tvwide`
# tvlong.pivot(index="season",
#              columns="episode",
#              values="views") \
#        .reset_index() \
#        .rename_axis(columns=None)

Tidy data (where to cover?)#

The concept of tidy data is a convention for structuring data sets that makes them easy to work with [@wickham2014tidy]. A tidy dataset has the following characteristics:

  • Each variable corresponds to a separate column.

  • Each observation corresponds to a separate row.

  • Each data cell contains a single value.

This structure makes it easy to select arbitrary subsets of the data based on values of the variables, and perform arbitrary transformations and aggregations. Tidy data is also very useful for data visualizations.

TODO: examples of non-tidy data + explain what is the problem + how to fix it

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” —Hadley Wickham

The Seaborn library we’ll use for plotting works well with data in tidy format, allowing you to create data visualizations of the variables by simply specifying the column names.

String methods#

TODO .str namespace

The .str prefix can be used to access other string manipulation methods like .str.join(), .str.split(), .str.startswith(), .str.strip(), etc. Any operation we can perform on a Python string, we can also perform the same operation on an entire Pandas series by calling the appropriate str-method.

Another transformation we might have to do is to split values that combine multiple variables. For example, medical records often contain the age and sex information as a single string 32F, which we would split into the variable age with value 32 and the variable sex with value F.

# TODO

Merging data frames#

TODO

Type conversions#

TODO .astype

Dummy coding of categorical variables#

TODO

I hope you got a general idea of the transformations that exist.

Data cleaning#

Before a dataset can be used for statistical analysis, we must often perform various pre-processing steps on the data to ensure it is consistently formatted. The term data cleaning describes various procedures for correcting various data problems so that the statistical analysis procedures will not “choke” on it, or lead to erroneous results. Data cleaning procedures include detecting coding errors, removing duplicate observations, and fixing other inconsistencies in the data.

Standardize categorical values#

A very common problem that occurs for categorical variables, is the use of multiple codes to represent the same concept. Consider the following series containing what-type-of-pet-is-it data in which “dog” is encoded using different values.

pets = pd.Series(["D", "dog", "Dog", "doggo"])
pets
0        D
1      dog
2      Dog
3    doggo
dtype: object

This type of inconsistent encoding will cause lots of trouble down the line. For example, performing a .groupby() operation on this variable will result in four different groups, even though all these pets are dogs.

We can fix this encoding problem by standardizing on a single code for representing dogs and replacing all other values with the standardized code, as shown below.

dogsubs = {"D":"dog", "Dog":"dog", "doggo":"dog"}
pets.replace(dogsubs)
0    dog
1    dog
2    dog
3    dog
dtype: object

The method .value_counts() is helpful for detecting coding errors and inconsistencies. Looking at the counts of how many times each value occurs can help us notice exceptional values, near-duplicates, and other problems with categorical variables.

pets.replace(dogsubs).value_counts()
dog    4
Name: count, dtype: int64

Number formatting errors#

The text string "1.2" corresponds to the number 1.2 (a float). We can do the conversion from text string to floating point number as follows.

float("1.2")
1.2

When loading data, Pandas will automatically recognize numerical expression like this and load them into columns of type float or int.

There are some common number formatting problems you need to watch out for. Many languages use the comma is the decimal separator instead of a decimal point, so the number 1.2 might be written as the text string "1,2", which cannot be recognized as a float.

float("1,2")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[88], line 1
----> 1 float("1,2")

ValueError: could not convert string to float: '1,2'

To fix this issue, we replace the comma with a period as follows.

"1,2".replace(",", ".")
'1.2'

Another example of a problematic numeric value is "1. 2", which is not a valid float because of the extra spaces. We can fix this by getting rid of the space "1. 2".replace(" ", "").

Let’s now learn how to perform this kind of string manipulations when working with Pandas series and data frames.

rawns = pd.Series(["1.2", "1,2", "1. 2"])
rawns
0     1.2
1     1,2
2    1. 2
dtype: object

The series rawns contains strings with correct and incorrect formatting. Note the series rawns has dtype (data type) of object, which is what Pandas uses for strings.

Let’s try to convert this series to a numeric values (floats). We can do this by calling the method .astype() as shown below.

# ERROR
# rawns.astype(float)

Calling the method .astype(float) is essentially the same as calling float on each of the values in the series, so it shouldn’t be surprising that we see an exception since the string "1,2" is not a valid float.

We can perform the string replacements on the data series rawns using the “str-methods” as shown below.

rawns.str.replace(",", ".") \
     .str.replace(" ", "") \
     .astype(float)
0    1.2
1    1.2
2    1.2
dtype: float64

After performing the replacements of commas to periods and removing unwanted spaces, the method .astype(float) succeeds.

Note the method we used in the above example is .str.replace() and not .replace().

Other types of data that might have format inconsistencies include dates, times, addresses, and postal codes. We need to watch out when processing these types of data, and make sure all the data is in a consistent format before starting the statistical analysis.

Use the .drop_duplicates() method to remove duplicated rows.

Dealing with missing values#

Real-world datasets often contain missing values that can occur as part of the data collection process. Missing values in Pandas are indicated with the special symbol NaN (Not a Number), or sometimes using the symbol <NA> (Not Available). In Python, the absence of a value corresponds to the value None. The term null value is a synonym for missing value.

In order to show some examples of dealing with missing values, we’ll work a “raw” dataset located at datasets/raw/minimal.csv.

rawdf = pd.read_csv("../datasets/raw/minimal.csv")
rawdf
x y team level
0 1.0 2.0 a 3.0
1 1.5 1.0 a 2.0
2 2.0 1.5 a 1.0
3 1.5 1.5 a NaN
4 2.5 2.0 b 3.0
5 3.0 1.5 b 3.0
6 11.0 NaN NaN 2.0

The data frame rawdf contains the same data we worked on previously, and some additional rows with “problematic” values that we need to deal with. Specifically, the rows with index 3 is missing the level variable, and row 6 is missing the values for the y and team variables. The missing values are indicated by the NaN symbol, which is a special float value that stands for Not a Number.

Note the variable level has been converted to a float data type in order to use the NaN for representing missing values. This is because integers do not have a natural way to represent missing values. Some datasets might use special codes like 999 or -1 to indicate missing integer values. You should use the .replace() method to change these values to float("NaN") or pd.NA in order to avoid the special code being used in numeric calculations by mistake.

We can use the .info() method to get an idea of the number of missing values in a data frame.

rawdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       7 non-null      float64
 1   y       6 non-null      float64
 2   team    6 non-null      object 
 3   level   6 non-null      float64
dtypes: float64(3), object(1)
memory usage: 352.0+ bytes

The data frame has a total of 7 rows, but the columns level, y, and team have only 6 non-null values, which tells us these columns contain one missing value each.

We can use the method .isna() to get a complete picture of all the values that are missing (not available).

rawdf.isna()
x y team level
0 False False False False
1 False False False False
2 False False False False
3 False False False True
4 False False False False
5 False False False False
6 False True True False

The locations in the above data frame that contain the value True correspond to the missing values in the data frame rawdf.

We can summarize the information about missing values, by computing the sum of all the rows, to get a count of the missing values for each variable.

rawdf.isna().sum(axis="rows")
x        0
y        1
team     1
level    1
dtype: int64
rawdf.isna().sum(axis="columns")
0    0
1    0
2    0
3    1
4    0
5    0
6    2
dtype: int64
# # ALT. convert to data types that have support <NA> values
# tmpdf = pd.read_csv("../datasets/raw/minimal.csv")
# rawdf2 = tmpdf.convert_dtypes()
# rawdf2.dtypes
float("NaN")
nan

We can also compute rawdf.isna().sum(axis="columns") to get the count of missing values in each row.

The most common way of dealing with missing values is to exclude them from the dataset, which is sometimes called “dropping” the data. The method .dropna() filters out any rows that contain null values and return a new “clean” data frame with no missing values.

cleandf = rawdf.dropna()
cleandf
x y team level
0 1.0 2.0 a 3.0
1 1.5 1.0 a 2.0
2 2.0 1.5 a 1.0
4 2.5 2.0 b 3.0
5 3.0 1.5 b 3.0

We can provide different arguments to the .dropna() method to give more specific instructions about which rows to drop. For example, missing values in one of the columns you don’t plan to use for a statistical analysis are not a problem.

Another approach for dealing with missing values is to use imputation, which is the process of “filling in” values based on our “best guess” of the missing values. Imputation is a tricky process since it involves changing the data, which can affect any subsequent statistical analysis, so we try to avoid it as much as possible. The common approaches for filling in missing values include using the mean or median of the variable, or guessing a missing value based on the neighbouring values.

# Recode level as integer
cleandf.loc[:,"level"] = cleandf["level"].astype(int)
cleandf
x y team level
0 1.0 2.0 a 3.0
1 1.5 1.0 a 2.0
2 2.0 1.5 a 1.0
4 2.5 2.0 b 3.0
5 3.0 1.5 b 3.0
# ALT. Convert the level column to `int`
# cleandf = cleandf.assign(level = cleandf["level"].astype(int))
# # Confirm `cleandf` is the same as `df` we worked with earlier
# cleandf.reset_index(drop=True).equals(df)

Dealing with outliers#

Outliers are observations that are much larger or much smaller than other data points. Sometimes extreme observations can result due to mistakes in the data collection process like measurement instrument malfunctions, data entry typos, or measurements of the wrong subject. For example, if we intend to study the weights of different dog breeds but somehow end up including a grizzly bear in the measurements. These mistakes are usually identified when a measurement is obviously impossible, like a 600kg dog. In certain cases, we can fix the mistake (for example, if the data was transcribed from paper to computer, you might be able to find paper record and correct the typo). Otherwise, it’s fair to remove these faulty observations from the data to avoid problems with statistical analysis.

Let’s look at some example data that contains an outlier and the procedure for identifying the outlier and filtering it out. Consider a series xs that contains five values.

xs = pd.Series([1.0, 2.0, 3.0, 4.0, 50.0])
xs
0     1.0
1     2.0
2     3.0
3     4.0
4    50.0
dtype: float64

We can easily see the value 50 is the outlier here, but the task may not be so easy for larger datasets with hundreds or thousands of observations.

There are several criteria we can use to systematically identify outliers:

  • more than 1.5*IQR away from the outer quantile, which is discussed in Section [sec:descriptive_statistics]{reference-type=”ref” reference=”sec:descriptive_statistics”}.

  • \(z\)-scores. TODO

  • OTHER?

  • Visual

# Visual observation of the values in the series `xs` to "see" the outlier.
# import seaborn as sns
# sns.stripplot(xs)
# Tukey outliers limits for values in the `xs` series
Q1, Q3 = xs.quantile([0.25, 0.75])
IQR = Q3 - Q1
xlim_low = Q1 - 1.5*IQR
xlim_high = Q3 + 1.5*IQR
(xlim_low, xlim_high)
(-1.0, 7.0)
# # ALT. 2-standard deviations Z-value outlier limits
# xmean = xs.mean()
# xstd = xs.std()
# xlow = xmean - 2*xstd
# xhigh = xmean + 2*xstd
# (xlow, xhigh)

Applying the heuristics based on the quartiles of the dataset, we obtain the criteria that any xs values outside the interval \([-1,7]\) are to be counted as outliers. In words, values smaller than \(-1\) or greater than \(7\) are to be considered as “unusual” for this dataset. Let’s build a mask that identifies all values that fit these criteria.

outliers = (xs < -1.0) | (xs > 7.0)
outliers
0    False
1    False
2    False
3    False
4     True
dtype: bool

The outliers series has the same length as the series xs and tells us which values in the list are outliers. Note the outlier definition consists of the conditions (xs < -1.0) and (xs > 7.0), combined with bitwise OR operator |, sometimes called the “pipe” character.

To remove the outliers from the series, we want to drop all the values where the outliers mask is True. Alternatively, we can select only the subset of xs that are not outliers. Using the Python NOT operator ~, we can invert the outlier mask and obtain a mask for all the values which are not outliers.

~outliers
0     True
1     True
2     True
3     True
4    False
dtype: bool

Using the ~outliers mask allows us to select all the non-outliers.

xs[~outliers]
0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

We can assign this to a new variable and continue the statistical analysis based on the subset of xs that excludes the problematic outlier.

Why outliers are problematic#

Outliers can have undue leverage on the value of some statistics and may lead to misleading analysis results if they are not removed. Statistics like the mean, variance, and standard deviation can be affected due to one or two exceptional values. For example, let’s compute the mean (average value) and the standard deviation (dispersion from the mean) computed from the complete dataset \([1,2,3,4,50]\).

xs.mean(), xs.std()
(12.0, 21.27204738618265)

These values are very misleading: they give us the wrong impression about the centre of the data distribution and how spread out it is. In reality, the first four values are small numbers, but the presence of the large outlier \(50\) makes the mean and the standard deviation appear much larger.

If we remove the outlier \(50\), the mean of the standard deviation of the remaining values \([1,2,3,4]\) are much smaller.

xs[~outliers].mean(), xs[~outliers].std()
(2.5, 1.2909944487358056)

These summary statistics are much more representative of the data. The mean value \(2.5\) is a good single-number description for the “centre” of the data points, and the standard deviation of \(1.29\) also gives us a good idea of the dispersion of the data points.

Any subsequent statistical analyses we might perform with this data will benefit from these more accurate summary statistics we obtain after removing the outlier.

When to remove outliers (ROUGH DRAFT)#

Rejecting observations that are far outside the range of expected values is the correct thing to do when these observations arise from data entry errors, but that doesn’t mean we should always remove outliers to make our data look “nicer.” We need to have a legitimate reason for removing outliers.

Outliers may indicate an unexpected phenomenon, or a previously unsuspected variable that influences the measurement. It would be a shame to just discard a potentially valuable finding. Instead of rejecting these observations, you could instead investigate the unusual cases more closely and look for an explanation. A consultation with a subject-matter expert (SME) would be a good idea before making the decision to exclude certain observations.

Alternatively, you could use a statistical procedure that gives less weight to outliers, or you could repeat the experiment to obtain a new dataset and compare your two sets of results. Finally, you could choose to report results both with and without outliers and let your audience decide.

If you eliminate data points from your analyses (for any reason), always report what you removed and why you removed it.

EXTRA STUFF: Whatever strategy you prefer, it’s best to decide how you will deal with outliers a priori—before you collect your data.

The goal of certain statistics procedures like anomaly detection is precisely to identify outliers, so you should definitely not remove them.

Outlier in n-dimensions: e.g. an univariate outlier might not be a bi- or multivariate outlier.

The data pre-processing steps we introduced in this section are an essential prerequisite for statistical analysis. It’s important that you learn to perform the basic data cleaning operations so that you’ll be able to work with messy, real-world datasets. We’ve only scratched the surface of what is possible, but I hope you got a general idea of the data cleaning steps you might need to do.

TODO: plug Case Studies section in the end of tutorial.

NOT OUTSOURCABLE (personal message)#

Data pre-processing is not something you can “outsource” or handoff to a colleague, because it’s very informative to “touch” the data to get to know it. If you only see data after it has been cleaned up by someone else, then you’re missing a lot of the context, and you were not included in some important decisions (dealing with missing values, removing outliers, etc.).

Data cleaning exercises#

To give you a taste of the data cleaning tasks, here are some exercises that ask you to clean-up the raw data files in datasets/raw/ to obtain the same data as in datasets/ folder.

TODO

Data sources#

Data extraction: get our hands on data#

Real-world data is stored in all kinds of places, and you often need to do an “extraction” phase to get your hands on the data. You’ll have to deal with many different kinds of data sources at one point or another in your data career. Here are the most common data sources you should know about:

  • Local files. The simplest kind of data file is the one you can “Save as,” receive as an email attachment, and open from your local file system. Every file is described by a file path that specifies the location of the file in the local file system, and a file extension like .csv, .xlsx, .json, .xml, which tells you what kind of format the data is stored in. More on data file formats below.

  • Online files. Files can be made available for download on the internet by placing them on a web server. Web servers also work with paths. The job of a web server is to respond to requests for different paths. For example, when your web browser makes a GET request for the path /datasets/minimal.csv on the server noBSstats.com, the web server software running on noBSstats.com will respond by sending back the contents of the file. The act of GETing a file from a remote host and saving it to the local file system is usually accomplished using the “Save as” operation, a terminal command line wget https://noBSstats.com/datasets/minimal.csv, of a Python script import requests; response = requests.get("..."); ....

  • Logs. Many software systems generate log files as part of their normal operation. For example, every web server keeps a log of all the requests it has received, which can be a very useful data source.

  • Surveys. Send out a survey of some sort and obtain the responses. Every startup should be investing time to talk to their customers, and user surveys play a big role in this.

  • Databases. Company operational data is stored in one or more databases. Access to this data is essential for understanding any business. The Structured Query Language (SQL) is the standard interface for accessing data in databases. Steps: (1) obtain access to the DB (server, username, and password), (2) connect to the DB, (3) run a DB query SELECT … FROM …, (4) save the output to a CSV file.

  • Websites. Data can sometimes be found as tables in a webpage’s HTML markup code. This data is usually displayed with the intention of being read by humans, but it’s often useful to extract the tabular data.

  • APIs. An application programming interface (API) is a standard way for computers to exchange data. Data obtained from APIs is intended for machine use. You can think of APIs as websites that allow fine-grained control of the data you’re requesting, and a well-defined format for the response data.

  • Data repositories. Existing datasets on specialized web servers for hosting data. A data repository will allow you to download datasets in the usual formats CSV, spreadsheet, etc. but also has additional description of the data (metadata). Examples of data repositories include, government data repositories, open science repositories like OSF, Zenodo, etc.

  • Published plots and graphs. Using tools like graphreader.com, you can sometimes read data points from graphs in published research papers.

  • Research datasets. CUT Ask researchers to provide you with raw data files from any research paper. You can expect mixed responses, but it doesn’t hurt to ask.

  • SQL files. CUT Structured Query Language is the standard format used to represent data in databases. A database dump files contain the complete instructions for recreating a database and all the contents in it. TODO: mention not meant to work directly – but load into DB then query to extract CSV.

Data formats#

Data files can be encoded in one of several data formats: CSV files, TSV files, spreadsheets, JSON files, HTML files, SQLite database files, etc.

We’ll now show examples of the Pandas code for loading data from CSV files and spreadsheets, since these are the most common data formats seen “in the wild.”

Joke: Trigger warning: this will be acronyms like SQL and CSV thrown at you. I know you’re thinking “wow that escalated quickly,” but we have to get a bit technical to make the knowledge actually useful.

CSV files#

The initialism CSV stands for Comma-Separated-Values and is a widespread file format for tabular data. CSV files consist of plain text values separated by commas. The first line of a CSV file usually contains the variable names (it is called the header row).

Comma-Separated-Values files are the most common file format for tabular data that you are likely to encounter. You can load CSV data using the function pd.read_csv(<path>), where <path> is the location of the CSV file.

Let’s see the “raw contents” of the data file ../datasets/minimal.csv as you would see if you opened it with a basic text editor like (e.g. Notepad in Windows or TextEdit on a Mac).

    x,y,team,level
    1.0,2.0,a,3
    1.5,1.0,a,2
    2.0,1.5,a,1
    2.5,2.0,b,3
    3.0,1.5,b,3

Note the first line of this file is a “header row” that contains the column names, while the remaining rows contain the actual data for the observations.

# ALT. (not cross platform)
# !head ../datasets/minimal.csv

Here is the code for loading a CSV file ../datasets/minimal.csv and showing the resulting data frame.

df = pd.read_csv("../datasets/minimal.csv")
print(df)
     x    y team  level
0  1.0  2.0    a      3
1  1.5  1.0    a      2
2  2.0  1.5    a      1
3  2.5  2.0    b      3
4  3.0  1.5    b      3

Spreadsheets files#

Spreadsheet software like LibreOffice Calc, Microsoft Excel, and Google Sheets can be used to edit spreadsheet files with extensions like .ods, .xlsx, and .xls. We can load data from spreadsheet files using the function pd.read_excel().

TODO: give heads up need to pip-install packages

odsdf = pd.read_excel("../datasets/formats/minimal.ods",
                      sheet_name="Sheet1")
odsdf.equals(df)
True
xlsxdf = pd.read_excel("../datasets/formats/minimal.xlsx",
                       sheet_name="Sheet1")
xlsxdf.equals(df)
True

Other data formats#

TSV#

The Tab-Separated-Values format is similar to CSV, but uses TAB characters as separators. TAB is a special character used for aligning text into columns, an it is represented as \t in Python strings.

!head ../datasets/formats/minimal.tsv
x	y	team	level
1.0	2.0	a	3
1.5	1.0	a	2
2.0	1.5	a	1
2.5	2.0	b	3
3.0	1.5	b	3

We load TSV files using the function pd.read_csv by passing the value "\t" to the sep (separator) argument: pd.read_csv(<path>, sep="\t").

tsvdf = pd.read_csv("../datasets/formats/minimal.tsv", sep="\t")
tsvdf
x y team level
0 1.0 2.0 a 3
1 1.5 1.0 a 2
2 2.0 1.5 a 1
3 2.5 2.0 b 3
4 3.0 1.5 b 3
tsvdf.equals(df)
True

JSON#

The acronym JSON stands for JavaScript Object Notation, and it is one of the most common data format by web applications and APIs. The JSON data format is similar to Python’s notation for lists (square brackets [...]), and dicts (curly braces {...} containing key:value pairs).

!head ../datasets/formats/minimal.json
[
  {"x":1.0, "y":2.0, "team":"a", "level":3},
  {"x":1.5, "y":1.0, "team":"a", "level":2},
  {"x":2.0, "y":1.5, "team":"a", "level":1},
  {"x":2.5, "y":2.0, "team":"b", "level":3},
  {"x":3.0, "y":1.5, "team":"b", "level":3}
]
Pandas provides the function `pd.read_json()` for loading JSON data.
If the contents of the JSON file is a list of observation,
we can load JSON files directly into a data frame using the function `pd.read_json()`.
  Cell In[121], line 1
    Pandas provides the function `pd.read_json()` for loading JSON data.
           ^
SyntaxError: invalid syntax
jsondf = pd.read_json("../datasets/formats/minimal.json")
jsondf.equals(df)
True

If the JSON source data file has a structure that is more complicated than a list-of-observations, we can try calling the function pd.json_normalize, which will try to auto-guess the structure of the JSON data. In the general case, importing JSON data might require using the Python module json to load the data into Python data structures and carry out processing steps to extract the desired subset of the data you’re interested in and organizing it into the list-of-observation format that Pandas expects.

HTML tables#

The HyperText Markup Language format is used by all the web pages you access on the web. The HTML source code of a webpage can include tabular data (the <table> tag in the HTML source code).

!head -16 ../datasets/formats/minimal.html
<table>
  <thead>
    <tr>
      <th>x</th>
      <th>y</th>
      <th>team</th>
      <th>level</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1.0</td>
      <td>2.0</td>
      <td>a</td>
      <td>3</td>
    </tr>

We can load the data from HTML files using the function pd.read_html(), which returns a list of data frames extracted from the tables found in a HTML document.

tables = pd.read_html("../datasets/formats/minimal.html")
htmldf = tables[0]
htmldf.equals(df)
True
htmldf
x y team level
0 1.0 2.0 a 3
1 1.5 1.0 a 2
2 2.0 1.5 a 1
3 2.5 2.0 b 3
4 3.0 1.5 b 3

XML#

The eXtensible Markup Language format is used by many structured data formats and APIs.

!head -8 ../datasets/formats/minimal.xml
<?xml version='1.0' encoding='utf-8'?>
<players>
  <player>
    <x>1.0</x>
    <y>2.0</y>
    <team>a</team>
    <level>3</level>
  </player>

We use the function pd.read_xml() to load XML data.

xmldf = pd.read_xml("../datasets/formats/minimal.xml")
xmldf.equals(df)
True

The Python libraries lxml or BeautifulSoup can be used for more advanced XML parsing and data processing steps (out of scope).

SQLite databases#

SQLite is a commonly used data format for storing a complete database as a single file. Pandas provides the generic functions pd.read_sql_table() and pd.read_sql_query() for extracting data from databases, and we can use these functions to “connect” to the SQLite database file and read the data stored in one or more database tables in the SQLite database file.

from sqlalchemy import create_engine
dbpath = "../datasets/formats/minimal.sqlite"
engine = create_engine("sqlite:///" + dbpath)
with engine.connect() as conn:
    sqldf = pd.read_sql_table("players", con=conn)
    print(sqldf.equals(df))
True
query = "SELECT x, y, team, level FROM players;"
sqldf2 = pd.read_sql_query(query, con=engine)
sqldf2.equals(df)
True

CUTTABLE: The key to remember during the data extraction phase is that your goal is to get the data in whatever format it is, and not try to restructure or modify it in any way. Leave the data transformation and cleaning steps for another time.

Case studies#

Recall the various datasets we used in the book: Vanessa’s website visitors, Bob’s electricity prices, and Charlotte’s student scores, as well as the apples, kombucha, doctors, and players. We were not involved in the data collection process for these datasets, but it’s still worth asking the clients (Vanessa, Bob, Charlotte, etc.) about the steps they followed to obtain the data. There is an unspoken rule in statistics that the more you know about the data, the better you’ll be able to do the statistical analysis on it.

Note the technical details in this section are not essential to understand. The goal is to just show some examples of real-world data collection and cleanup steps, so you’ll know what to expect when working on your own datasets. We’ll present the backstory for each dataset as an informal conversation. If you’re interested in the technical details, you can look at the notebook for the exact commands used (optional reading for technical people).

Background stories for the example datasets

Collecting the apples dataset#

Alice

TODO: import from .tex

Collecting the electricity prices dataset#

A few days later, you call up Bob to ask him about how he collected the electricity prices dataset eprices.csv. He tells you he obtained the electricity pricing data by scraping the pages from a local price-comparison website. This website contained the electricity prices for various charging stations in the form of an HTML table. He used the function pd.read_html() to extract the prices for each station and categorized the stations based on their location in the East and West parts of the city.

Listening to the explanations, you wonder about the possible bias that might exist in the electricity prices that Bob collected, so you decide to ask him about it.

“Bob, tell me more about this website. Which charging stations are listed there?” you ask.

“It’s a volunteer-run effort. People post the price they paid every day, so it includes a wide selection of the stations. There were several pages of results, and each page has 20 prices, so I think I had the data from most of the stations” he says.

“Ah OK, cool. It seems it’s a pretty good sample,” you say. “I was worried it might be a commercial website that only shows the prices from ‘preferred’ stations.”

You’re reassured you have a representative sample of the prices, so you’ll be able to study the question “Which part of the city has cheaper electricity prices?” since you have the data from a reasonable number of charging stations.

You load up Bob’s dataset datasets/epriceswide.csv and print the first few rows to see how the dataset is structured.

epriceswide = pd.read_csv("../datasets/epriceswide.csv")
epriceswide.head()
East West
0 7.7 11.8
1 5.9 10.0
2 7.0 11.0
3 4.8 8.6
4 6.3 8.3

It seems Bob stored the data in “wide format,” with the prices from charging stations in the East in one column, and the prices from the charging stations in the West in another column. In other words, knowing if a given price is in the East or West is encoded in its column position. You’re can use the .melt() method to convert wide data into long data.

eprices = pd.read_csv("../datasets/eprices.csv")
eprices.head()
loc price
0 East 7.7
1 East 5.9
2 East 7.0
3 East 4.8
4 East 6.3

You’re confident you know what you’ll need to do from here.

“Okay Bob. Thanks for the info.” you say to him. “I think I have everything I need to do the statistical analysis, and I’ll get back to you soon with the results.”

“Cool! Thanks for agreeing to provide ‘stats support’ on this. I’m looking forward to seeing the results. Do you know what type of statistical analysis you’ll use for this data set?” he asks.

Bob is asking a really good question. Depending on the data we have, there are several types of statistical analysis recipes we can use. In this case, we’re trying to compare the means of two groups of numerical variables. In order to describe the variability in the data, we’ll also measure the standard deviation for the two groups. Based on these statistics computed from the observed data, we’ll be able to make conclusions about the general population (all electricity prices). Our final goal is to tell Bob conclusively if there is a difference between East and West prices, and if there is a difference which side of his city has the cheaper prices.

You recently did an analysis of a similar dataset where you compared the means of two groups, so you have a clear plan for what you’re going to do.

“My plan is to compute the mean and the standard deviation for the East and West prices, then compare the means to see if there is a difference. If there is a difference between the means, I’ll run a statistical test to see if the observed difference is statistically significant,” you explain.

“Wow, sounds complicated!” exclaims Bob, mystified by the technical jargon.

You reassure him there is nothing too complicated and statistical testing is just a fancy way to compare means obtained from two groups and determine if the observed difference in the means could have occurred by chance. You clearly see he’s interested to learn more about how statistical tests work, but you know this is not the moment to explain everything. There is a whole book ahead of us for this. In Chapter 2 we’ll learn to use probability theory to model data distributions, then in Chapter 3 develop tools for estimating parameters from data samples and using statistical hypothesis testing procedures. Let’s not rush into things. Easy does it.

Collecting the students dataset#

Charlotte is a bit of a techie, so she set up a learning management system (LMS) server for the students in her class. Charlotte learned how to do this (run web applications on her own server) after suffering months of difficulties of trying to upload her teaching materials to the platform provided by her school. She reasoned that it can’t be this hard to run her own server: it’s just documents, video lectures, and exercises, and there is plenty of software that can do this.

Charlotte’s main reason for running her own server is because she didn’t feel comfortable with the idea of her students’ learning process being monitored by a proprietary learning platform. You’re excited to talk to her, because it’s nice to meet a teacher who cares deeply about student data privacy.

The LMS Charlotte used for this class stores student records in a database. She made some Structured Query Language (SQL) queries using the function pd.read_sql_query() to obtain detailed logs of each students’ actions on the platform. She then aggregated the total effort (total time spend on the platform) and the combined score for each student. She transported the data as a CSV to her laptop, then send it to you.

Using the function pd.read_csv, you can load the students.csv dataset and print the first few rows.

students = pd.read_csv("../datasets/students.csv", index_col="student_ID")
students.head()
background curriculum effort score
student_ID
1 arts debate 10.96 75.0
2 science lecture 8.69 75.0
3 arts debate 8.60 67.0
4 arts lecture 7.92 70.3
5 science debate 9.90 76.1
SELECT <which variables> FROM <which table>;
SELECT <which variables> FROM <which table> WHERE <conditions>;

e.g. 
SELECT student_id, time_on_task FROM learner_analytics;
SELECT student_id, scrore FROM student_final_grades;

AGGREGATE total effort
JOIN effort and score tables

You go through the column names to confirm that you understand the meaning of the categorical variables (see page ) and you know how all the numerical variables were calculated.

You wonder how she calculated the effort and score variables. She explains she computed the effort variable from the total time spent learning, which includes watching videos and doing exercises. She computed the score variable as the average success rate on all exercises that the student completed.

“Okay, thanks Charlotte,” you say, “the dataset looks great! Let me just confirm the two questions you’re interested in.”

“Right. So the first question is whether the curriculum choice (lecture vs.debate) makes a difference in students scores. And the second question is if students who made more effort in the course got a higher score.”

The first question is similar to the electricity prices comparison (comparison between the means in two groups), so you know exactly how to proceed for this one. The second question is about a relation between two numerical variables, which you can study by using linear regression to model how score variable increases (or decreases) for different values of the effort variable.

“Thanks for going over all this with me,” you say. “I know exactly how to proceed from here.”

“You’re welcome, and looking forward to discussing the results with you when they are ready,” says Charlotte as you say goodbye.

xD = students[students["curriculum"]=="debate"]["score"].values
xL = students[students["curriculum"]=="lecture"]["score"].values
import numpy as np
np.mean(xD), np.mean(xL), np.mean(xD) -np.mean(xL)
(76.4625, 68.14285714285714, 8.319642857142867)
from scipy.stats import ttest_ind

ttest_ind(xD, xL)
TtestResult(statistic=1.7197867420465698, pvalue=0.10917234443214315, df=13.0)

Collecting the kombucha dataset#

Khalid

Collecting the doctors dataset#

Dan

Collecting the website visitors dataset#

Recall the description of the website visitors dataset we introduced on page . You can see the first few rows of the dataset by loading the CSV file datasets/visitors.csv.

visitors = pd.read_csv("../datasets/visitors.csv")
visitors.head()
IP address version bought
0 135.185.92.4 A 0
1 14.75.235.1 A 1
2 50.132.244.139 B 0
3 144.181.130.234 A 0
4 90.92.5.100 B 0

Looking at the dataset helps you understand the general structure, but you have some questions about how the data was collected, so you decide to call Vanessa and ask her.

“Hi Vanessa. I’m looking at the data you sent me and I had some questions,” you start.

“Yes, ask away,” she responds.

“How did you assign the visitors to version A or B?” you ask.

“Every time the website received a new visitor (new IP address), it randomly sent them to either version A or version B,” she says. “It’s basically equivalent to flipping a coin.” She goes on to explain that the two versions of the website can be identified from the server logs, since the two designs use different background images. A visitor who sees version A of the website will load the background image images/bgA.jpg, while visitors who see version B will load the image images/bgB.jpg.

“And how did you calculate the bought column?” you ask.

“When a visitor completes the purchase steps, they are sent to a special /thankyou page, so I used that to identify visitors who bought something.”

“Can you tell me more about the steps you took to extract the data from the server logs?”

“Sure,” replies Vanessa and starts on a long explanation, which is summarized below.

Vanessa started by extracting the web server access logs for the date range when the experiment was running. She then loaded the log files into Pandas data frames and concatenated the data from the different days.

She then did some data cleaning by excluding rows generated from bots based on the user agent value.

She then applied the main logic for determining the bought variable by collecting all the log entries for individual IP addresses. For each visitor (unique IP address) she looked for a request to images/bgA.jpg or images/bgB.jpg followed by the request to the /thankyou page, which indicates they bought something (bought=1). If the visitor never reached the /thankyou page, then we know they didn’t make a purchase, so she recorded bought=0 for them.

Finally she stored the data as visitors.csv and sent it to you.

Remore shell
zcat /var/log/nginx/access.log.*.gz > /tmp/access_logs.txt

Local shell
scp minireference.com:/tmp/access_logs.txt data/access_logs.txt

Links

# access_logs = open("data/access_logs.txt")
# df = pd.read_csv(
#     access_logs,
#     sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
#     engine='python',
#     usecols=[0, 3, 4, 5, 6, 7, 8],
#     names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
#     na_values='-',
#     header=None
# )
  • A VIEWS = CONTAINS /static/images/homepage/bgA.jpg

  • B VIEWS = CONTAINS /static/images/homepage/bgB.jpg

  • A CONVERSIONS = CONTAINS /static/images/homepage/bgA.jpg and /thankyou

  • B CONVERSIONS = CONTAINS /static/images/homepage/bgB.jpg and /thankyou

  • p_A = A CONVERSIONS / A VIEWS

  • p_B = B CONVERSIONS / B VIEWS

We can compute the conversion rate for each version of the website using the .groupby() method.

visitors.groupby("version") \
         ["bought"].value_counts(normalize=True)
version  bought
A        0         0.935175
         1         0.064825
B        0         0.962229
         1         0.037771
Name: proportion, dtype: float64
visitors.groupby("version") \
         ["bought"].agg(["sum", "count"]) \
         .eval("sum/count")
version
A    0.064825
B    0.037771
dtype: float64

For version A (the new design), the conversion rate is \(p_{\!A} = 0.0648\). For version B (the old design), the conversion rate is \(p_{B} = 0.0377\).

Collecting the players dataset#

# TODO?

Bonus topics#

NumPy arrays#

Under the hood, Pandas Series and DataFrame objects are based on efficient numerical NumPy arrays. You generally won’t need to interact with NumPy commands when working in Pandas, but sometimes useful to use the NumPy syntax to perform efficient selection of data.

import numpy as np
values = np.array([1, 3, 5, 7])
values
array([1, 3, 5, 7])
values - 2
array([-1,  1,  3,  5])
np.exp(values)
array([   2.71828183,   20.08553692,  148.4131591 , 1096.63315843])

Selecting a subset of the values#

values < 4
array([ True,  True, False, False])
values[values < 4]
array([1, 3])

Create a list of evenly spaced numbers#

(often use when creating graphs)

np.linspace(2, 4, 5)
array([2. , 2.5, 3. , 3.5, 4. ])
# # ALT. use arange to create the same list
# np.arange(2, 4+0.5, 0.5)

Index and sorting#

index operations (set, change, reset, etc.) sort, sort_values, sort_index, and using rank() method

Pandas plot methods#

Use the .plot() method to obtain basic plots, see Appendix E for Seaborn tutorial for more advanced plots (specific for statistics).

Conclusion#

The Stack Overflow discussion forums are a good place to ask questions once you learn the jargon terms for data transformations (.melt(), .groupby(), .agg(), etc.). You can also use ChatGPT for data management.

Exercises#

  • 1/ TODO: data transformations: melt, dropna, filter by value
    TODO: come up with data, answer, solution

  • 2/ Rename/normalize categorical values, OH – Ohio, WA – Washington, etc.

  • 3/ Use groupby to compute the average \(y\)-position for the players in the two teams.
    TODO: come up with exercise, answer, solution

  • 4/ str methods: split City, ST – into city and state (or Last Name, First name into separate names)

  • 5/ Filter out outliers using pre-specified cutoff (Q3 + 1.5IQR) but don’t tell how limits were computed (FWD refs to Descr. Stats and Z-score).

CUT MATERIAL#

Datasets for the book#

students = pd.read_csv("../datasets/students.csv")
students.head()
student_ID background curriculum effort score
0 1 arts debate 10.96 75.0
1 2 science lecture 8.69 75.0
2 3 arts debate 8.60 67.0
3 4 arts lecture 7.92 70.3
4 5 science debate 9.90 76.1

Data pre-processing tasks#

  • Extract the “raw” data from various data source formats (spreadsheet, databases, files, web servers).

  • Transform the data by reshaping and cleaning it.

  • Load the data into the system used for statistical analysis.

Extract#

Extract data from different source formats#

On UNIX systems (Linux and macOS) the command line program head can be used to show the first few lines of any file. The command head is very useful for exploring files—by printing the first few lines, you can get an idea of the format it is in.

Unfortunately, on Windows the command head is not available, so instead of relying on command line tools, we’ll write a simple Python function that called head that does the same thing as the command line tool: it prints the first few lines of a file. By default this function will print the first five lines of the file, but users can override the count argument to request a different number of lines to be printed.

import os

def head(path, count=7):
    """
    Print the first `count` lines of the file at `path`.
    """
    datafile = open(path, "r")
    lines = datafile.readlines()
    for line in lines[0:count]:
        print(line, end="")

The function head contains some special handling for Windows users. If the path is specified using the UNIX path separator /, it will be auto-corrected to use the Windows path separator \.

Load#

ALT. Storing statistical datasets ALT. Data loading

When working with data, it’s important to follow good practices for storing data and metadata associated with your analysis. Make sure you always have dataset in a format that is easy to load into the software system we’ll be using for data analysis.

It’s still a good idea to save the dataset to a new CSV file in order to separate the extraction, transformation, and cleaning steps from the subsequent statistical analysis. Saving the dataset in a general-purpose format like CSV will also make it easy to share the data with collaborators, or experiment with other statistical software like RStudio, JASP, and Jamovi.

To save the data frame cleandf as the CSV file mydata.csv, we can use its .to_csv() method.

Let’s save the cleaned data to the file mydata.csv in a directory mydataset.

cleandf.to_csv("mydataset/mydata.csv", index=False)

We can verify the data was successfully saved to disk using the head function, TODO remove head( call which prints the first few lines from the file.

head("mydataset/mydata.csv")
x,y,team,level
1.0,2.0,a,3
1.5,1.0,a,2
2.0,1.5,a,1
2.5,2.0,b,3
3.0,1.5,b,3

We should also create a short text file README.txt that describes the data file, and provides the codebook information for the variables.

Include in the README any information about the dataset that might be relevant for the research project.

Remember to document all steps we followed to obtain the dataset. This includes information about how we obtained the data and the transformation and cleaning steps we performed. It’s important to record all these details in order to make the ETL pipeline reproducible, meaning someone else could run the same procedure as you to obtain the same dataset. Ideally, you should include a Jupyter notebook or Python script with the data transformations.

head("mydataset/README.txt", count=11)
Players dataset
===============
Description: Synthetic data used to show Pandas operations.
Filename: mydata.csv
Format: A CSV file with 5 rows and 4 columns.
- x (numeric): horizontal position of the player.
- y (numeric): vertical position of the player.
- team (categorical): which team the player is in (a or b).
- level (categorical): player strength (1, 2, or 3).
Source: Synthetic data created by the author.
License: CC0 (public domain)