Appendix D — Pandas tutorial#

TODO: intro

Pandas overview#

Installing Pandas#

First let’s make sure pandas is installed using the %pip Jupyter command.

# %pip install pandas

We then import the pandas library as the alias pd.

import pandas as pd

Series#

Pandas pd.Series objects are similar to Python lists [3,5,7,9]. They are containers for series of values.

s = pd.Series([3, 5, 7, 9])
s
0    3
1    5
2    7
3    9
dtype: int64

Python lists use integers for identifying the elements of the list (first = index 0, second = index 1, last = index len(self)-1).

Pandas series support the same functionality. Here are some example of accessing individual values of the series using the default 0-base indexing.

print("First:  index =", 0, " value =", s[0])
print("Second: index =", 1, " value =", s[1])
print("Last:   index =", len(s)-1, " value =", s[len(s)-1])
First:  index = 0  value = 3
Second: index = 1  value = 5
Last:   index = 3  value = 9

The series index attribute tells you all the possible indices for the series.

s.index
RangeIndex(start=0, stop=4, step=1)
list(s.index)
[0, 1, 2, 3]

The series s uses the default index [0, 1, 2, 3], which consists of a range of integers, starting at 0, just like the index of a Python list with four elements.

s.values
array([3, 5, 7, 9])
type(s.values)
numpy.ndarray

In addition to accessing individual elements like this,

s[0]
3

we can also “slice” a series to obtain a new series that contains indices and values of the slice:

s[0:3]
0    3
1    5
2    7
dtype: int64

Performing arithmetic operations on the series.

s.sum()
24
s / s.sum()
0    0.125000
1    0.208333
2    0.291667
3    0.375000
dtype: float64
s.mean()
6.0
s.std()
2.581988897471611

We can also use arbitrary functions from numpy on a series, and Pandas will apply the function to the values in the series.

import numpy as np
np.log(s)
0    1.098612
1    1.609438
2    1.945910
3    2.197225
dtype: float64

Bonus material 1#

# a series of float values
s2 = pd.Series([0.3, 1.5, 2.2])
s2
0    0.3
1    1.5
2    2.2
dtype: float64
# a series of categorical values
s3 = pd.Series(["a", "b", "b", "c"])
s3
0    a
1    b
2    b
3    c
dtype: object

Bonus material 2#

Pandas series allow arbitrary labels to be used as the index, not just integers. For example, we can use an series index that consists of string labels like ("x", "y", etc.).

s4 = pd.Series(index=["x", "y", "z", "t"],
               data =[ 3,   5,   7,   9 ])
s4
x    3
y    5
z    7
t    9
dtype: int64
s4.index
Index(['x', 'y', 'z', 't'], dtype='object')
s4.values
array([3, 5, 7, 9])

We can now use the string labels to access individual elements of the series.

s4["y"]
5

In other words, Pandas series also act like Python dictionary objects with arbitrary keys. Indeed any quantity that can be used as a key in a dictionary (a Python hashable object), can also be used as a label in a Pandas series. The list of keys of a Python dictionary is the same as the index of a Pandas series.

Data frames#

Loading the dataset minimal.csv

# !cat "../datasets/minimal.csv"
df = pd.read_csv("../datasets/minimal.csv")
df
x y team level
0 1.0 2.0 a 3
1 1.5 1.0 a 2
2 2.0 1.5 a 1
3 2.5 2.0 b 3
4 3.0 1.5 b 3
df.dtypes
x        float64
y        float64
team      object
level      int64
dtype: object

Data frame properties#

type(df)
pandas.core.frame.DataFrame
df.index
RangeIndex(start=0, stop=5, step=1)
list(df.index)
[0, 1, 2, 3, 4]
df.columns
Index(['x', 'y', 'team', 'level'], dtype='object')
df.shape
(5, 4)
df.dtypes
x        float64
y        float64
team      object
level      int64
dtype: object
df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       5 non-null      float64
 1   y       5 non-null      float64
 2   team    5 non-null      object 
 3   level   5 non-null      int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 538.0 bytes
# df.axes
# df.memory_usage()
# df.values

Accessing and selecting values#

df.loc[2, "y"]
1.5

Selecting entire rows#

row2 = df.loc[2,:]
row2
x        2.0
y        1.5
team       a
level      1
Name: 2, dtype: object
# Rows of the dataframe are Series objects
type(row2)
pandas.core.series.Series
row2.index
Index(['x', 'y', 'team', 'level'], dtype='object')
row2.values
array([2.0, 1.5, 'a', 1], dtype=object)
row2["y"]
1.5

Selecting entire columns#

ys = df["y"]
ys
0    2.0
1    1.0
2    1.5
3    2.0
4    1.5
Name: y, dtype: float64
df["y"].equals( df.loc[:,"y"] )
True
df["y"].equals( df.y )
True
type(ys)
pandas.core.series.Series
ys.index
RangeIndex(start=0, stop=5, step=1)
ys.values
array([2. , 1. , 1.5, 2. , 1.5])
ys[2]
1.5

Selecting multiple columns#

df[["x", "y"]]
x y
0 1.0 2.0
1 1.5 1.0
2 2.0 1.5
3 2.5 2.0
4 3.0 1.5

Selecting subsets of the data frame#

We rarely need to do this, but for the purpose of illustration of the loc syntax, here is the code for selecting the y and team columns from the last two rows of the data frame.

df.loc[3:5, ["y","team"]]
y team
3 2.0 b
4 1.5 b

Selecting subsets of rows#

df.head(2)
# df.tail(2)
# df.sample(3)
x y team level
0 1.0 2.0 a 3
1 1.5 1.0 a 2

To select only rows where team is b, we first build the boolean selection mask…

mask = df["team"] == "b"
mask
0    False
1    False
2    False
3     True
4     True
Name: team, dtype: bool

… then select the rows using the mask.

df[mask]
x y team level
3 2.5 2.0 b 3
4 3.0 1.5 b 3

The above two step process can be combined into a more compact expression:

df[df["team"]=="b"]
x y team level
3 2.5 2.0 b 3
4 3.0 1.5 b 3
df[(df["team"] == "b") & (df["x"] >= 3)]
x y team level
4 3.0 1.5 b 3
df["level"].isin([2,3])
0     True
1     True
2    False
3     True
4     True
Name: level, dtype: bool

Creating data frames from scratch#

There are other ways to create a pd.DataFrame from Python data containers like dicts and lists.

Creating a data frame from a dictionary of columns:

dict_of_columns = {
    "x": [1.0, 1.5, 2.0, 2.5, 3.0],
    "y": [2.0, 1.0, 1.5, 2.0, 1.5],
    "team": ["a", "a", "a", "b", "b"],
    "level": [3, 2, 1, 3, 3],
}

df2 = pd.DataFrame(dict_of_columns)
df2
x y team level
0 1.0 2.0 a 3
1 1.5 1.0 a 2
2 2.0 1.5 a 1
3 2.5 2.0 b 3
4 3.0 1.5 b 3
# df2 is identical to df loaded from minimal.csv
df2.equals(df)
True

Creating a data frame from a list of records (lists or tuples):

list_records = [
    [1.0, 2.0, "a", 3],
    [1.5, 1.0, "a", 2],
    [2.0, 1.5, "a", 1],
    [2.5, 2.0, "b", 3],
    [3.0, 1.5, "b", 3],
]
columns = ["x", "y", "team", "level"]

df3 = pd.DataFrame(list_records, columns=columns)
df3
x y team level
0 1.0 2.0 a 3
1 1.5 1.0 a 2
2 2.0 1.5 a 1
3 2.5 2.0 b 3
4 3.0 1.5 b 3
# df3 is identical to df loaded from minimal.csv
df3.equals(df)
True

Creating a data frame from a list of dicts:

dict_records = [
    dict(x=1.0, y=2.0, team="a", level=3),
    dict(x=1.5, y=1.0, team="a", level=2),
    dict(x=2.0, y=1.5, team="a", level=1),
    dict(x=2.5, y=2.0, team="b", level=3),
    dict(x=3.0, y=1.5, team="b", level=3),
]
df4 = pd.DataFrame(dict_records)
df4
x y team level
0 1.0 2.0 a 3
1 1.5 1.0 a 2
2 2.0 1.5 a 1
3 2.5 2.0 b 3
4 3.0 1.5 b 3
# df4 is identical to df loaded from minimal.csv
df4.equals(df)
True
# Note dict(key="val") is just an alternative syntax for {"key":"val"}
dict(x=1.0, y=2.0, group="a", level=3) == {"x":1.0, "y":2.0, "group":"a", "level":3}
True

Exercises 1#

Sorting, grouping and aggregation#

df.groupby("team")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f27a7eab7c0>
df.groupby("team")["x"]
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f27a7eab340>
df.groupby("team")["x"].mean()
team
a    1.50
b    2.75
Name: x, dtype: float64
df.groupby("team")["x"].aggregate(["sum", "count", "mean"])
sum count mean
team
a 4.5 3 1.50
b 5.5 2 2.75
df.groupby("team")["x"] \
  .agg(["sum", "count", "mean"])
sum count mean
team
a 4.5 3 1.50
b 5.5 2 2.75
(df
  .groupby("team")["x"]
  .agg(["sum", "count", "mean"])
)
sum count mean
team
a 4.5 3 1.50
b 5.5 2 2.75

Data transformations#

Transpose#

The transpose of a matrix corresponds to flipping the rows and the columns.

dfT = df.transpose()
dfT
0 1 2 3 4
x 1.0 1.5 2.0 2.5 3.0
y 2.0 1.0 1.5 2.0 1.5
team a a a b b
level 3 2 1 3 3
# add the columns to the index; result is series with a multi-index
# df.stack()
# ALT. way to do transpose
# df.stack().reorder_levels([1,0]).unstack()

Adding new columns#

df["xy"] = df["x"] * df["y"]
print(df)
     x    y team  level   xy
0  1.0  2.0    a      3  2.0
1  1.5  1.0    a      2  1.5
2  2.0  1.5    a      1  3.0
3  2.5  2.0    b      3  5.0
4  3.0  1.5    b      3  4.5
# undo the modification
df = df.drop(columns=["xy"])
df2 = df.copy()
df2["xy"] = df["x"] * df["y"]
df2 = df.assign(xy = df["x"] * df["y"])
import numpy as np
df3 = df.assign(xy = df["x"] * df["y"]) \
        .assign(z = 1) \
        .assign(r = np.sqrt(df["x"]**2 + df["y"]**2)) \
        .assign(team = df["team"].str.upper())
print(df3)
     x    y team  level   xy  z         r
0  1.0  2.0    A      3  2.0  1  2.236068
1  1.5  1.0    A      2  1.5  1  1.802776
2  2.0  1.5    A      1  3.0  1  2.500000
3  2.5  2.0    B      3  5.0  1  3.201562
4  3.0  1.5    B      3  4.5  1  3.354102

Dropping rows and and columns#

df.drop([0,2,4])
x y team level
1 1.5 1.0 a 2
3 2.5 2.0 b 3
df.drop(columns=["level"])
x y team
0 1.0 2.0 a
1 1.5 1.0 a
2 2.0 1.5 a
3 2.5 2.0 b
4 3.0 1.5 b

Other related methods .dropna() for removing rows with missing values, and .drop_duplicates() for removing rows that contain duplicate data.

Renaming columns and values#

To rename the columns of a data frame, we can use the .rename() method with the columns attribute set to a Python dictionary of the replacements we want to make.

df.rename(columns={"team":"TEAM", "level":"LEVEL"})
x y TEAM LEVEL
0 1.0 2.0 a 3
1 1.5 1.0 a 2
2 2.0 1.5 a 1
3 2.5 2.0 b 3
4 3.0 1.5 b 3

To rename values, we can use the replace method, passing in a dictionary of replacements we want to for each column.

team_mapping = {"a":"A", "b":"B"}
df.replace({"team":team_mapping})
x y team level
0 1.0 2.0 A 3
1 1.5 1.0 A 2
2 2.0 1.5 A 1
3 2.5 2.0 B 3
4 3.0 1.5 B 3
# # ALT. use str-methods to get uppercase letter
# df.assign(team = df["team"].str.upper())

Reshaping data frames#

views_data = {
    "season": ["Season 1", "Season 2"],
    "Episode 1": [1000, 10000],
    "Episode 2": [2000, 20000],
    "Episode 3": [3000, 30000],
}
tvwide = pd.DataFrame(views_data)
tvwide
season Episode 1 Episode 2 Episode 3
0 Season 1 1000 2000 3000
1 Season 2 10000 20000 30000
tvlong = tvwide.melt(id_vars=["season"],
                     var_name="episode",
                     value_name="views")
tvlong
season episode views
0 Season 1 Episode 1 1000
1 Season 2 Episode 1 10000
2 Season 1 Episode 2 2000
3 Season 2 Episode 2 20000
4 Season 1 Episode 3 3000
5 Season 2 Episode 3 30000

Let’s see the same thing now but sorted by season then by episode:

tvlong.sort_values( by=["season", "episode"]) \
      .reset_index(drop=True)
season episode views
0 Season 1 Episode 1 1000
1 Season 1 Episode 2 2000
2 Season 1 Episode 3 3000
3 Season 2 Episode 1 10000
4 Season 2 Episode 2 20000
5 Season 2 Episode 3 30000

The method for the opposite transformation (converting long data to wide data) is called pivot and works like this:

tvlong.pivot(index="season",
             columns="episode",
             values="views")
episode Episode 1 Episode 2 Episode 3
season
Season 1 1000 2000 3000
Season 2 10000 20000 30000
# # ALT. to get *exactly* the same data frame as `tvwide`
# tvlong.pivot(index="season",
#              columns="episode",
#              values="views") \
#        .reset_index() \
#        .rename_axis(columns=None)

String methods#

Merging data frames#

Type conversions#

Creating dummy variables from categorical variables#

Data cleaning#

Standardize categorical values#

pets = pd.Series(["D", "dog", "Dog", "doggo"])
pets
0        D
1      dog
2      Dog
3    doggo
dtype: object
dogsubs = {"D":"dog", "Dog":"dog", "doggo":"dog"}
pets.replace(dogsubs)
0    dog
1    dog
2    dog
3    dog
dtype: object
pets.replace(dogsubs).value_counts()
dog    4
Name: count, dtype: int64

Numerical values#

float("1.2")
1.2
float("1,2")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[89], line 1
----> 1 float("1,2")

ValueError: could not convert string to float: '1,2'
"1,2".replace(",", ".")
'1.2'
float("1. 2")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[91], line 1
----> 1 float("1. 2")

ValueError: could not convert string to float: '1. 2'
"1. 2".replace(" ", "")
'1.2'
rawns = pd.Series(["1.2", "1,2", "1. 2"])
rawns
0     1.2
1     1,2
2    1. 2
dtype: object
# rawns.astype(float)
rawns.str.replace(",", ".") \
     .str.replace(" ", "") \
     .astype(float)
0    1.2
1    1.2
2    1.2
dtype: float64

Use the .drop_duplicates() method to remove duplicated rows.

Dealing with missing values#

rawdf = pd.read_csv("../datasets/raw/minimal.csv")
rawdf
x y team level
0 1.0 2.0 a 3.0
1 1.5 1.0 a 2.0
2 2.0 1.5 a 1.0
3 1.5 1.5 a NaN
4 2.5 2.0 b 3.0
5 3.0 1.5 b 3.0
6 11.0 NaN NaN 2.0
rawdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       7 non-null      float64
 1   y       6 non-null      float64
 2   team    6 non-null      object 
 3   level   6 non-null      float64
dtypes: float64(3), object(1)
memory usage: 352.0+ bytes
print(rawdf.isna())
       x      y   team  level
0  False  False  False  False
1  False  False  False  False
2  False  False  False  False
3  False  False  False   True
4  False  False  False  False
5  False  False  False  False
6  False   True   True  False
rawdf.isna().sum(axis="rows")
x        0
y        1
team     1
level    1
dtype: int64
rawdf.isna().sum(axis="columns")
0    0
1    0
2    0
3    1
4    0
5    0
6    2
dtype: int64
# # ALT. convert to data types that have support <NA> values
# tmpdf = pd.read_csv("../datasets/raw/minimal.csv")
# rawdf2 = tmpdf.convert_dtypes()
# rawdf2.dtypes
float("NaN")
nan

Remove rows that contain any missing values:

cleandf = rawdf.dropna(how="any")
cleandf
x y team level
0 1.0 2.0 a 3.0
1 1.5 1.0 a 2.0
2 2.0 1.5 a 1.0
4 2.5 2.0 b 3.0
5 3.0 1.5 b 3.0
# Raises warning
# cleandf["level"] = cleandf["level"].astype(int)
# Convert the level column to `int`
cleandf = cleandf.assign(level = cleandf["level"].astype(int))
# # Confirm `cleandf` is the same as `df` we worked with earlier
# cleandf.reset_index(drop=True).equals(df)

Dealing with outliers#

xs = pd.Series([1.0, 2.0, 3.0, 4.0, 50.0])
xs
0     1.0
1     2.0
2     3.0
3     4.0
4    50.0
dtype: float64

Visual observation of the values in the series xs to “see” the outlier.

import seaborn as sns
sns.stripplot(xs)
<Axes: >
../_images/6d1224923a52c129c75305f59d75462e21905750f2ea6da1dfbe682510bd4e90.png
# Tukey outliers limits for values in the `xs` series
Q1, Q3 = xs.quantile([0.25, 0.75])
IQR = Q3 - Q1
xlim_low = Q1 - 1.5*IQR
xlim_high = Q3 + 1.5*IQR
(xlim_low, xlim_high)
(-1.0, 7.0)
# # ALT. 2-standard deviations Z-value outlier limits
# xmean = xs.mean()
# xstd = xs.std()
# xlow = xmean - 2*xstd
# xhigh = xmean + 2*xstd
# (xlow, xhigh)

Let’s build a mask that identifies outliers using the criterion “values greater than 7”.

outliers = (xs < -1.0) | (xs > 7.0)
outliers
0    False
1    False
2    False
3    False
4     True
dtype: bool

The “not outliers” are the values we want to keep. The Python operator ~ is used for this.

~outliers
0     True
1     True
2     True
3     True
4    False
dtype: bool
xs[~outliers]
0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

The mean computed from all the values:

xs.mean()
12.0
xs.std()
21.27204738618265

The mean computed without the outliers:

xs[~outliers].mean()
2.5
xs[~outliers].std()
1.2909944487358056

Data sources#

Data file formats#

CSV files#

Comma-Separated-Values are the most common format for tabular data. CSV files are regular text files that contain individual values (numeric or text) separated by commas. In many CSV files, the first line is a special header row that contains the names of the variables.

!head ../datasets/minimal.csv
x,y,team,level
1.0,2.0,a,3
1.5,1.0,a,2
2.0,1.5,a,1
2.5,2.0,b,3
3.0,1.5,b,3
df = pd.read_csv("../datasets/minimal.csv")
print(df)
     x    y team  level
0  1.0  2.0    a      3
1  1.5  1.0    a      2
2  2.0  1.5    a      1
3  2.5  2.0    b      3
4  3.0  1.5    b      3

Spreadsheets files#

odsdf = pd.read_excel("../datasets/formats/minimal.ods",
                      sheet_name="Sheet1")
odsdf.equals(df)
True
xlsxdf = pd.read_excel("../datasets/formats/minimal.xlsx",
                       sheet_name="Sheet1")
xlsxdf.equals(df)
True

Other data formats#

TSV#

The Tab-Separated-Values format is similar to CSV, but uses tabs as separators.

! head ../datasets/formats/minimal.tsv
x	y	team	level
1.0	2.0	a	3
1.5	1.0	a	2
2.0	1.5	a	1
2.5	2.0	b	3
3.0	1.5	b	3

You can read a TSV file using the pd.read_csv by setting the appropriate value for sep (separator) argument. Note in Python strings, the TAB character is represented as \t.

tsvdf = pd.read_csv("../datasets/formats/minimal.tsv", sep="\t")
tsvdf
x y team level
0 1.0 2.0 a 3
1 1.5 1.0 a 2
2 2.0 1.5 a 1
3 2.5 2.0 b 3
4 3.0 1.5 b 3
tsvdf.equals(df)
True

JSON#

JavaScript Object Notation looks like this:

!head ../datasets/formats/minimal.json
[
  {"x":1.0, "y":2.0, "team":"a", "level":3},
  {"x":1.5, "y":1.0, "team":"a", "level":2},
  {"x":2.0, "y":1.5, "team":"a", "level":1},
  {"x":2.5, "y":2.0, "team":"b", "level":3},
  {"x":3.0, "y":1.5, "team":"b", "level":3}
]
jsondf = pd.read_json("../datasets/formats/minimal.json")
jsondf.equals(df)
True

HTML tables#

!head -16 ../datasets/formats/minimal.html
<table>
  <thead>
    <tr>
      <th>x</th>
      <th>y</th>
      <th>team</th>
      <th>level</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1.0</td>
      <td>2.0</td>
      <td>a</td>
      <td>3</td>
    </tr>
tables = pd.read_html("../datasets/formats/minimal.html")
htmldf = tables[0]
htmldf.equals(df)
True
htmldf
x y team level
0 1.0 2.0 a 3
1 1.5 1.0 a 2
2 2.0 1.5 a 1
3 2.5 2.0 b 3
4 3.0 1.5 b 3

XML#

The eXtensible Markup Language is another common data format.

!head -8 ../datasets/formats/minimal.xml
<?xml version='1.0' encoding='utf-8'?>
<players>
  <player>
    <x>1.0</x>
    <y>2.0</y>
    <team>a</team>
    <level>3</level>
  </player>
xmldf = pd.read_xml("../datasets/formats/minimal.xml")
xmldf.equals(df)
True

SQLite databases#

from sqlalchemy import create_engine
dbpath = "../datasets/formats/minimal.sqlite"
engine = create_engine("sqlite:///" + dbpath)
sqldf = pd.read_sql_table("players", con=engine)
sqldf.equals(df)
True
query = "SELECT x, y, team, level FROM players;"
sqldf2 = pd.read_sql_query(query, con=engine)
sqldf2.equals(df)
True

Case studies#

Background stories for the example datasets

Links

Collecting the website visitors dataset#

Remore shell
zcat /var/log/nginx/access.log.*.gz > /tmp/access_logs.txt

Local shell
scp minireference.com:/tmp/access_logs.txt data/access_logs.txt

# access_logs = open("data/access_logs.txt")
# df = pd.read_csv(
#     access_logs,
#     sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
#     engine='python',
#     usecols=[0, 3, 4, 5, 6, 7, 8],
#     names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
#     na_values='-',
#     header=None
# )
visitors = pd.read_csv("../datasets/visitors.csv")
visitors.head()
IP address version bought
0 135.185.92.4 A 0
1 14.75.235.1 A 1
2 50.132.244.139 B 0
3 144.181.130.234 A 0
4 90.92.5.100 B 0
  • A VIEWS = CONTAINS /static/images/homepage/bgA.jpg

  • B VIEWS = CONTAINS /static/images/homepage/bgB.jpg

  • A CONVERSIONS = CONTAINS /static/images/homepage/bgA.jpg and /thankyou

  • B CONVERSIONS = CONTAINS /static/images/homepage/bgB.jpg and /thankyou

  • p_A = A CONVERSIONS / A VIEWS

  • p_B = B CONVERSIONS / B VIEWS

visitors.groupby("version") \
         ["bought"].value_counts(normalize=True)
version  bought
A        0         0.935175
         1         0.064825
B        0         0.962229
         1         0.037771
Name: proportion, dtype: float64
visitors.groupby("version") \
         ["bought"].agg(["sum", "count"]) \
         .eval("sum/count")
version
A    0.064825
B    0.037771
dtype: float64

Collecting the electricity prices dataset#

# xE = [7.7, 5.9,  7, 4.8, 6.3, 6.3, 5.5, 5.4, 6.5]
# xW = [11.8, 10, 11, 8.6, 8.3, 9.4,   8, 6.8, 8.5]
eprices = pd.read_csv("../datasets/eprices.csv")
eprices.head()
loc price
0 East 7.7
1 East 5.9
2 East 7.0
3 East 4.8
4 East 6.3

Collecting the students dataset#

students = pd.read_csv("../datasets/students.csv", index_col="student_ID")
print(students.head())
           background curriculum  effort  score
student_ID                                     
1                arts     debate   10.96   75.0
2             science    lecture    8.69   75.0
3                arts     debate    8.60   67.0
4                arts    lecture    7.92   70.3
5             science     debate    9.90   76.1
SELECT <which variables> FROM <which table>;
SELECT <which variables> FROM <which table> WHERE <conditions>;

e.g. 
SELECT student_id, time_on_task FROM learner_analytics;
SELECT student_id, scrore FROM student_final_grades;

AGGREGATE total effort
JOIN effort and score tables
students = pd.read_csv("../datasets/students.csv", index_col="student_ID")
students
background curriculum effort score
student_ID
1 arts debate 10.96 75.0
2 science lecture 8.69 75.0
3 arts debate 8.60 67.0
4 arts lecture 7.92 70.3
5 science debate 9.90 76.1
6 business debate 10.80 79.8
7 science lecture 7.81 72.7
8 business lecture 9.13 75.4
9 business lecture 5.21 57.0
10 science lecture 7.71 69.0
11 business debate 9.82 70.4
12 arts debate 11.53 96.2
13 science debate 7.10 62.9
14 science lecture 6.39 57.6
15 arts debate 12.00 84.3
xD = students[students["curriculum"]=="debate"]["score"].values
xL = students[students["curriculum"]=="lecture"]["score"].values
import numpy as np
np.mean(xD), np.mean(xL), np.mean(xD) -np.mean(xL)
(76.4625, 68.14285714285714, 8.319642857142867)
from scipy.stats import ttest_ind

ttest_ind(xD, xL)
Ttest_indResult(statistic=1.7197867420465698, pvalue=0.10917234443214315)

Collecting the apples dataset#

Collecting the kombucha dataset#

Collecting the doctors dataset#

Collecting the players dataset#

Bonus topics#

NumPy arrays#

Under the hood, Pandas Series and DataFrame objects are based on efficient numerical NumPy arrays. You generally won’t need to interact with NumPy commands when working in Pandas, but sometimes useful to use the NumPy syntax to perform efficient selection of data.

import numpy as np
values = np.array([1, 3, 5, 7])
values
array([1, 3, 5, 7])
values - 2
array([-1,  1,  3,  5])
np.exp(values)
array([   2.71828183,   20.08553692,  148.4131591 , 1096.63315843])

Selecting a subset of the values#

values < 4
array([ True,  True, False, False])
values[values < 4]
array([1, 3])

Create a list of evenly spaced numbers#

(often use when creating graphs)

np.linspace(2, 4, 5)
array([2. , 2.5, 3. , 3.5, 4. ])
# # ALT. use arange to create the same list
# np.arange(2, 4+0.5, 0.5)

Index and sorting#

Pandas plot methods#

CUT MATERIAL#

Datasets for the book#

students = pd.read_csv("../datasets/students.csv")
students.head()
student_ID background curriculum effort score
0 1 arts debate 10.96 75.0
1 2 science lecture 8.69 75.0
2 3 arts debate 8.60 67.0
3 4 arts lecture 7.92 70.3
4 5 science debate 9.90 76.1

Data pre-processing tasks#

  • Extract the “raw” data from various data source formats (spreadsheet, databases, files, web servers).

  • Transform the data by reshaping and cleaning it.

  • Load the data into the system used for statistical analysis.

Extract#

Extract data from different source formats#

On UNIX systems (Linux and macOS) the command line program head can be used to show the first few lines of any file. The command head is very useful for exploring files—by printing the first few lines, you can get an idea of the format it is in.

Unfortunately, on Windows the command head is not available, so instead of relying on command line tools, we’ll write a simple Python function that called head that does the same thing as the command line tool: it prints the first few lines of a file. By default this function will print the first five lines of the file, but users can override the count argument to request a different number of lines to be printed.

import os

def head(path, count=7):
    """
    Print the first `count` lines of the file at `path`.
    """
    datafile = open(path, "r")
    lines = datafile.readlines()
    for line in lines[0:count]:
        print(line, end="")

The function head contains some special handling for Windows users. If the path is specified using the UNIX path separator /, it will be auto-corrected to use the Windows path separator \.

Load#

Let’s save the cleaned data to the file mydata.csv in a directory mydataset.

cleandf.to_csv("mydataset/mydata.csv", index=False)
head("mydataset/mydata.csv")
x,y,team,level
1.0,2.0,a,3
1.5,1.0,a,2
2.0,1.5,a,1
2.5,2.0,b,3
3.0,1.5,b,3

Information about the dataset is provided in a text file README.txt.

head("mydataset/README.txt", count=11)
Players dataset
===============
Description: Synthetic data used to show Pandas operations.
Filename: mydata.csv
Format: A CSV file with 5 rows and 4 columns.
- x (numeric): horizontal position of the player.
- y (numeric): vertical position of the player.
- team (categorical): which team the player is in (a or b).
- level (categorical): player strength (1, 2, or 3).
Source: Synthetic data created by the author.
License: CC0 (public domain)