Section 1.2 — Data in practice#
This notebook contains all the code from Section 1.2 Data in practice of the No Bullshit Guide to Statistics.
Test a simple Python command#
2 + 3
5
Getting started with JupyterLab#
Download and install JupyterLab Desktop#
Follow instructions in the Python tutorial to install JupyterLab Desktop on your computer.
Download the noBSstats
notebooks and datasets#
TODO: include image from attachments
Datasets for the book#
import os
os.listdir("../datasets")
['students.csv',
'epriceswide.csv',
'minimal.csv',
'exercises',
'formats',
'README.md',
'visitors.csv',
'index.md',
'eprices.csv',
'cut_material',
'apples.csv',
'doctors.csv',
'players_full.csv',
'players.csv',
'kombuchapop.csv',
'kombucha.csv',
'raw']
Interactive notebooks for each section#
sorted(os.listdir("../notebooks"))
['10_DATA.md',
'11_intro_to_data.ipynb',
'12_data_in_practice.ipynb',
'13_descriptive_statistics.ipynb',
'20_PROB.md',
'21_discrete_random_vars.ipynb',
'22_multiple_random_vars.ipynb',
'23_inventory_discrete_dists.ipynb',
'24_calculus_prerequisites.ipynb',
'25_continuous_random_vars.ipynb',
'26_inventory_continuous_dists.ipynb',
'27_random_var_generation.ipynb',
'28_random_samples.ipynb',
'30_STATS.md',
'31_estimators.ipynb',
'32_confidence_intervals.ipynb',
'33_intro_to_NHST.ipynb',
'34_analytical_approx.ipynb',
'35_two_sample_tests.ipynb',
'36_design.ipynb',
'37_inventory_stats_tests.ipynb',
'40_LINEAR_MODELS.md',
'41_introduction_to_LMs.ipynb',
'50_BAYESIAN_STATS.md',
'99_mean_estimation_details.ipynb',
'99_proportions_estimators.ipynb',
'OLD34_analytical_approximation.ipynb',
'README.md',
'attachments',
'cut_material.ipynb',
'drafts',
'explorations',
'index.md',
'one_sample_known_mean_unknown_var.ipynb',
'plot_helpers.py',
'simdata',
'stats_helpers.py',
'test_helpers.py']
Exercises notebooks#
sorted(os.listdir("../exercises"))
['__pycache__',
'datasets',
'exercises_12_practical_data.ipynb',
'exercises_13_descr_stats.ipynb',
'exercises_21_discrete_RVs.ipynb',
'exercises_31_estimtors.ipynb',
'exercises_32_confidence_intervals.ipynb',
'exercises_33_intro_to_NHST.ipynb',
'exercises_35_two_sample_tests.ipynb',
'plot_helpers.py',
'problems_1_data.ipynb',
'solutions',
'stats_helpers.py']
Data management with Pandas#
The first step is to import the Pandas library.
We’ll follow the standard convention of importing the pandas
module under the alias pd
.
import pandas as pd
Data frames#
Players dataset#
%pycat ../datasets/players.csv
We can create a the data frame object players
by loading the players dataset located at ../datasets/players.csv
by calling the function pd.read_csv
.
players = pd.read_csv("../datasets/players.csv")
players
username | country | age | ezlvl | time | points | finished | |
---|---|---|---|---|---|---|---|
0 | mary | us | 38 | 0 | 124.94 | 418 | 0 |
1 | jane | ca | 21 | 0 | 331.64 | 1149 | 1 |
2 | emil | fr | 52 | 1 | 324.61 | 1321 | 1 |
3 | ivan | ca | 50 | 1 | 39.51 | 226 | 0 |
4 | hasan | tr | 26 | 1 | 253.19 | 815 | 0 |
5 | jordan | us | 45 | 0 | 28.49 | 206 | 0 |
6 | sanjay | ca | 27 | 1 | 350.00 | 1401 | 1 |
7 | lena | uk | 23 | 0 | 408.76 | 1745 | 1 |
8 | shuo | cn | 24 | 1 | 194.77 | 1043 | 0 |
9 | r0byn | us | 59 | 0 | 255.55 | 1102 | 0 |
10 | anna | pl | 18 | 0 | 303.66 | 1209 | 1 |
11 | joro | bg | 22 | 1 | 381.97 | 1491 | 1 |
Data frame properties#
What type of object is players
?
type(players)
pandas.core.frame.DataFrame
The players
data frame object has a bunch of useful properties (attributes)
and functions (methods) “attached” to it,
which we can access using the dot syntax.
The shape of the players
data frame#
players.shape
(12, 7)
The rows index#
len(players.index)
12
players.index
RangeIndex(start=0, stop=12, step=1)
list(players.index)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
The columns index#
len(players.columns)
7
players.columns
Index(['username', 'country', 'age', 'ezlvl', 'time', 'points', 'finished'], dtype='object')
list(players.columns)
['username', 'country', 'age', 'ezlvl', 'time', 'points', 'finished']
Exploring data frame objects#
players.head(3)
# players.tail(3)
# players.sample(3)
username | country | age | ezlvl | time | points | finished | |
---|---|---|---|---|---|---|---|
0 | mary | us | 38 | 0 | 124.94 | 418 | 0 |
1 | jane | ca | 21 | 0 | 331.64 | 1149 | 1 |
2 | emil | fr | 52 | 1 | 324.61 | 1321 | 1 |
Data types#
players.dtypes
username object
country object
age int64
ezlvl int64
time float64
points int64
finished int64
dtype: object
players.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 username 12 non-null object
1 country 12 non-null object
2 age 12 non-null int64
3 ezlvl 12 non-null int64
4 time 12 non-null float64
5 points 12 non-null int64
6 finished 12 non-null int64
dtypes: float64(1), int64(4), object(2)
memory usage: 800.0+ bytes
Accessing values in a DataFrame#
Selecting individual values#
# Emil's points
players.loc[2,"points"]
1321
Selecting entire rows#
# Sanjay's data
row6 = players.loc[6,:] # == players.loc[6]
row6
username sanjay
country ca
age 27
ezlvl 1
time 350.0
points 1401
finished 1
Name: 6, dtype: object
# Rows of the dataframe are Series objects
type(row6)
pandas.core.series.Series
The index
of the series row6
is the same as the columns index of the data frame players
.
row6.index
Index(['username', 'country', 'age', 'ezlvl', 'time', 'points', 'finished'], dtype='object')
To access individual values, use the square bracket notation.
row6["age"]
27
Selecting entire columns#
ages = players["age"]
ages
0 38
1 21
2 52
3 50
4 26
5 45
6 27
7 23
8 24
9 59
10 18
11 22
Name: age, dtype: int64
type(ages)
pandas.core.series.Series
ages.index
RangeIndex(start=0, stop=12, step=1)
ages.values
array([38, 21, 52, 50, 26, 45, 27, 23, 24, 59, 18, 22])
# ALT1.
# players["age"].equals( players.loc[:,"age"] )
# ALT2.
# players["age"].equals( players.age )
ages[6]
27
Selecting multiple columns#
players[ ["username", "country"] ]
username | country | |
---|---|---|
0 | mary | us |
1 | jane | ca |
2 | emil | fr |
3 | ivan | ca |
4 | hasan | tr |
5 | jordan | us |
6 | sanjay | ca |
7 | lena | uk |
8 | shuo | cn |
9 | r0byn | us |
10 | anna | pl |
11 | joro | bg |
Statistical calculations using Pandas#
ages = players["age"] # == players.loc[:,"age"]
ages
0 38
1 21
2 52
3 50
4 26
5 45
6 27
7 23
8 24
9 59
10 18
11 22
Name: age, dtype: int64
type(ages)
pandas.core.series.Series
Series attributes#
ages.index
RangeIndex(start=0, stop=12, step=1)
ages.values
array([38, 21, 52, 50, 26, 45, 27, 23, 24, 59, 18, 22])
ages.name
'age'
players.loc[6]
username sanjay
country ca
age 27
ezlvl 1
time 350.0
points 1401
finished 1
Name: 6, dtype: object
Series methods#
ages.count()
12
# # ALT
# len(ages)
ages.sum()
405
ages.sum() / ages.count()
33.75
ages.mean()
33.75
ages.std()
14.28365244861157
Selecting only certain rows (filtering)#
To select only rows where ezlvl
is 1
, we first build the boolean selection mask…
mask = players["ezlvl"] == 1
mask
0 False
1 False
2 True
3 True
4 True
5 False
6 True
7 False
8 True
9 False
10 False
11 True
Name: ezlvl, dtype: bool
… then select the rows using the mask.
players[mask]
username | country | age | ezlvl | time | points | finished | |
---|---|---|---|---|---|---|---|
2 | emil | fr | 52 | 1 | 324.61 | 1321 | 1 |
3 | ivan | ca | 50 | 1 | 39.51 | 226 | 0 |
4 | hasan | tr | 26 | 1 | 253.19 | 815 | 0 |
6 | sanjay | ca | 27 | 1 | 350.00 | 1401 | 1 |
8 | shuo | cn | 24 | 1 | 194.77 | 1043 | 0 |
11 | joro | bg | 22 | 1 | 381.97 | 1491 | 1 |
The above two step process can be combined into a more compact expression:
players[players["ezlvl"]==1]
username | country | age | ezlvl | time | points | finished | |
---|---|---|---|---|---|---|---|
2 | emil | fr | 52 | 1 | 324.61 | 1321 | 1 |
3 | ivan | ca | 50 | 1 | 39.51 | 226 | 0 |
4 | hasan | tr | 26 | 1 | 253.19 | 815 | 0 |
6 | sanjay | ca | 27 | 1 | 350.00 | 1401 | 1 |
8 | shuo | cn | 24 | 1 | 194.77 | 1043 | 0 |
11 | joro | bg | 22 | 1 | 381.97 | 1491 | 1 |
Bonus topic: multiple selection criteria#
# mask for selecting players with ezlvl=1 and time greater than 200 mins
# players[(players["ezlvl"] == 1) & (players["time"] >= 200)]
# mask for selecting US and Canada players
# players["country"].isin(["us","ca"])
Sorting data frames and ranking#
players.sort_values("time", ascending=False)
username | country | age | ezlvl | time | points | finished | |
---|---|---|---|---|---|---|---|
7 | lena | uk | 23 | 0 | 408.76 | 1745 | 1 |
11 | joro | bg | 22 | 1 | 381.97 | 1491 | 1 |
6 | sanjay | ca | 27 | 1 | 350.00 | 1401 | 1 |
1 | jane | ca | 21 | 0 | 331.64 | 1149 | 1 |
2 | emil | fr | 52 | 1 | 324.61 | 1321 | 1 |
10 | anna | pl | 18 | 0 | 303.66 | 1209 | 1 |
9 | r0byn | us | 59 | 0 | 255.55 | 1102 | 0 |
4 | hasan | tr | 26 | 1 | 253.19 | 815 | 0 |
8 | shuo | cn | 24 | 1 | 194.77 | 1043 | 0 |
0 | mary | us | 38 | 0 | 124.94 | 418 | 0 |
3 | ivan | ca | 50 | 1 | 39.51 | 226 | 0 |
5 | jordan | us | 45 | 0 | 28.49 | 206 | 0 |
players["time"].rank(ascending=False)
0 10.0
1 4.0
2 5.0
3 11.0
4 8.0
5 12.0
6 3.0
7 1.0
8 9.0
9 7.0
10 6.0
11 2.0
Name: time, dtype: float64
Grouping and aggregation#
players.groupby("ezlvl")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fac849dd610>
players.groupby("ezlvl")["time"]
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fac849dd250>
players.groupby("ezlvl")["time"].mean()
ezlvl
0 242.173333
1 257.341667
Name: time, dtype: float64
print(players.groupby("ezlvl")["time"].aggregate(["sum", "mean"]))
sum mean
ezlvl
0 1453.04 242.173333
1 1544.05 257.341667
# # ALT1. newline continuation character
# players.groupby("ezlvl")["time"] \
# .agg(["sum", "mean"])
# # ALT2. expression inside parentheses
# (players
# .groupby("ezlvl")["time"]
# .agg(["sum", "mean", "max"])
# )
Data visualization with Seaborn#
The first step is to import the seaboarn
module
under the alias sns
.
import seaborn as sns
If you get an error when running this code cell,
run %pip install seaborn
to install the Seaborn library.
times = players["time"]
sns.stripplot(x=times)
<Axes: xlabel='time'>

sns.stripplot(data=players, x="time")
<Axes: xlabel='time'>

sns.stripplot(data=players, x="time", hue="ezlvl")
<Axes: xlabel='time'>

Studying the effect of ezlvl
on time
#
The players dataset was collected as part of an experiment
designed to answer the question “Does the easy first level lead to an improvement in user retention?”
We want to compare the time
variable (total time players spent in the game)
of players who were shown the “easy level” version of the game (ezlvl==1
)
to the control group of played who played the regular vesion of the game (ezlvl==0
).
mean0 = players[players["ezlvl"]==0]["time"].mean()
mean0
242.17333333333332
mean1 = players[players["ezlvl"]==1]["time"].mean()
mean1
257.34166666666664
sns.stripplot(data=players, x="time", y="ezlvl",
hue="ezlvl", orient="h", legend=None)
<Axes: xlabel='time', ylabel='ezlvl'>

# ALT. same stripplot with markers for the group means
# ax = sns.stripplot(data=players, x="time", y="ezlvl", hue="ezlvl", orient="h", legend=None)
# sns.stripplot(x=[mean0], y=[0], marker="D", orient="h", color="b", ax=ax)
# sns.stripplot(x=[mean1], y=[1], marker="D", orient="h", color="r", ax=ax)
Studying the relationship between age
and time
#
The secondary research question,
is to look for a correlation between the age
variable and the time
variable.
sns.scatterplot(data=players, x="age", y="time")
<Axes: xlabel='age', ylabel='time'>

sns.regplot(data=players, x="age", y="time", ci=None)
<Axes: xlabel='age', ylabel='time'>

Real-world datasets#
TODO Add table as .md
Apple weights#
apples = pd.read_csv("../datasets/apples.csv")
apples.shape
(30, 1)
apples.head(3)
weight | |
---|---|
0 | 205.0 |
1 | 182.0 |
2 | 192.0 |
apples['weight'].mean()
202.6
sns.stripplot(data=apples, x="weight", jitter=0, alpha=0.5)
<Axes: xlabel='weight'>

Electricity prices#
eprices = pd.read_csv("../datasets/eprices.csv")
eprices.shape
(18, 2)
eprices.head(3)
loc | price | |
---|---|---|
0 | East | 7.7 |
1 | East | 5.9 |
2 | East | 7.0 |
eprices[eprices["loc"]=="West"]["price"].mean()
9.155555555555557
eprices[eprices["loc"]=="East"]["price"].mean()
6.155555555555556
sns.stripplot(data=eprices, x="price", y="loc", hue="loc")
<Axes: xlabel='price', ylabel='loc'>

Students’ scores#
students = pd.read_csv("../datasets/students.csv")
students.shape
(15, 5)
students.head()
student_ID | background | curriculum | effort | score | |
---|---|---|---|---|---|
0 | 1 | arts | debate | 10.96 | 75.0 |
1 | 2 | science | lecture | 8.69 | 75.0 |
2 | 3 | arts | debate | 8.60 | 67.0 |
3 | 4 | arts | lecture | 7.92 | 70.3 |
4 | 5 | science | debate | 9.90 | 76.1 |
sns.stripplot(data=students, x="score", y="curriculum", hue="curriculum")
<Axes: xlabel='score', ylabel='curriculum'>

lscores = students[students["curriculum"]=="lecture"]
lscores["score"].mean()
68.14285714285714
dscores = students[students["curriculum"]=="debate"]
dscores["score"].mean()
76.4625
Kombucha volumes#
kombucha = pd.read_csv("../datasets/kombucha.csv")
kombucha.shape
(347, 2)
kombucha.columns
Index(['batch', 'volume'], dtype='object')
kombucha.head(3)
batch | volume | |
---|---|---|
0 | 1 | 1016.24 |
1 | 1 | 993.88 |
2 | 1 | 994.72 |
kombucha["batch"].unique()
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
sns.stripplot(data=kombucha, x="batch", y="volume", alpha=0.3)
<Axes: xlabel='batch', ylabel='volume'>

Average volume of the sample from Batch 01#
batch01 = kombucha[kombucha["batch"]==1]
ksample01 = batch01["volume"]
ksample01.mean()
999.10375
Doctors’ sleep study#
doctors = pd.read_csv("../datasets/doctors.csv")
doctors.shape
(224, 4)
doctors.head(3)
permit | name | location | score | |
---|---|---|---|---|
0 | 93636 | Yesenia Smith | urban | 82.0 |
1 | 79288 | Andrew Stanley | rural | 85.0 |
2 | 94980 | Jessica Castro | rural | 97.0 |
sns.stripplot(data=doctors, x="score", y="location", hue="location")
<Axes: xlabel='score', ylabel='location'>

Average sleep scores for doctors in different locations#
udoctors = doctors[doctors["location"]=="urban"]
udoctors["score"].mean()
79.57051282051282
rdoctors = doctors[doctors["location"]=="rural"]
rdoctors["score"].mean()
81.79411764705883
Website visitors#
visitors = pd.read_csv("../datasets/visitors.csv")
visitors.shape
(2000, 3)
visitors.head(5)
IP address | version | bought | |
---|---|---|---|
0 | 135.185.92.4 | A | 0 |
1 | 14.75.235.1 | A | 1 |
2 | 50.132.244.139 | B | 0 |
3 | 144.181.130.234 | A | 0 |
4 | 90.92.5.100 | B | 0 |
visitors[visitors["version"]=="A"]["bought"].mean()
0.06482465462274177
visitors[visitors["version"]=="B"]["bought"].mean()
0.03777148253068933
sns.barplot(data=visitors, x="bought", y="version")
<Axes: xlabel='bought', ylabel='version'>

Discussion#
Data extraction#
Data transformations#
Tidy data#
epriceswide = pd.read_csv("../datasets/epriceswide.csv")
epriceswide.shape
(9, 2)
epriceswide
East | West | |
---|---|---|
0 | 7.7 | 11.8 |
1 | 5.9 | 10.0 |
2 | 7.0 | 11.0 |
3 | 4.8 | 8.6 |
4 | 6.3 | 8.3 |
5 | 6.3 | 9.4 |
6 | 5.5 | 8.0 |
7 | 5.4 | 6.8 |
8 | 6.5 | 8.5 |
epriceswide.melt(var_name="loc", value_name="price")
loc | price | |
---|---|---|
0 | East | 7.7 |
1 | East | 5.9 |
2 | East | 7.0 |
3 | East | 4.8 |
4 | East | 6.3 |
5 | East | 6.3 |
6 | East | 5.5 |
7 | East | 5.4 |
8 | East | 6.5 |
9 | West | 11.8 |
10 | West | 10.0 |
11 | West | 11.0 |
12 | West | 8.6 |
13 | West | 8.3 |
14 | West | 9.4 |
15 | West | 8.0 |
16 | West | 6.8 |
17 | West | 8.5 |
The melted-epriceswide
is the same as eprices
#
epriceslong = epriceswide.melt(var_name="loc", value_name="price")
eprices.equals(epriceslong)
True
Data cleaning#
Links#
TODO: import from Appendix D