Section 1.2 — Data in practice#

This notebook contains all the code from Section 1.2 Data in practice of the No Bullshit Guide to Statistics.

Test a simple Python command#

2 + 3
5

Getting started with JupyterLab#

Download and install JupyterLab Desktop#

Follow instructions in the Python tutorial to install JupyterLab Desktop on your computer.

Download the noBSstats notebooks and datasets#

TODO: include image from attachments

Datasets for the book#

import os
os.listdir("../datasets")
['students.csv',
 'epriceswide.csv',
 'minimal.csv',
 'exercises',
 'formats',
 'README.md',
 'visitors.csv',
 'index.md',
 'eprices.csv',
 'cut_material',
 'apples.csv',
 'doctors.csv',
 'players_full.csv',
 'players.csv',
 'kombuchapop.csv',
 'kombucha.csv',
 'raw']

Interactive notebooks for each section#

sorted(os.listdir("../notebooks"))
['10_DATA.md',
 '11_intro_to_data.ipynb',
 '12_data_in_practice.ipynb',
 '13_descriptive_statistics.ipynb',
 '20_PROB.md',
 '21_discrete_random_vars.ipynb',
 '22_multiple_random_vars.ipynb',
 '23_inventory_discrete_dists.ipynb',
 '24_calculus_prerequisites.ipynb',
 '25_continuous_random_vars.ipynb',
 '26_inventory_continuous_dists.ipynb',
 '27_random_var_generation.ipynb',
 '28_random_samples.ipynb',
 '30_STATS.md',
 '31_estimators.ipynb',
 '32_confidence_intervals.ipynb',
 '33_intro_to_NHST.ipynb',
 '34_analytical_approx.ipynb',
 '35_two_sample_tests.ipynb',
 '36_design.ipynb',
 '37_inventory_stats_tests.ipynb',
 '40_LINEAR_MODELS.md',
 '41_introduction_to_LMs.ipynb',
 '50_BAYESIAN_STATS.md',
 '99_mean_estimation_details.ipynb',
 '99_proportions_estimators.ipynb',
 'OLD34_analytical_approximation.ipynb',
 'README.md',
 'attachments',
 'cut_material.ipynb',
 'drafts',
 'explorations',
 'index.md',
 'one_sample_known_mean_unknown_var.ipynb',
 'plot_helpers.py',
 'simdata',
 'stats_helpers.py',
 'test_helpers.py']

Exercises notebooks#

sorted(os.listdir("../exercises"))
['__pycache__',
 'datasets',
 'exercises_12_practical_data.ipynb',
 'exercises_13_descr_stats.ipynb',
 'exercises_21_discrete_RVs.ipynb',
 'exercises_31_estimtors.ipynb',
 'exercises_32_confidence_intervals.ipynb',
 'exercises_33_intro_to_NHST.ipynb',
 'exercises_35_two_sample_tests.ipynb',
 'plot_helpers.py',
 'problems_1_data.ipynb',
 'solutions',
 'stats_helpers.py']

Data management with Pandas#

The first step is to import the Pandas library. We’ll follow the standard convention of importing the pandas module under the alias pd.

import pandas as pd

Data frames#

Players dataset#

%pycat ../datasets/players.csv

We can create a the data frame object players by loading the players dataset located at ../datasets/players.csv by calling the function pd.read_csv.

players = pd.read_csv("../datasets/players.csv")
players
username country age ezlvl time points finished
0 mary us 38 0 124.94 418 0
1 jane ca 21 0 331.64 1149 1
2 emil fr 52 1 324.61 1321 1
3 ivan ca 50 1 39.51 226 0
4 hasan tr 26 1 253.19 815 0
5 jordan us 45 0 28.49 206 0
6 sanjay ca 27 1 350.00 1401 1
7 lena uk 23 0 408.76 1745 1
8 shuo cn 24 1 194.77 1043 0
9 r0byn us 59 0 255.55 1102 0
10 anna pl 18 0 303.66 1209 1
11 joro bg 22 1 381.97 1491 1

Data frame properties#

What type of object is players ?

type(players)
pandas.core.frame.DataFrame

The players data frame object has a bunch of useful properties (attributes) and functions (methods) “attached” to it, which we can access using the dot syntax.

The shape of the players data frame#

players.shape
(12, 7)

The rows index#

len(players.index)
12
players.index
RangeIndex(start=0, stop=12, step=1)
list(players.index)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

The columns index#

len(players.columns)
7
players.columns
Index(['username', 'country', 'age', 'ezlvl', 'time', 'points', 'finished'], dtype='object')
list(players.columns)
['username', 'country', 'age', 'ezlvl', 'time', 'points', 'finished']

Exploring data frame objects#

players.head(3)
# players.tail(3)
# players.sample(3)
username country age ezlvl time points finished
0 mary us 38 0 124.94 418 0
1 jane ca 21 0 331.64 1149 1
2 emil fr 52 1 324.61 1321 1

Data types#

players.dtypes
username     object
country      object
age           int64
ezlvl         int64
time        float64
points        int64
finished      int64
dtype: object
players.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   username  12 non-null     object 
 1   country   12 non-null     object 
 2   age       12 non-null     int64  
 3   ezlvl     12 non-null     int64  
 4   time      12 non-null     float64
 5   points    12 non-null     int64  
 6   finished  12 non-null     int64  
dtypes: float64(1), int64(4), object(2)
memory usage: 800.0+ bytes

Accessing values in a DataFrame#

Selecting individual values#

# Emil's points
players.loc[2,"points"]
1321

Selecting entire rows#

# Sanjay's data
row6 = players.loc[6,:]  # == players.loc[6]
row6
username    sanjay
country         ca
age             27
ezlvl            1
time         350.0
points        1401
finished         1
Name: 6, dtype: object
# Rows of the dataframe are Series objects
type(row6)
pandas.core.series.Series

The index of the series row6 is the same as the columns index of the data frame players.

row6.index
Index(['username', 'country', 'age', 'ezlvl', 'time', 'points', 'finished'], dtype='object')

To access individual values, use the square bracket notation.

row6["age"]
27

Selecting entire columns#

ages = players["age"]
ages
0     38
1     21
2     52
3     50
4     26
5     45
6     27
7     23
8     24
9     59
10    18
11    22
Name: age, dtype: int64
type(ages)
pandas.core.series.Series
ages.index
RangeIndex(start=0, stop=12, step=1)
ages.values
array([38, 21, 52, 50, 26, 45, 27, 23, 24, 59, 18, 22])
# ALT1.
# players["age"].equals( players.loc[:,"age"] )
# ALT2.
# players["age"].equals( players.age )
ages[6]
27

Selecting multiple columns#

players[ ["username", "country"] ]
username country
0 mary us
1 jane ca
2 emil fr
3 ivan ca
4 hasan tr
5 jordan us
6 sanjay ca
7 lena uk
8 shuo cn
9 r0byn us
10 anna pl
11 joro bg

Statistical calculations using Pandas#

ages = players["age"]  # == players.loc[:,"age"]
ages
0     38
1     21
2     52
3     50
4     26
5     45
6     27
7     23
8     24
9     59
10    18
11    22
Name: age, dtype: int64
type(ages)
pandas.core.series.Series

Series attributes#

ages.index
RangeIndex(start=0, stop=12, step=1)
ages.values
array([38, 21, 52, 50, 26, 45, 27, 23, 24, 59, 18, 22])
ages.name
'age'
players.loc[6]
username    sanjay
country         ca
age             27
ezlvl            1
time         350.0
points        1401
finished         1
Name: 6, dtype: object

Series methods#

ages.count()
12
# # ALT
# len(ages)
ages.sum()
405
ages.sum() / ages.count()
33.75
ages.mean()
33.75
ages.std()
14.28365244861157

Selecting only certain rows (filtering)#

To select only rows where ezlvl is 1, we first build the boolean selection mask…

mask = players["ezlvl"] == 1
mask
0     False
1     False
2      True
3      True
4      True
5     False
6      True
7     False
8      True
9     False
10    False
11     True
Name: ezlvl, dtype: bool

… then select the rows using the mask.

players[mask]
username country age ezlvl time points finished
2 emil fr 52 1 324.61 1321 1
3 ivan ca 50 1 39.51 226 0
4 hasan tr 26 1 253.19 815 0
6 sanjay ca 27 1 350.00 1401 1
8 shuo cn 24 1 194.77 1043 0
11 joro bg 22 1 381.97 1491 1

The above two step process can be combined into a more compact expression:

players[players["ezlvl"]==1]
username country age ezlvl time points finished
2 emil fr 52 1 324.61 1321 1
3 ivan ca 50 1 39.51 226 0
4 hasan tr 26 1 253.19 815 0
6 sanjay ca 27 1 350.00 1401 1
8 shuo cn 24 1 194.77 1043 0
11 joro bg 22 1 381.97 1491 1

Bonus topic: multiple selection criteria#

# mask for selecting players with ezlvl=1 and time greater than 200 mins
# players[(players["ezlvl"] == 1) & (players["time"] >= 200)]
# mask for selecting US and Canada players
# players["country"].isin(["us","ca"])

Sorting data frames and ranking#

players.sort_values("time", ascending=False)
username country age ezlvl time points finished
7 lena uk 23 0 408.76 1745 1
11 joro bg 22 1 381.97 1491 1
6 sanjay ca 27 1 350.00 1401 1
1 jane ca 21 0 331.64 1149 1
2 emil fr 52 1 324.61 1321 1
10 anna pl 18 0 303.66 1209 1
9 r0byn us 59 0 255.55 1102 0
4 hasan tr 26 1 253.19 815 0
8 shuo cn 24 1 194.77 1043 0
0 mary us 38 0 124.94 418 0
3 ivan ca 50 1 39.51 226 0
5 jordan us 45 0 28.49 206 0
players["time"].rank(ascending=False)
0     10.0
1      4.0
2      5.0
3     11.0
4      8.0
5     12.0
6      3.0
7      1.0
8      9.0
9      7.0
10     6.0
11     2.0
Name: time, dtype: float64

Grouping and aggregation#

players.groupby("ezlvl")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fac849dd610>
players.groupby("ezlvl")["time"]
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fac849dd250>
players.groupby("ezlvl")["time"].mean()
ezlvl
0    242.173333
1    257.341667
Name: time, dtype: float64
print(players.groupby("ezlvl")["time"].aggregate(["sum", "mean"]))
           sum        mean
ezlvl                     
0      1453.04  242.173333
1      1544.05  257.341667
# # ALT1. newline continuation character
# players.groupby("ezlvl")["time"] \
#   .agg(["sum", "mean"])
# # ALT2. expression inside parentheses
# (players
#   .groupby("ezlvl")["time"]
#   .agg(["sum", "mean", "max"])
# )

Data visualization with Seaborn#

The first step is to import the seaboarn module under the alias sns.

import seaborn as sns

If you get an error when running this code cell, run %pip install seaborn to install the Seaborn library.

times = players["time"]
sns.stripplot(x=times)
<Axes: xlabel='time'>
../_images/42391dbb307fa5150866f9402977ae9c5c246d18ff2b132213fd5960b1c8a8ef.png
sns.stripplot(data=players, x="time")
<Axes: xlabel='time'>
../_images/8369bee985dcbc83c5c2d2714594e53e48b60c5e4b4f420349d1ff2417640daa.png
sns.stripplot(data=players, x="time", hue="ezlvl")
<Axes: xlabel='time'>
../_images/d26d8e3111b5106c3c5c37ada6a062f95e86e4140bada3ba3220f25d711d5f52.png

Studying the effect of ezlvl on time#

The players dataset was collected as part of an experiment designed to answer the question “Does the easy first level lead to an improvement in user retention?” We want to compare the time variable (total time players spent in the game) of players who were shown the “easy level” version of the game (ezlvl==1) to the control group of played who played the regular vesion of the game (ezlvl==0).

mean0 = players[players["ezlvl"]==0]["time"].mean()
mean0
242.17333333333332
mean1 = players[players["ezlvl"]==1]["time"].mean()
mean1
257.34166666666664
sns.stripplot(data=players, x="time", y="ezlvl",
              hue="ezlvl", orient="h", legend=None)
<Axes: xlabel='time', ylabel='ezlvl'>
../_images/b06b0b7ed11c879834f705bdd47ef15227b6b56c471b61d97342c6b4f4b8b06a.png
# ALT. same stripplot with markers for the group means 
# ax = sns.stripplot(data=players, x="time", y="ezlvl", hue="ezlvl", orient="h", legend=None)
# sns.stripplot(x=[mean0], y=[0], marker="D", orient="h", color="b", ax=ax)
# sns.stripplot(x=[mean1], y=[1], marker="D", orient="h", color="r", ax=ax)

Studying the relationship between age and time#

The secondary research question, is to look for a correlation between the age variable and the time variable.

sns.scatterplot(data=players, x="age", y="time")
<Axes: xlabel='age', ylabel='time'>
../_images/567c6e190314c38d044d422e8a1c579cebe9a8fdd2e0e29847886b74b9320e71.png
sns.regplot(data=players, x="age", y="time", ci=None)
<Axes: xlabel='age', ylabel='time'>
../_images/eee882f79c4c26427d8dbd469fd02c93841c94427fe8fbd0ab0727e6253ce3f9.png

Real-world datasets#

TODO Add table as .md

Apple weights#

apples = pd.read_csv("../datasets/apples.csv")
apples.shape
(30, 1)
apples.head(3)
weight
0 205.0
1 182.0
2 192.0
apples['weight'].mean()
202.6
sns.stripplot(data=apples, x="weight", jitter=0, alpha=0.5)
<Axes: xlabel='weight'>
../_images/3c7a05fe3279e9ebe5720f9064f64337c7c3fa8042767bd10cc35040a54e7b65.png

Electricity prices#

eprices = pd.read_csv("../datasets/eprices.csv")
eprices.shape
(18, 2)
eprices.head(3)
loc price
0 East 7.7
1 East 5.9
2 East 7.0
eprices[eprices["loc"]=="West"]["price"].mean()
9.155555555555557
eprices[eprices["loc"]=="East"]["price"].mean()
6.155555555555556
sns.stripplot(data=eprices, x="price", y="loc", hue="loc")
<Axes: xlabel='price', ylabel='loc'>
../_images/a8bf53fea725bda7e3ea1a5175dbcc6c83ef1c7535f37694bd71b1f4ee81db1d.png

Students’ scores#

students = pd.read_csv("../datasets/students.csv")
students.shape
(15, 5)
students.head()
student_ID background curriculum effort score
0 1 arts debate 10.96 75.0
1 2 science lecture 8.69 75.0
2 3 arts debate 8.60 67.0
3 4 arts lecture 7.92 70.3
4 5 science debate 9.90 76.1
sns.stripplot(data=students, x="score", y="curriculum", hue="curriculum")
<Axes: xlabel='score', ylabel='curriculum'>
../_images/0ac50062fd28fc52a94528181144d9f4327dbb8e73347084a69e35796b20f0c9.png
lscores = students[students["curriculum"]=="lecture"]
lscores["score"].mean()
68.14285714285714
dscores = students[students["curriculum"]=="debate"]
dscores["score"].mean()
76.4625

Kombucha volumes#

kombucha = pd.read_csv("../datasets/kombucha.csv")
kombucha.shape
(347, 2)
kombucha.columns
Index(['batch', 'volume'], dtype='object')
kombucha.head(3)
batch volume
0 1 1016.24
1 1 993.88
2 1 994.72
kombucha["batch"].unique()
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
sns.stripplot(data=kombucha, x="batch", y="volume", alpha=0.3)
<Axes: xlabel='batch', ylabel='volume'>
../_images/420cc8b7906f873be8040cabe4762a3be7163b3c6d8ba7ecfe0df61fb520a5f4.png

Average volume of the sample from Batch 01#

batch01 = kombucha[kombucha["batch"]==1]
ksample01 = batch01["volume"]
ksample01.mean()
999.10375

Doctors’ sleep study#

doctors = pd.read_csv("../datasets/doctors.csv")
doctors.shape
(224, 4)
doctors.head(3)
permit name location score
0 93636 Yesenia Smith urban 82.0
1 79288 Andrew Stanley rural 85.0
2 94980 Jessica Castro rural 97.0
sns.stripplot(data=doctors, x="score", y="location", hue="location")
<Axes: xlabel='score', ylabel='location'>
../_images/0d0f4adbce5afec6e8e0847f669ba9049ad19192db251a2a556fce37ae97679d.png

Average sleep scores for doctors in different locations#

udoctors = doctors[doctors["location"]=="urban"]
udoctors["score"].mean()
79.57051282051282
rdoctors = doctors[doctors["location"]=="rural"]
rdoctors["score"].mean()
81.79411764705883

Website visitors#

visitors = pd.read_csv("../datasets/visitors.csv")
visitors.shape
(2000, 3)
visitors.head(5)
IP address version bought
0 135.185.92.4 A 0
1 14.75.235.1 A 1
2 50.132.244.139 B 0
3 144.181.130.234 A 0
4 90.92.5.100 B 0
visitors[visitors["version"]=="A"]["bought"].mean()
0.06482465462274177
visitors[visitors["version"]=="B"]["bought"].mean()
0.03777148253068933
sns.barplot(data=visitors, x="bought", y="version")
<Axes: xlabel='bought', ylabel='version'>
../_images/a93188e83b3eb5bb20d3203eb2593ee22e2ade24d66c4f3ff1ef12b04cf425f9.png

Discussion#

Data extraction#

Data transformations#

Tidy data#

epriceswide = pd.read_csv("../datasets/epriceswide.csv")
epriceswide.shape
(9, 2)
epriceswide
East West
0 7.7 11.8
1 5.9 10.0
2 7.0 11.0
3 4.8 8.6
4 6.3 8.3
5 6.3 9.4
6 5.5 8.0
7 5.4 6.8
8 6.5 8.5
epriceswide.melt(var_name="loc", value_name="price")
loc price
0 East 7.7
1 East 5.9
2 East 7.0
3 East 4.8
4 East 6.3
5 East 6.3
6 East 5.5
7 East 5.4
8 East 6.5
9 West 11.8
10 West 10.0
11 West 11.0
12 West 8.6
13 West 8.3
14 West 9.4
15 West 8.0
16 West 6.8
17 West 8.5

The melted-epriceswide is the same as eprices#

epriceslong = epriceswide.melt(var_name="loc", value_name="price")
eprices.equals(epriceslong)
True

Data cleaning#