Chapter 1: DATA#
Concept map:
1.1 Introduction to data#
Data is the fuel for stats (see section outline for more details).
1.2 Data in practice#
Before it can be used in statistical analyses, data often requires pre-processing steps which are can like extract, transform, and load (ETL) as shows below:
We’ll illustrate these steps based on the data for Amy’s research:
Extract data from spreadsheet file using
read_excel
Transform to tidy format using
melt
, cleanup, then save data as a CSV fileLoad data again using
read_csv
to make sure all good
Notebook setup#
We begin by importing some Python modules for data analysis and plotting.
import pandas as pd # data manipulations
import seaborn as sns # plotting and visualizations
# notebook and figures setup
%matplotlib inline
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(8,5)})
# silence annoying warnings
import warnings; warnings.filterwarnings('ignore')
Extract#
We can use the function read_excel
in the pandas
module, which have imported as the shorter alias pd
to extract the data the sheet named “Data” in the spreadhseet file.
rawdf = pd.read_excel("data/employee_lifetime_values_records.ods", sheet_name="Data")
# if you get an error on the above line, run `!pip install odfpy` then this:
# rawdf = pd.read_excel("https://github.com/minireference/noBSstats/raw/main/stats_overview/data/employee_lifetime_values_records.ods", sheet_name="Data")
# the data extracted from the spreadsheet is now stored in rawdf, a pd.DataFrame object
rawdf
Group NS | Group S | |
---|---|---|
0 | 923.87 | 1297.44 |
1 | 751.38 | 776.41 |
2 | 432.83 | 1207.48 |
3 | 1417.36 | 1203.30 |
4 | 973.24 | 1127.58 |
5 | 1000.52 | 1464.32 |
6 | 1040.38 | 926.94 |
7 | 1324.41 | 1550.08 |
8 | 1171.52 | 1111.03 |
9 | 1195.03 | 1057.69 |
10 | 664.63 | 1361.05 |
11 | 769.31 | 1108.01 |
12 | 690.74 | 1054.53 |
13 | 1284.31 | 1232.11 |
14 | 958.96 | 623.06 |
15 | 856.65 | 862.36 |
16 | 1124.43 | 1011.34 |
17 | 1521.95 | 1088.06 |
18 | 987.67 | 1188.16 |
19 | 1070.90 | 1434.01 |
20 | 1110.94 | 843.81 |
21 | 1620.93 | 1716.61 |
22 | 1251.99 | 987.64 |
23 | 860.90 | 1094.02 |
24 | 963.01 | 1225.66 |
25 | 677.76 | 931.61 |
26 | 949.26 | 1329.68 |
27 | 990.13 | 1293.03 |
28 | 1255.27 | 1240.44 |
29 | 716.64 | 1105.59 |
30 | 1013.83 | NaN |
Note there is a missing value (denoted NaN
) in the last row of Group S. We’ll have to fix that in the data-cleaning step later on.
# the "shape" of the dataframe is
rawdf.shape
(31, 2)
# we can quickly see the statistics for the two groups
rawdf.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Group NS | 31.0 | 1018.41129 | 265.815869 | 432.83 | 858.7750 | 990.130 | 1183.2750 | 1620.93 |
Group S | 30.0 | 1148.43500 | 233.037704 | 623.06 | 1022.1375 | 1119.305 | 1279.8825 | 1716.61 |
The above table contains some useful summary for the two groups in the data. We can confirm we have 31 data points for Group NS, 30 data points for Group S and see all the descriptive statistics like mean, standard deviation, min/max, and the quartiles (more on that later).
Transform and clean the raw data#
Currently data is in a “wide” format (block where each row contains multiple observations)
Instead, we prefer all data to be in “tall” format where each row is a single observation
The operation to convert is called melt
and there is a Python method for the same name:
# before melt...
# rawdf
tmpdf1 = rawdf.melt(var_name="group", value_name="ELV")
# ...after melt
tmpdf1
group | ELV | |
---|---|---|
0 | Group NS | 923.87 |
1 | Group NS | 751.38 |
2 | Group NS | 432.83 |
3 | Group NS | 1417.36 |
4 | Group NS | 973.24 |
... | ... | ... |
57 | Group S | 1329.68 |
58 | Group S | 1293.03 |
59 | Group S | 1240.44 |
60 | Group S | 1105.59 |
61 | Group S | NaN |
62 rows × 2 columns
# lets use short labels...
tmpdf2 = tmpdf1.replace({"Group NS": "NS", "Group S": "S"})
tmpdf2
group | ELV | |
---|---|---|
0 | NS | 923.87 |
1 | NS | 751.38 |
2 | NS | 432.83 |
3 | NS | 1417.36 |
4 | NS | 973.24 |
... | ... | ... |
57 | S | 1329.68 |
58 | S | 1293.03 |
59 | S | 1240.44 |
60 | S | 1105.59 |
61 | S | NaN |
62 rows × 2 columns
# clean data (remove NaN values)
df = tmpdf2.dropna()
df
group | ELV | |
---|---|---|
0 | NS | 923.87 |
1 | NS | 751.38 |
2 | NS | 432.83 |
3 | NS | 1417.36 |
4 | NS | 973.24 |
... | ... | ... |
56 | S | 931.61 |
57 | S | 1329.68 |
58 | S | 1293.03 |
59 | S | 1240.44 |
60 | S | 1105.59 |
61 rows × 2 columns
Click here to see an online visualization of these three operations on a small sample from the data using pandastutor.com.
Save the data as a CSV file#
df.to_csv('data/employee_lifetime_values.csv', index=False)
# !du data/employee_lifetime_values.csv
# !head data/employee_lifetime_values.csv
# Try loading the data to make sure it is the same
df = pd.read_csv('data/employee_lifetime_values.csv')
df
group | ELV | |
---|---|---|
0 | NS | 923.87 |
1 | NS | 751.38 |
2 | NS | 432.83 |
3 | NS | 1417.36 |
4 | NS | 973.24 |
... | ... | ... |
56 | S | 931.61 |
57 | S | 1329.68 |
58 | S | 1293.03 |
59 | S | 1240.44 |
60 | S | 1105.59 |
61 rows × 2 columns
The dataframe df
is now in long format and this will make doing stats and plotting much easier from now on.
Data cleaning and data munging are important prerequisites for any statistical analysis. We’ve only touched the surface here, but it’s important for you to know these steps exist, and are often the most time consuming part of any “data job.”
pandas sidenote: selecting subsets#
# first create a boolean "mask" of rows that match
# df["group"] == "S"
# selects subset that corresponds to the mask
# df[df["group"]=="S"]
1.3 Descriptive statistics#
Use
.describe()
to get summary statisticsDraw a histogram
Draw a boxplot
df[df["group"]=="S"].describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
ELV | 30.0 | 1148.435 | 233.037704 | 623.06 | 1022.1375 | 1119.305 | 1279.8825 | 1716.61 |
df[df["group"]=="NS"].describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
ELV | 31.0 | 1018.41129 | 265.815869 | 432.83 | 858.775 | 990.13 | 1183.275 | 1620.93 |
Visualizations#
sns.boxplot(data=df, x="ELV", y="group")
<Axes: xlabel='ELV', ylabel='group'>
sns.stripplot(data=df, x="ELV", y="group")
<Axes: xlabel='ELV', ylabel='group'>
# sns.histplot(data=df, x="ELV", hue="group")
# better histplot
fig, axes = plt.subplots(2, 1, sharex=True)
blue, orange = sns.color_palette()[0], sns.color_palette()[1]
sns.histplot(data=df[df["group"]=="NS"], x="ELV", color=blue, ax=axes[0])
axes[0].axvline(df[df["group"]=="NS"]["ELV"].mean(), 0, 12)
sns.histplot(data=df[df["group"]=="S"], x="ELV", color=orange, ax=axes[1])
axes[1].axvline(df[df["group"]=="S"]["ELV"].mean(), 0, 12, color=orange)
<matplotlib.lines.Line2D at 0x7f694da1ec40>
Observed difference in means in the data sample#
meanS = df[df["group"]=="S"]['ELV'].mean()
meanNS = df[df["group"]=="NS"]['ELV'].mean()
# compute d, the difference in ELV means between two groups
d = meanS - meanNS
print("The difference between groups means is", round(d,2))
The difference between groups means is 130.02
Observed variability in the data#
stdS = df[df["group"]=="S"]['ELV'].std()
print("Standard deviation for Group S is", round(stdS,2))
stdNS = df[df["group"]=="NS"]['ELV'].std()
print("Standard deviation for Group NS is", round(stdNS,2))
Standard deviation for Group S is 233.04
Standard deviation for Group NS is 265.82
Summary#
Reminder: goals is to determine if stats training leads to meaningful ELV increase
Numerically and visually, we see that employees who received stats training (Group B) have higher ELV, but could the observed difference be just a fluke?
Facts: the observed difference in means is
d
= \(\overline{x}_S - \overline{x}_{NS} = 130.03\) and the standard deviation of the sample is 233 and 266 respectively. Is a difference of 130 (signal) significant for data with variability ~250 (noise)?
Discussion#
the data samples we have collected contain A LOT of variability
we need a language for describing variability (probability theory)
we can use probability theory for modelling data and considering hypothetical scenarios
Learning the language of probability theory will help us rephrase the research questions as precise mathematical problems:
Does statistical training make a difference in ELV?
We’ll define a probability model for a hypothetical situation where statistics training makes no difference and calculate how likely it is to observe the difference \(d= \overline{x}_S - \overline{x}_{NS} = 130\) between two groups due to chance.How big is the increase in ELV that occurs thanks to stats training?
We’ll produce an estimate of the difference in ELV between groups that takes into account the variability of the data.
Next step#
Ready for an intro to probability theory? If so, move on to the next notebook 02_PROB.ipynb.