Chapter 1: DATA#

Link to outline

Concept map: concepts_DATA.png

1.1 Introduction to data#

Data is the fuel for stats (see section outline for more details).

1.2 Data in practice#

Before it can be used in statistical analyses, data often requires pre-processing steps which are can like extract, transform, and load (ETL) as shows below:

ETL_block_diagram.png

We’ll illustrate these steps based on the data for Amy’s research:

  • Extract data from spreadsheet file using read_excel

  • Transform to tidy format using melt, cleanup, then save data as a CSV file

  • Load data again using read_csv to make sure all good

Notebook setup#

We begin by importing some Python modules for data analysis and plotting.

import pandas as pd    # data manipulations
import seaborn as sns  # plotting and visualizations
# notebook and figures setup
%matplotlib inline
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(8,5)})
# silence annoying warnings
import warnings; warnings.filterwarnings('ignore')

Extract#

We can use the function read_excel in the pandas module, which have imported as the shorter alias pd to extract the data the sheet named “Data” in the spreadhseet file.

rawdf = pd.read_excel("data/employee_lifetime_values_records.ods", sheet_name="Data")

# if you get an error on the above line, run `!pip install odfpy` then this:
# rawdf = pd.read_excel("https://github.com/minireference/noBSstats/raw/main/stats_overview/data/employee_lifetime_values_records.ods", sheet_name="Data")

# the data extracted from the spreadsheet is now stored in rawdf, a pd.DataFrame object
rawdf
Group NS Group S
0 923.87 1297.44
1 751.38 776.41
2 432.83 1207.48
3 1417.36 1203.30
4 973.24 1127.58
5 1000.52 1464.32
6 1040.38 926.94
7 1324.41 1550.08
8 1171.52 1111.03
9 1195.03 1057.69
10 664.63 1361.05
11 769.31 1108.01
12 690.74 1054.53
13 1284.31 1232.11
14 958.96 623.06
15 856.65 862.36
16 1124.43 1011.34
17 1521.95 1088.06
18 987.67 1188.16
19 1070.90 1434.01
20 1110.94 843.81
21 1620.93 1716.61
22 1251.99 987.64
23 860.90 1094.02
24 963.01 1225.66
25 677.76 931.61
26 949.26 1329.68
27 990.13 1293.03
28 1255.27 1240.44
29 716.64 1105.59
30 1013.83 NaN

Note there is a missing value (denoted NaN) in the last row of Group S. We’ll have to fix that in the data-cleaning step later on.

# the "shape" of the dataframe is
rawdf.shape
(31, 2)
# we can quickly see the statistics for the two groups
rawdf.describe().T
count mean std min 25% 50% 75% max
Group NS 31.0 1018.41129 265.815869 432.83 858.7750 990.130 1183.2750 1620.93
Group S 30.0 1148.43500 233.037704 623.06 1022.1375 1119.305 1279.8825 1716.61

The above table contains some useful summary for the two groups in the data. We can confirm we have 31 data points for Group NS, 30 data points for Group S and see all the descriptive statistics like mean, standard deviation, min/max, and the quartiles (more on that later).


Transform and clean the raw data#

  • Currently data is in a “wide” format (block where each row contains multiple observations)

  • Instead, we prefer all data to be in “tall” format where each row is a single observation

data-melt-operation.png

The operation to convert is called melt and there is a Python method for the same name:

# before melt...
# rawdf
tmpdf1 = rawdf.melt(var_name="group", value_name="ELV")

# ...after melt
tmpdf1
group ELV
0 Group NS 923.87
1 Group NS 751.38
2 Group NS 432.83
3 Group NS 1417.36
4 Group NS 973.24
... ... ...
57 Group S 1329.68
58 Group S 1293.03
59 Group S 1240.44
60 Group S 1105.59
61 Group S NaN

62 rows × 2 columns

# lets use short labels...
tmpdf2 = tmpdf1.replace({"Group NS": "NS", "Group S": "S"})
tmpdf2
group ELV
0 NS 923.87
1 NS 751.38
2 NS 432.83
3 NS 1417.36
4 NS 973.24
... ... ...
57 S 1329.68
58 S 1293.03
59 S 1240.44
60 S 1105.59
61 S NaN

62 rows × 2 columns

# clean data (remove NaN values)
df = tmpdf2.dropna()
df
group ELV
0 NS 923.87
1 NS 751.38
2 NS 432.83
3 NS 1417.36
4 NS 973.24
... ... ...
56 S 931.61
57 S 1329.68
58 S 1293.03
59 S 1240.44
60 S 1105.59

61 rows × 2 columns

Click here to see an online visualization of these three operations on a small sample from the data using pandastutor.com.

Save the data as a CSV file#

df.to_csv('data/employee_lifetime_values.csv', index=False)
# !du data/employee_lifetime_values.csv
# !head data/employee_lifetime_values.csv
# Try loading the data to make sure it is the same
df = pd.read_csv('data/employee_lifetime_values.csv')
df
group ELV
0 NS 923.87
1 NS 751.38
2 NS 432.83
3 NS 1417.36
4 NS 973.24
... ... ...
56 S 931.61
57 S 1329.68
58 S 1293.03
59 S 1240.44
60 S 1105.59

61 rows × 2 columns

The dataframe df is now in long format and this will make doing stats and plotting much easier from now on.

Data cleaning and data munging are important prerequisites for any statistical analysis. We’ve only touched the surface here, but it’s important for you to know these steps exist, and are often the most time consuming part of any “data job.”

pandas sidenote: selecting subsets#

# first create a boolean "mask" of rows that match
# df["group"] == "S"
# selects subset that corresponds to the mask
# df[df["group"]=="S"]

1.3 Descriptive statistics#

  • Use .describe() to get summary statistics

  • Draw a histogram

  • Draw a boxplot

df[df["group"]=="S"].describe().T
count mean std min 25% 50% 75% max
ELV 30.0 1148.435 233.037704 623.06 1022.1375 1119.305 1279.8825 1716.61
df[df["group"]=="NS"].describe().T
count mean std min 25% 50% 75% max
ELV 31.0 1018.41129 265.815869 432.83 858.775 990.13 1183.275 1620.93

Visualizations#

sns.boxplot(data=df, x="ELV", y="group")
<Axes: xlabel='ELV', ylabel='group'>
../_images/cb487fc36eaccf8bb54b8cbe6857ca3d6fd2de53a4cad35736e25dac49f5d4c1.png
sns.stripplot(data=df, x="ELV", y="group")
<Axes: xlabel='ELV', ylabel='group'>
../_images/54be7b59311d95169b3fd8c52aaed6b79830425f7cc300c6ab514435dffd43bf.png
# sns.histplot(data=df, x="ELV", hue="group")
# better histplot
fig, axes = plt.subplots(2, 1, sharex=True)
blue, orange  = sns.color_palette()[0], sns.color_palette()[1]
sns.histplot(data=df[df["group"]=="NS"], x="ELV", color=blue, ax=axes[0])
axes[0].axvline(df[df["group"]=="NS"]["ELV"].mean(), 0, 12)

sns.histplot(data=df[df["group"]=="S"], x="ELV", color=orange, ax=axes[1])
axes[1].axvline(df[df["group"]=="S"]["ELV"].mean(), 0, 12, color=orange)
<matplotlib.lines.Line2D at 0x7f7600141390>
../_images/d918c966ba464b78e0abb2b2a2424d34d90f3b3c14c6050181ebdc963839f84c.png

Observed difference in means in the data sample#

meanS = df[df["group"]=="S"]['ELV'].mean()
meanNS = df[df["group"]=="NS"]['ELV'].mean()

# compute d, the difference in ELV means between two groups
d = meanS - meanNS
print("The difference between groups means is", round(d,2))
The difference between groups means is 130.02

Observed variability in the data#

stdS = df[df["group"]=="S"]['ELV'].std()
print("Standard deviation for Group S is", round(stdS,2))

stdNS = df[df["group"]=="NS"]['ELV'].std()
print("Standard deviation for Group NS is", round(stdNS,2))
Standard deviation for Group S is 233.04
Standard deviation for Group NS is 265.82

Summary#

  • Reminder: goals is to determine if stats training leads to meaningful ELV increase

  • Numerically and visually, we see that employees who received stats training (Group B) have higher ELV, but could the observed difference be just a fluke?

  • Facts: the observed difference in means is d = \(\overline{x}_S - \overline{x}_{NS} = 130.03\) and the standard deviation of the sample is 233 and 266 respectively. Is a difference of 130 (signal) significant for data with variability ~250 (noise)?

Discussion#

  • the data samples we have collected contain A LOT of variability

  • we need a language for describing variability (probability theory)

  • we can use probability theory for modelling data and considering hypothetical scenarios

Learning the language of probability theory will help us rephrase the research questions as precise mathematical problems:

  1. Does statistical training make a difference in ELV?
    We’ll define a probability model for a hypothetical situation where statistics training makes no difference and calculate how likely it is to observe the difference \(d= \overline{x}_S - \overline{x}_{NS} = 130\) between two groups due to chance.

  2. How big is the increase in ELV that occurs thanks to stats training?
    We’ll produce an estimate of the difference in ELV between groups that takes into account the variability of the data.

Next step#

Ready for an intro to probability theory? If so, move on to the next notebook 02_PROB.ipynb.