Section 1.3 — Descriptive statistics

Section 1.3 — Descriptive statistics#

This notebook contains all the code from Section 1.3 Descriptive Statistics of the No Bullshit Guide to Statistics.

All the data manipulations are done using the pandas library, and data visualizations are based on the seaborn library.

Code cells containing an ALT comment and commented out show alternative way for computing the same quantities or additional details that were not included in the book. It’s up to you if you want to learn about these alternative options—just uncomment the code and execute.

Notebook setup#

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Pandas setup
pd.set_option("display.precision", 2)

# Figures setup
sns.set_theme(
    context="paper",
    style="whitegrid",
    palette="colorblind",
    rc={'figure.figsize': (7,4)},
)

%config InlineBackend.figure_format = 'retina'

# simple float __repr__
np.set_printoptions(legacy='1.25')

Numerical variables#

Definitions#

Descriptive statistics for the students dataset#

Load the `students` dataset from CSV#

import pandas as pd
students = pd.read_csv("../datasets/students.csv")
students

	student_ID	background	curriculum	effort	score
0	1	arts	debate	10.96	75.0
1	2	science	lecture	8.69	75.0
2	3	arts	debate	8.60	67.0
3	4	arts	lecture	7.92	70.3
4	5	science	debate	9.90	76.1
5	6	business	debate	10.80	79.8
6	7	science	lecture	7.81	72.7
7	8	business	lecture	9.13	75.4
8	9	business	lecture	5.21	57.0
9	10	science	lecture	7.71	69.0
10	11	business	debate	9.82	70.4
11	12	arts	debate	11.53	96.2
12	13	science	debate	7.10	62.9
13	14	science	lecture	6.39	57.6
14	15	arts	debate	12.00	84.3

# what type of object is `students`?
type(students)

pandas.core.frame.DataFrame

# rows
students.index

RangeIndex(start=0, stop=15, step=1)

# columns
students.columns

Index(['student_ID', 'background', 'curriculum', 'effort', 'score'], dtype='object')

# info about memory and data types of each column
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   student_ID  15 non-null     int64  
 1   background  15 non-null     object 
 2   curriculum  15 non-null     object 
 3   effort      15 non-null     float64
 4   score       15 non-null     float64
dtypes: float64(2), int64(1), object(2)
memory usage: 728.0+ bytes

Focus on the `score` variable#

Let’s look at the score variable.

scores = students["score"]

scores.count()

scores.sort_values()

   57.0
  57.6
  62.9
   67.0
   69.0
   70.3
  70.4
   72.7
   75.0
   75.0
   75.4
   76.1
   79.8
  84.3
  96.2
Name: score, dtype: float64

# # ALT
# sorted(scores)

Min, max, and median#

scores.min()

57.0

scores.max()

96.2

scores.max() - scores.min()  # Range

39.2

scores.median()

72.7

Strip plot#

import seaborn as sns
sns.stripplot(data=students, x="score", jitter=0);

../_images/a35dfb66138e34e99de236f16e619357721cca21a13447c7a8e744f1840f157e.png

We used the option jitter=0 to disable the Seaborn default behaviour of adding a small amount of random vertical displacement to each data point, which we don’t need for this plot.

# # ALT  also show mean via https://stackoverflow.com/a/67579487/127114
# with plt.rc_context({"figure.figsize":(7,2)}):
#     sns.stripplot(x="score", data=students, jitter=0.03, alpha=0.6)
#     color0 = sns.color_palette()[0]
#     sns.boxplot(data=students,
#                 showmeans=True,
#                 meanline=True,
#                 meanprops={'color': color0, 'ls': '-', 'lw': 2},
#                 medianprops={'visible': False},
#                 whiskerprops={'visible': False},
#                 zorder=10,
#                 x="score",
#                 showfliers=False,
#                 showbox=False,
#                 showcaps=False);

Mean, variance, and standard deviation#

scores.mean()

72.58

scores.var()  # sample variance, 1/(n-1) EE(x-mean)^2

99.58600000000001

scores.std()

9.979278531036199

Histograms#

bins = [50, 60, 70, 80, 90, 100]
scores.value_counts(bins=bins, sort=False)

(49.999, 60.0]    2
(60.0, 70.0]      3
(70.0, 80.0]      8
(80.0, 90.0]      1
(90.0, 100.0]     1
Name: count, dtype: int64

# note mode is the bin 70--80, which contains 8 values

bins = [50, 60, 70, 80, 90, 100]
sns.histplot(data=students, x="score", bins=bins);

../_images/b3e920883b492b4a0742004959ecbe6f340571499bd32dcba26c63db62908a6f.png

Quartiles#

Q1 = scores.quantile(q=0.25)
Q1

68.0

Q2 = scores.quantile(q=0.5)
Q2

72.7

Q3 = scores.quantile(q=0.75)
Q3

75.75

IQR = Q3 - Q1
IQR

7.75

Box plots#

# box plot
sns.boxplot(data=students, x="score", width=0.4);

../_images/772ba4ded5046c76e7ab95ae3dd143d92c5d40ad035e1ea8acc0712352d3775a.png

All summary statistics#

scores.describe()

count    15.00
mean     72.58
std       9.98
min      57.00
25%      68.00
50%      72.70
75%      75.75
max      96.20
Name: score, dtype: float64

students[["score"]].describe().T

	count	mean	std	min	25%	50%	75%	max
score	15.0	72.58	9.98	57.0	68.0	72.7	75.75	96.2

Comparing two numeric variables#

students[ ["effort","score"] ].describe()

	effort	score
count	15.00	15.00
mean	8.90	72.58
std	1.95	9.98
min	5.21	57.00
25%	7.76	68.00
50%	8.69	72.70
75%	10.35	75.75
max	12.00	96.20

students.describe()

	student_ID	effort	score
count	15.00	15.00	15.00
mean	8.00	8.90	72.58
std	4.47	1.95	9.98
min	1.00	5.21	57.00
25%	4.50	7.76	68.00
50%	8.00	8.69	72.70
75%	11.50	10.35	75.75
max	15.00	12.00	96.20

sns.scatterplot(data=students, x="effort", y="score");

../_images/f76fbae36b884ed5d5fc412d2734fb429d38615d7162b3040e8d022f5cdb96b4.png

Covariance and correlation#

students[["effort", "score"]].cov()

	effort	score
effort	3.8	17.10
score	17.1	99.59

students[["effort", "score"]].corr()

	effort	score
effort	1.00	0.88
score	0.88	1.00

students[["effort","score"]].corr().loc["score","effort"]

0.8794375135614694

Multiple numerical variables#

Extract data for the two groups:

dstudents = students[students["curriculum"]=="debate"]
dstudents

	student_ID	background	curriculum	effort	score
0	1	arts	debate	10.96	75.0
2	3	arts	debate	8.60	67.0
4	5	science	debate	9.90	76.1
5	6	business	debate	10.80	79.8
10	11	business	debate	9.82	70.4
11	12	arts	debate	11.53	96.2
12	13	science	debate	7.10	62.9
14	15	arts	debate	12.00	84.3

dstudents["score"].describe()

count     8.00
mean     76.46
std      10.52
min      62.90
25%      69.55
50%      75.55
75%      80.92
max      96.20
Name: score, dtype: float64

lstudents = students[students["curriculum"]=="lecture"]
lstudents["score"].describe()

count     7.00
mean     68.14
std       7.76
min      57.00
25%      63.30
50%      70.30
75%      73.85
max      75.40
Name: score, dtype: float64

Strip plots#

sns.stripplot(data=students, x="score", y="curriculum", hue="curriculum");

../_images/43691317aa6ee75330cf99bc296a1d1e16120a6c61bb3f32dc14751ac3d6064a.png

Box plots#

sns.boxplot(data=students, x="score", y="curriculum", width=0.4);

../_images/ef7479076c1a738b431d64e154d0edc5cfdf6f4f187609e8a0aa4ae816b9106e.png

Histograms#

# histograms
fig, axs = plt.subplots(2, 1, sharex=True)

# prepare data
dstudents = students[students["curriculum"]=="debate"]
lstudents = students[students["curriculum"]=="lecture"]

# select colors to use for the two groups  
blue, yellow  = sns.color_palette()[0], sns.color_palette()[1]

# plot histograms
bins = [50, 60, 70, 80, 90, 100]
sns.histplot(data=dstudents, x="score", color=blue, ax=axs[0], bins=bins)
sns.histplot(data=lstudents, x="score", color=yellow, ax=axs[1], bins=bins)

# add labels
axs[0].legend(labels=["debate"])
axs[1].legend(labels=["lecture"]);

../_images/c62450298cc4ee1425e48ebd36e3573302a32ce9ff38ce87d32b63f173a6e10a.png

# # ALT grouped bar charts
# sns.histplot(data=students, x="score", hue="curriculum",
#              bins=bins, multiple="dodge")

# # ALT2 use `displot` with curriculum as row-variable
# bins = [50, 60, 70, 80, 90, 100]
# sns.displot(data=students, x='score', row='curriculum',
#             hue='curriculum', alpha=0.8, bins=bins, aspect=3.3, height=2)

Categorical data#

Analysis of the `background` variable#

backgrounds = students["background"]

The Pandas methods .value_counts() can be used to compute the frequencies of any series or data frame.

backgrounds.value_counts()

background
science     6
arts        5
business    4
Name: count, dtype: int64

We often chain the .sort_index() after .value_counts() to sort the results in alphabetical order.

backgrounds.value_counts().sort_index()

background
arts        5
business    4
science     6
Name: count, dtype: int64

Use the option normalize=True for the methods .value_counts() to compute relative frequencies.

# relative frequencies
backgrounds.value_counts(normalize=True)

background
science     0.40
arts        0.33
business    0.27
Name: proportion, dtype: float64

The method .describe() shows some useful facts about the variable.

backgrounds.describe()

count          15
unique          3
top       science
freq            6
Name: background, dtype: object

# # ALT
# backgrounds.value_counts() / backgrounds.count()

# # ALT  combiend table with both frequency and relative frequency
# df2 = pd.DataFrame({
#     "frequency": backgrounds.value_counts(),
#     "relative frequency": backgrounds.value_counts(normalize=True)
# })
# df2

# bar chart of counts
sns.countplot(data=students, x="background");

../_images/cf16139b8d5aeaeffcfd661f08979cf1d4a2a71b07cdd548279df92fdf46fcfd.png

# # ALT bar chart with relative frequencies
# df3 = (backgrounds
#  .value_counts(normalize=True)
#  .sort_index()
#  .rename("relative frequency")
#  .rename_axis("background")
#  .reset_index()
# )
# sns.barplot(data=df3, x="background", y="relative frequency")

Comparing two categodescribeal variables#

students[ ["background","curriculum"] ]

	background	curriculum
0	arts	debate
1	science	lecture
2	arts	debate
3	arts	lecture
4	science	debate
5	business	debate
6	science	lecture
7	business	lecture
8	business	lecture
9	science	lecture
10	business	debate
11	arts	debate
12	science	debate
13	science	lecture
14	arts	debate

# joint frequencies
pd.crosstab(index=students["curriculum"],
            columns=students["background"],
            margins=True, margins_name="TOTAL")

background	arts	business	science	TOTAL
curriculum
debate	4	2	2	8
lecture	1	2	4	7
TOTAL	5	4	6	15

# joint relative frequencies
pd.crosstab(index=students["curriculum"],
            columns=students["background"],
            margins=True, margins_name="TOTAL",
            normalize=True)

background	arts	business	science	TOTAL
curriculum
debate	0.27	0.13	0.13	0.53
lecture	0.07	0.13	0.27	0.47
TOTAL	0.33	0.27	0.40	1.00

sns.countplot(data=students, x="background", hue="curriculum", alpha=0.8);

../_images/da80ea1f4d3f08909ee2b129de4fe7a04f67635713a5412b60b5873baa475948.png

sns.histplot(data=students, x="background",
             hue="curriculum", multiple="stack", shrink=.7);

../_images/eb432d77c9b6309151827863e476f9fed755b4c2444ebc8fe9d0a0cb0a0a3155.png

# # ALT  using displot
# sns.displot(data=students, x="background", hue="curriculum", multiple="stack", shrink=0.8)

# # ALT2  using Pandas plot function for crosstab
# pd.crosstab(
#     index=students["curriculum"],
#     columns=students["background"],
# ).T.plot(kind="bar", stacked=True, rot=0)

# # stacked joint relative frequencies
# sns.histplot(data=students, x="background",
#              hue="curriculum", shrink=.7,
#              multiple="stack", stat="proportion")

# curriculum relative frequencies conditional on background
pd.crosstab(index=students["curriculum"],
            columns=students["background"],
            margins=True, margins_name="TOTAL",
            normalize="columns")

background	arts	business	science	TOTAL
curriculum
debate	0.8	0.5	0.33	0.53
lecture	0.2	0.5	0.67	0.47

ct1 = pd.crosstab(
    index=students["curriculum"],
    columns=students["background"],
    normalize="columns",
)
axct1 = ct1.T.plot(kind="bar", stacked=True, rot=0, alpha=0.8)
plt.legend(loc = "upper left", bbox_to_anchor=(1.02,1))
plt.ylabel("conditional relative frequency");

../_images/a57b9da0a6a8ff89b3d1a1203f0165bdb7c569f8605784973bd994447be8b488.png

# background relative frequencies conditional on curriculum
pd.crosstab(index=students["curriculum"],
            columns=students["background"],
            margins=True, margins_name="TOTAL",
            normalize="index")

background	arts	business	science
curriculum
debate	0.50	0.25	0.25
lecture	0.14	0.29	0.57
TOTAL	0.33	0.27	0.40

ct2 = pd.crosstab(
    index=students["curriculum"],
    columns=students["background"],
    normalize="index",
)
axct2 = ct2.plot(kind="barh", stacked=True, rot=0, alpha=0.8, )
plt.legend(loc = "upper left", bbox_to_anchor=(1.02,1))
plt.xlabel("conditional relative frequency");

../_images/b378157b5e024671c204bf5b7b2a295acd2dc1b9368e8f38f04bb3833dc6bf21.png

Explanations#

Kernel density plots#

sns.kdeplot(data=students, x="score", bw_adjust=0.2)
sns.kdeplot(data=students, x="score", bw_adjust=0.5);

../_images/fbc7655224f638b329e94108d6eee798878eacf4a1b764dc814eb244ae71e0ea.png

Discussion#

Code implementations of `median` and `quantile` functions (bonus material)#

Linear interpolation for the median#

When computing the median of a list of values, it’s not always possible to find a value from the list that satisfies the definition “splits the list into equal parts.” For example, if the list has an even number of elements, then there is no single “middle” number. The convention in this situation is to create a new number that consists of a 50-50 mix of the two middle numbers, a process known as linear interpolation.

The code below shows the procedure for calculating the median:

def median(values):
    n = len(values)
    svalues = sorted(values)
    if n % 2 == 1:            # Case A: n is odd
        mid = n // 2
        return svalues[mid]
    else:                     # Case B: n is even
        j = n // 2
        return 0.5*svalues[j-1] + 0.5*svalues[j]

The logic in the above function splits into two cases, depending on \(n\), the length of the list of values. If \(n\) is odd, we return the middle value from the list of sorted values svalues[mid]. If \(n\) is even, we return a mixture of the two values straddling the midpoint.

It’s not important that you understand the details of the above code snippet, since Pandas will perform median calculations for you. I just want you to be aware this interpolation is happening behind the scenes, so you won’t be surprised if the \(\mathbf{Med}\) value you obtain is not one of the observations in your dataset. For example, the median of a list of integers is a decimal median([1,2]) = 1.5.

import numpy as np
assert median([1,300]) == np.median([1,300])

Linear interpolation for quantiles#

Linear interpolation is also used when computing quartiles, percentiles, and quantiles. In this section, we’ll explain the calculations and code for computing the \(q\)th quantile for the dataset for anyone interested in learning the technical details. Feel free to skip if you’re not passionate about code.

Suppose we want to compute the \(q\)th quantile for a list of \(n\) values \(\mathbf{x}=[x_0, x_1, x_2, \ldots, x_{n-1}]\). We’ll assume the values are already in sorted order. The \(q\)th quantile corresponds to the position \(p = q(n-1)\) within the list, where \(p\) is a decimal number between \(0\) (index of the first element) and \(n-1\) (index of the last element). We can split the position into an integer part \(i\) and a fractional part \(g\), where \(i = \lfloor p \rfloor\) is the greatest integer that’s less than or equal to \(p\), and \(g = p - \lfloor p \rfloor\) is what remains of \(p\) once we remove the integer part.

The \(q\)th quantile lies somewhere between the values \(x_i\) and \(x_{i+1}\) in the list. Specifically, the quantile computed by linear interpolation is a mixture with proportion \(g\) of \(x_{i+1}\) and \((1-g)\) of \(x_i\):

\[ \texttt{quantile}(\mathbf{x}, q) = (1-g) x_i + g x_{i+1}. \]

The code below shows the practical implementation of the math calculations described above.

def quantile(values, q):
    svalues = sorted(values)
    p = q * (len(values)-1)
    i = int(p)
    g = p - int(p)
    return (1-g)*svalues[i] + g*svalues[i+1]

The function we defined above is equivalent to the np.quantile(q) method in Pandas and NumPy, which uses the linear interpolation by default. Note however, there are other valid interpolation methods for computing quantiles that may be used in other statistics software, so don’t freak out if you get different values when computing quantiles.

Everything we discussed about the quantile function also applies to percentiles and quartiles. For example, the \(k\)^th^ quartile is given by quantile(q=k/4), and the \(k\)^th^ percentile is quantile(q=k/100).

arr = [21, 22, 24, 24, 26, 97]
assert quantile(arr, 0.33) == np.quantile(arr, 0.33)
assert quantile(arr, 0.25) == np.quantile(arr, 0.25)
assert quantile(arr, 0.75) == np.quantile(arr, 0.75)

The function quantile we defined above:

Equivalent to quantile(values, q, method="linear") in numpy.
Equivalent to quantile(values, q, type=7) in R.

Section 1.3 — Descriptive statistics

Contents

Section 1.3 — Descriptive statistics#

Notebook setup#

Numerical variables#

Definitions#

Descriptive statistics for the students dataset#

Load the students dataset from CSV#

Focus on the score variable#

Min, max, and median#

Strip plot#

Mean, variance, and standard deviation#

Histograms#

Quartiles#

Box plots#

All summary statistics#

Comparing two numeric variables#

Covariance and correlation#

Multiple numerical variables#

Strip plots#

Box plots#

Histograms#

Categorical data#

Analysis of the background variable#

Comparing two categodescribeal variables#

Explanations#

Kernel density plots#

Discussion#

Code implementations of median and quantile functions (bonus material)#

Linear interpolation for the median#

Linear interpolation for quantiles#

Load the `students` dataset from CSV#

Focus on the `score` variable#

Analysis of the `background` variable#

Code implementations of `median` and `quantile` functions (bonus material)#