{ "cells": [ { "cell_type": "markdown", "id": "c53eb8f5-80a7-4330-9f13-63220177fcc0", "metadata": { "tags": [] }, "source": [ "# Section 5.2 — Bayesian inference computations\n", "\n", "This notebook contains the code examples from [Section 5.2 Bayesian inference computations]() from the **No Bullshit Guide to Statistics**.\n", "\n", "See also:\n", "\n", "- [Chp_04.ipynb](./explorations/SR2024/week02/Chp_04.ipynb)\n", "- [Chp_05.ipynb](./explorations/SR2024/week02/Chp_05.ipynb)\n", "- [homework_week02_ivan_savov.ipynb](./explorations/SR2024/week02/homework_week02_ivan_savov.ipynb)" ] }, { "cell_type": "markdown", "id": "a2d8dda2-58a9-424e-9fb3-32ad6e8777d8", "metadata": { "tags": [] }, "source": [ "#### Notebook setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "aea0223e-aee9-4875-a714-897b6646baaa", "metadata": {}, "outputs": [], "source": [ "# load Python modules\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 2, "id": "efd86c5a-c9d2-4eab-b67d-a65e39b23ef2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Figures setup\n", "plt.clf() # needed otherwise `sns.set_theme` doesn\"t work\n", "from plot_helpers import RCPARAMS\n", "RCPARAMS.update({\"figure.figsize\": (5, 3)}) # good for screen\n", "# RCPARAMS.update({\"figure.figsize\": (5, 1.6)}) # good for print\n", "sns.set_theme(\n", " context=\"paper\",\n", " style=\"whitegrid\",\n", " palette=\"colorblind\",\n", " rc=RCPARAMS,\n", ")\n", "\n", "# High-resolution please\n", "%config InlineBackend.figure_format = \"retina\"\n", "\n", "# Where to store figures\n", "DESTDIR = \"figures/bayes/computations\"" ] }, { "cell_type": "code", "execution_count": 3, "id": "df811a10-417d-4389-8bff-30e59b5f6aef", "metadata": {}, "outputs": [], "source": [ "# set random seed for repeatability\n", "np.random.seed(42)\n", "#######################################################" ] }, { "cell_type": "code", "execution_count": null, "id": "b473f755-7ed8-414e-819b-609cd9220b7d", "metadata": {}, "outputs": [], "source": [ "\n" ] }, { "cell_type": "markdown", "id": "aa35e2c1-e86a-4a41-b84c-35c5e76ad45e", "metadata": {}, "source": [ "## Example 1: estimating the probability of a biased coin" ] }, { "cell_type": "code", "execution_count": 4, "id": "0b336492-4ff2-4b54-8511-982a90105238", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(34, 50, 0.68)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sample of coin tosses observations (1=heads 0=tails)\n", "# We assume they come from a Bernoulli distribution\n", "ctosses = [1,1,0,0,1,0,1,1,1,1,1,0,1,1,0,0,0,1,1,1,\n", " 1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,0,0,1,0,\n", " 0,1,0,1,0,1,0,1,1,0]\n", "sum(ctosses), len(ctosses), sum(ctosses)/len(ctosses)" ] }, { "cell_type": "code", "execution_count": 5, "id": "e11905de-c3de-47dd-a6f9-5618a39a1ae3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
heads
01
11
20
30
41
\n", "
" ], "text/plain": [ " heads\n", "0 1\n", "1 1\n", "2 0\n", "3 0\n", "4 1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame({\"heads\":ctosses})\n", "df1.head()" ] }, { "cell_type": "code", "execution_count": 6, "id": "a70db815-4aea-48d2-afc8-875503c68c8d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
heads
01
11
20
30
41
\n", "
" ], "text/plain": [ " heads\n", "0 1\n", "1 1\n", "2 0\n", "3 0\n", "4 1" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# ALT. load the `ctosses` data frame from the exercises/ folder\n", "df1 = pd.read_csv(\"../datasets/exercises/ctosses.csv\")\n", "df1.head()" ] }, { "cell_type": "code", "execution_count": 7, "id": "18811c8b-9a76-478b-9a4e-04e348011527", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.68" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1[\"heads\"].mean()" ] }, { "cell_type": "markdown", "id": "5681d1be-88d5-42d2-b4f3-d8b98ecd0ad2", "metadata": {}, "source": [ "The model is\n", "\n", "$$\n", " C \\sim \\textrm{Bernoulli}(p)\n", " \\qquad\n", " p \\sim \\mathcal{U}(0,1).\n", "$$" ] }, { "cell_type": "code", "execution_count": 8, "id": "953b1bb0-ff29-49d0-a597-c3ce7dbffed0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", " 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,\n", " 1, 1, 1, 1, 1, 1]),\n", " 0.78)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# BONUS DEMO: simulate a random sample from the data model (for a fixed `true_p`)\n", "# np.random.seed(46)\n", "\n", "# params\n", "n = len(ctosses)\n", "true_p = 0.7\n", "\n", "# gen a random dataframe DF1 like df1\n", "from scipy.stats import bernoulli\n", "ctosses = bernoulli(p=true_p).rvs(n)\n", "DF1 = pd.DataFrame({\"heads\": ctosses})\n", "DF1[\"heads\"].values, DF1[\"heads\"].mean()" ] }, { "cell_type": "code", "execution_count": 9, "id": "f28e392d-c44a-4ad1-85f8-d2a0c3dc49da", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1,\n", " 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,\n", " 1, 0, 0, 0, 0, 0]),\n", " 0.46)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# BONUS DEMO: simulate a random sample from the full Bayesian model\n", "# np.random.seed(47)\n", "n = len(ctosses)\n", "\n", "# gen parameters p independently for each observation\n", "from scipy.stats import uniform\n", "true_ps = uniform(0,1).rvs(n)\n", "\n", "# gen a random data sample of size n\n", "from scipy.stats import bernoulli\n", "ctosses = bernoulli(p=true_ps).rvs(n)\n", "DF1 = pd.DataFrame({\"ctoss\": ctosses})\n", "DF1[\"ctoss\"].values, DF1[\"ctoss\"].mean()" ] }, { "cell_type": "code", "execution_count": 10, "id": "624ddc91-1ae9-4096-8680-ae3c539b1f74", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING (pytensor.configdefaults): g++ not detected! PyTensor will be unable to compile C-implementations and will default to Python. Performance may be severely degraded. To remove this warning, set PyTensor flags cxx to an empty string.\n", "WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.\n" ] }, { "data": { "text/plain": [ " Formula: heads ~ 1\n", " Family: bernoulli\n", " Link: p = identity\n", " Observations: 50\n", " Priors: \n", " target = p\n", " Common-level effects\n", " Intercept ~ Uniform(lower: 0.0, upper: 1.0)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import bambi as bmb\n", "\n", "formula = \"heads ~ 1\"\n", "\n", "priors1 = {\n", " \"Intercept\": bmb.Prior(\"Uniform\", lower=0, upper=1),\n", " # \"Intercept\": bmb.Prior(\"Beta\", alpha=1, beta=1),\n", "}\n", "\n", "#######################################################\n", "mod1 = bmb.Model(formula,\n", " priors=priors1,\n", " family=\"bernoulli\",\n", " link=\"identity\",\n", " data=df1)\n", "mod1.set_alias({\"Intercept\":\"p prior\"})\n", "mod1" ] }, { "cell_type": "code", "execution_count": 11, "id": "70928fdc-a255-4cab-8624-3a34a24957e3", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "cluster__obs__ (50)\n", "\n", "__obs__ (50)\n", "\n", "\n", "\n", "p prior\n", "\n", "p prior\n", "~\n", "Uniform\n", "\n", "\n", "\n", "p\n", "\n", "p\n", "~\n", "Deterministic\n", "\n", "\n", "\n", "p prior->p\n", "\n", "\n", "\n", "\n", "\n", "heads\n", "\n", "heads\n", "~\n", "Bernoulli\n", "\n", "\n", "\n", "p->heads\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mod1.build()\n", "mod1.graph()" ] }, { "cell_type": "code", "execution_count": 12, "id": "eda4bf9e-c1ea-4636-951d-89689d41ff3a", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Modeling the probability that heads==1\n", "Auto-assigning NUTS sampler...\n", "Initializing NUTS using jitter+adapt_diag...\n", "Multiprocess sampling (4 chains in 4 jobs)\n", "NUTS: [p prior]\n", "/Users/ivan/Projects/Minireference/STATSbook/noBSstatsnotebooks/venv/lib/python3.12/site-packages/pytensor/scalar/basic.py:3056: RuntimeWarning: divide by zero encountered in log1p\n", " return np.log1p(x)\n", "/Users/ivan/Projects/Minireference/STATSbook/noBSstatsnotebooks/venv/lib/python3.12/site-packages/pytensor/scalar/basic.py:3056: RuntimeWarning: divide by zero encountered in log1p\n", " return np.log1p(x)\n", "/Users/ivan/Projects/Minireference/STATSbook/noBSstatsnotebooks/venv/lib/python3.12/site-packages/pytensor/tensor/elemwise.py:754: RuntimeWarning: divide by zero encountered in impl (vectorized)\n", " variables = ufunc(*ufunc_args, **ufunc_kwargs)\n", "/Users/ivan/Projects/Minireference/STATSbook/noBSstatsnotebooks/venv/lib/python3.12/site-packages/pytensor/tensor/elemwise.py:754: RuntimeWarning: divide by zero encountered in impl (vectorized)\n", " variables = ufunc(*ufunc_args, **ufunc_kwargs)\n", "/Users/ivan/Projects/Minireference/STATSbook/noBSstatsnotebooks/venv/lib/python3.12/site-packages/pytensor/scalar/basic.py:2000: RuntimeWarning: divide by zero encountered in divide\n", " return x / y\n", "/Users/ivan/Projects/Minireference/STATSbook/noBSstatsnotebooks/venv/lib/python3.12/site-packages/pytensor/scalar/basic.py:2000: RuntimeWarning: divide by zero encountered in divide\n", " return x / y\n", "/Users/ivan/Projects/Minireference/STATSbook/noBSstatsnotebooks/venv/lib/python3.12/site-packages/numpy/core/fromnumeric.py:88: RuntimeWarning: invalid value encountered in reduce\n", " return ufunc.reduce(obj, axis, dtype, out, **passkwargs)\n", "/Users/ivan/Projects/Minireference/STATSbook/noBSstatsnotebooks/venv/lib/python3.12/site-packages/numpy/core/fromnumeric.py:88: RuntimeWarning: invalid value encountered in reduce\n", " return ufunc.reduce(obj, axis, dtype, out, **passkwargs)\n", "/Users/ivan/Projects/Minireference/STATSbook/noBSstatsnotebooks/venv/lib/python3.12/site-packages/pytensor/tensor/elemwise.py:754: RuntimeWarning: invalid value encountered in impl (vectorized)\n", " variables = ufunc(*ufunc_args, **ufunc_kwargs)\n", "/Users/ivan/Projects/Minireference/STATSbook/noBSstatsnotebooks/venv/lib/python3.12/site-packages/pytensor/tensor/elemwise.py:754: RuntimeWarning: invalid value encountered in impl (vectorized)\n", " variables = ufunc(*ufunc_args, **ufunc_kwargs)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "910d86f52fce4115b239026df57b5897", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 3 seconds.\n" ] } ], "source": [ "idata1 = mod1.fit(draws=2000)" ] }, { "cell_type": "code", "execution_count": 13, "id": "75e16c56-f159-4513-9b60-0f242c022119", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
arviz.InferenceData
\n", "
\n", "
    \n", " \n", "
  • \n", " \n", " \n", "
    \n", "
    \n", "
      \n", "
      \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
      <xarray.Dataset> Size: 80kB\n",
             "Dimensions:  (chain: 4, draw: 2000)\n",
             "Coordinates:\n",
             "  * chain    (chain) int64 32B 0 1 2 3\n",
             "  * draw     (draw) int64 16kB 0 1 2 3 4 5 6 ... 1994 1995 1996 1997 1998 1999\n",
             "Data variables:\n",
             "    p prior  (chain, draw) float64 64kB 0.6102 0.6364 0.6419 ... 0.6463 0.5637\n",
             "Attributes:\n",
             "    created_at:                  2024-10-14T14:08:12.695529+00:00\n",
             "    arviz_version:               0.19.0\n",
             "    inference_library:           pymc\n",
             "    inference_library_version:   5.17.0\n",
             "    sampling_time:               2.8274266719818115\n",
             "    tuning_steps:                1000\n",
             "    modeling_interface:          bambi\n",
             "    modeling_interface_version:  0.14.0

      \n", "
    \n", "
    \n", "
  • \n", " \n", "
  • \n", " \n", " \n", "
    \n", "
    \n", "
      \n", "
      \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
      <xarray.Dataset> Size: 992kB\n",
             "Dimensions:                (chain: 4, draw: 2000)\n",
             "Coordinates:\n",
             "  * chain                  (chain) int64 32B 0 1 2 3\n",
             "  * draw                   (draw) int64 16kB 0 1 2 3 4 ... 1996 1997 1998 1999\n",
             "Data variables: (12/17)\n",
             "    acceptance_rate        (chain, draw) float64 64kB 1.0 1.0 1.0 ... 1.0 0.6039\n",
             "    diverging              (chain, draw) bool 8kB False False ... False False\n",
             "    energy                 (chain, draw) float64 64kB 34.57 33.13 ... 34.17\n",
             "    energy_error           (chain, draw) float64 64kB -1.533 -0.181 ... 0.5044\n",
             "    index_in_trajectory    (chain, draw) int64 64kB -1 3 -1 1 3 ... -2 -2 1 1 1\n",
             "    largest_eigval         (chain, draw) float64 64kB nan nan nan ... nan nan\n",
             "    ...                     ...\n",
             "    process_time_diff      (chain, draw) float64 64kB 0.000578 ... 0.0003\n",
             "    reached_max_treedepth  (chain, draw) bool 8kB False False ... False False\n",
             "    smallest_eigval        (chain, draw) float64 64kB nan nan nan ... nan nan\n",
             "    step_size              (chain, draw) float64 64kB 1.012 1.012 ... 1.438\n",
             "    step_size_bar          (chain, draw) float64 64kB 1.441 1.441 ... 1.293\n",
             "    tree_depth             (chain, draw) int64 64kB 2 2 1 1 2 2 ... 2 2 2 1 2 1\n",
             "Attributes:\n",
             "    created_at:                  2024-10-14T14:08:12.702456+00:00\n",
             "    arviz_version:               0.19.0\n",
             "    inference_library:           pymc\n",
             "    inference_library_version:   5.17.0\n",
             "    sampling_time:               2.8274266719818115\n",
             "    tuning_steps:                1000\n",
             "    modeling_interface:          bambi\n",
             "    modeling_interface_version:  0.14.0

      \n", "
    \n", "
    \n", "
  • \n", " \n", "
  • \n", " \n", " \n", "
    \n", "
    \n", "
      \n", "
      \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
      <xarray.Dataset> Size: 800B\n",
             "Dimensions:  (__obs__: 50)\n",
             "Coordinates:\n",
             "  * __obs__  (__obs__) int64 400B 0 1 2 3 4 5 6 7 8 ... 42 43 44 45 46 47 48 49\n",
             "Data variables:\n",
             "    heads    (__obs__) int64 400B 1 1 0 0 1 0 1 1 1 1 1 ... 0 1 0 1 0 1 0 1 1 0\n",
             "Attributes:\n",
             "    created_at:                  2024-10-14T14:08:12.704210+00:00\n",
             "    arviz_version:               0.19.0\n",
             "    inference_library:           pymc\n",
             "    inference_library_version:   5.17.0\n",
             "    modeling_interface:          bambi\n",
             "    modeling_interface_version:  0.14.0

      \n", "
    \n", "
    \n", "
  • \n", " \n", "
\n", "
\n", " " ], "text/plain": [ "Inference data with groups:\n", "\t> posterior\n", "\t> sample_stats\n", "\t> observed_data" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "idata1" ] }, { "cell_type": "code", "execution_count": 14, "id": "ff859442-5a5f-4b60-b6f3-06e3c8e9c0a4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meansdhdi_3%hdi_97%
p prior0.6740.0640.5540.792
\n", "
" ], "text/plain": [ " mean sd hdi_3% hdi_97%\n", "p prior 0.674 0.064 0.554 0.792" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import arviz as az\n", "az.summary(idata1, kind=\"stats\")" ] }, { "cell_type": "code", "execution_count": 15, "id": "649d6ffe-2919-46d9-bbbd-aa10a6bcdb98", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meansdhdi_3%hdi_97%mcse_meanmcse_sdess_bulkess_tailr_hat
p prior0.6740.0640.5540.7920.0010.0013491.05335.01.0
\n", "
" ], "text/plain": [ " mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk \\\n", "p prior 0.674 0.064 0.554 0.792 0.001 0.001 3491.0 \n", "\n", " ess_tail r_hat \n", "p prior 5335.0 1.0 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "az.summary(idata1)" ] }, { "cell_type": "code", "execution_count": 16, "id": "044e2d0b-9e80-4c3b-8b60-25522c13716b", "metadata": {}, "outputs": [ { "ename": "KeyError", "evalue": "'var names: \"[\\'Intercept\\'] are not present\" in dataset'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "File \u001b[0;32m~/Projects/Minireference/STATSbook/noBSstatsnotebooks/venv/lib/python3.12/site-packages/arviz/utils.py:78\u001b[0m, in \u001b[0;36m_var_names\u001b[0;34m(var_names, data, filter_vars, errors)\u001b[0m\n\u001b[1;32m 77\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m---> 78\u001b[0m var_names \u001b[38;5;241m=\u001b[39m \u001b[43m_subset_list\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 79\u001b[0m \u001b[43m \u001b[49m\u001b[43mvar_names\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mall_vars\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mfilter_items\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mfilter_vars\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mwarn\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43merrors\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43merrors\u001b[49m\n\u001b[1;32m 80\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 81\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n", "File \u001b[0;32m~/Projects/Minireference/STATSbook/noBSstatsnotebooks/venv/lib/python3.12/site-packages/arviz/utils.py:158\u001b[0m, in \u001b[0;36m_subset_list\u001b[0;34m(subset, whole_list, filter_items, warn, errors)\u001b[0m\n\u001b[1;32m 157\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m np\u001b[38;5;241m.\u001b[39mall(existing_items) \u001b[38;5;129;01mand\u001b[39;00m (errors \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mraise\u001b[39m\u001b[38;5;124m\"\u001b[39m):\n\u001b[0;32m--> 158\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mnp\u001b[38;5;241m.\u001b[39marray(subset)[\u001b[38;5;241m~\u001b[39mexisting_items]\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m are not present\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 160\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m subset\n", "\u001b[0;31mKeyError\u001b[0m: \"['Intercept'] are not present\"", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[16], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43maz\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mplot_posterior\u001b[49m\u001b[43m(\u001b[49m\u001b[43midata1\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mvar_names\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mIntercept\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mpoint_estimate\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mmode\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mhdi_prob\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m0.9\u001b[39;49m\u001b[43m)\u001b[49m;\n", "File \u001b[0;32m~/Projects/Minireference/STATSbook/noBSstatsnotebooks/venv/lib/python3.12/site-packages/arviz/plots/posteriorplot.py:231\u001b[0m, in \u001b[0;36mplot_posterior\u001b[0;34m(data, var_names, filter_vars, combine_dims, transform, coords, grid, figsize, textsize, hdi_prob, multimodal, skipna, round_to, point_estimate, group, rope, ref_val, rope_color, ref_val_color, kind, bw, circular, bins, labeller, ax, backend, backend_kwargs, show, **kwargs)\u001b[0m\n\u001b[1;32m 229\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m transform \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 230\u001b[0m data \u001b[38;5;241m=\u001b[39m transform(data)\n\u001b[0;32m--> 231\u001b[0m var_names \u001b[38;5;241m=\u001b[39m \u001b[43m_var_names\u001b[49m\u001b[43m(\u001b[49m\u001b[43mvar_names\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mfilter_vars\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 233\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m coords \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 234\u001b[0m coords \u001b[38;5;241m=\u001b[39m {}\n", "File \u001b[0;32m~/Projects/Minireference/STATSbook/noBSstatsnotebooks/venv/lib/python3.12/site-packages/arviz/utils.py:83\u001b[0m, in \u001b[0;36m_var_names\u001b[0;34m(var_names, data, filter_vars, errors)\u001b[0m\n\u001b[1;32m 81\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[1;32m 82\u001b[0m msg \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m \u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;241m.\u001b[39mjoin((\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mvar names:\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00merr\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124min dataset\u001b[39m\u001b[38;5;124m\"\u001b[39m))\n\u001b[0;32m---> 83\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(msg) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01merr\u001b[39;00m\n\u001b[1;32m 84\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m var_names\n", "\u001b[0;31mKeyError\u001b[0m: 'var names: \"[\\'Intercept\\'] are not present\" in dataset'" ] } ], "source": [ "az.plot_posterior(idata1, var_names=\"Intercept\", point_estimate=\"mode\", hdi_prob=0.9);" ] }, { "cell_type": "code", "execution_count": null, "id": "4c97ca76-da06-4e3a-86ac-bc823fad04d0", "metadata": {}, "outputs": [], "source": [ "# ALT. manual plot of posterior density\n", "post = az.extract(idata1, group=\"posterior\", var_names=\"Intercept\").values\n", "sns.kdeplot(x=post.flatten(), bw_adjust=1);" ] }, { "cell_type": "code", "execution_count": null, "id": "62240476-5dfd-465f-8d5d-b561de0cf061", "metadata": {}, "outputs": [], "source": [ "# BONUS: verify the prior was flat\n", "# mod1.plot_priors();" ] }, { "cell_type": "code", "execution_count": null, "id": "8c25710e-2c83-4792-8296-98366d5f9236", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "64abe9f5-4ba7-4d1c-ab55-2915e218cf39", "metadata": {}, "outputs": [], "source": [ "# mod1.predict(idata1, kind=\"response\")\n", "# preds = az.extract(idata1, group=\"posterior_predictive\", var_names=\"win\").values\n", "# sns.histplot(x=preds.sum(axis=0), stat=\"density\")\n", "# sns.kdeplot(x=post.mean(axis=0)*30)" ] }, { "cell_type": "markdown", "id": "f3a3df96-6458-4f60-9fc5-8f4eff69bd61", "metadata": {}, "source": [ "### ALT. Declare a model in PyMC" ] }, { "cell_type": "code", "execution_count": null, "id": "bcbb2198-3b4b-4ede-9cde-2348a9cf00f3", "metadata": {}, "outputs": [], "source": [ "import pymc as pm\n", "\n", "with pm.Model() as model:\n", " # Specify the prior distribution of unknown parameter\n", " p = pm.Beta(\"p\", alpha=1, beta=1)\n", " # Specify the likelihood distribution and condition on the observed data\n", " y_obs = pm.Binomial(\"y_obs\", n=1, p=p, observed=df1)\n", " # Sample from the posterior distribution\n", " idata1_alt = pm.sample(draws=2000)" ] }, { "cell_type": "code", "execution_count": null, "id": "2bafd751-38f0-45f1-898f-6ae6190600ea", "metadata": {}, "outputs": [], "source": [ "az.summary(idata1_alt, kind=\"stats\")" ] }, { "cell_type": "code", "execution_count": null, "id": "1745b4ca-a337-4a19-bf71-176bcc4e9165", "metadata": {}, "outputs": [], "source": [ "pred_dists1 = (pm.sample_prior_predictive(1000, model).prior_predictive[\"y_obs\"].values,\n", " pm.sample_posterior_predictive(idata1, model).posterior_predictive[\"y_obs\"].values)" ] }, { "cell_type": "code", "execution_count": null, "id": "7f719710-de85-4a83-b9b8-c1c8181887e8", "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(4, 1, figsize=(9, 9))\n", "\n", "for idx, n_d, dist in zip((1, 3), (\"Prior\", \"Posterior\"), pred_dists1):\n", " az.plot_dist(dist.sum(-1), \n", " hist_kwargs={\"color\":\"0.5\", \"bins\":range(0, 22)},\n", " ax=axes[idx])\n", " axes[idx].set_title(f\"{n_d} predictive distribution\", fontweight='bold')\n", " axes[idx].set_xlim(-1, 21)\n", " axes[idx].set_xlabel(\"number of success\")\n", "\n", "az.plot_dist(pm.draw(p, 1000),\n", " plot_kwargs={\"color\":\"0.5\"},\n", " fill_kwargs={'alpha':1}, ax=axes[0])\n", "axes[0].set_title(\"Prior distribution\", fontweight='bold')\n", "axes[0].set_xlim(0, 1)\n", "axes[0].set_ylim(0, 4)\n", "axes[0].tick_params(axis='both', pad=7)\n", "axes[0].set_xlabel(\"p\")\n", "\n", "az.plot_dist(idata1.posterior[\"Intercept\"],\n", " plot_kwargs={\"color\":\"0.5\"},\n", " fill_kwargs={'alpha':1},\n", " ax=axes[2])\n", "axes[2].set_title(\"Posterior distribution\", fontweight='bold')\n", "axes[2].set_xlim(0, 1)\n", "axes[2].tick_params(axis='both', pad=7)\n", "axes[2].set_xlabel(\"p\");" ] }, { "cell_type": "code", "execution_count": null, "id": "9bad4afd-a779-4ec2-a642-0459a750ffc5", "metadata": {}, "outputs": [], "source": [ "# ???\n", "# from scipy.stats import binom\n", "# predictions1 = (binom(n=1, p=idata1.posterior[\"Intercept\"].mean()).rvs((4000, len(ctosses))),\n", "# pred_dists1[1])\n", "# for d, c, l in zip(predictions, (\"C0\", \"C4\"), (\"posterior mean\", \"posterior predictive\")):\n", "# ax = az.plot_dist(d.sum(-1),\n", "# label=l,\n", "# figsize=(10, 5),\n", "# hist_kwargs={\"alpha\": 0.5, \"color\":c, \"bins\":range(0, 22)})\n", "# ax.set_yticks([])\n", "# ax.set_xlabel(\"number of success\")" ] }, { "cell_type": "code", "execution_count": null, "id": "8327a8e1-56b7-4bbf-a964-b96bcd7d8b69", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "f9413fbb-9260-4dfe-892c-e34209b19a19", "metadata": {}, "source": [ "## Example 2: estimating the IQ score" ] }, { "cell_type": "code", "execution_count": null, "id": "130576b1-e64e-4bde-9b4f-dfa2dffdf350", "metadata": {}, "outputs": [], "source": [ "# Sample of IQ observations\n", "iqs = [ 82.6, 105.5, 96.7, 84.0, 127.2, 98.8, 94.3,\n", " 122.1, 86.8, 86.1, 107.9, 118.9, 116.5, 101.0,\n", " 91.0, 130.0, 155.7, 120.8, 107.9, 117.1, 100.1,\n", " 108.2, 99.8, 103.6, 108.1, 110.3, 101.8, 131.7,\n", " 103.8, 116.4]" ] }, { "cell_type": "markdown", "id": "77982532-0bbb-4124-b372-843f95fec9fa", "metadata": {}, "source": [ "We assume the IQ scores come from a population with standard deviation $\\sigma = 15$." ] }, { "cell_type": "code", "execution_count": null, "id": "08134f2d-75c1-4d1e-846a-607a8901fd82", "metadata": {}, "outputs": [], "source": [ "df2 = pd.DataFrame({\"iq\":iqs})\n", "# df2.head()\n", "np.mean(iqs)" ] }, { "cell_type": "code", "execution_count": null, "id": "052f4c58-0df9-4de4-9c8f-c1f634547cb9", "metadata": {}, "outputs": [], "source": [ "# ALT. load the `iqs.csv` data file from the exercises/ folder\n", "df2 = pd.read_csv(\"../datasets/exercises/iqs.csv\")\n", "# df2.head()\n", "df2[\"iq\"].mean()" ] }, { "cell_type": "markdown", "id": "d23761d8-d6bd-422f-90b2-aa691b2e8b03", "metadata": {}, "source": [ "The model is\n", "\n", "$$\n", " X \\sim \\mathcal{N}(M, \\sigma=15)\n", " \\qquad\n", " M \\sim \\mathcal{N}(\\mu_M=100,\\sigma_M=40).\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "id": "16768470-1bca-4169-bbe0-e08843ceda7c", "metadata": {}, "outputs": [], "source": [ "#######################################################" ] }, { "cell_type": "code", "execution_count": null, "id": "81781d51-eb70-4956-a578-78149c8c0939", "metadata": {}, "outputs": [], "source": [ "import bambi as bmb\n", "import pymc as pm\n", "\n", "formula2 = \"iq ~ 1\"\n", "\n", "priors2 = {\n", " \"Intercept\": bmb.Prior(\"Normal\", mu=100, sigma=40),\n", " # \"sigma\": bmb.Prior(\"Normal\", mu=15, sigma=0.01), # hack to get constant value\n", " # \"sigma\": bmb.Prior(\"Deterministic\", var=pm.ConstantData(\"sigma\", value=15)),\n", " # pm.ConstantData\n", " \"sigma\": bmb.Prior(\"ConstantData\", value=15),\n", "}\n", "\n", "mod2 = bmb.Model(formula2,\n", " priors=priors2,\n", " family=\"gaussian\",\n", " link=\"identity\",\n", " data=df2)\n", "\n", "mod2.set_alias({\"Intercept\":\"mu prior\"})\n", "mod2" ] }, { "cell_type": "code", "execution_count": null, "id": "71c9ef16-bbc2-4b38-929f-dabcf6ae0b65", "metadata": {}, "outputs": [], "source": [ "mod2.build()\n", "mod2.graph()" ] }, { "cell_type": "code", "execution_count": null, "id": "2fa0866e-a323-478f-93cf-bbeffda62261", "metadata": {}, "outputs": [], "source": [ "idata2 = mod2.fit(draws=2000)\n", "# idata2" ] }, { "cell_type": "code", "execution_count": null, "id": "bbfb991b-b435-4920-8729-059425cdc8c6", "metadata": {}, "outputs": [], "source": [ "az.plot_posterior(idata2, var_names=\"Intercept\", point_estimate=\"mode\", hdi_prob=0.9);" ] }, { "cell_type": "code", "execution_count": null, "id": "38e3af1f-e177-4503-a839-f7de1686d873", "metadata": {}, "outputs": [], "source": [ "import arviz as az\n", "az.summary(idata2, kind=\"stats\")" ] }, { "cell_type": "code", "execution_count": null, "id": "44ec68ed-2399-45df-b922-8b730c984a38", "metadata": {}, "outputs": [], "source": [ "az.summary(idata2, kind=\"stats\", stat_focus=\"median\")" ] }, { "cell_type": "code", "execution_count": null, "id": "44c47a0d-9300-4052-a8ac-ca6153b4219d", "metadata": {}, "outputs": [], "source": [ "mod2.backend.model.data_vars" ] }, { "cell_type": "code", "execution_count": null, "id": "140acafd-2d10-4a37-bf25-1c9c39db79da", "metadata": {}, "outputs": [], "source": [ "mod2.backend.model.observed_RVs" ] }, { "cell_type": "code", "execution_count": null, "id": "cb6ba0f0-ab72-415c-a2b9-79f3fe634952", "metadata": {}, "outputs": [], "source": [ "mod2.backend.model.unobserved_RVs" ] }, { "cell_type": "code", "execution_count": null, "id": "7db473f8-1f0d-474c-a3c2-e8224770e09a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "393d7af3-3e51-458b-ad18-1dd7e2961064", "metadata": {}, "source": [ "## Discussion" ] }, { "cell_type": "markdown", "id": "32efdf18-c42c-472c-997b-d5d39c35cef6", "metadata": {}, "source": [ "### MCMC diagnostic plots\n", "\n", "There are several Arviz plots we can use to check if the Markov Chain Monte Carlo chains were sampling from the posterior as expected, or ..." ] }, { "cell_type": "code", "execution_count": null, "id": "d523e69b-9248-4c09-a5d6-d880e14a913b", "metadata": {}, "outputs": [], "source": [ "# az.plot_trace(idata);\n", "# az.plot_trace(idata, kind=\"rank_bars\");\n", "# az.plot_trace(idata, kind=\"rank_vlines\");\n", "\n", "# plot_cap ~= partial regression plots?\n", "# e.g. plot_cap(model_2, fitted_2, \"weight\", ax=ax);" ] }, { "cell_type": "markdown", "id": "42e59770-6df4-4d98-bc6b-013e95ed488b", "metadata": {}, "source": [ "### Choice of priors\n", "\n", "Different priors lead to different posteriors.\n", "\n", "See `Code 1.8 and Figure 1.7` in\n", "[chp_01.ipynb](./explorations/PyBayesianBookCode/notebooks_updated/chp_01.ipynb)" ] }, { "cell_type": "code", "execution_count": null, "id": "11152a46-4fb9-409c-8714-694b3e0e22a1", "metadata": {}, "outputs": [], "source": [ "from scipy.stats import beta\n", "\n", "x = np.linspace(0, 1, 500)\n", "params = [(0.5, 0.5), (1, 1), (3,3), (100, 25)]\n", "\n", "labels = [\"Jeffreys\",\n", " \"MaxEnt\",\n", " \"Weakly Informative\",\n", " \"Informative\"]\n", "\n", "_, ax = plt.subplots()\n", "for (α, β), label, c in zip(params, labels, (0, 1, 4, 2)):\n", " pdf = beta.pdf(x, α, β)\n", " ax.plot(x, pdf, label=f\"{label}\", c=f\"C{c}\", lw=3)\n", " ax.set(yticks=[], xlabel=\"θ\", title=\"Priors\")\n", " ax.legend()\n", "# plt.savefig(\"img/chp01/prior_informativeness_spectrum.png\")" ] }, { "cell_type": "markdown", "id": "0ad1e1f1-001a-4f1b-88a8-0be4e2bd277b", "metadata": {}, "source": [ "### Bayesian workflow\n", "\n", "See also [chp_09.ipynb](./explorations/PyBayesianBookCode/notebooks_updated/chp_09.ipynb)" ] }, { "cell_type": "code", "execution_count": null, "id": "4019a510-e3ec-4823-84ea-74f43772895d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "221fb158-499c-4bf6-8bc9-54d81d9bbab6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "cf70d07a-37fd-4fe4-9a3f-15063661e3d6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 5 }