From 557dfb22f46205b3b6d73faa318c9c8faae7371d Mon Sep 17 00:00:00 2001 From: Adityajaiswal03 Date: Mon, 5 Aug 2024 23:49:33 +0530 Subject: [PATCH] ReadMe doc updated --- .../learn/core_notebooks/pymc_overview.ipynb | 197 +++++++++++++++++- 1 file changed, 195 insertions(+), 2 deletions(-) diff --git a/docs/source/learn/core_notebooks/pymc_overview.ipynb b/docs/source/learn/core_notebooks/pymc_overview.ipynb index 3f2bebffc5..2a40c3c24b 100644 --- a/docs/source/learn/core_notebooks/pymc_overview.ipynb +++ b/docs/source/learn/core_notebooks/pymc_overview.ipynb @@ -2713,7 +2713,200 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Case study 1: Educational Outcomes for Hearing-impaired Children\n", + "Linear Regression Example\n", + "==========================\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Case Study 1: Predicting Plant Growth Using Environmental Variables\n", + "\n", + "Plant growth can be influenced by multiple factors, and understanding these relationships is crucial for optimizing agricultural practices.\n", + "\n", + "Imagine we conduct an experiment to predict the growth of a plant based on different environmental variables." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The Experiment\n", + "\n", + "We generate synthetic data for the experiment by drawing from a normal distribution. The independent variables in our study include:\n", + "\n", + "- Sunlight Hours: Number of hours the plant is exposed to sunlight daily.\n", + "- Water Amount: Daily water amount given to the plant (in milliliters).\n", + "- Soil Nitrogen Content: Percentage of nitrogen content in the soil.\n", + "\n", + "The dependent variable is:\n", + "\n", + "Plant Growth (y): Measured as the increase in plant height (in centimeters) over a certain period." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pymc as pm\n", + "\n", + "# Taking draws from a normal distribution\n", + "seed = 42\n", + "x_dist = pm.Normal.dist(shape=(100, 3))\n", + "x_data = pm.draw(x_dist, random_seed=seed)\n", + "\n", + "# Independent Variables:\n", + "# Sunlight Hours: Number of hours the plant is exposed to sunlight daily.\n", + "# Water Amount: Daily water amount given to the plant (in milliliters).\n", + "# Soil Nitrogen Content: Percentage of nitrogen content in the soil.\n", + "\n", + "\n", + "# Dependent Variable:\n", + "# Plant Growth (y): Measured as the increase in plant height (in centimeters) over a certain period.\n", + "\n", + "\n", + "# Define coordinate values for all dimensions of the data\n", + "coords={\n", + " \"trial\": range(100),\n", + " \"features\": [\"sunlight hours\", \"water amount\", \"soil nitrogen\"],\n", + "}\n", + "\n", + "# Define generative model\n", + "with pm.Model(coords=coords) as generative_model:\n", + " x = pm.Data(\"x\", x_data, dims=[\"trial\", \"features\"])\n", + "\n", + " # Model parameters\n", + " betas = pm.Normal(\"betas\", dims=\"features\")\n", + " sigma = pm.HalfNormal(\"sigma\")\n", + "\n", + " # Linear model\n", + " mu = x @ betas\n", + "\n", + " # Likelihood\n", + " # Assuming we measure deviation of each plant from baseline\n", + " plant_growth = pm.Normal(\"plant growth\", mu, sigma, dims=\"trial\")\n", + "\n", + "\n", + "# Generating data from model by fixing parameters\n", + "fixed_parameters = {\n", + " \"betas\": [5, 20, 2],\n", + " \"sigma\": 0.5,\n", + "}\n", + "with pm.do(generative_model, fixed_parameters) as synthetic_model:\n", + " idata = pm.sample_prior_predictive(random_seed=seed) # Sample from prior predictive distribution.\n", + " synthetic_y = idata.prior[\"plant growth\"].sel(draw=0, chain=0)\n", + "\n", + "\n", + "# Infer parameters conditioned on observed data\n", + "with pm.observe(generative_model, {\"plant growth\": synthetic_y}) as inference_model:\n", + " idata = pm.sample(random_seed=seed)\n", + "\n", + " summary = pm.stats.summary(idata, var_names=[\"betas\", \"sigma\"])\n", + " print(summary)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From the summary, we can see that the mean of the inferred parameters are very close to the fixed parameters\n", + "\n", + "| Params | mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat |\n", + "|------------------------|--------|-------|--------|---------|-----------|---------|----------|----------|-------|\n", + "| betas[sunlight hours] | 4.972 | 0.054 | 4.866 | 5.066 | 0.001 | 0.001 | 3003 | 1257 | 1 |\n", + "| betas[water amount] | 19.963 | 0.051 | 19.872 | 20.062 | 0.001 | 0.001 | 3112 | 1658 | 1 |\n", + "| betas[soil nitrogen] | 1.994 | 0.055 | 1.899 | 2.107 | 0.001 | 0.001 | 3221 | 1559 | 1 |\n", + "| sigma | 0.511 | 0.037 | 0.438 | 0.575 | 0.001 | 0 | 2945 | 1522 | 1 |\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Simulate new data conditioned on inferred parameters\n", + "new_x_data = pm.draw(\n", + " pm.Normal.dist(shape=(3, 3)),\n", + " random_seed=seed,\n", + ")\n", + "new_coords = coords | {\"trial\": [0, 1, 2]}\n", + "\n", + "with inference_model:\n", + " pm.set_data({\"x\": new_x_data}, coords=new_coords)\n", + " pm.sample_posterior_predictive(\n", + " idata,\n", + " predictions=True,\n", + " extend_inferencedata=True,\n", + " random_seed=seed,\n", + " )\n", + "\n", + "pm.stats.summary(idata.predictions, kind=\"stats\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The new data conditioned on inferred parameters would look like:\n", + "\n", + "| Output | mean | sd | hdi_3% | hdi_97% |\n", + "|------------------|--------|-------|--------|---------|\n", + "| plant growth[0] | 14.229 | 0.515 | 13.325 | 15.272 |\n", + "| plant growth[1] | 24.418 | 0.511 | 23.428 | 25.326 |\n", + "| plant growth[2] | -6.747 | 0.511 | -7.740 | -5.797 |\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Simulate new data, under a scenario where the first beta is zero\n", + "with pm.do(\n", + " inference_model,\n", + " {inference_model[\"betas\"]: inference_model[\"betas\"] * [0, 1, 1]},\n", + ") as plant_growth_model:\n", + " new_predictions = pm.sample_posterior_predictive(\n", + " idata,\n", + " predictions=True,\n", + " random_seed=seed,\n", + " )\n", + "\n", + "pm.stats.summary(new_predictions, kind=\"stats\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The new data conditioned on inferred parameters would look like:\n", + "\n", + "| Output | mean | sd | hdi_3% | hdi_97% |\n", + "|------------------|--------|-------|--------|---------|\n", + "| plant growth[0] | 14.229 | 0.515 | 13.325 | 15.272 |\n", + "| plant growth[1] | 24.418 | 0.511 | 23.428 | 25.326 |\n", + "| plant growth[2] | -6.747 | 0.511 | -7.740 | -5.797 |\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Case study 2: Educational Outcomes for Hearing-impaired Children\n", "\n", "As a motivating example, we will use a dataset of educational outcomes for children with hearing impairment. Here, we are interested in determining factors that are associated with better or poorer learning outcomes. " ] @@ -3527,7 +3720,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Case study 2: Coal mining disasters\n", + "## Case study 3: Coal mining disasters\n", "\n", "Consider the following time series of recorded coal mining disasters in the UK from 1851 to 1962 (Jarrett, 1979). The number of disasters is thought to have been affected by changes in safety regulations during this period. Unfortunately, we also have a pair of years with missing data, identified as missing by a `nan` in the pandas `Series`. These missing values will be automatically imputed by PyMC. \n", "\n",