diff --git a/_freeze/content/find/parsnip/index/execute-results/html.json b/_freeze/content/find/parsnip/index/execute-results/html.json
index 8718c532..cd872254 100644
--- a/_freeze/content/find/parsnip/index/execute-results/html.json
+++ b/_freeze/content/find/parsnip/index/execute-results/html.json
@@ -1,7 +1,7 @@
{
"hash": "90dba81c1c48d850e12a2f4bae0f8d64",
"result": {
- "markdown": "---\ntitle: Search parsnip models\nweight: 2\ndescription: | \n Find model types, engines, and arguments to fit and predict in the tidymodels framework.\n---\n\n\nTo learn about the parsnip package, see [*Get Started: Build a Model*](/start/models/). Use the tables below to find [model types and engines](#models) and to explore [model arguments](#model-args).\n\n\n\n\n\n\n\n\n\n## Explore models {#models}\n\n\n\n\n::: {.cell-output-display}\n```{=html}\n
\n\n```\n:::\n\n\n
\n\nModels can be added by the user too. The article [How to build a parsnip model](/learn/develop/models/) walks you through the steps.\n\n## Explore model arguments {#model-args}\n\nThe parsnip package provides consistent interface for working with similar models across different engines. This means that parsnip adopts standardized parameter names as arguments, and those names may be different from those used by the individual engines. The searchable table below provides a mapping between the parsnip and the engine arguments: \n\n
\n\n\n::: {.cell-output-display}\n```{=html}\n\n\n```\n:::\n",
+ "markdown": "---\ntitle: Search parsnip models\nweight: 2\ndescription: | \n Find model types, engines, and arguments to fit and predict in the tidymodels framework.\n---\n\n\nTo learn about the parsnip package, see [*Get Started: Build a Model*](/start/models/). Use the tables below to find [model types and engines](#models) and to explore [model arguments](#model-args).\n\n\n\n\n\n\n\n\n\n## Explore models {#models}\n\n\n\n\n::: {.cell-output-display}\n```{=html}\n\n\n```\n:::\n\n\n
\n\nModels can be added by the user too. The article [How to build a parsnip model](/learn/develop/models/) walks you through the steps.\n\n## Explore model arguments {#model-args}\n\nThe parsnip package provides consistent interface for working with similar models across different engines. This means that parsnip adopts standardized parameter names as arguments, and those names may be different from those used by the individual engines. The searchable table below provides a mapping between the parsnip and the engine arguments: \n\n
\n\n\n::: {.cell-output-display}\n```{=html}\n\n\n```\n:::\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
diff --git a/_freeze/content/find/recipes/index/execute-results/html.json b/_freeze/content/find/recipes/index/execute-results/html.json
index c19dd1e2..cf7a1844 100644
--- a/_freeze/content/find/recipes/index/execute-results/html.json
+++ b/_freeze/content/find/recipes/index/execute-results/html.json
@@ -1,7 +1,7 @@
{
"hash": "14716ad5ffee97d2204641313d0efd02",
"result": {
- "markdown": "---\nsubtitle: Recipes\ntitle: Search recipe steps\nweight: 3\ndescription: | \n Find recipe steps in the tidymodels framework to help you prep your data for modeling.\n---\n\n\n\n\n\n\nTo learn about the recipes package, see [*Get Started: Preprocess your data with recipes*](/start/recipes/). The table below allows you to search for recipe steps across tidymodels packages.\n\n\n\n\n::: {.cell-output-display}\n```{=html}\n\n\n```\n:::\n",
+ "markdown": "---\nsubtitle: Recipes\ntitle: Search recipe steps\nweight: 3\ndescription: | \n Find recipe steps in the tidymodels framework to help you prep your data for modeling.\n---\n\n\n\n\n\n\nTo learn about the recipes package, see [*Get Started: Preprocess your data with recipes*](/start/recipes/). The table below allows you to search for recipe steps across tidymodels packages.\n\n\n\n\n::: {.cell-output-display}\n```{=html}\n\n\n```\n:::\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
diff --git a/_freeze/content/learn/develop/parameters/index/execute-results/html.json b/_freeze/content/learn/develop/parameters/index/execute-results/html.json
index 83e91770..61fe1eb8 100644
--- a/_freeze/content/learn/develop/parameters/index/execute-results/html.json
+++ b/_freeze/content/learn/develop/parameters/index/execute-results/html.json
@@ -1,7 +1,7 @@
{
"hash": "b25235885fee717b9b516645517237ff",
"result": {
- "markdown": "---\ntitle: \"How to create a tuning parameter function\"\ncategories:\n - developer tools\ntype: learn-subsection\nweight: 4\ndescription: | \n Build functions to use in tuning both quantitative and qualitative parameters.\n---\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: dials and scales.\n\nSome models and recipe steps contain parameters that dials does not know about. You can construct new quantitative and qualitative parameters using `new_quant_param()` or `new_qual_param()`, respectively. This article is a guide to creating new parameters.\n\n## Quantitative parameters\n\nAs an example, let's consider the multivariate adaptive regression spline ([MARS](https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline)) model, which creates nonlinear features from predictors and adds them to a linear regression models. The earth package is an excellent implementation of this method.\n\nMARS creates an initial set of features and then prunes them back to an appropriate size. This can be done automatically by `earth::earth()` or the number of final terms can be set by the user. The parsnip function `mars()` has a parameter called `num_terms` that defines this.\n\nWhat if we want to create a parameter for the number of *initial terms* included in the model. There is no argument in `parsnip::mars()` for this but we will make one now. The argument name in `earth::earth()` is `nk`, which is not very descriptive. Our parameter will be called `num_initial_terms`.\n\nWe use the `new_quant_param()` function since this is a numeric parameter. The main two arguments to a numeric parameter function are `range` and `trans`.\n\nThe `range` specifies the possible values of the parameter. For our example, a minimal value might be one or two. What is the upper limit? The default in the earth package is\n\n\n::: {.cell layout-align=\"center\" hash='cache/eart_0aaa451856e86c8fdc7e0c3f099c8de4'}\n\n```{.r .cell-code}\nmin(200, max(20, 2 * ncol(x))) + 1\n```\n:::\n\n\nwhere `x` is the predictor matrix. We often put in values that are either sensible defaults or are minimal enough to work for the majority of data sets. For now, let's specify an upper limit of 10 but this will be discussed more in the next section.\n\nThe other argument is `trans`, which represents a transformation that should be applied to the parameter values when working with them. For example, many regularization methods have a `penalty` parameter that tends to range between zero and some upper bound (let's say 1). The effect of going from a penalty value of 0.01 to 0.1 is much more impactful than going from 0.9 to 1.0. In such a case, it might make sense to work with this parameter in transformed units (such as the log, in this example). If new parameter values are generated at random, it helps if they are uniformly simulated in the transformed units and then converted back to the original units.\n\nThe `trans` parameter accepts a transformation object from the scales package. For example:\n\n\n::: {.cell layout-align=\"center\" hash='cache/scales_b25ae3bb346dde06d2a7e463ba1f4c4d'}\n\n```{.r .cell-code}\nlibrary(scales)\nlsf.str(\"package:scales\", pattern = \"_trans$\")\n#> asn_trans : function () \n#> atanh_trans : function () \n#> boxcox_trans : function (p, offset = 0) \n#> compose_trans : function (...) \n#> date_trans : function () \n#> exp_trans : function (base = exp(1)) \n#> hms_trans : function () \n#> identity_trans : function () \n#> log_trans : function (base = exp(1)) \n#> log10_trans : function () \n#> log1p_trans : function () \n#> log2_trans : function () \n#> logit_trans : function () \n#> modulus_trans : function (p, offset = 1) \n#> probability_trans : function (distribution, ...) \n#> probit_trans : function () \n#> pseudo_log_trans : function (sigma = 1, base = exp(1)) \n#> reciprocal_trans : function () \n#> reverse_trans : function () \n#> sqrt_trans : function () \n#> time_trans : function (tz = NULL) \n#> yj_trans : function (p)\nscales::log10_trans()\n#> Transformer: log-10 [1e-100, Inf]\n```\n:::\n\n\nA value of `NULL` means that no transformation should be used.\n\nA quantitative parameter function should have these two arguments and, in the function body, a call `new_quant_param()`. There are a few arguments to this function:\n\n\n::: {.cell layout-align=\"center\" hash='cache/new_quant_param_dc898a8030c6fa2847b68be9c2db5701'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nargs(new_quant_param)\n#> function (type = c(\"double\", \"integer\"), range = NULL, inclusive = NULL, \n#> default = deprecated(), trans = NULL, values = NULL, label = NULL, \n#> finalize = NULL) \n#> NULL\n```\n:::\n\n\n- Possible types are double precision and integers. The value of `type` should agree with the values of `range` in the function definition.\n\n- It's OK for our tuning to include the minimum or maximum, so we'll use `c(TRUE, TRUE)` for `inclusive`. If the value cannot include one end of the range, set one or both of these values to `FALSE`.\n\n- The `label` should be a named character string where the name is the parameter name and the value represents what will be printed automatically.\n\n- `finalize` is an argument that can set parts of the range. This is discussed more below.\n\nHere's an example of a basic quantitative parameter object:\n\n\n::: {.cell layout-align=\"center\" hash='cache/num-initial-terms_d602318800beb1ef90a7bde3e6959438'}\n\n```{.r .cell-code}\nnum_initial_terms <- function(range = c(1L, 10L), trans = NULL) {\n new_quant_param(\n type = \"integer\",\n range = range,\n inclusive = c(TRUE, TRUE),\n trans = trans,\n label = c(num_initial_terms = \"# Initial MARS Terms\"),\n finalize = NULL\n )\n}\n\nnum_initial_terms()\n#> # Initial MARS Terms (quantitative)\n#> Range: [1, 10]\n\n# Sample from the parameter:\nset.seed(4832856)\nnum_initial_terms() %>% value_sample(5)\n#> [1] 6 4 9 10 4\n```\n:::\n\n\n### Finalizing parameters\n\nIt might be the case that the range of the parameter is unknown. For example, parameters that are related to the number of columns in a data set cannot be exactly specified in the absence of data. In those cases, a placeholder of `unknown()` can be added. This will force the user to \"finalize\" the parameter object for their particular data set. Let's redefine our function with an `unknown()` value:\n\n\n::: {.cell layout-align=\"center\" hash='cache/num-initial-terms-unk_9de7d72b673760c5098403e4f395b8d8'}\n\n```{.r .cell-code}\nnum_initial_terms <- function(range = c(1L, unknown()), trans = NULL) {\n new_quant_param(\n type = \"integer\",\n range = range,\n inclusive = c(TRUE, TRUE),\n trans = trans,\n label = c(num_initial_terms = \"# Initial MARS Terms\"),\n finalize = NULL\n )\n}\nnum_initial_terms()\n\n# Can we sample? \nnum_initial_terms() %>% value_sample(5)\n```\n:::\n\n\nThe `finalize` argument of `num_initial_terms()` can take a function that uses data to set the range. For example, the package already includes a few functions for finalization:\n\n\n::: {.cell layout-align=\"center\" hash='cache/dials-final-funcs_13c8f3f4f0f277ecb3c76d762cb7a32c'}\n\n```{.r .cell-code}\nlsf.str(\"package:dials\", pattern = \"^get_\")\n#> get_batch_sizes : function (object, x, frac = c(1/10, 1/3), ...) \n#> get_log_p : function (object, x, ...) \n#> get_n : function (object, x, log_vals = FALSE, ...) \n#> get_n_frac : function (object, x, log_vals = FALSE, frac = 1/3, ...) \n#> get_n_frac_range : function (object, x, log_vals = FALSE, frac = c(1/10, 5/10), ...) \n#> get_p : function (object, x, log_vals = FALSE, ...) \n#> get_rbf_range : function (object, x, seed = sample.int(10^5, 1), ...)\n```\n:::\n\n\nThese functions generally take a data frame of predictors (in an argument called `x`) and add the range of the parameter object. Using the formula in the earth package, we might use:\n\n\n::: {.cell layout-align=\"center\" hash='cache/earth-range_e1c9bf6b8f535f761d22d3b738ea8bb2'}\n\n```{.r .cell-code}\nget_initial_mars_terms <- function(object, x) {\n upper_bound <- min(200, max(20, 2 * ncol(x))) + 1\n upper_bound <- as.integer(upper_bound)\n bounds <- range_get(object)\n bounds$upper <- upper_bound\n range_set(object, bounds)\n}\n\n# Use the mtcars are the finalize the upper bound: \nnum_initial_terms() %>% get_initial_mars_terms(x = mtcars[, -1])\n#> # Initial MARS Terms (quantitative)\n#> Range: [1, 21]\n```\n:::\n\n\nOnce we add this function to the object, the general `finalize()` method can be used:\n\n\n::: {.cell layout-align=\"center\" hash='cache/final-obj_9b8361428190d870490441c3eecf012e'}\n\n```{.r .cell-code}\nnum_initial_terms <- function(range = c(1L, unknown()), trans = NULL) {\n new_quant_param(\n type = \"integer\",\n range = range,\n inclusive = c(TRUE, TRUE),\n trans = trans,\n label = c(num_initial_terms = \"# Initial MARS Terms\"),\n finalize = get_initial_mars_terms\n )\n}\n\nnum_initial_terms() %>% finalize(x = mtcars[, -1])\n#> # Initial MARS Terms (quantitative)\n#> Range: [1, 21]\n```\n:::\n\n\n## Qualitative parameters\n\nNow let's look at an example of a qualitative parameter. If a model includes a data aggregation step, we want to allow users to tune how our parameters are aggregated. For example, in embedding methods, possible values might be `min`, `max`, `mean`, `sum`, or to not aggregate at all (\"none\"). Since these cannot be put on a numeric scale, they are possible values of a qualitative parameter. We'll take \"character\" input (not \"logical\"), and we must specify the allowed values. By default we won't aggregate.\n\n\n::: {.cell layout-align=\"center\" hash='cache/aggregation_39f71033809bc19015c698dbeca6311d'}\n\n```{.r .cell-code}\naggregation <- function(values = c(\"none\", \"min\", \"max\", \"mean\", \"sum\")) {\n new_qual_param(\n type = \"character\",\n values = values,\n # By default, the first value is selected as default. We'll specify that to\n # make it clear.\n default = \"none\",\n label = c(aggregation = \"Aggregation Method\")\n )\n}\n```\n:::\n\n\nWithin the dials package, the convention is to have the values contained in a separate vector whose name starts with `values_`. For example:\n\n\n::: {.cell layout-align=\"center\" hash='cache/aggregation-vec_7dcc9f145f63cccd19b24d9fa23f4d10'}\n\n```{.r .cell-code}\nvalues_aggregation <- c(\"none\", \"min\", \"max\", \"mean\", \"sum\")\naggregation <- function(values = values_aggregation) {\n new_qual_param(\n type = \"character\",\n values = values,\n # By default, the first value is selected as default. We'll specify that to\n # make it clear.\n default = \"none\",\n label = c(aggregation = \"Aggregation Method\")\n )\n}\n```\n:::\n\n\nThis step may not make sense if you are using the function in a script and not keeping it within a package.\n\nWe can use our `aggregation` parameters with dials functions.\n\n\n::: {.cell layout-align=\"center\" hash='cache/aggregation-use_9ac92377d564310ae312bf63617fedad'}\n\n```{.r .cell-code}\naggregation()\n#> Warning: The `default` argument of `new_qual_param()` is deprecated as of\n#> dials 1.0.1.\n#> Aggregation Method (qualitative)\n#> 5 possible values include:\n#> 'none', 'min', 'max', 'mean' and 'sum'\naggregation() %>% value_sample(3)\n#> [1] \"min\" \"sum\" \"mean\"\n```\n:::\n\n\n## Session information\n\n\n::: {.cell layout-align=\"center\" hash='cache/si_43a75b68dcc94565ba13180d7ad26a69'}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.2.0 (2022-04-22)\n#> os macOS Monterey 12.6.1\n#> system aarch64, darwin20\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/New_York\n#> date 2023-01-05\n#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)\n#> dials * 1.1.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)\n#> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.2.0)\n#> parsnip * 1.0.3 2022-11-11 [1] CRAN (R 4.2.0)\n#> purrr * 1.0.0 2022-12-20 [1] CRAN (R 4.2.0)\n#> recipes * 1.0.3 2022-11-09 [1] CRAN (R 4.2.0)\n#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)\n#> rsample * 1.1.1 2022-12-07 [1] CRAN (R 4.2.0)\n#> scales * 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)\n#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)\n#> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0)\n#> tune * 1.0.1.9001 2022-12-09 [1] Github (tidymodels/tune@e23abdf)\n#> workflows * 1.1.2 2022-11-16 [1] CRAN (R 4.2.0)\n#> yardstick * 1.1.0.9000 2022-12-27 [1] Github (tidymodels/yardstick@5f1b9ce)\n#> \n#> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n",
+ "markdown": "---\ntitle: \"How to create a tuning parameter function\"\ncategories:\n - developer tools\ntype: learn-subsection\nweight: 4\ndescription: | \n Build functions to use in tuning both quantitative and qualitative parameters.\n---\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: dials and scales.\n\nSome models and recipe steps contain parameters that dials does not know about. You can construct new quantitative and qualitative parameters using `new_quant_param()` or `new_qual_param()`, respectively. This article is a guide to creating new parameters.\n\n## Quantitative parameters\n\nAs an example, let's consider the multivariate adaptive regression spline ([MARS](https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline)) model, which creates nonlinear features from predictors and adds them to a linear regression models. The earth package is an excellent implementation of this method.\n\nMARS creates an initial set of features and then prunes them back to an appropriate size. This can be done automatically by `earth::earth()` or the number of final terms can be set by the user. The parsnip function `mars()` has a parameter called `num_terms` that defines this.\n\nWhat if we want to create a parameter for the number of *initial terms* included in the model. There is no argument in `parsnip::mars()` for this but we will make one now. The argument name in `earth::earth()` is `nk`, which is not very descriptive. Our parameter will be called `num_initial_terms`.\n\nWe use the `new_quant_param()` function since this is a numeric parameter. The main two arguments to a numeric parameter function are `range` and `trans`.\n\nThe `range` specifies the possible values of the parameter. For our example, a minimal value might be one or two. What is the upper limit? The default in the earth package is\n\n\n::: {.cell layout-align=\"center\" hash='cache/eart_0aaa451856e86c8fdc7e0c3f099c8de4'}\n\n```{.r .cell-code}\nmin(200, max(20, 2 * ncol(x))) + 1\n```\n:::\n\n\nwhere `x` is the predictor matrix. We often put in values that are either sensible defaults or are minimal enough to work for the majority of data sets. For now, let's specify an upper limit of 10 but this will be discussed more in the next section.\n\nThe other argument is `trans`, which represents a transformation that should be applied to the parameter values when working with them. For example, many regularization methods have a `penalty` parameter that tends to range between zero and some upper bound (let's say 1). The effect of going from a penalty value of 0.01 to 0.1 is much more impactful than going from 0.9 to 1.0. In such a case, it might make sense to work with this parameter in transformed units (such as the log, in this example). If new parameter values are generated at random, it helps if they are uniformly simulated in the transformed units and then converted back to the original units.\n\nThe `trans` parameter accepts a transformation object from the scales package. For example:\n\n\n::: {.cell layout-align=\"center\" hash='cache/scales_b25ae3bb346dde06d2a7e463ba1f4c4d'}\n\n```{.r .cell-code}\nlibrary(scales)\nlsf.str(\"package:scales\", pattern = \"_trans$\")\n#> asn_trans : function () \n#> atanh_trans : function () \n#> boxcox_trans : function (p, offset = 0) \n#> compose_trans : function (...) \n#> date_trans : function () \n#> exp_trans : function (base = exp(1)) \n#> hms_trans : function () \n#> identity_trans : function () \n#> log_trans : function (base = exp(1)) \n#> log10_trans : function () \n#> log1p_trans : function () \n#> log2_trans : function () \n#> logit_trans : function () \n#> modulus_trans : function (p, offset = 1) \n#> probability_trans : function (distribution, ...) \n#> probit_trans : function () \n#> pseudo_log_trans : function (sigma = 1, base = exp(1)) \n#> reciprocal_trans : function () \n#> reverse_trans : function () \n#> sqrt_trans : function () \n#> time_trans : function (tz = NULL) \n#> yj_trans : function (p)\nscales::log10_trans()\n#> Transformer: log-10 [1e-100, Inf]\n```\n:::\n\n\nA value of `NULL` means that no transformation should be used.\n\nA quantitative parameter function should have these two arguments and, in the function body, a call `new_quant_param()`. There are a few arguments to this function:\n\n\n::: {.cell layout-align=\"center\" hash='cache/new_quant_param_dc898a8030c6fa2847b68be9c2db5701'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nargs(new_quant_param)\n#> function (type = c(\"double\", \"integer\"), range = NULL, inclusive = NULL, \n#> default = deprecated(), trans = NULL, values = NULL, label = NULL, \n#> finalize = NULL) \n#> NULL\n```\n:::\n\n\n- Possible types are double precision and integers. The value of `type` should agree with the values of `range` in the function definition.\n\n- It's OK for our tuning to include the minimum or maximum, so we'll use `c(TRUE, TRUE)` for `inclusive`. If the value cannot include one end of the range, set one or both of these values to `FALSE`.\n\n- The `label` should be a named character string where the name is the parameter name and the value represents what will be printed automatically.\n\n- `finalize` is an argument that can set parts of the range. This is discussed more below.\n\nHere's an example of a basic quantitative parameter object:\n\n\n::: {.cell layout-align=\"center\" hash='cache/num-initial-terms_d602318800beb1ef90a7bde3e6959438'}\n\n```{.r .cell-code}\nnum_initial_terms <- function(range = c(1L, 10L), trans = NULL) {\n new_quant_param(\n type = \"integer\",\n range = range,\n inclusive = c(TRUE, TRUE),\n trans = trans,\n label = c(num_initial_terms = \"# Initial MARS Terms\"),\n finalize = NULL\n )\n}\n\nnum_initial_terms()\n#> # Initial MARS Terms (quantitative)\n#> Range: [1, 10]\n\n# Sample from the parameter:\nset.seed(4832856)\nnum_initial_terms() %>% value_sample(5)\n#> [1] 6 4 9 10 4\n```\n:::\n\n\n### Finalizing parameters\n\nIt might be the case that the range of the parameter is unknown. For example, parameters that are related to the number of columns in a data set cannot be exactly specified in the absence of data. In those cases, a placeholder of `unknown()` can be added. This will force the user to \"finalize\" the parameter object for their particular data set. Let's redefine our function with an `unknown()` value:\n\n\n::: {.cell layout-align=\"center\" hash='cache/num-initial-terms-unk_9de7d72b673760c5098403e4f395b8d8'}\n\n```{.r .cell-code}\nnum_initial_terms <- function(range = c(1L, unknown()), trans = NULL) {\n new_quant_param(\n type = \"integer\",\n range = range,\n inclusive = c(TRUE, TRUE),\n trans = trans,\n label = c(num_initial_terms = \"# Initial MARS Terms\"),\n finalize = NULL\n )\n}\nnum_initial_terms()\n\n# Can we sample? \nnum_initial_terms() %>% value_sample(5)\n```\n:::\n\n\nThe `finalize` argument of `num_initial_terms()` can take a function that uses data to set the range. For example, the package already includes a few functions for finalization:\n\n\n::: {.cell layout-align=\"center\" hash='cache/dials-final-funcs_13c8f3f4f0f277ecb3c76d762cb7a32c'}\n\n```{.r .cell-code}\nlsf.str(\"package:dials\", pattern = \"^get_\")\n#> get_batch_sizes : function (object, x, frac = c(1/10, 1/3), ...) \n#> get_log_p : function (object, x, ...) \n#> get_n : function (object, x, log_vals = FALSE, ...) \n#> get_n_frac : function (object, x, log_vals = FALSE, frac = 1/3, ...) \n#> get_n_frac_range : function (object, x, log_vals = FALSE, frac = c(1/10, 5/10), ...) \n#> get_p : function (object, x, log_vals = FALSE, ...) \n#> get_rbf_range : function (object, x, seed = sample.int(10^5, 1), ...)\n```\n:::\n\n\nThese functions generally take a data frame of predictors (in an argument called `x`) and add the range of the parameter object. Using the formula in the earth package, we might use:\n\n\n::: {.cell layout-align=\"center\" hash='cache/earth-range_e1c9bf6b8f535f761d22d3b738ea8bb2'}\n\n```{.r .cell-code}\nget_initial_mars_terms <- function(object, x) {\n upper_bound <- min(200, max(20, 2 * ncol(x))) + 1\n upper_bound <- as.integer(upper_bound)\n bounds <- range_get(object)\n bounds$upper <- upper_bound\n range_set(object, bounds)\n}\n\n# Use the mtcars are the finalize the upper bound: \nnum_initial_terms() %>% get_initial_mars_terms(x = mtcars[, -1])\n#> # Initial MARS Terms (quantitative)\n#> Range: [1, 21]\n```\n:::\n\n\nOnce we add this function to the object, the general `finalize()` method can be used:\n\n\n::: {.cell layout-align=\"center\" hash='cache/final-obj_9b8361428190d870490441c3eecf012e'}\n\n```{.r .cell-code}\nnum_initial_terms <- function(range = c(1L, unknown()), trans = NULL) {\n new_quant_param(\n type = \"integer\",\n range = range,\n inclusive = c(TRUE, TRUE),\n trans = trans,\n label = c(num_initial_terms = \"# Initial MARS Terms\"),\n finalize = get_initial_mars_terms\n )\n}\n\nnum_initial_terms() %>% finalize(x = mtcars[, -1])\n#> # Initial MARS Terms (quantitative)\n#> Range: [1, 21]\n```\n:::\n\n\n## Qualitative parameters\n\nNow let's look at an example of a qualitative parameter. If a model includes a data aggregation step, we want to allow users to tune how our parameters are aggregated. For example, in embedding methods, possible values might be `min`, `max`, `mean`, `sum`, or to not aggregate at all (\"none\"). Since these cannot be put on a numeric scale, they are possible values of a qualitative parameter. We'll take \"character\" input (not \"logical\"), and we must specify the allowed values. By default we won't aggregate.\n\n\n::: {.cell layout-align=\"center\" hash='cache/aggregation_39f71033809bc19015c698dbeca6311d'}\n\n```{.r .cell-code}\naggregation <- function(values = c(\"none\", \"min\", \"max\", \"mean\", \"sum\")) {\n new_qual_param(\n type = \"character\",\n values = values,\n # By default, the first value is selected as default. We'll specify that to\n # make it clear.\n default = \"none\",\n label = c(aggregation = \"Aggregation Method\")\n )\n}\n```\n:::\n\n\nWithin the dials package, the convention is to have the values contained in a separate vector whose name starts with `values_`. For example:\n\n\n::: {.cell layout-align=\"center\" hash='cache/aggregation-vec_7dcc9f145f63cccd19b24d9fa23f4d10'}\n\n```{.r .cell-code}\nvalues_aggregation <- c(\"none\", \"min\", \"max\", \"mean\", \"sum\")\naggregation <- function(values = values_aggregation) {\n new_qual_param(\n type = \"character\",\n values = values,\n # By default, the first value is selected as default. We'll specify that to\n # make it clear.\n default = \"none\",\n label = c(aggregation = \"Aggregation Method\")\n )\n}\n```\n:::\n\n\nThis step may not make sense if you are using the function in a script and not keeping it within a package.\n\nWe can use our `aggregation` parameters with dials functions.\n\n\n::: {.cell layout-align=\"center\" hash='cache/aggregation-use_9ac92377d564310ae312bf63617fedad'}\n\n```{.r .cell-code}\naggregation()\n#> Warning: The `default` argument of `new_qual_param()` is deprecated as of\n#> dials 1.0.1.\n#> Aggregation Method (qualitative)\n#> 5 possible values include:\n#> 'none', 'min', 'max', 'mean' and 'sum'\naggregation() %>% value_sample(3)\n#> [1] \"min\" \"sum\" \"mean\"\n```\n:::\n\n\n## Session information\n\n\n::: {.cell layout-align=\"center\" hash='cache/si_43a75b68dcc94565ba13180d7ad26a69'}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.2.0 (2022-04-22)\n#> os macOS Big Sur/Monterey 10.16\n#> system x86_64, darwin17.0\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/New_York\n#> date 2023-01-04\n#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)\n#> dials * 1.1.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)\n#> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> infer * 1.0.3 2022-08-22 [1] CRAN (R 4.2.0)\n#> parsnip * 1.0.3 2022-11-11 [1] CRAN (R 4.2.0)\n#> purrr * 0.3.5 2022-10-06 [1] CRAN (R 4.2.0)\n#> recipes * 1.0.3.9000 2022-12-12 [1] local\n#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)\n#> rsample * 1.1.1.9000 2022-12-13 [1] local\n#> scales * 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)\n#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)\n#> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0)\n#> tune * 1.0.1.9001 2022-12-05 [1] Github (tidymodels/tune@e23abdf)\n#> workflows * 1.1.2 2022-11-16 [1] CRAN (R 4.2.0)\n#> yardstick * 1.1.0.9000 2022-11-30 [1] Github (tidymodels/yardstick@5f1b9ce)\n#> \n#> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
diff --git a/_freeze/content/learn/develop/recipes/index/execute-results/html.json b/_freeze/content/learn/develop/recipes/index/execute-results/html.json
index 8d8b8212..9b4ebc44 100644
--- a/_freeze/content/learn/develop/recipes/index/execute-results/html.json
+++ b/_freeze/content/learn/develop/recipes/index/execute-results/html.json
@@ -1,7 +1,7 @@
{
- "hash": "e4f84da00cddb4f9fbc6e267db2e8700",
+ "hash": "8db58da323045aa3496d83885e9bf155",
"result": {
- "markdown": "---\ntitle: \"Create your own recipe step function\"\ncategories:\n - developer tools\ntype: learn-subsection\nweight: 1\ndescription: | \n Write a new recipe step for data preprocessing.\n---\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: modeldata and tidymodels.\n\nThere are many existing recipe steps in packages like recipes, themis, textrecipes, and others. A full list of steps in CRAN packages [can be found here](/find/recipes/). However, you might need to define your own preprocessing operations; this article describes how to do that. If you are looking for good examples of existing steps, we suggest looking at the code for [centering](https://github.com/tidymodels/recipes/blob/master/R/center.R) or [PCA](https://github.com/tidymodels/recipes/blob/master/R/pca.R) to start. \n\nFor check operations (e.g. `check_class()`), the process is very similar. Notes on this are available at the end of this article. \n\nThe general process to follow is to:\n\n1. Define a step constructor function.\n\n2. Create the minimal S3 methods for `prep()`, `bake()`, and `print()`. \n\n3. Optionally add some extra methods to work with other tidymodels packages, such as `tunable()` and `tidy()`. \n\nAs an example, we will create a step for converting data into percentiles. \n\n## A new step definition\n\nLet's create a step that replaces the value of a variable with its percentile from the training set. The example data we'll use is from the modeldata package:\n\n\n::: {.cell layout-align=\"center\" hash='cache/initial_e9b15d7263a13cf19c62529abfd10bdb'}\n\n```{.r .cell-code}\nlibrary(modeldata)\ndata(biomass)\nstr(biomass)\n#> 'data.frame':\t536 obs. of 8 variables:\n#> $ sample : chr \"Akhrot Shell\" \"Alabama Oak Wood Waste\" \"Alder\" \"Alfalfa\" ...\n#> $ dataset : chr \"Training\" \"Training\" \"Training\" \"Training\" ...\n#> $ carbon : num 49.8 49.5 47.8 45.1 46.8 ...\n#> $ hydrogen: num 5.64 5.7 5.8 4.97 5.4 5.75 5.99 5.7 5.5 5.9 ...\n#> $ oxygen : num 42.9 41.3 46.2 35.6 40.7 ...\n#> $ nitrogen: num 0.41 0.2 0.11 3.3 1 2.04 2.68 1.7 0.8 1.2 ...\n#> $ sulfur : num 0 0 0.02 0.16 0.02 0.1 0.2 0.2 0 0.1 ...\n#> $ HHV : num 20 19.2 18.3 18.2 18.4 ...\n\nbiomass_tr <- biomass[biomass$dataset == \"Training\",]\nbiomass_te <- biomass[biomass$dataset == \"Testing\",]\n```\n:::\n\n\nTo illustrate the transformation with the `carbon` variable, note the training set distribution of this variable with a vertical line below for the first value of the test set. \n\n\n::: {.cell layout-align=\"center\" hash='cache/carbon_dist_1ea4da99e7a470221927e2615897ebcd'}\n\n```{.r .cell-code}\nlibrary(ggplot2)\ntheme_set(theme_bw())\nggplot(biomass_tr, aes(x = carbon)) + \n geom_histogram(binwidth = 5, col = \"blue\", fill = \"blue\", alpha = .5) + \n geom_vline(xintercept = biomass_te$carbon[1], lty = 2)\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=100%}\n:::\n:::\n\n\nBased on the training set, 42.1% of the data are less than a value of 46.35. There are some applications where it might be advantageous to represent the predictor values as percentiles rather than their original values. \n\nOur new step will do this computation for any numeric variables of interest. We will call this new recipe step `step_percentile()`. The code below is designed for illustration and not speed or best practices. We've left out a lot of error trapping that we would want in a real implementation. \n\n## Create the function\n\nTo start, there is a _user-facing_ function. Let's call that `step_percentile()`. This is just a simple wrapper around a _constructor function_, which defines the rules for any step object that defines a percentile transformation. We'll call this constructor `step_percentile_new()`. \n\nThe function `step_percentile()` takes the same arguments as your function and simply adds it to a new recipe. The `...` signifies the variable selectors that can be used.\n\n\n::: {.cell layout-align=\"center\" hash='cache/initial_def_9e13ec18326669f1593d727bab539fa6'}\n\n```{.r .cell-code}\nstep_percentile <- function(\n recipe, \n ..., \n role = NA, \n trained = FALSE, \n ref_dist = NULL,\n options = list(probs = (0:100)/100, names = TRUE),\n skip = FALSE,\n id = rand_id(\"percentile\")\n ) {\n\n ## The variable selectors are not immediately evaluated by using\n ## the `quos()` function in `rlang`. `ellipse_check()` captures \n ## the values and also checks to make sure that they are not empty. \n terms <- ellipse_check(...) \n\n add_step(\n recipe, \n step_percentile_new(\n terms = terms, \n trained = trained,\n role = role, \n ref_dist = ref_dist,\n options = options,\n skip = skip,\n id = id\n )\n )\n}\n```\n:::\n\n\nYou should always keep the first four arguments (`recipe` though `trained`) the same as listed above. Some notes:\n\n * the `role` argument is used when you either 1) create new variables and want their role to be pre-set or 2) replace the existing variables with new values. The latter is what we will be doing and using `role = NA` will leave the existing role intact. \n * `trained` is set by the package when the estimation step has been run. You should default your function definition's argument to `FALSE`. \n * `skip` is a logical. Whenever a recipe is prepped, each step is trained and then baked. However, there are some steps that should not be applied when a call to `bake()` is used. For example, if a step is applied to the variables with roles of \"outcomes\", these data would not be available for new samples. \n * `id` is a character string that can be used to identify steps in package code. `rand_id()` will create an ID that has the prefix and a random character sequence. \n\nWe can estimate the percentiles of new data points based on the percentiles from the training set with `approx()`. Our `step_percentile` contains a `ref_dist` object to store these percentiles (pre-computed from the training set in `prep()`) for later use in `bake()`.\n\nWe will use `stats::quantile()` to compute the grid. However, we might also want to have control over the granularity of this grid, so the `options` argument will be used to define how that calculation is done. We could use the ellipses (aka `...`) so that any options passed to `step_percentile()` that are not one of its arguments will then be passed to `stats::quantile()`. However, we recommend making a separate list object with the options and use these inside the function because `...` is already used to define the variable selection. \n\nIt is also important to consider if there are any _main arguments_ to the step. For example, for spline-related steps such as `step_ns()`, users typically want to adjust the argument for the degrees of freedom in the spline (e.g. `splines::ns(x, df)`). Rather than letting users add `df` to the `options` argument: \n\n* Allow the important arguments to be main arguments to the step function. \n\n* Follow the tidymodels [conventions for naming arguments](https://tidymodels.github.io/model-implementation-principles/standardized-argument-names.html). Whenever possible, avoid jargon and keep common argument names. \n\nThere are benefits to following these principles (as shown below). \n\n## Initialize a new object\n\nNow, the constructor function can be created.\n\nThe function cascade is: \n\n```\nstep_percentile() calls recipes::add_step()\n└──> recipes::add_step() calls step_percentile_new()\n └──> step_percentile_new() calls recipes::step()\n```\n\n`step()` is a general constructor for recipes that mainly makes sure that the resulting step object is a list with an appropriate S3 class structure. Using `subclass = \"percentile\"` will set the class of new objects to `\"step_percentile\"`. \n\n\n::: {.cell layout-align=\"center\" hash='cache/initialize_565d7c1989cea598cb438eb330637305'}\n\n```{.r .cell-code}\nstep_percentile_new <- \n function(terms, role, trained, ref_dist, options, skip, id) {\n step(\n subclass = \"percentile\", \n terms = terms,\n role = role,\n trained = trained,\n ref_dist = ref_dist,\n options = options,\n skip = skip,\n id = id\n )\n }\n```\n:::\n\n\nThis constructor function should have no default argument values. Defaults should be set in the user-facing step object. \n\n## Create the `prep` method\n\nYou will need to create a new `prep()` method for your step's class. To do this, three arguments that the method should have are:\n\n```r\nfunction(x, training, info = NULL)\n```\n\nwhere\n\n * `x` will be the `step_percentile` object,\n * `training` will be a _tibble_ that has the training set data, and\n * `info` will also be a tibble that has information on the current set of data available. This information is updated as each step is evaluated by its specific `prep()` method so it may not have the variables from the original data. The columns in this tibble are `variable` (the variable name), `type` (currently either \"numeric\" or \"nominal\"), `role` (defining the variable's role), and `source` (either \"original\" or \"derived\" depending on where it originated).\n\nYou can define other arguments as well. \n\nThe first thing that you might want to do in the `prep()` function is to translate the specification listed in the `terms` argument to column names in the current data. There is a function called `recipes_eval_select()` that can be used to obtain this. \n\n{{% warning %}} The `recipes_eval_select()` function is not one you interact with as a typical recipes user, but it is helpful if you develop your own custom recipe steps. {{%/ warning %}}\n\n\n::: {.cell layout-align=\"center\" hash='cache/prep_1_e9cc5b4aa66319e4eca05dbda8b4cc2c'}\n\n```{.r .cell-code}\nprep.step_percentile <- function(x, training, info = NULL, ...) {\n col_names <- recipes_eval_select(x$terms, training, info) \n # TODO finish the rest of the function\n}\n```\n:::\n\n\nAfter this function call, it is a good idea to check that the selected columns have the appropriate type (e.g. numeric for this example). See `recipes::check_type()` to do this for basic types. \n\nOnce we have this, we can save the approximation grid. For the grid, we will use a helper function that enables us to run `rlang::exec()` to splice in any extra arguments contained in the `options` list to the call to `quantile()`: \n\n\n::: {.cell layout-align=\"center\" hash='cache/splice_d7f87c1fe0ee11958f763f1b2ab1ee72'}\n\n```{.r .cell-code}\nget_train_pctl <- function(x, args = NULL) {\n res <- rlang::exec(\"quantile\", x = x, !!!args)\n # Remove duplicate percentile values\n res[!duplicated(res)]\n}\n\n# For example:\nget_train_pctl(biomass_tr$carbon, list(probs = 0:1))\n#> 0% 100% \n#> 14.61 97.18\nget_train_pctl(biomass_tr$carbon)\n#> 0% 25% 50% 75% 100% \n#> 14.610 44.715 47.100 49.725 97.180\n```\n:::\n\n\nNow, the `prep()` method can be created: \n\n\n::: {.cell layout-align=\"center\" hash='cache/prep-2_7c79c53e94b392f35b346414e8b3f731'}\n\n```{.r .cell-code}\nprep.step_percentile <- function(x, training, info = NULL, ...) {\n col_names <- recipes_eval_select(x$terms, training, info)\n ## You can add error trapping for non-numeric data here and so on. \n \n ## We'll use the names later so make sure they are available\n if (x$options$names == FALSE) {\n rlang::abort(\"`names` should be set to TRUE\")\n }\n \n if (!any(names(x$options) == \"probs\")) {\n x$options$probs <- (0:100)/100\n } else {\n x$options$probs <- sort(unique(x$options$probs))\n }\n \n # Compute percentile grid\n ref_dist <- purrr::map(training[, col_names], get_train_pctl, args = x$options)\n\n ## Use the constructor function to return the updated object. \n ## Note that `trained` is now set to TRUE\n \n step_percentile_new(\n terms = x$terms, \n trained = TRUE,\n role = x$role, \n ref_dist = ref_dist,\n options = x$options,\n skip = x$skip,\n id = x$id\n )\n}\n```\n:::\n\n\nWe suggest favoring `rlang::abort()` and `rlang::warn()` over `stop()` and `warning()`. The former can be used for better traceback results.\n\n\n## Create the `bake` method\n\nRemember that the `prep()` function does not _apply_ the step to the data; it only estimates any required values such as `ref_dist`. We will need to create a new method for our `step_percentile()` class. The minimum arguments for this are\n\n```r\nfunction(object, new_data, ...)\n```\n\nwhere `object` is the updated step function that has been through the corresponding `prep()` code and `new_data` is a tibble of data to be processed. \n\nHere is the code to convert the new data to percentiles. The input data (`x` below) comes in as a numeric vector and the output is a vector of approximate percentiles: \n\n\n::: {.cell layout-align=\"center\" hash='cache/bake-helpers_26b81d4ec4ac1dca48a0bd67178379d4'}\n\n```{.r .cell-code}\npctl_by_approx <- function(x, ref) {\n # In case duplicates were removed, get the percentiles from\n # the names of the reference object\n grid <- as.numeric(gsub(\"%$\", \"\", names(ref))) \n approx(x = ref, y = grid, xout = x)$y/100\n}\n```\n:::\n\n\nThese computations are done column-wise using `purrr::map2_dfc()` to modify the new data in-place:\n\n\n::: {.cell layout-align=\"center\" hash='cache/bake-method_c5dab3d658e8739f1bff0630466624e5'}\n\n```{.r .cell-code}\nbake.step_percentile <- function(object, new_data, ...) {\n ## For illustration (and not speed), we will loop through the affected variables\n ## and do the computations\n vars <- names(object$ref_dist)\n \n new_data[, vars] <-\n purrr::map2_dfc(new_data[, vars], object$ref_dist, pctl_by_approx)\n \n ## Always convert to tibbles on the way out\n tibble::as_tibble(new_data)\n}\n```\n:::\n\n\n{{% note %}} You need to import `recipes::prep()` and `recipes::bake()` to create your own step function in a package. {{%/ note %}}\n\n## Run the example\n\nLet's use the example data to make sure that it works: \n\n\n::: {.cell layout-align=\"center\" hash='cache/example_772256fc0ca7141fc3d35b1fd93e88a9'}\n\n```{.r .cell-code}\nrec_obj <- \n recipe(HHV ~ ., data = biomass_tr) %>%\n step_percentile(ends_with(\"gen\")) %>%\n prep(training = biomass_tr)\n\nbiomass_te %>% select(ends_with(\"gen\")) %>% slice(1:2)\n#> hydrogen oxygen nitrogen\n#> 1 5.67 47.20 0.30\n#> 2 5.50 48.06 2.85\nbake(rec_obj, biomass_te %>% slice(1:2), ends_with(\"gen\"))\n#> # A tibble: 2 × 3\n#> hydrogen oxygen nitrogen\n#> \n#> 1 0.45 0.903 0.21 \n#> 2 0.38 0.922 0.928\n\n# Checking to get approximate result: \nmean(biomass_tr$hydrogen <= biomass_te$hydrogen[1])\n#> [1] 0.4517544\nmean(biomass_tr$oxygen <= biomass_te$oxygen[1])\n#> [1] 0.9013158\n```\n:::\n\n\nThe plot below shows how the original hydrogen percentiles line up with the estimated values:\n\n\n::: {.cell layout-align=\"center\" hash='cache/cdf_plot_159799b7fd72828fadd4d606fa204f3e'}\n\n```{.r .cell-code}\nhydrogen_values <- \n bake(rec_obj, biomass_te, hydrogen) %>% \n bind_cols(biomass_te %>% select(original = hydrogen))\n\nggplot(biomass_tr, aes(x = hydrogen)) + \n # Plot the empirical distribution function of the \n # hydrogen training set values as a black line\n stat_ecdf() + \n # Overlay the estimated percentiles for the new data: \n geom_point(data = hydrogen_values, \n aes(x = original, y = hydrogen), \n col = \"red\", alpha = .5, cex = 2) + \n labs(x = \"New Hydrogen Values\", y = \"Percentile Based on Training Set\")\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=672}\n:::\n:::\n\n\nThese line up very nicely! \n\n## Custom check operations \n\nThe process here is exactly the same as steps; the internal functions have a similar naming convention: \n\n * `add_check()` instead of `add_step()`\n * `check()` instead of `step()`, and so on. \n \nIt is strongly recommended that:\n \n 1. The operations start with `check_` (i.e. `check_range()` and `check_range_new()`)\n 1. The check uses `rlang::abort(paste0(...))` when the conditions are not met\n 1. The original data are returned (unaltered) by the check when the conditions are satisfied. \n\n## Other step methods\n\nThere are a few other S3 methods that can be created for your step function. They are not required unless you plan on using your step in the broader tidymodels package set. \n\n### A print method\n\nIf you don't add a print method for `step_percentile`, it will still print but it will be printed as a list of (potentially large) objects and look a bit ugly. The recipes package contains a helper function called `printer()` that should be useful in most cases. We are using it here for the custom print method for `step_percentile`. It requires the original terms specification and the column names this specification is evaluated to by `prep()`. For the former, our step object is structured so that the list object `ref_dist` has the names of the selected variables: \n\n\n::: {.cell layout-align=\"center\" hash='cache/print-method_3703e6bb929a5780983729502421e4b4'}\n\n```{.r .cell-code}\nprint.step_percentile <-\n function(x, width = max(20, options()$width - 35), ...) {\n cat(\"Percentile transformation on \", sep = \"\")\n printer(\n # Names before prep (could be selectors)\n untr_obj = x$terms,\n # Names after prep:\n tr_obj = names(x$ref_dist),\n # Has it been prepped? \n trained = x$trained,\n # An estimate of how many characters to print on a line: \n width = width\n )\n invisible(x)\n }\n\n# Results before `prep()`:\nrecipe(HHV ~ ., data = biomass_tr) %>%\n step_percentile(ends_with(\"gen\"))\n#> Recipe\n#> \n#> Inputs:\n#> \n#> role #variables\n#> outcome 1\n#> predictor 7\n#> \n#> Operations:\n#> \n#> Percentile transformation on ends_with(\"gen\")\n\n# Results after `prep()`: \nrec_obj\n#> Recipe\n#> \n#> Inputs:\n#> \n#> role #variables\n#> outcome 1\n#> predictor 7\n#> \n#> Training data contained 456 data points and no missing data.\n#> \n#> Operations:\n#> \n#> Percentile transformation on hydrogen, oxygen, nitrogen [trained]\n```\n:::\n\n \n### Methods for declaring required packages\n\nSome recipe steps use functions from other packages. When this is the case, the `step_*()` function should check to see if the package is installed. The function `recipes::recipes_pkg_check()` will do this. For example: \n\n```\n> recipes::recipes_pkg_check(\"some_package\")\n1 package is needed for this step and is not installed. (some_package). Start \na clean R session then run: install.packages(\"some_package\")\n```\n\nThere is an S3 method that can be used to declare what packages should be loaded when using the step. For a hypothetical step that relies on the `hypothetical` package, this might look like: \n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-15_ef766f12d6be3f28d8352aa2616aa143'}\n\n```{.r .cell-code}\nrequired_pkgs.step_hypothetical <- function(x, ...) {\n c(\"hypothetical\", \"myrecipespkg\")\n}\n```\n:::\n\n\nIn this example, `myrecipespkg` is the package where the step resides (if it is in a package).\n\nThe reason to declare what packages should be loaded is parallel processing. When parallel worker processes are created, there is heterogeneity across technologies regarding which packages are loaded. Multicore methods on macOS and Linux load all of the packages that were loaded in the main R process. However, parallel processing using psock clusters have no additional packages loaded. If the home package for a recipe step is not loaded in the worker processes, the `prep()` methods cannot be found and an error occurs. \n\nIf this S3 method is used for your step, you can rely on this for checking the installation: \n \n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-16_bb0b7690de51c7e735663ee46cceddfd'}\n\n```{.r .cell-code}\nrecipes::recipes_pkg_check(required_pkgs.step_hypothetical())\n```\n:::\n\n\nIf you'd like an example of this in a package, please take a look at the [embed](https://github.com/tidymodels/embed/) or [themis](https://github.com/tidymodels/themis/) package.\n\n### A tidy method\n\nThe `broom::tidy()` method is a means to return information about the step in a usable format. For our step, it would be helpful to know the reference values. \n\nWhen the recipe has been prepped, those data are in the list `ref_dist`. A small function can be used to reformat that data into a tibble. It is customary to return the main values as `value`:\n\n\n::: {.cell layout-align=\"center\" hash='cache/tidy-calcs_4ea442a4112fcdb07bcd881ce7e823bf'}\n\n```{.r .cell-code}\nformat_pctl <- function(x) {\n tibble::tibble(\n value = unname(x),\n percentile = as.numeric(gsub(\"%$\", \"\", names(x))) \n )\n}\n\n# For example: \npctl_step_object <- rec_obj$steps[[1]]\npctl_step_object\n#> Percentile transformation on hydrogen, oxygen, nitrogen [trained]\nformat_pctl(pctl_step_object$ref_dist[[\"hydrogen\"]])\n#> # A tibble: 87 × 2\n#> value percentile\n#> \n#> 1 0.03 0\n#> 2 0.934 1\n#> 3 1.60 2\n#> 4 2.07 3\n#> 5 2.45 4\n#> 6 2.74 5\n#> 7 3.15 6\n#> 8 3.49 7\n#> 9 3.71 8\n#> 10 3.99 9\n#> # … with 77 more rows\n```\n:::\n\n\nThe tidy method could return these values for each selected column. Before `prep()`, missing values can be used as placeholders. \n\n\n::: {.cell layout-align=\"center\" hash='cache/tidy_034d8d1109dc6a42f7f14bec91cfc669'}\n\n```{.r .cell-code}\ntidy.step_percentile <- function(x, ...) {\n if (is_trained(x)) {\n res <- map_dfr(x$ref_dist, format_pctl, .id = \"term\")\n }\n else {\n term_names <- sel2char(x$terms)\n res <-\n tibble(\n terms = term_names,\n value = rlang::na_dbl,\n percentile = rlang::na_dbl\n )\n }\n # Always return the step id: \n res$id <- x$id\n res\n}\n\ntidy(rec_obj, number = 1)\n#> # A tibble: 274 × 4\n#> term value percentile id \n#> \n#> 1 hydrogen 0.03 0 percentile_HcwF5\n#> 2 hydrogen 0.934 1 percentile_HcwF5\n#> 3 hydrogen 1.60 2 percentile_HcwF5\n#> 4 hydrogen 2.07 3 percentile_HcwF5\n#> 5 hydrogen 2.45 4 percentile_HcwF5\n#> 6 hydrogen 2.74 5 percentile_HcwF5\n#> 7 hydrogen 3.15 6 percentile_HcwF5\n#> 8 hydrogen 3.49 7 percentile_HcwF5\n#> 9 hydrogen 3.71 8 percentile_HcwF5\n#> 10 hydrogen 3.99 9 percentile_HcwF5\n#> # … with 264 more rows\n```\n:::\n\n\n### Methods for tuning parameters\n\nThe tune package can be used to find reasonable values of step arguments by model tuning. There are some S3 methods that are useful to define for your step. The percentile example doesn't really have any tunable parameters, so we will demonstrate using `step_poly()`, which returns a polynomial expansion of selected columns. Its function definition has the arguments: \n\n\n::: {.cell layout-align=\"center\" hash='cache/poly-args_aeebdb3dac81ee70eba24e6ae9ec6448'}\n\n```{.r .cell-code}\nargs(step_poly)\n#> function (recipe, ..., role = \"predictor\", trained = FALSE, objects = NULL, \n#> degree = 2, options = list(), skip = FALSE, id = rand_id(\"poly\")) \n#> NULL\n```\n:::\n\n\nThe argument `degree` is tunable.\n\nTo work with tune it is _helpful_ (but not required) to use an S3 method called `tunable()` to define which arguments should be tuned and how values of those arguments should be generated. \n\n`tunable()` takes the step object as its argument and returns a tibble with columns: \n\n* `name`: The name of the argument. \n\n* `call_info`: A list that describes how to call a function that returns a dials parameter object. \n\n* `source`: A character string that indicates where the tuning value comes from (i.e., a model, a recipe etc.). Here, it is just `\"recipe\"`. \n\n* `component`: A character string with more information about the source. For recipes, this is just the name of the step (e.g. `\"step_poly\"`). \n\n* `component_id`: A character string to indicate where a unique identifier is for the object. For recipes, this is just the `id` value of the step object. \n\nThe main piece of information that requires some detail is `call_info`. This is a list column in the tibble. Each element of the list is a list that describes the package and function that can be used to create a dials parameter object. \n\nFor example, for a nearest-neighbors `neighbors` parameter, this value is just: \n\n\n::: {.cell layout-align=\"center\" hash='cache/mtry_3c03e505855845f8e5bb7a077fd5b825'}\n\n```{.r .cell-code}\ninfo <- list(pkg = \"dials\", fun = \"neighbors\")\n\n# FYI: how it is used under-the-hood: \nnew_param_call <- rlang::call2(.fn = info$fun, .ns = info$pkg)\nrlang::eval_tidy(new_param_call)\n#> # Nearest Neighbors (quantitative)\n#> Range: [1, 10]\n```\n:::\n\n\nFor `step_poly()`, a dials object is needed that returns an integer that is the number of new columns to create. It turns out that there are a few different types of tuning parameters related to degree: \n\n```r\n> lsf.str(\"package:dials\", pattern = \"degree\")\ndegree : function (range = c(1, 3), trans = NULL) \ndegree_int : function (range = c(1L, 3L), trans = NULL) \nprod_degree : function (range = c(1L, 2L), trans = NULL) \nspline_degree : function (range = c(3L, 10L), trans = NULL) \n```\n\nLooking at the `range` values, some return doubles and others return integers. For our problem, `degree_int()` would be a good choice. \n\nFor `step_poly()` the `tunable()` S3 method could be: \n\n\n::: {.cell layout-align=\"center\" hash='cache/tunable_16f4fa39f3ddb8145e6a67412664dadb'}\n\n```{.r .cell-code}\ntunable.step_poly <- function (x, ...) {\n tibble::tibble(\n name = c(\"degree\"),\n call_info = list(list(pkg = \"dials\", fun = \"degree_int\")),\n source = \"recipe\",\n component = \"step_poly\",\n component_id = x$id\n )\n}\n```\n:::\n\n\n\n## Session information\n\n\n::: {.cell layout-align=\"center\" hash='cache/si_43a75b68dcc94565ba13180d7ad26a69'}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.2.0 (2022-04-22)\n#> os macOS Monterey 12.6.1\n#> system aarch64, darwin20\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/New_York\n#> date 2023-01-05\n#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)\n#> dials * 1.1.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)\n#> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.2.0)\n#> modeldata * 1.0.1 2022-09-06 [1] CRAN (R 4.2.0)\n#> parsnip * 1.0.3 2022-11-11 [1] CRAN (R 4.2.0)\n#> purrr * 1.0.0 2022-12-20 [1] CRAN (R 4.2.0)\n#> recipes * 1.0.3 2022-11-09 [1] CRAN (R 4.2.0)\n#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)\n#> rsample * 1.1.1 2022-12-07 [1] CRAN (R 4.2.0)\n#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)\n#> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0)\n#> tune * 1.0.1.9001 2022-12-09 [1] Github (tidymodels/tune@e23abdf)\n#> workflows * 1.1.2 2022-11-16 [1] CRAN (R 4.2.0)\n#> yardstick * 1.1.0.9000 2022-12-27 [1] Github (tidymodels/yardstick@5f1b9ce)\n#> \n#> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n",
+ "markdown": "---\ntitle: \"Create your own recipe step function\"\ncategories:\n - developer tools\ntype: learn-subsection\nweight: 1\ndescription: | \n Write a new recipe step for data preprocessing.\n---\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: modeldata and tidymodels.\n\nThere are many existing recipe steps in packages like recipes, themis, textrecipes, and others. A full list of steps in CRAN packages [can be found here](/find/recipes/). However, you might need to define your own preprocessing operations; this article describes how to do that. If you are looking for good examples of existing steps, we suggest looking at the code for [centering](https://github.com/tidymodels/recipes/blob/master/R/center.R) or [PCA](https://github.com/tidymodels/recipes/blob/master/R/pca.R) to start. \n\nFor check operations (e.g. `check_class()`), the process is very similar. Notes on this are available at the end of this article. \n\nThe general process to follow is to:\n\n1. Define a step constructor function.\n\n2. Create the minimal S3 methods for `prep()`, `bake()`, and `print()`. \n\n3. Optionally add some extra methods to work with other tidymodels packages, such as `tunable()` and `tidy()`. \n\nAs an example, we will create a step for converting data into percentiles. \n\n## A new step definition\n\nLet's create a step that replaces the value of a variable with its percentile from the training set. The example data we'll use is from the modeldata package:\n\n\n::: {.cell layout-align=\"center\" hash='cache/initial_e9b15d7263a13cf19c62529abfd10bdb'}\n\n```{.r .cell-code}\nlibrary(modeldata)\ndata(biomass)\nstr(biomass)\n#> 'data.frame':\t536 obs. of 8 variables:\n#> $ sample : chr \"Akhrot Shell\" \"Alabama Oak Wood Waste\" \"Alder\" \"Alfalfa\" ...\n#> $ dataset : chr \"Training\" \"Training\" \"Training\" \"Training\" ...\n#> $ carbon : num 49.8 49.5 47.8 45.1 46.8 ...\n#> $ hydrogen: num 5.64 5.7 5.8 4.97 5.4 5.75 5.99 5.7 5.5 5.9 ...\n#> $ oxygen : num 42.9 41.3 46.2 35.6 40.7 ...\n#> $ nitrogen: num 0.41 0.2 0.11 3.3 1 2.04 2.68 1.7 0.8 1.2 ...\n#> $ sulfur : num 0 0 0.02 0.16 0.02 0.1 0.2 0.2 0 0.1 ...\n#> $ HHV : num 20 19.2 18.3 18.2 18.4 ...\n\nbiomass_tr <- biomass[biomass$dataset == \"Training\",]\nbiomass_te <- biomass[biomass$dataset == \"Testing\",]\n```\n:::\n\n\nTo illustrate the transformation with the `carbon` variable, note the training set distribution of this variable with a vertical line below for the first value of the test set. \n\n\n::: {.cell layout-align=\"center\" hash='cache/carbon_dist_1ea4da99e7a470221927e2615897ebcd'}\n\n```{.r .cell-code}\nlibrary(ggplot2)\ntheme_set(theme_bw())\nggplot(biomass_tr, aes(x = carbon)) + \n geom_histogram(binwidth = 5, col = \"blue\", fill = \"blue\", alpha = .5) + \n geom_vline(xintercept = biomass_te$carbon[1], lty = 2)\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=100%}\n:::\n:::\n\n\nBased on the training set, 42.1% of the data are less than a value of 46.35. There are some applications where it might be advantageous to represent the predictor values as percentiles rather than their original values. \n\nOur new step will do this computation for any numeric variables of interest. We will call this new recipe step `step_percentile()`. The code below is designed for illustration and not speed or best practices. We've left out a lot of error trapping that we would want in a real implementation. \n\n## Create the function\n\nTo start, there is a _user-facing_ function. Let's call that `step_percentile()`. This is just a simple wrapper around a _constructor function_, which defines the rules for any step object that defines a percentile transformation. We'll call this constructor `step_percentile_new()`. \n\nThe function `step_percentile()` takes the same arguments as your function and simply adds it to a new recipe. The `...` signifies the variable selectors that can be used.\n\n\n::: {.cell layout-align=\"center\" hash='cache/initial_def_9e13ec18326669f1593d727bab539fa6'}\n\n```{.r .cell-code}\nstep_percentile <- function(\n recipe, \n ..., \n role = NA, \n trained = FALSE, \n ref_dist = NULL,\n options = list(probs = (0:100)/100, names = TRUE),\n skip = FALSE,\n id = rand_id(\"percentile\")\n ) {\n\n ## The variable selectors are not immediately evaluated by using\n ## the `quos()` function in `rlang`. `ellipse_check()` captures \n ## the values and also checks to make sure that they are not empty. \n terms <- ellipse_check(...) \n\n add_step(\n recipe, \n step_percentile_new(\n terms = terms, \n trained = trained,\n role = role, \n ref_dist = ref_dist,\n options = options,\n skip = skip,\n id = id\n )\n )\n}\n```\n:::\n\n\nYou should always keep the first four arguments (`recipe` though `trained`) the same as listed above. Some notes:\n\n * the `role` argument is used when you either 1) create new variables and want their role to be pre-set or 2) replace the existing variables with new values. The latter is what we will be doing and using `role = NA` will leave the existing role intact. \n * `trained` is set by the package when the estimation step has been run. You should default your function definition's argument to `FALSE`. \n * `skip` is a logical. Whenever a recipe is prepped, each step is trained and then baked. However, there are some steps that should not be applied when a call to `bake()` is used. For example, if a step is applied to the variables with roles of \"outcomes\", these data would not be available for new samples. \n * `id` is a character string that can be used to identify steps in package code. `rand_id()` will create an ID that has the prefix and a random character sequence. \n\nWe can estimate the percentiles of new data points based on the percentiles from the training set with `approx()`. Our `step_percentile` contains a `ref_dist` object to store these percentiles (pre-computed from the training set in `prep()`) for later use in `bake()`.\n\nWe will use `stats::quantile()` to compute the grid. However, we might also want to have control over the granularity of this grid, so the `options` argument will be used to define how that calculation is done. We could use the ellipses (aka `...`) so that any options passed to `step_percentile()` that are not one of its arguments will then be passed to `stats::quantile()`. However, we recommend making a separate list object with the options and use these inside the function because `...` is already used to define the variable selection. \n\nIt is also important to consider if there are any _main arguments_ to the step. For example, for spline-related steps such as `step_ns()`, users typically want to adjust the argument for the degrees of freedom in the spline (e.g. `splines::ns(x, df)`). Rather than letting users add `df` to the `options` argument: \n\n* Allow the important arguments to be main arguments to the step function. \n\n* Follow the tidymodels [conventions for naming arguments](https://tidymodels.github.io/model-implementation-principles/standardized-argument-names.html). Whenever possible, avoid jargon and keep common argument names. \n\nThere are benefits to following these principles (as shown below). \n\n## Initialize a new object\n\nNow, the constructor function can be created.\n\nThe function cascade is: \n\n```\nstep_percentile() calls recipes::add_step()\n└──> recipes::add_step() calls step_percentile_new()\n └──> step_percentile_new() calls recipes::step()\n```\n\n`step()` is a general constructor for recipes that mainly makes sure that the resulting step object is a list with an appropriate S3 class structure. Using `subclass = \"percentile\"` will set the class of new objects to `\"step_percentile\"`. \n\n\n::: {.cell layout-align=\"center\" hash='cache/initialize_565d7c1989cea598cb438eb330637305'}\n\n```{.r .cell-code}\nstep_percentile_new <- \n function(terms, role, trained, ref_dist, options, skip, id) {\n step(\n subclass = \"percentile\", \n terms = terms,\n role = role,\n trained = trained,\n ref_dist = ref_dist,\n options = options,\n skip = skip,\n id = id\n )\n }\n```\n:::\n\n\nThis constructor function should have no default argument values. Defaults should be set in the user-facing step object. \n\n## Create the `prep` method\n\nYou will need to create a new `prep()` method for your step's class. To do this, three arguments that the method should have are:\n\n```r\nfunction(x, training, info = NULL)\n```\n\nwhere\n\n * `x` will be the `step_percentile` object,\n * `training` will be a _tibble_ that has the training set data, and\n * `info` will also be a tibble that has information on the current set of data available. This information is updated as each step is evaluated by its specific `prep()` method so it may not have the variables from the original data. The columns in this tibble are `variable` (the variable name), `type` (currently either \"numeric\" or \"nominal\"), `role` (defining the variable's role), and `source` (either \"original\" or \"derived\" depending on where it originated).\n\nYou can define other arguments as well. \n\nThe first thing that you might want to do in the `prep()` function is to translate the specification listed in the `terms` argument to column names in the current data. There is a function called `recipes_eval_select()` that can be used to obtain this. \n\n::: {.callout-warning}\n The `recipes_eval_select()` function is not one you interact with as a typical recipes user, but it is helpful if you develop your own custom recipe steps. \n:::\n\n\n::: {.cell layout-align=\"center\" hash='cache/prep_1_e9cc5b4aa66319e4eca05dbda8b4cc2c'}\n\n```{.r .cell-code}\nprep.step_percentile <- function(x, training, info = NULL, ...) {\n col_names <- recipes_eval_select(x$terms, training, info) \n # TODO finish the rest of the function\n}\n```\n:::\n\n\nAfter this function call, it is a good idea to check that the selected columns have the appropriate type (e.g. numeric for this example). See `recipes::check_type()` to do this for basic types. \n\nOnce we have this, we can save the approximation grid. For the grid, we will use a helper function that enables us to run `rlang::exec()` to splice in any extra arguments contained in the `options` list to the call to `quantile()`: \n\n\n::: {.cell layout-align=\"center\" hash='cache/splice_d7f87c1fe0ee11958f763f1b2ab1ee72'}\n\n```{.r .cell-code}\nget_train_pctl <- function(x, args = NULL) {\n res <- rlang::exec(\"quantile\", x = x, !!!args)\n # Remove duplicate percentile values\n res[!duplicated(res)]\n}\n\n# For example:\nget_train_pctl(biomass_tr$carbon, list(probs = 0:1))\n#> 0% 100% \n#> 14.61 97.18\nget_train_pctl(biomass_tr$carbon)\n#> 0% 25% 50% 75% 100% \n#> 14.610 44.715 47.100 49.725 97.180\n```\n:::\n\n\nNow, the `prep()` method can be created: \n\n\n::: {.cell layout-align=\"center\" hash='cache/prep-2_7c79c53e94b392f35b346414e8b3f731'}\n\n```{.r .cell-code}\nprep.step_percentile <- function(x, training, info = NULL, ...) {\n col_names <- recipes_eval_select(x$terms, training, info)\n ## You can add error trapping for non-numeric data here and so on. \n \n ## We'll use the names later so make sure they are available\n if (x$options$names == FALSE) {\n rlang::abort(\"`names` should be set to TRUE\")\n }\n \n if (!any(names(x$options) == \"probs\")) {\n x$options$probs <- (0:100)/100\n } else {\n x$options$probs <- sort(unique(x$options$probs))\n }\n \n # Compute percentile grid\n ref_dist <- purrr::map(training[, col_names], get_train_pctl, args = x$options)\n\n ## Use the constructor function to return the updated object. \n ## Note that `trained` is now set to TRUE\n \n step_percentile_new(\n terms = x$terms, \n trained = TRUE,\n role = x$role, \n ref_dist = ref_dist,\n options = x$options,\n skip = x$skip,\n id = x$id\n )\n}\n```\n:::\n\n\nWe suggest favoring `rlang::abort()` and `rlang::warn()` over `stop()` and `warning()`. The former can be used for better traceback results.\n\n\n## Create the `bake` method\n\nRemember that the `prep()` function does not _apply_ the step to the data; it only estimates any required values such as `ref_dist`. We will need to create a new method for our `step_percentile()` class. The minimum arguments for this are\n\n```r\nfunction(object, new_data, ...)\n```\n\nwhere `object` is the updated step function that has been through the corresponding `prep()` code and `new_data` is a tibble of data to be processed. \n\nHere is the code to convert the new data to percentiles. The input data (`x` below) comes in as a numeric vector and the output is a vector of approximate percentiles: \n\n\n::: {.cell layout-align=\"center\" hash='cache/bake-helpers_26b81d4ec4ac1dca48a0bd67178379d4'}\n\n```{.r .cell-code}\npctl_by_approx <- function(x, ref) {\n # In case duplicates were removed, get the percentiles from\n # the names of the reference object\n grid <- as.numeric(gsub(\"%$\", \"\", names(ref))) \n approx(x = ref, y = grid, xout = x)$y/100\n}\n```\n:::\n\n\nThese computations are done column-wise using `purrr::map2_dfc()` to modify the new data in-place:\n\n\n::: {.cell layout-align=\"center\" hash='cache/bake-method_c5dab3d658e8739f1bff0630466624e5'}\n\n```{.r .cell-code}\nbake.step_percentile <- function(object, new_data, ...) {\n ## For illustration (and not speed), we will loop through the affected variables\n ## and do the computations\n vars <- names(object$ref_dist)\n \n new_data[, vars] <-\n purrr::map2_dfc(new_data[, vars], object$ref_dist, pctl_by_approx)\n \n ## Always convert to tibbles on the way out\n tibble::as_tibble(new_data)\n}\n```\n:::\n\n\n::: {.callout-note}\nYou need to import `recipes::prep()` and `recipes::bake()` to create your own step function in a package. \n:::\n\n## Run the example\n\nLet's use the example data to make sure that it works: \n\n\n::: {.cell layout-align=\"center\" hash='cache/example_772256fc0ca7141fc3d35b1fd93e88a9'}\n\n```{.r .cell-code}\nrec_obj <- \n recipe(HHV ~ ., data = biomass_tr) %>%\n step_percentile(ends_with(\"gen\")) %>%\n prep(training = biomass_tr)\n\nbiomass_te %>% select(ends_with(\"gen\")) %>% slice(1:2)\n#> hydrogen oxygen nitrogen\n#> 1 5.67 47.20 0.30\n#> 2 5.50 48.06 2.85\nbake(rec_obj, biomass_te %>% slice(1:2), ends_with(\"gen\"))\n#> # A tibble: 2 × 3\n#> hydrogen oxygen nitrogen\n#> \n#> 1 0.45 0.903 0.21 \n#> 2 0.38 0.922 0.928\n\n# Checking to get approximate result: \nmean(biomass_tr$hydrogen <= biomass_te$hydrogen[1])\n#> [1] 0.4517544\nmean(biomass_tr$oxygen <= biomass_te$oxygen[1])\n#> [1] 0.9013158\n```\n:::\n\n\nThe plot below shows how the original hydrogen percentiles line up with the estimated values:\n\n\n::: {.cell layout-align=\"center\" hash='cache/cdf_plot_159799b7fd72828fadd4d606fa204f3e'}\n\n```{.r .cell-code}\nhydrogen_values <- \n bake(rec_obj, biomass_te, hydrogen) %>% \n bind_cols(biomass_te %>% select(original = hydrogen))\n\nggplot(biomass_tr, aes(x = hydrogen)) + \n # Plot the empirical distribution function of the \n # hydrogen training set values as a black line\n stat_ecdf() + \n # Overlay the estimated percentiles for the new data: \n geom_point(data = hydrogen_values, \n aes(x = original, y = hydrogen), \n col = \"red\", alpha = .5, cex = 2) + \n labs(x = \"New Hydrogen Values\", y = \"Percentile Based on Training Set\")\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=672}\n:::\n:::\n\n\nThese line up very nicely! \n\n## Custom check operations \n\nThe process here is exactly the same as steps; the internal functions have a similar naming convention: \n\n * `add_check()` instead of `add_step()`\n * `check()` instead of `step()`, and so on. \n \nIt is strongly recommended that:\n \n 1. The operations start with `check_` (i.e. `check_range()` and `check_range_new()`)\n 1. The check uses `rlang::abort(paste0(...))` when the conditions are not met\n 1. The original data are returned (unaltered) by the check when the conditions are satisfied. \n\n## Other step methods\n\nThere are a few other S3 methods that can be created for your step function. They are not required unless you plan on using your step in the broader tidymodels package set. \n\n### A print method\n\nIf you don't add a print method for `step_percentile`, it will still print but it will be printed as a list of (potentially large) objects and look a bit ugly. The recipes package contains a helper function called `printer()` that should be useful in most cases. We are using it here for the custom print method for `step_percentile`. It requires the original terms specification and the column names this specification is evaluated to by `prep()`. For the former, our step object is structured so that the list object `ref_dist` has the names of the selected variables: \n\n\n::: {.cell layout-align=\"center\" hash='cache/print-method_3703e6bb929a5780983729502421e4b4'}\n\n```{.r .cell-code}\nprint.step_percentile <-\n function(x, width = max(20, options()$width - 35), ...) {\n cat(\"Percentile transformation on \", sep = \"\")\n printer(\n # Names before prep (could be selectors)\n untr_obj = x$terms,\n # Names after prep:\n tr_obj = names(x$ref_dist),\n # Has it been prepped? \n trained = x$trained,\n # An estimate of how many characters to print on a line: \n width = width\n )\n invisible(x)\n }\n\n# Results before `prep()`:\nrecipe(HHV ~ ., data = biomass_tr) %>%\n step_percentile(ends_with(\"gen\"))\n#> Recipe\n#> \n#> Inputs:\n#> \n#> role #variables\n#> outcome 1\n#> predictor 7\n#> \n#> Operations:\n#> \n#> Percentile transformation on ends_with(\"gen\")\n\n# Results after `prep()`: \nrec_obj\n#> Recipe\n#> \n#> Inputs:\n#> \n#> role #variables\n#> outcome 1\n#> predictor 7\n#> \n#> Training data contained 456 data points and no missing data.\n#> \n#> Operations:\n#> \n#> Percentile transformation on hydrogen, oxygen, nitrogen [trained]\n```\n:::\n\n \n### Methods for declaring required packages\n\nSome recipe steps use functions from other packages. When this is the case, the `step_*()` function should check to see if the package is installed. The function `recipes::recipes_pkg_check()` will do this. For example: \n\n```\n> recipes::recipes_pkg_check(\"some_package\")\n1 package is needed for this step and is not installed. (some_package). Start \na clean R session then run: install.packages(\"some_package\")\n```\n\nThere is an S3 method that can be used to declare what packages should be loaded when using the step. For a hypothetical step that relies on the `hypothetical` package, this might look like: \n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-15_ef766f12d6be3f28d8352aa2616aa143'}\n\n```{.r .cell-code}\nrequired_pkgs.step_hypothetical <- function(x, ...) {\n c(\"hypothetical\", \"myrecipespkg\")\n}\n```\n:::\n\n\nIn this example, `myrecipespkg` is the package where the step resides (if it is in a package).\n\nThe reason to declare what packages should be loaded is parallel processing. When parallel worker processes are created, there is heterogeneity across technologies regarding which packages are loaded. Multicore methods on macOS and Linux load all of the packages that were loaded in the main R process. However, parallel processing using psock clusters have no additional packages loaded. If the home package for a recipe step is not loaded in the worker processes, the `prep()` methods cannot be found and an error occurs. \n\nIf this S3 method is used for your step, you can rely on this for checking the installation: \n \n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-16_bb0b7690de51c7e735663ee46cceddfd'}\n\n```{.r .cell-code}\nrecipes::recipes_pkg_check(required_pkgs.step_hypothetical())\n```\n:::\n\n\nIf you'd like an example of this in a package, please take a look at the [embed](https://github.com/tidymodels/embed/) or [themis](https://github.com/tidymodels/themis/) package.\n\n### A tidy method\n\nThe `broom::tidy()` method is a means to return information about the step in a usable format. For our step, it would be helpful to know the reference values. \n\nWhen the recipe has been prepped, those data are in the list `ref_dist`. A small function can be used to reformat that data into a tibble. It is customary to return the main values as `value`:\n\n\n::: {.cell layout-align=\"center\" hash='cache/tidy-calcs_4ea442a4112fcdb07bcd881ce7e823bf'}\n\n```{.r .cell-code}\nformat_pctl <- function(x) {\n tibble::tibble(\n value = unname(x),\n percentile = as.numeric(gsub(\"%$\", \"\", names(x))) \n )\n}\n\n# For example: \npctl_step_object <- rec_obj$steps[[1]]\npctl_step_object\n#> Percentile transformation on hydrogen, oxygen, nitrogen [trained]\nformat_pctl(pctl_step_object$ref_dist[[\"hydrogen\"]])\n#> # A tibble: 87 × 2\n#> value percentile\n#> \n#> 1 0.03 0\n#> 2 0.934 1\n#> 3 1.60 2\n#> 4 2.07 3\n#> 5 2.45 4\n#> 6 2.74 5\n#> 7 3.15 6\n#> 8 3.49 7\n#> 9 3.71 8\n#> 10 3.99 9\n#> # … with 77 more rows\n```\n:::\n\n\nThe tidy method could return these values for each selected column. Before `prep()`, missing values can be used as placeholders. \n\n\n::: {.cell layout-align=\"center\" hash='cache/tidy_034d8d1109dc6a42f7f14bec91cfc669'}\n\n```{.r .cell-code}\ntidy.step_percentile <- function(x, ...) {\n if (is_trained(x)) {\n res <- map_dfr(x$ref_dist, format_pctl, .id = \"term\")\n }\n else {\n term_names <- sel2char(x$terms)\n res <-\n tibble(\n terms = term_names,\n value = rlang::na_dbl,\n percentile = rlang::na_dbl\n )\n }\n # Always return the step id: \n res$id <- x$id\n res\n}\n\ntidy(rec_obj, number = 1)\n#> # A tibble: 274 × 4\n#> term value percentile id \n#> \n#> 1 hydrogen 0.03 0 percentile_Sim0q\n#> 2 hydrogen 0.934 1 percentile_Sim0q\n#> 3 hydrogen 1.60 2 percentile_Sim0q\n#> 4 hydrogen 2.07 3 percentile_Sim0q\n#> 5 hydrogen 2.45 4 percentile_Sim0q\n#> 6 hydrogen 2.74 5 percentile_Sim0q\n#> 7 hydrogen 3.15 6 percentile_Sim0q\n#> 8 hydrogen 3.49 7 percentile_Sim0q\n#> 9 hydrogen 3.71 8 percentile_Sim0q\n#> 10 hydrogen 3.99 9 percentile_Sim0q\n#> # … with 264 more rows\n```\n:::\n\n\n### Methods for tuning parameters\n\nThe tune package can be used to find reasonable values of step arguments by model tuning. There are some S3 methods that are useful to define for your step. The percentile example doesn't really have any tunable parameters, so we will demonstrate using `step_poly()`, which returns a polynomial expansion of selected columns. Its function definition has the arguments: \n\n\n::: {.cell layout-align=\"center\" hash='cache/poly-args_aeebdb3dac81ee70eba24e6ae9ec6448'}\n\n```{.r .cell-code}\nargs(step_poly)\n#> function (recipe, ..., role = \"predictor\", trained = FALSE, objects = NULL, \n#> degree = 2, options = list(), skip = FALSE, id = rand_id(\"poly\")) \n#> NULL\n```\n:::\n\n\nThe argument `degree` is tunable.\n\nTo work with tune it is _helpful_ (but not required) to use an S3 method called `tunable()` to define which arguments should be tuned and how values of those arguments should be generated. \n\n`tunable()` takes the step object as its argument and returns a tibble with columns: \n\n* `name`: The name of the argument. \n\n* `call_info`: A list that describes how to call a function that returns a dials parameter object. \n\n* `source`: A character string that indicates where the tuning value comes from (i.e., a model, a recipe etc.). Here, it is just `\"recipe\"`. \n\n* `component`: A character string with more information about the source. For recipes, this is just the name of the step (e.g. `\"step_poly\"`). \n\n* `component_id`: A character string to indicate where a unique identifier is for the object. For recipes, this is just the `id` value of the step object. \n\nThe main piece of information that requires some detail is `call_info`. This is a list column in the tibble. Each element of the list is a list that describes the package and function that can be used to create a dials parameter object. \n\nFor example, for a nearest-neighbors `neighbors` parameter, this value is just: \n\n\n::: {.cell layout-align=\"center\" hash='cache/mtry_3c03e505855845f8e5bb7a077fd5b825'}\n\n```{.r .cell-code}\ninfo <- list(pkg = \"dials\", fun = \"neighbors\")\n\n# FYI: how it is used under-the-hood: \nnew_param_call <- rlang::call2(.fn = info$fun, .ns = info$pkg)\nrlang::eval_tidy(new_param_call)\n#> # Nearest Neighbors (quantitative)\n#> Range: [1, 10]\n```\n:::\n\n\nFor `step_poly()`, a dials object is needed that returns an integer that is the number of new columns to create. It turns out that there are a few different types of tuning parameters related to degree: \n\n```r\n> lsf.str(\"package:dials\", pattern = \"degree\")\ndegree : function (range = c(1, 3), trans = NULL) \ndegree_int : function (range = c(1L, 3L), trans = NULL) \nprod_degree : function (range = c(1L, 2L), trans = NULL) \nspline_degree : function (range = c(3L, 10L), trans = NULL) \n```\n\nLooking at the `range` values, some return doubles and others return integers. For our problem, `degree_int()` would be a good choice. \n\nFor `step_poly()` the `tunable()` S3 method could be: \n\n\n::: {.cell layout-align=\"center\" hash='cache/tunable_16f4fa39f3ddb8145e6a67412664dadb'}\n\n```{.r .cell-code}\ntunable.step_poly <- function (x, ...) {\n tibble::tibble(\n name = c(\"degree\"),\n call_info = list(list(pkg = \"dials\", fun = \"degree_int\")),\n source = \"recipe\",\n component = \"step_poly\",\n component_id = x$id\n )\n}\n```\n:::\n\n\n\n## Session information\n\n\n::: {.cell layout-align=\"center\" hash='cache/si_43a75b68dcc94565ba13180d7ad26a69'}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.2.0 (2022-04-22)\n#> os macOS Big Sur/Monterey 10.16\n#> system x86_64, darwin17.0\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/New_York\n#> date 2023-01-04\n#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)\n#> dials * 1.1.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)\n#> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> infer * 1.0.3 2022-08-22 [1] CRAN (R 4.2.0)\n#> modeldata * 1.0.1.9000 2023-01-04 [1] local\n#> parsnip * 1.0.3 2022-11-11 [1] CRAN (R 4.2.0)\n#> purrr * 0.3.5 2022-10-06 [1] CRAN (R 4.2.0)\n#> recipes * 1.0.3.9000 2022-12-12 [1] local\n#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)\n#> rsample * 1.1.1.9000 2022-12-13 [1] local\n#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)\n#> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0)\n#> tune * 1.0.1.9001 2022-12-05 [1] Github (tidymodels/tune@e23abdf)\n#> workflows * 1.1.2 2022-11-16 [1] CRAN (R 4.2.0)\n#> yardstick * 1.1.0.9000 2022-11-30 [1] Github (tidymodels/yardstick@5f1b9ce)\n#> \n#> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
diff --git a/_freeze/content/learn/models/parsnip-ranger-glmnet/index/execute-results/html.json b/_freeze/content/learn/models/parsnip-ranger-glmnet/index/execute-results/html.json
index b1a7b7e8..3f609d25 100644
--- a/_freeze/content/learn/models/parsnip-ranger-glmnet/index/execute-results/html.json
+++ b/_freeze/content/learn/models/parsnip-ranger-glmnet/index/execute-results/html.json
@@ -1,7 +1,7 @@
{
"hash": "fa8587a14b33bc8502378013a546ffc7",
"result": {
- "markdown": "---\ntitle: \"Regression models two ways\"\ncategories:\n - model fitting\n - random forests\n - linear regression\ntype: learn-subsection\nweight: 1\ndescription: | \n Create and train different kinds of regression models with different computational engines.\n---\n\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: glmnet, randomForest, ranger, and tidymodels.\n\nWe can create regression models with the tidymodels package [parsnip](https://parsnip.tidymodels.org/) to predict continuous or numeric quantities. Here, let's first fit a random forest model, which does _not_ require all numeric input (see discussion [here](https://bookdown.org/max/FES/categorical-trees.html)) and discuss how to use `fit()` and `fit_xy()`, as well as _data descriptors_. \n\nSecond, let's fit a regularized linear regression model to demonstrate how to move between different types of models using parsnip. \n\n## The Ames housing data\n\nWe'll use the Ames housing data set to demonstrate how to create regression models using parsnip. First, set up the data set and create a simple training/test set split:\n\n\n::: {.cell layout-align=\"center\" hash='cache/ames-split_18b5bf0134171b332b56ced5fc3b1911'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n\ndata(ames)\n\nset.seed(4595)\ndata_split <- initial_split(ames, strata = \"Sale_Price\", prop = 0.75)\n\names_train <- training(data_split)\names_test <- testing(data_split)\n```\n:::\n\n\nThe use of the test set here is _only for illustration_; normally in a data analysis these data would be saved to the very end after many models have been evaluated. \n\n## Random forest\n\nWe'll start by fitting a random forest model to a small set of parameters. Let's create a model with the predictors `Longitude`, `Latitude`, `Lot_Area`, `Neighborhood`, and `Year_Sold`. A simple random forest model can be specified via:\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic_02332292464935db3e80ae3984b88bfe'}\n\n```{.r .cell-code}\nrf_defaults <- rand_forest(mode = \"regression\")\nrf_defaults\n#> Random Forest Model Specification (regression)\n#> \n#> Computational engine: ranger\n```\n:::\n\n\nThe model will be fit with the ranger package by default. Since we didn't add any extra arguments to `fit`, _many_ of the arguments will be set to their defaults from the function `ranger::ranger()`. The help pages for the model function describe the default parameters and you can also use the `translate()` function to check out such details. \n\nThe parsnip package provides two different interfaces to fit a model: \n\n- the formula interface (`fit()`), and\n- the non-formula interface (`fit_xy()`).\n\nLet's start with the non-formula interface:\n\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic-xy_b14276d12ae7cf9ce85ecf44ee6a9bb4'}\n\n```{.r .cell-code}\npreds <- c(\"Longitude\", \"Latitude\", \"Lot_Area\", \"Neighborhood\", \"Year_Sold\")\n\nrf_xy_fit <- \n rf_defaults %>%\n set_engine(\"ranger\") %>%\n fit_xy(\n x = ames_train[, preds],\n y = log10(ames_train$Sale_Price)\n )\n\nrf_xy_fit\n#> parsnip model object\n#> \n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) \n#> \n#> Type: Regression \n#> Number of trees: 500 \n#> Sample size: 2197 \n#> Number of independent variables: 5 \n#> Mtry: 2 \n#> Target node size: 5 \n#> Variable importance mode: none \n#> Splitrule: variance \n#> OOB prediction error (MSE): 0.008500188 \n#> R squared (OOB): 0.7239116\n```\n:::\n\n\nThe non-formula interface doesn't do anything to the predictors before passing them to the underlying model function. This particular model does _not_ require indicator variables (sometimes called \"dummy variables\") to be created prior to fitting the model. Note that the output shows \"Number of independent variables: 5\".\n\nFor regression models, we can use the basic `predict()` method, which returns a tibble with a column named `.pred`:\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic-xy-pred_6e4354a6451e46bc8faa718908d9a912'}\n\n```{.r .cell-code}\ntest_results <- \n ames_test %>%\n select(Sale_Price) %>%\n mutate(Sale_Price = log10(Sale_Price)) %>%\n bind_cols(\n predict(rf_xy_fit, new_data = ames_test[, preds])\n )\ntest_results %>% slice(1:5)\n#> # A tibble: 5 × 2\n#> Sale_Price .pred\n#> \n#> 1 5.39 5.25\n#> 2 5.28 5.29\n#> 3 5.23 5.26\n#> 4 5.21 5.30\n#> 5 5.60 5.51\n\n# summarize performance\ntest_results %>% metrics(truth = Sale_Price, estimate = .pred) \n#> # A tibble: 3 × 3\n#> .metric .estimator .estimate\n#> \n#> 1 rmse standard 0.0945\n#> 2 rsq standard 0.733 \n#> 3 mae standard 0.0629\n```\n:::\n\n\nNote that: \n\n * If the model required indicator variables, we would have to create them manually prior to using `fit()` (perhaps using the recipes package).\n * We had to manually log the outcome prior to modeling. \n\nNow, for illustration, let's use the formula method using some new parameter values:\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic-form_1e1aadae8c7bde9360f11e8062781218'}\n\n```{.r .cell-code}\nrand_forest(mode = \"regression\", mtry = 3, trees = 1000) %>%\n set_engine(\"ranger\") %>%\n fit(\n log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,\n data = ames_train\n )\n#> parsnip model object\n#> \n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~3, x), num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) \n#> \n#> Type: Regression \n#> Number of trees: 1000 \n#> Sample size: 2197 \n#> Number of independent variables: 5 \n#> Mtry: 3 \n#> Target node size: 5 \n#> Variable importance mode: none \n#> Splitrule: variance \n#> OOB prediction error (MSE): 0.008402569 \n#> R squared (OOB): 0.7270823\n```\n:::\n\n \nSuppose that we would like to use the randomForest package instead of ranger. To do so, the only part of the syntax that needs to change is the `set_engine()` argument:\n\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-rf_7f5cc80f129451dce218438d3e2b5856'}\n\n```{.r .cell-code}\nrand_forest(mode = \"regression\", mtry = 3, trees = 1000) %>%\n set_engine(\"randomForest\") %>%\n fit(\n log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,\n data = ames_train\n )\n#> parsnip model object\n#> \n#> \n#> Call:\n#> randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, mtry = min_cols(~3, x)) \n#> Type of random forest: regression\n#> Number of trees: 1000\n#> No. of variables tried at each split: 3\n#> \n#> Mean of squared residuals: 0.008472074\n#> % Var explained: 72.47\n```\n:::\n\n\nLook at the formula code that was printed out; one function uses the argument name `ntree` and the other uses `num.trees`. The parsnip models don't require you to know the specific names of the main arguments. \n\nNow suppose that we want to modify the value of `mtry` based on the number of predictors in the data. Usually, a good default value is `floor(sqrt(num_predictors))` but a pure bagging model requires an `mtry` value equal to the total number of parameters. There may be cases where you may not know how many predictors are going to be present when the model will be fit (perhaps due to the generation of indicator variables or a variable filter) so this might be difficult to know exactly ahead of time when you write your code. \n\nWhen the model it being fit by parsnip, [_data descriptors_](https://parsnip.tidymodels.org/reference/descriptors.html) are made available. These attempt to let you know what you will have available when the model is fit. When a model object is created (say using `rand_forest()`), the values of the arguments that you give it are _immediately evaluated_ unless you delay them. To delay the evaluation of any argument, you can used `rlang::expr()` to make an expression. \n\nTwo relevant data descriptors for our example model are:\n\n * `.preds()`: the number of predictor _variables_ in the data set that are associated with the predictors **prior to dummy variable creation**.\n * `.cols()`: the number of predictor _columns_ after dummy variables (or other encodings) are created.\n\nSince ranger won't create indicator values, `.preds()` would be appropriate for `mtry` for a bagging model. \n\nFor example, let's use an expression with the `.preds()` descriptor to fit a bagging model: \n\n\n::: {.cell layout-align=\"center\" hash='cache/bagged_2b76f70b641acbdb2616b84443585217'}\n\n```{.r .cell-code}\nrand_forest(mode = \"regression\", mtry = .preds(), trees = 1000) %>%\n set_engine(\"ranger\") %>%\n fit(\n log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,\n data = ames_train\n )\n#> parsnip model object\n#> \n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~.preds(), x), num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) \n#> \n#> Type: Regression \n#> Number of trees: 1000 \n#> Sample size: 2197 \n#> Number of independent variables: 5 \n#> Mtry: 5 \n#> Target node size: 5 \n#> Variable importance mode: none \n#> Splitrule: variance \n#> OOB prediction error (MSE): 0.00867085 \n#> R squared (OOB): 0.7183685\n```\n:::\n\n\n\n## Regularized regression\n\nA linear model might work for this data set as well. We can use the `linear_reg()` parsnip model. There are two engines that can perform regularization/penalization, the glmnet and sparklyr packages. Let's use the former here. The glmnet package only implements a non-formula method, but parsnip will allow either one to be used. \n\nWhen regularization is used, the predictors should first be centered and scaled before being passed to the model. The formula method won't do that automatically so we will need to do this ourselves. We'll use the [recipes](https://recipes.tidymodels.org/) package for these steps. \n\n\n::: {.cell layout-align=\"center\" hash='cache/glmn-form_a0ca81e5cfdf6601081373c7b271e499'}\n\n```{.r .cell-code}\nnorm_recipe <- \n recipe(\n Sale_Price ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold, \n data = ames_train\n ) %>%\n step_other(Neighborhood) %>% \n step_dummy(all_nominal()) %>%\n step_center(all_predictors()) %>%\n step_scale(all_predictors()) %>%\n step_log(Sale_Price, base = 10) %>% \n # estimate the means and standard deviations\n prep(training = ames_train, retain = TRUE)\n\n# Now let's fit the model using the processed version of the data\n\nglmn_fit <- \n linear_reg(penalty = 0.001, mixture = 0.5) %>% \n set_engine(\"glmnet\") %>%\n fit(Sale_Price ~ ., data = bake(norm_recipe, new_data = NULL))\nglmn_fit\n#> parsnip model object\n#> \n#> \n#> Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = \"gaussian\", alpha = ~0.5) \n#> \n#> Df %Dev Lambda\n#> 1 0 0.00 0.138300\n#> 2 1 1.96 0.126000\n#> 3 1 3.72 0.114800\n#> 4 1 5.28 0.104600\n#> 5 2 7.07 0.095320\n#> 6 3 9.64 0.086850\n#> 7 4 12.58 0.079140\n#> 8 5 15.45 0.072110\n#> 9 5 17.93 0.065700\n#> 10 7 20.81 0.059860\n#> 11 7 23.51 0.054550\n#> 12 7 25.82 0.049700\n#> 13 8 28.20 0.045290\n#> 14 8 30.31 0.041260\n#> 15 8 32.12 0.037600\n#> 16 8 33.66 0.034260\n#> 17 8 34.97 0.031210\n#> 18 8 36.08 0.028440\n#> 19 8 37.02 0.025910\n#> 20 9 37.90 0.023610\n#> 21 9 38.65 0.021510\n#> 22 9 39.29 0.019600\n#> 23 9 39.83 0.017860\n#> 24 9 40.28 0.016270\n#> 25 10 40.68 0.014830\n#> 26 11 41.06 0.013510\n#> 27 11 41.38 0.012310\n#> 28 11 41.65 0.011220\n#> 29 11 41.88 0.010220\n#> 30 12 42.09 0.009313\n#> 31 12 42.27 0.008486\n#> 32 12 42.43 0.007732\n#> 33 12 42.56 0.007045\n#> 34 12 42.66 0.006419\n#> 35 12 42.75 0.005849\n#> 36 12 42.83 0.005329\n#> 37 12 42.90 0.004856\n#> 38 12 42.95 0.004424\n#> 39 12 42.99 0.004031\n#> 40 12 43.03 0.003673\n#> 41 12 43.06 0.003347\n#> 42 12 43.09 0.003050\n#> 43 12 43.11 0.002779\n#> 44 12 43.13 0.002532\n#> 45 12 43.15 0.002307\n#> 46 12 43.16 0.002102\n#> 47 12 43.17 0.001915\n#> 48 12 43.18 0.001745\n#> 49 12 43.19 0.001590\n#> 50 12 43.19 0.001449\n#> 51 12 43.20 0.001320\n#> 52 12 43.20 0.001203\n#> 53 12 43.21 0.001096\n#> 54 12 43.21 0.000999\n#> 55 12 43.21 0.000910\n#> 56 12 43.21 0.000829\n#> 57 12 43.22 0.000755\n#> 58 12 43.22 0.000688\n#> 59 12 43.22 0.000627\n#> 60 12 43.22 0.000571\n#> 61 12 43.22 0.000521\n#> 62 12 43.22 0.000474\n#> 63 12 43.22 0.000432\n#> 64 12 43.22 0.000394\n#> 65 12 43.22 0.000359\n```\n:::\n\n\nIf `penalty` were not specified, all of the `lambda` values would be computed. \n\nTo get the predictions for this specific value of `lambda` (aka `penalty`):\n\n\n::: {.cell layout-align=\"center\" hash='cache/glmn-pred_673611c19e448251aeb977fec5788162'}\n\n```{.r .cell-code}\n# First, get the processed version of the test set predictors:\ntest_normalized <- bake(norm_recipe, new_data = ames_test, all_predictors())\n\ntest_results <- \n test_results %>%\n rename(`random forest` = .pred) %>%\n bind_cols(\n predict(glmn_fit, new_data = test_normalized) %>%\n rename(glmnet = .pred)\n )\ntest_results\n#> # A tibble: 733 × 3\n#> Sale_Price `random forest` glmnet\n#> \n#> 1 5.39 5.25 5.16\n#> 2 5.28 5.29 5.27\n#> 3 5.23 5.26 5.24\n#> 4 5.21 5.30 5.24\n#> 5 5.60 5.51 5.24\n#> 6 5.32 5.29 5.26\n#> 7 5.17 5.14 5.18\n#> 8 5.06 5.13 5.17\n#> 9 4.98 5.01 5.18\n#> 10 5.11 5.14 5.19\n#> # … with 723 more rows\n\ntest_results %>% metrics(truth = Sale_Price, estimate = glmnet) \n#> # A tibble: 3 × 3\n#> .metric .estimator .estimate\n#> \n#> 1 rmse standard 0.142 \n#> 2 rsq standard 0.391 \n#> 3 mae standard 0.0979\n\ntest_results %>% \n gather(model, prediction, -Sale_Price) %>% \n ggplot(aes(x = prediction, y = Sale_Price)) + \n geom_abline(col = \"green\", lty = 2) + \n geom_point(alpha = .4) + \n facet_wrap(~model) + \n coord_fixed()\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=672}\n:::\n:::\n\n\nThis final plot compares the performance of the random forest and regularized regression models.\n\n## Session information\n\n\n::: {.cell layout-align=\"center\" hash='cache/si_43a75b68dcc94565ba13180d7ad26a69'}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.2.0 (2022-04-22)\n#> os macOS Monterey 12.6.1\n#> system aarch64, darwin20\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/New_York\n#> date 2023-01-05\n#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)\n#> dials * 1.1.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)\n#> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> glmnet * 4.1-6 2022-11-27 [1] CRAN (R 4.2.0)\n#> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.2.0)\n#> parsnip * 1.0.3 2022-11-11 [1] CRAN (R 4.2.0)\n#> purrr * 1.0.0 2022-12-20 [1] CRAN (R 4.2.0)\n#> randomForest * 4.7-1.1 2022-05-23 [1] CRAN (R 4.2.0)\n#> ranger * 0.14.1 2022-06-18 [1] CRAN (R 4.2.0)\n#> recipes * 1.0.3 2022-11-09 [1] CRAN (R 4.2.0)\n#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)\n#> rsample * 1.1.1 2022-12-07 [1] CRAN (R 4.2.0)\n#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)\n#> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0)\n#> tune * 1.0.1.9001 2022-12-09 [1] Github (tidymodels/tune@e23abdf)\n#> workflows * 1.1.2 2022-11-16 [1] CRAN (R 4.2.0)\n#> yardstick * 1.1.0.9000 2022-12-27 [1] Github (tidymodels/yardstick@5f1b9ce)\n#> \n#> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n",
+ "markdown": "---\ntitle: \"Regression models two ways\"\ncategories:\n - model fitting\n - random forests\n - linear regression\ntype: learn-subsection\nweight: 1\ndescription: | \n Create and train different kinds of regression models with different computational engines.\n---\n\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: glmnet, randomForest, ranger, and tidymodels.\n\nWe can create regression models with the tidymodels package [parsnip](https://parsnip.tidymodels.org/) to predict continuous or numeric quantities. Here, let's first fit a random forest model, which does _not_ require all numeric input (see discussion [here](https://bookdown.org/max/FES/categorical-trees.html)) and discuss how to use `fit()` and `fit_xy()`, as well as _data descriptors_. \n\nSecond, let's fit a regularized linear regression model to demonstrate how to move between different types of models using parsnip. \n\n## The Ames housing data\n\nWe'll use the Ames housing data set to demonstrate how to create regression models using parsnip. First, set up the data set and create a simple training/test set split:\n\n\n::: {.cell layout-align=\"center\" hash='cache/ames-split_18b5bf0134171b332b56ced5fc3b1911'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n\ndata(ames)\n\nset.seed(4595)\ndata_split <- initial_split(ames, strata = \"Sale_Price\", prop = 0.75)\n\names_train <- training(data_split)\names_test <- testing(data_split)\n```\n:::\n\n\nThe use of the test set here is _only for illustration_; normally in a data analysis these data would be saved to the very end after many models have been evaluated. \n\n## Random forest\n\nWe'll start by fitting a random forest model to a small set of parameters. Let's create a model with the predictors `Longitude`, `Latitude`, `Lot_Area`, `Neighborhood`, and `Year_Sold`. A simple random forest model can be specified via:\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic_02332292464935db3e80ae3984b88bfe'}\n\n```{.r .cell-code}\nrf_defaults <- rand_forest(mode = \"regression\")\nrf_defaults\n#> Random Forest Model Specification (regression)\n#> \n#> Computational engine: ranger\n```\n:::\n\n\nThe model will be fit with the ranger package by default. Since we didn't add any extra arguments to `fit`, _many_ of the arguments will be set to their defaults from the function `ranger::ranger()`. The help pages for the model function describe the default parameters and you can also use the `translate()` function to check out such details. \n\nThe parsnip package provides two different interfaces to fit a model: \n\n- the formula interface (`fit()`), and\n- the non-formula interface (`fit_xy()`).\n\nLet's start with the non-formula interface:\n\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic-xy_b14276d12ae7cf9ce85ecf44ee6a9bb4'}\n\n```{.r .cell-code}\npreds <- c(\"Longitude\", \"Latitude\", \"Lot_Area\", \"Neighborhood\", \"Year_Sold\")\n\nrf_xy_fit <- \n rf_defaults %>%\n set_engine(\"ranger\") %>%\n fit_xy(\n x = ames_train[, preds],\n y = log10(ames_train$Sale_Price)\n )\n\nrf_xy_fit\n#> parsnip model object\n#> \n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) \n#> \n#> Type: Regression \n#> Number of trees: 500 \n#> Sample size: 2197 \n#> Number of independent variables: 5 \n#> Mtry: 2 \n#> Target node size: 5 \n#> Variable importance mode: none \n#> Splitrule: variance \n#> OOB prediction error (MSE): 0.008500188 \n#> R squared (OOB): 0.7239116\n```\n:::\n\n\nThe non-formula interface doesn't do anything to the predictors before passing them to the underlying model function. This particular model does _not_ require indicator variables (sometimes called \"dummy variables\") to be created prior to fitting the model. Note that the output shows \"Number of independent variables: 5\".\n\nFor regression models, we can use the basic `predict()` method, which returns a tibble with a column named `.pred`:\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic-xy-pred_6e4354a6451e46bc8faa718908d9a912'}\n\n```{.r .cell-code}\ntest_results <- \n ames_test %>%\n select(Sale_Price) %>%\n mutate(Sale_Price = log10(Sale_Price)) %>%\n bind_cols(\n predict(rf_xy_fit, new_data = ames_test[, preds])\n )\ntest_results %>% slice(1:5)\n#> # A tibble: 5 × 2\n#> Sale_Price .pred\n#> \n#> 1 5.39 5.25\n#> 2 5.28 5.29\n#> 3 5.23 5.26\n#> 4 5.21 5.30\n#> 5 5.60 5.51\n\n# summarize performance\ntest_results %>% metrics(truth = Sale_Price, estimate = .pred) \n#> # A tibble: 3 × 3\n#> .metric .estimator .estimate\n#> \n#> 1 rmse standard 0.0945\n#> 2 rsq standard 0.733 \n#> 3 mae standard 0.0629\n```\n:::\n\n\nNote that: \n\n * If the model required indicator variables, we would have to create them manually prior to using `fit()` (perhaps using the recipes package).\n * We had to manually log the outcome prior to modeling. \n\nNow, for illustration, let's use the formula method using some new parameter values:\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic-form_1e1aadae8c7bde9360f11e8062781218'}\n\n```{.r .cell-code}\nrand_forest(mode = \"regression\", mtry = 3, trees = 1000) %>%\n set_engine(\"ranger\") %>%\n fit(\n log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,\n data = ames_train\n )\n#> parsnip model object\n#> \n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~3, x), num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) \n#> \n#> Type: Regression \n#> Number of trees: 1000 \n#> Sample size: 2197 \n#> Number of independent variables: 5 \n#> Mtry: 3 \n#> Target node size: 5 \n#> Variable importance mode: none \n#> Splitrule: variance \n#> OOB prediction error (MSE): 0.008402569 \n#> R squared (OOB): 0.7270823\n```\n:::\n\n \nSuppose that we would like to use the randomForest package instead of ranger. To do so, the only part of the syntax that needs to change is the `set_engine()` argument:\n\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-rf_7f5cc80f129451dce218438d3e2b5856'}\n\n```{.r .cell-code}\nrand_forest(mode = \"regression\", mtry = 3, trees = 1000) %>%\n set_engine(\"randomForest\") %>%\n fit(\n log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,\n data = ames_train\n )\n#> parsnip model object\n#> \n#> \n#> Call:\n#> randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, mtry = min_cols(~3, x)) \n#> Type of random forest: regression\n#> Number of trees: 1000\n#> No. of variables tried at each split: 3\n#> \n#> Mean of squared residuals: 0.008472074\n#> % Var explained: 72.47\n```\n:::\n\n\nLook at the formula code that was printed out; one function uses the argument name `ntree` and the other uses `num.trees`. The parsnip models don't require you to know the specific names of the main arguments. \n\nNow suppose that we want to modify the value of `mtry` based on the number of predictors in the data. Usually, a good default value is `floor(sqrt(num_predictors))` but a pure bagging model requires an `mtry` value equal to the total number of parameters. There may be cases where you may not know how many predictors are going to be present when the model will be fit (perhaps due to the generation of indicator variables or a variable filter) so this might be difficult to know exactly ahead of time when you write your code. \n\nWhen the model it being fit by parsnip, [_data descriptors_](https://parsnip.tidymodels.org/reference/descriptors.html) are made available. These attempt to let you know what you will have available when the model is fit. When a model object is created (say using `rand_forest()`), the values of the arguments that you give it are _immediately evaluated_ unless you delay them. To delay the evaluation of any argument, you can used `rlang::expr()` to make an expression. \n\nTwo relevant data descriptors for our example model are:\n\n * `.preds()`: the number of predictor _variables_ in the data set that are associated with the predictors **prior to dummy variable creation**.\n * `.cols()`: the number of predictor _columns_ after dummy variables (or other encodings) are created.\n\nSince ranger won't create indicator values, `.preds()` would be appropriate for `mtry` for a bagging model. \n\nFor example, let's use an expression with the `.preds()` descriptor to fit a bagging model: \n\n\n::: {.cell layout-align=\"center\" hash='cache/bagged_2b76f70b641acbdb2616b84443585217'}\n\n```{.r .cell-code}\nrand_forest(mode = \"regression\", mtry = .preds(), trees = 1000) %>%\n set_engine(\"ranger\") %>%\n fit(\n log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,\n data = ames_train\n )\n#> parsnip model object\n#> \n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~.preds(), x), num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) \n#> \n#> Type: Regression \n#> Number of trees: 1000 \n#> Sample size: 2197 \n#> Number of independent variables: 5 \n#> Mtry: 5 \n#> Target node size: 5 \n#> Variable importance mode: none \n#> Splitrule: variance \n#> OOB prediction error (MSE): 0.00867085 \n#> R squared (OOB): 0.7183685\n```\n:::\n\n\n\n## Regularized regression\n\nA linear model might work for this data set as well. We can use the `linear_reg()` parsnip model. There are two engines that can perform regularization/penalization, the glmnet and sparklyr packages. Let's use the former here. The glmnet package only implements a non-formula method, but parsnip will allow either one to be used. \n\nWhen regularization is used, the predictors should first be centered and scaled before being passed to the model. The formula method won't do that automatically so we will need to do this ourselves. We'll use the [recipes](https://recipes.tidymodels.org/) package for these steps. \n\n\n::: {.cell layout-align=\"center\" hash='cache/glmn-form_a0ca81e5cfdf6601081373c7b271e499'}\n\n```{.r .cell-code}\nnorm_recipe <- \n recipe(\n Sale_Price ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold, \n data = ames_train\n ) %>%\n step_other(Neighborhood) %>% \n step_dummy(all_nominal()) %>%\n step_center(all_predictors()) %>%\n step_scale(all_predictors()) %>%\n step_log(Sale_Price, base = 10) %>% \n # estimate the means and standard deviations\n prep(training = ames_train, retain = TRUE)\n\n# Now let's fit the model using the processed version of the data\n\nglmn_fit <- \n linear_reg(penalty = 0.001, mixture = 0.5) %>% \n set_engine(\"glmnet\") %>%\n fit(Sale_Price ~ ., data = bake(norm_recipe, new_data = NULL))\nglmn_fit\n#> parsnip model object\n#> \n#> \n#> Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = \"gaussian\", alpha = ~0.5) \n#> \n#> Df %Dev Lambda\n#> 1 0 0.00 0.138300\n#> 2 1 1.96 0.126000\n#> 3 1 3.72 0.114800\n#> 4 1 5.28 0.104600\n#> 5 2 7.07 0.095320\n#> 6 3 9.64 0.086850\n#> 7 4 12.58 0.079140\n#> 8 5 15.45 0.072110\n#> 9 5 17.93 0.065700\n#> 10 7 20.81 0.059860\n#> 11 7 23.51 0.054550\n#> 12 7 25.82 0.049700\n#> 13 8 28.20 0.045290\n#> 14 8 30.31 0.041260\n#> 15 8 32.12 0.037600\n#> 16 8 33.66 0.034260\n#> 17 8 34.97 0.031210\n#> 18 8 36.08 0.028440\n#> 19 8 37.02 0.025910\n#> 20 9 37.90 0.023610\n#> 21 9 38.65 0.021510\n#> 22 9 39.29 0.019600\n#> 23 9 39.83 0.017860\n#> 24 9 40.28 0.016270\n#> 25 10 40.68 0.014830\n#> 26 11 41.06 0.013510\n#> 27 11 41.38 0.012310\n#> 28 11 41.65 0.011220\n#> 29 11 41.88 0.010220\n#> 30 12 42.09 0.009313\n#> 31 12 42.27 0.008486\n#> 32 12 42.43 0.007732\n#> 33 12 42.56 0.007045\n#> 34 12 42.66 0.006419\n#> 35 12 42.75 0.005849\n#> 36 12 42.83 0.005329\n#> 37 12 42.90 0.004856\n#> 38 12 42.95 0.004424\n#> 39 12 42.99 0.004031\n#> 40 12 43.03 0.003673\n#> 41 12 43.06 0.003347\n#> 42 12 43.09 0.003050\n#> 43 12 43.11 0.002779\n#> 44 12 43.13 0.002532\n#> 45 12 43.15 0.002307\n#> 46 12 43.16 0.002102\n#> 47 12 43.17 0.001915\n#> 48 12 43.18 0.001745\n#> 49 12 43.19 0.001590\n#> 50 12 43.19 0.001449\n#> 51 12 43.20 0.001320\n#> 52 12 43.20 0.001203\n#> 53 12 43.21 0.001096\n#> 54 12 43.21 0.000999\n#> 55 12 43.21 0.000910\n#> 56 12 43.21 0.000829\n#> 57 12 43.22 0.000755\n#> 58 12 43.22 0.000688\n#> 59 12 43.22 0.000627\n#> 60 12 43.22 0.000571\n#> 61 12 43.22 0.000521\n#> 62 12 43.22 0.000474\n#> 63 12 43.22 0.000432\n#> 64 12 43.22 0.000394\n#> 65 12 43.22 0.000359\n```\n:::\n\n\nIf `penalty` were not specified, all of the `lambda` values would be computed. \n\nTo get the predictions for this specific value of `lambda` (aka `penalty`):\n\n\n::: {.cell layout-align=\"center\" hash='cache/glmn-pred_673611c19e448251aeb977fec5788162'}\n\n```{.r .cell-code}\n# First, get the processed version of the test set predictors:\ntest_normalized <- bake(norm_recipe, new_data = ames_test, all_predictors())\n\ntest_results <- \n test_results %>%\n rename(`random forest` = .pred) %>%\n bind_cols(\n predict(glmn_fit, new_data = test_normalized) %>%\n rename(glmnet = .pred)\n )\ntest_results\n#> # A tibble: 733 × 3\n#> Sale_Price `random forest` glmnet\n#> \n#> 1 5.39 5.25 5.16\n#> 2 5.28 5.29 5.27\n#> 3 5.23 5.26 5.24\n#> 4 5.21 5.30 5.24\n#> 5 5.60 5.51 5.24\n#> 6 5.32 5.29 5.26\n#> 7 5.17 5.14 5.18\n#> 8 5.06 5.13 5.17\n#> 9 4.98 5.01 5.18\n#> 10 5.11 5.14 5.19\n#> # … with 723 more rows\n\ntest_results %>% metrics(truth = Sale_Price, estimate = glmnet) \n#> # A tibble: 3 × 3\n#> .metric .estimator .estimate\n#> \n#> 1 rmse standard 0.142 \n#> 2 rsq standard 0.391 \n#> 3 mae standard 0.0979\n\ntest_results %>% \n gather(model, prediction, -Sale_Price) %>% \n ggplot(aes(x = prediction, y = Sale_Price)) + \n geom_abline(col = \"green\", lty = 2) + \n geom_point(alpha = .4) + \n facet_wrap(~model) + \n coord_fixed()\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=672}\n:::\n:::\n\n\nThis final plot compares the performance of the random forest and regularized regression models.\n\n## Session information\n\n\n::: {.cell layout-align=\"center\" hash='cache/si_43a75b68dcc94565ba13180d7ad26a69'}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.2.0 (2022-04-22)\n#> os macOS Big Sur/Monterey 10.16\n#> system x86_64, darwin17.0\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/New_York\n#> date 2023-01-04\n#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)\n#> dials * 1.1.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)\n#> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> glmnet * 4.1-4 2022-04-15 [1] CRAN (R 4.2.0)\n#> infer * 1.0.3 2022-08-22 [1] CRAN (R 4.2.0)\n#> parsnip * 1.0.3 2022-11-11 [1] CRAN (R 4.2.0)\n#> purrr * 0.3.5 2022-10-06 [1] CRAN (R 4.2.0)\n#> randomForest * 4.7-1.1 2022-05-23 [1] CRAN (R 4.2.0)\n#> ranger * 0.14.1 2022-06-18 [1] CRAN (R 4.2.0)\n#> recipes * 1.0.3.9000 2022-12-12 [1] local\n#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)\n#> rsample * 1.1.1.9000 2022-12-13 [1] local\n#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)\n#> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0)\n#> tune * 1.0.1.9001 2022-12-05 [1] Github (tidymodels/tune@e23abdf)\n#> workflows * 1.1.2 2022-11-16 [1] CRAN (R 4.2.0)\n#> yardstick * 1.1.0.9000 2022-11-30 [1] Github (tidymodels/yardstick@5f1b9ce)\n#> \n#> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
diff --git a/_freeze/content/learn/statistics/many-models/index/execute-results/html.json b/_freeze/content/learn/statistics/many-models/index/execute-results/html.json
new file mode 100644
index 00000000..47703bfd
--- /dev/null
+++ b/_freeze/content/learn/statistics/many-models/index/execute-results/html.json
@@ -0,0 +1,14 @@
+{
+ "hash": "8b507d2b8c5252519f803be34ba5a47b",
+ "result": {
+ "markdown": "---\ntitle: \"Many Models\"\ncategories:\n - statistical analysis\n - tidying results\ntype: learn-subsection\nweight: 1\ndescription: | \n Analyze the results of many linear regressions and learn about nested data frames along the way.\n---\n\n\n\n\n\n\n\n_These materials were in the first edition of [R for Data Science](https://r4ds.had.co.nz). They are not in later editions and are shown here -- with some modifications-- with permission_. To use code in this article, you will need to install the following packages: gapminder, stringr, and tidymodels.\n\nIn this article, you're going to learn three powerful ideas that help you to work with large numbers of models with ease:\n\n1. Using many simple models to better understand complex datasets.\n\n1. Using list-columns to store arbitrary data structures in a data frame.\n For example, this will allow you to have a column that contains linear \n models.\n \n1. Using the __broom__ package, originally by David Robinson, to turn models into tidy \n data. This is a powerful technique for working with large numbers of models.\n\nWe'll start by diving into a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signals so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.\n\nThe following sections will dive into more detail about the individual techniques:\n\n1. In [list-columns], you'll learn more about the list-column data structure,\n and why it's valid to put lists in data frames.\n \n1. In [creating list-columns], you'll learn the three main ways in which you'll\n create list-columns.\n \n1. In [simplifying list-columns] you'll learn how to convert list-columns back\n to regular atomic vectors (or sets of atomic vectors) so you can work\n with them more easily.\n \n1. In [making tidy data with broom], you'll learn about the full set of tools\n provided by broom, and see how they can be applied to other types of \n data structure.\n\nFirst, let's load some packages:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-3_68b436f9fd1b8a93a8377ef202bb7c9c'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\ntidymodels_prefer()\n```\n:::\n\n\n## gapminder\n\nTo motivate the power of many simple models, we're going to look into the \"gapminder\" data. This data was popularised by Hans Rosling, a Swedish doctor and statistician. If you've never heard of him, stop reading this chapter right now and go watch one of his videos! He is a fantastic data presenter and illustrates how you can use data to present a compelling story. A good place to start is this short video filmed in conjunction with the BBC: .\n\nThe gapminder data summarises the progression of countries over time, looking at statistics like life expectancy and GDP. The data is easy to access in R, thanks to Jenny Bryan who created the gapminder package:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-4_61bfa05dd2f83dae069375d16a77211a'}\n\n```{.r .cell-code}\nlibrary(gapminder)\ngapminder\n#> # A tibble: 1,704 × 6\n#> country continent year lifeExp pop gdpPercap\n#> \n#> 1 Afghanistan Asia 1952 28.8 8425333 779.\n#> 2 Afghanistan Asia 1957 30.3 9240934 821.\n#> 3 Afghanistan Asia 1962 32.0 10267083 853.\n#> 4 Afghanistan Asia 1967 34.0 11537966 836.\n#> 5 Afghanistan Asia 1972 36.1 13079460 740.\n#> 6 Afghanistan Asia 1977 38.4 14880372 786.\n#> 7 Afghanistan Asia 1982 39.9 12881816 978.\n#> 8 Afghanistan Asia 1987 40.8 13867957 852.\n#> 9 Afghanistan Asia 1992 41.7 16317921 649.\n#> 10 Afghanistan Asia 1997 41.8 22227415 635.\n#> # … with 1,694 more rows\n```\n:::\n\n\nIn this case study, we're going to focus on just three variables to answer the question \"How does life expectancy (`lifeExp`) change over time (`year`) for each country (`country`)?\". A good place to start is with a plot:\n\n\n::: {.cell layout-align=\"center\" hash='cache/year-life-exp_2ae61e7d6272489aa9ea110790fd6fc6'}\n\n```{.r .cell-code}\ngapminder %>% \n ggplot(aes(year, lifeExp, group = country)) +\n geom_line(alpha = 1/3)\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=672}\n:::\n:::\n\n\nThis is a small dataset: it only has ~1,700 observations and 3 variables. But it's still hard to see what's going on! Overall, it looks like life expectancy has been steadily improving. However, if you look closely, you might notice some countries that don't follow this pattern. How can we make those countries easier to see?\n\nOne way is to use the same approach as in the last chapter: there's a strong signal (overall linear growth) that makes it hard to see subtler trends. We'll tease these factors apart by fitting a model with a linear trend. The model captures steady growth over time, and the residuals will show what's left.\n\nYou already know how to do that if we had a single country:\n\n\n::: {.cell layout-align=\"center\" fig.asp='1' hash='cache/year-life-exp-nz_06fcd418a5a9e9d7c356ed3c88d34748'}\n\n```{.r .cell-code}\nnz <- filter(gapminder, country == \"New Zealand\")\nnz %>% \n ggplot(aes(year, lifeExp)) + \n geom_line() + \n ggtitle(\"Full data = \")\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=33%}\n:::\n\n```{.r .cell-code}\n\nnz_mod <- lm(lifeExp ~ year, data = nz)\n```\n:::\n\n\nThe broom package provides a general set of functions to turn models into tidy data. One function, `augment()`, is used to supplement a data set with columns such as the predicted values, residuals, and so on. We'll use this to plot the predicted versus observed data: \n\n\n::: {.cell layout-align=\"center\" hash='cache/year-life-exp-nz-resid_6f887ff3d02adfedd7ea443d09af9830'}\n\n```{.r .cell-code}\nnz_mod_res <- augment(nz_mod)\n\nnz_mod_res %>%\n ggplot(aes(year, .fitted)) + \n geom_line() + \n ggtitle(\"Linear trend + \")\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=672}\n:::\n\n```{.r .cell-code}\n\nnz_mod_res %>% \n ggplot(aes(year, .resid)) + \n geom_hline(yintercept = 0, colour = \"white\", linewidth = 3) + \n geom_line() + \n ggtitle(\"Remaining pattern\")\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=672}\n:::\n:::\n\n\nThere is a `data` argument to `augment()`. The default uses the same data points that the model was trained on. \n\nHow can we easily fit that model to every country?\n\n### Nested data\n\nYou could imagine copy and pasting that code multiple times; but you've already learned a better way! Extract out the common code with a function and repeat using a map function from purrr. This problem is structured a little differently to what you've seen before. Instead of repeating an action for each variable, we want to repeat an action for each country, a subset of rows. To do that, we need a new data structure: the __nested data frame__. To create a nested data frame we start with a grouped data frame, and \"nest\" it:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-8_cc1ce25bb88024098422348cd2fe7ee9'}\n\n```{.r .cell-code}\nby_country <- gapminder %>% \n group_by(country, continent) %>% \n nest()\n\nby_country\n#> # A tibble: 142 × 3\n#> # Groups: country, continent [142]\n#> country continent data \n#> \n#> 1 Afghanistan Asia \n#> 2 Albania Europe \n#> 3 Algeria Africa \n#> 4 Angola Africa \n#> 5 Argentina Americas \n#> 6 Australia Oceania \n#> 7 Austria Europe \n#> 8 Bahrain Asia \n#> 9 Bangladesh Asia \n#> 10 Belgium Europe \n#> # … with 132 more rows\n```\n:::\n\n\n(I'm cheating a little by grouping on both `continent` and `country`. Given `country`, `continent` is fixed, so this doesn't add any more groups, but it's an easy way to carry an extra variable along for the ride.)\n\nThis creates a data frame that has one row per group (per country), and a rather unusual column: `data`. `data` is a list of data frames (or tibbles, to be precise). This seems like a crazy idea: we have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea.\n\nThe `data` column is a little tricky to look at because it's a moderately complicated list, and we're still working on good tools to explore these objects. Unfortunately using `str()` is not recommended as it will often produce very long output. But if you pluck out a single element from the `data` column you'll see that it contains all the data for that country (in this case, Afghanistan).\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-9_8d4fea0867420a2fefe97b6580e4e5be'}\n\n```{.r .cell-code}\nby_country$data[[1]]\n#> # A tibble: 12 × 4\n#> year lifeExp pop gdpPercap\n#> \n#> 1 1952 28.8 8425333 779.\n#> 2 1957 30.3 9240934 821.\n#> 3 1962 32.0 10267083 853.\n#> 4 1967 34.0 11537966 836.\n#> 5 1972 36.1 13079460 740.\n#> 6 1977 38.4 14880372 786.\n#> 7 1982 39.9 12881816 978.\n#> 8 1987 40.8 13867957 852.\n#> 9 1992 41.7 16317921 649.\n#> 10 1997 41.8 22227415 635.\n#> 11 2002 42.1 25268405 727.\n#> 12 2007 43.8 31889923 975.\n```\n:::\n\n\nNote the difference between a standard grouped data frame and a nested data frame: in a grouped data frame, each row is an observation; in a nested data frame, each row is a group. Another way to think about a nested dataset is we now have a meta-observation: a row that represents the complete time course for a country, rather than a single point in time.\n\n### List-columns\n\nNow that we have our nested data frame, we're in a good position to fit some models. We have a model-fitting function:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-10_64a76b96fa6c7b356edac37e78cd41e0'}\n\n```{.r .cell-code}\ncountry_model <- function(df) {\n lm(lifeExp ~ year, data = df)\n}\n```\n:::\n\n\nAnd we want to apply it to every data frame. The data frames are in a list, so we can use `purrr::map()` to apply `country_model` to each element:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-11_bc3c5e98ba55e9a5ca853844c95293af'}\n\n```{.r .cell-code}\nmodels <- map(by_country$data, country_model)\n```\n:::\n\n\nHowever, rather than leaving the list of models as a free-floating object, I think it's better to store it as a column in the `by_country` data frame. Storing related objects in columns is a key part of the value of data frames, and why I think list-columns are such a good idea. In the course of working with these countries, we are going to have lots of lists where we have one element per country. So why not store them all together in one data frame?\n\nIn other words, instead of creating a new object in the global environment, we're going to create a new variable in the `by_country` data frame. That's a job for `dplyr::mutate()`:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-12_290ff8735e10dce4997cd3751707c272'}\n\n```{.r .cell-code}\nby_country <- by_country %>% \n mutate(model = map(data, country_model))\nby_country\n#> # A tibble: 142 × 4\n#> # Groups: country, continent [142]\n#> country continent data model \n#> \n#> 1 Afghanistan Asia \n#> 2 Albania Europe \n#> 3 Algeria Africa \n#> 4 Angola Africa \n#> 5 Argentina Americas \n#> 6 Australia Oceania \n#> 7 Austria Europe \n#> 8 Bahrain Asia \n#> 9 Bangladesh Asia \n#> 10 Belgium Europe \n#> # … with 132 more rows\n```\n:::\n\n\nThis has a big advantage: because all the related objects are stored together, you don't need to manually keep them in sync when you filter or arrange. The semantics of the data frame takes care of that for you:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-13_eb34a6390d7d26b87d93852a737b52df'}\n\n```{.r .cell-code}\nby_country %>% \n filter(continent == \"Europe\")\n#> # A tibble: 30 × 4\n#> # Groups: country, continent [30]\n#> country continent data model \n#> \n#> 1 Albania Europe \n#> 2 Austria Europe \n#> 3 Belgium Europe \n#> 4 Bosnia and Herzegovina Europe \n#> 5 Bulgaria Europe \n#> 6 Croatia Europe \n#> 7 Czech Republic Europe \n#> 8 Denmark Europe \n#> 9 Finland Europe \n#> 10 France Europe \n#> # … with 20 more rows\nby_country %>% \n arrange(continent, country)\n#> # A tibble: 142 × 4\n#> # Groups: country, continent [142]\n#> country continent data model \n#> \n#> 1 Algeria Africa \n#> 2 Angola Africa \n#> 3 Benin Africa \n#> 4 Botswana Africa \n#> 5 Burkina Faso Africa \n#> 6 Burundi Africa \n#> 7 Cameroon Africa \n#> 8 Central African Republic Africa \n#> 9 Chad Africa \n#> 10 Comoros Africa \n#> # … with 132 more rows\n```\n:::\n\n\nIf your list of data frames and list of models were separate objects, you have to remember that whenever you re-order or subset one vector, you need to re-order or subset all the others in order to keep them in sync. If you forget, your code will continue to work, but it will give the wrong answer!\n\n### Unnesting\n\nPreviously we computed the residuals of a single model with a single dataset. Now we have 142 data frames and 142 models. To compute the residuals, we need to call `augment()` with each model pair:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-14_9ac5aa978a8091b63a9a551404452704'}\n\n```{.r .cell-code}\nby_country <- by_country %>% \n mutate(\n results = map(model, augment)\n )\nby_country\n#> # A tibble: 142 × 5\n#> # Groups: country, continent [142]\n#> country continent data model results \n#> \n#> 1 Afghanistan Asia \n#> 2 Albania Europe \n#> 3 Algeria Africa \n#> 4 Angola Africa \n#> 5 Argentina Americas \n#> 6 Australia Oceania \n#> 7 Austria Europe \n#> 8 Bahrain Asia \n#> 9 Bangladesh Asia \n#> 10 Belgium Europe \n#> # … with 132 more rows\n```\n:::\n\n\nBut how can you plot a list of data frames? Instead of struggling to answer that question, let's turn the list of data frames back into a regular data frame. Previously we used `nest()` to turn a regular data frame into an nested data frame, and now we do the opposite with `unnest()`:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-15_dac9af647026b33256698116cb78dd11'}\n\n```{.r .cell-code}\nresids <- unnest(by_country, results)\nresids\n#> # A tibble: 1,704 × 12\n#> # Groups: country, continent [142]\n#> country conti…¹ data model lifeExp year .fitted .resid .hat .sigma\n#> \n#> 1 Afghanist… Asia 28.8 1952 29.9 -1.11 0.295 1.21\n#> 2 Afghanist… Asia 30.3 1957 31.3 -0.952 0.225 1.24\n#> 3 Afghanist… Asia 32.0 1962 32.7 -0.664 0.169 1.27\n#> 4 Afghanist… Asia 34.0 1967 34.0 -0.0172 0.127 1.29\n#> 5 Afghanist… Asia 36.1 1972 35.4 0.674 0.0991 1.27\n#> 6 Afghanist… Asia 38.4 1977 36.8 1.65 0.0851 1.15\n#> 7 Afghanist… Asia 39.9 1982 38.2 1.69 0.0851 1.15\n#> 8 Afghanist… Asia 40.8 1987 39.5 1.28 0.0991 1.21\n#> 9 Afghanist… Asia 41.7 1992 40.9 0.754 0.127 1.26\n#> 10 Afghanist… Asia 41.8 1997 42.3 -0.534 0.169 1.27\n#> # … with 1,694 more rows, 2 more variables: .cooksd , .std.resid ,\n#> # and abbreviated variable name ¹continent\n```\n:::\n\n\nNote that each regular column is repeated once for each row of the nested tibble.\n\nNow we have regular data frame, we can plot the residuals:\n\n\n::: {.cell layout-align=\"center\" hash='cache/resid-col-country_2c4f5aac18cc015b75b9190ec1728554'}\n\n```{.r .cell-code}\nresids %>% \n ggplot(aes(year, .resid)) +\n geom_line(aes(group = country), alpha = 1 / 3) + \n geom_smooth(se = FALSE)\n#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs =\n#> \"cs\")'\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=672}\n:::\n:::\n\n\nFacetting by continent is particularly revealing:\n \n\n::: {.cell layout-align=\"center\" hash='cache/resid-col-country-facetting_14d0637611062dbe6c16304bcf2c5b9a'}\n\n```{.r .cell-code}\nresids %>% \n ggplot(aes(year, .resid, group = country)) +\n geom_line(alpha = 1 / 3) + \n facet_wrap(~continent)\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=672}\n:::\n:::\n\n\nIt looks like we've missed some mild patterns. There's also something interesting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle.\n\n### Model quality\n\nInstead of looking at the residuals from the model, we could look at some general measurements of model quality. You learned how to compute some specific measures in the previous chapter. Here we'll show a different approach using the broom package and we'll use `broom::glance()` to extract some model quality metrics. If we apply it to a model, we get a data frame with a single row:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-18_29fbf5cd328907283445f10a58d7920e'}\n\n```{.r .cell-code}\nbroom::glance(nz_mod)\n#> # A tibble: 1 × 12\n#> r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵\n#> \n#> 1 0.954 0.949 0.804 205. 5.41e-8 1 -13.3 32.6 34.1 6.47 10\n#> # … with 1 more variable: nobs , and abbreviated variable names\n#> # ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual\n```\n:::\n\n\nWe can use `mutate()` and `unnest()` to create a data frame with a row for each country:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-19_f0d129ad2719cc5c9d7ebe4ebe270d50'}\n\n```{.r .cell-code}\nby_country %>% \n mutate(glance = map(model, broom::glance)) %>% \n unnest(glance)\n#> # A tibble: 142 × 17\n#> # Groups: country, continent [142]\n#> country continent data model results r.squa…¹ adj.r…² sigma stati…³\n#> \n#> 1 Afghanistan Asia 0.948 0.942 1.22 181. \n#> 2 Albania Europe 0.911 0.902 1.98 102. \n#> 3 Algeria Africa 0.985 0.984 1.32 662. \n#> 4 Angola Africa 0.888 0.877 1.41 79.1\n#> 5 Argentina Americas 0.996 0.995 0.292 2246. \n#> 6 Australia Oceania 0.980 0.978 0.621 481. \n#> 7 Austria Europe 0.992 0.991 0.407 1261. \n#> 8 Bahrain Asia 0.967 0.963 1.64 291. \n#> 9 Bangladesh Asia 0.989 0.988 0.977 930. \n#> 10 Belgium Europe