Sydney Informatics Hub – tidymodels_tips

skip = TRUE

The sale price data are already log-transformed in the ames data frame. Why not use step_log() in our recipe:

step_log(Sale_Price, base = 10)

This will cause a failure when the recipe is applied to new properties with an unknown sale price. Since price is what we are trying to predict, there probably won’t be a column in the data for this variable. In fact, to avoid information leakage, many tidymodels packages isolate the data being used when making any predictions. This means that the training set and any outcome columns are not available for use at prediction time.

For simple transformations of the outcome column(s), we strongly suggest that those operations be conducted outside of the recipe.

However, there are other circumstances where this is not an adequate solution. When using a recipe, we thus need a mechanism to ensure that some operations are applied only to the data that are given to the model. Each step function has an option called skip that, when set to TRUE, will be ignored by the predict() function. In this way, you can isolate the steps that affect the modeling data without causing errors when applied to new samples. However, all steps are applied when using fit().

step_mutate() - step_select()

step_mutate() creates a specification of a recipe step that will add variables using dplyr::mutate(). We did this step during EDA in Session 1 but we could have easily introduced it in our recipe ames_rec object in this way:

ames_rec <- recipe(sale_price ~ ., data = ames_train) %>% 
            step_mutate(time_since_remodel = year_sold - year_remod_add, 
                        house_age = year_sold - year_built) %>%
            step_select(-year_remod_add, -year_built)

The advantage of performing these preprocessing steps with the recipe package is that all the feature engineering you want to perform to your data can be put into a single object, you can save that object, you can carry it around. It’s not in a bunch of scripts. It’s been unit tested and it has a lot of features in it.

update()

This step method for update() updates steps within a recipe object.

In the example below, the [[2]] is used to access the 2 step in the recipe object’s steps list.

The update() function is then used to modify this step by changing the threshold used in the step_nzv() function to 0.05.

For a step to be updated, it must not already have been trained.

ames_rec$steps[[2]] <- update(ames_rec$steps[[2]], threshold = 0.05)

use_*()

The usemodels package is a helpful way of quickly creating code snippets to fit models using the tidymodels framework. Given a simple formula and a data set, the use_* functions can create code that appropriate for the data (given the model). The package includes these templates:

library(usemodels)
ls("package:usemodels", pattern = "use_")

[1] "use_C5.0"             "use_cubist"           "use_earth"           
[4] "use_glmnet"           "use_kernlab_svm_poly" "use_kernlab_svm_rbf" 
[7] "use_kknn"             "use_ranger"           "use_xgboost"

library(mlbench)
data(PimaIndiansDiabetes)
use_ranger(diabetes ~ ., data = PimaIndiansDiabetes)

ranger_recipe <- 
  recipe(formula = diabetes ~ ., data = PimaIndiansDiabetes) 

ranger_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_mode("classification") %>% 
  set_engine("ranger") 

ranger_workflow <- 
  workflow() %>% 
  add_recipe(ranger_recipe) %>% 
  add_model(ranger_spec) 

set.seed(58662)
ranger_tune <-
  tune_grid(ranger_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))

tune_race_anova()

The problem with grid search is that you don’t know if some of those choices you made about the candidate parameters are any good until you’re done with all the computations. The tune_race_anova() function from the finetune package is a dynamic way of doing grid search. What racing does is as you start to do the model tuning, it looks at the results as they happen and eliminates tuning parameter combinations that are unlikely to be the best results using a repeated measure ANOVA model:

install.packages("finetune")
library(finetune)

rf_tune_wf <- rf_workflow %>%
              tune_race_anova(resamples = diabetes_folds,
                              grid = rf_grid)