class: bottom, left, title-slide, clear background-image: url('img/beamer_title_KUL.png') background-size: contain name: titlepage # Hands-on Machine Learning with R - Module 1 <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=100%></html> <span style="color:#FAFAFA;font-size:18.0pt"> Katrien Antonio & Roel Henckaerts & Jonas Crevecoeur <br> <br> </span> <span style="color:#FAFAFA;font-size:14.0pt"> Webinar | October - November, 2023<br> </span> --- class: inverse, center, middle name: prologue # Prologue <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- name: introduction # Introduction ### Course
https://github.com/katrienantonio/hands-on-machine-learning-R-module-1 The course repo on GitHub, where you can find the data sets, lecture sheets, R scripts and R markdown files. -- ### Us
[https://katrienantonio.github.io/](https://katrienantonio.github.io/) & [https://be.linkedin.com/in/jonascrevecoeur](LinkedIn Jonas) & [https://be.linkedin.com/in/roelhenckaerts](LinkedIn Roel)
[katrien.antonio@kuleuven.be](mailto:katrien.antonio@kuleuven.be) & [roel.henckaerts@kuleuven.be](mailto:roel.henckaerts@kuleuven.be)
(Katrien) Professor in insurance data science
(Jonas) PhD in insurance data science, now consultant in statistics, data science and data engineering with Data Minded
(Roel) PhD in insurance data science, now consultant in data science with AI start up [Prophecy Labs](https://www.prophecylabs.com/) --- name: checklist # Checklist ☑ Do you have a fairly recent version of R? ☑ Do you have a fairly recent version of RStudio? ☑ Have you installed the R packages listed in the software requirements? or ☑ Have you created an account on posit cloud (to avoid any local installation issues)? --- name: why-this-course # inspired by Grant McDermott intro lecture # Why this course? ### The goals of this course .font140[
] -- * develop practical .KULbginline[machine learning (ML) foundations] -- * .KULbginline[fill in the gaps] left by traditional training in actuarial science or econometrics -- * focus on the use of ML methods for the .KULbginline[analysis of frequency + severity data], but also .KULbginline[non-standard data] such as images -- * .KULbginline[explore] a substantial range of .KULbginline[methods (and data types)] (from GLMs to deep learning), but - most importantly - .KULbginline[build foundation] so that you can explore other methods (and data types) yourself. -- <br> > *"In short, we will cover things that we wish someone had taught us in our undergraduate programs."* > <br> > .font80[This quote is from the [Data science for economists course](http://github.com/uo-ec607/lectures) by Grant McDermott.] --- # Module 1's Outline .pull-left[ * [Prologue](#prologue) * [Knowing me, knowing you: <br> statistical and machine learning](#knowing) - Supervised and unsupervised learning - Regression and classification - Statistical modeling: the two cultures * [Model accuracy and loss functions](#accuracy) * [Overfitting and bias-variance tradeoff](#overfit) * [Data splitting, Resampling methods](#resampling) ] .pull-right[ * [Parameter tuning](#tuning) - with {caret}, {rsample} and {purrr} * [Target and feature engineering](#engineering) - Data leakage - Pre-processing steps - Specifying blue-prints with {recipes} - Putting it all together: {recipes} and {caret}/{rsample} * [Regression models](#regression) - Creating models in R and tidy model output with {broom} - GLMs with {glm} - GAMs with {mgcv} - Regularized (G)LMs with {glmnet}. ] --- name: map-ML-world class: right, middle, clear background-image: url("img/map_ML_world.jpg") background-size: 45% background-position: left .KULbginline[Some roadmaps to explore the ML landscape...] <img src = "img/AI_ML_DL.jpg" height = "350px" /> .font60[Source: [Machine Learning for Everyone In simple words. With real-world examples. Yes, again.](https://vas3k.com/blog/machine_learning/)] --- name: map-ML-world class: right, middle, clear background-image: url("img/main_types_ML.jpg") background-size: 85% background-position: middle --- class: inverse, center, middle name: knowing # Knowing me, knowing you: <br> <br> # statistical and machine learning <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- name: supervised-learning # Supervised learning .pull-left-alt[ Supervised learning builds ("learns") a .blue[model] `\(\color{#3b3b9a}{f}\)` (*the Signal*) such that the .orange[outcome or target] `\(\color{#fb6107}{Y}\)` can be written as `$$\color{#FFA500}{Y} = \color{#3b3b9a}{f}(\color{#e64173}{x_1, \ldots, x_p}) + \epsilon$$` with .pink[features] `\(\color{#e64173}{x_1, \ldots, x_p}\)` and error term `\(\epsilon\)` (*the Noise*). Supervised learners construct .KULbginline[predictive models]. <br> <br> .footnote[Picture taken from [Machine Learning for Everyone. In simple words. With real-world examples. Yes, again](https://vas3k.com/blog/machine_learning/)] ] .pull-right-alt[ .center[ <img src="img/supervised_unsupervised_drawing.jpg" width="100%" style="display: block; margin: auto;" /> ] ] --- name: unsupervised-learning # Unsupervised learning .pull-left-alt[ With unsupervised learning there is .KULbginline[NO] .orange[outcome or target] `\(\color{#FFA500}{Y}\)`, only the feature vector `\(\color{#e64173}{x = (x_1, \ldots, x_p)}\)`. Let `\(n\)` denote the sample size and `\(p\)` the number of features. Then, `\(\color{#e64173}{X}\)` is the `\(n \times p\)` matrix of features, with `\(\color{#e64173}{x}_{i,j}\)` observation `\(i\)` on variable or feature `\(j\)`. Unsupervised learners construct .KULbginline[descriptive models], without any *supervising* output, letting the data "speak for itself". ] .pull-right-alt[ .center[ <img src="img/K-means_drawing.jpg" width="80%" height="55%" style="display: block; margin: auto;" /> ] .footnote[Picture taken from [Machine Learning for Everyone. In simple words. With real-world examples. Yes, again](https://vas3k.com/blog/machine_learning/)] ] --- class: clear <br> <br> .center[ <img src="img/supervised_unsupervised_robot.jpg" width="90%" style="display: block; margin: auto;" /> ] .footnote[Picture taken from [this source](https://twitter.com/athena_schools/status/1063013435779223553).] --- name: what's-in-a-name # What's in a name? .KULbginline[Machine learning] constructs algorithms that learn from data. -- .KULbginline[Statistical learning] emphasizes statistical models and the assessment of uncertainty. -- .KULbginline[Data science] applies mathematics, statistics, machine learning, engineering, etc. to extract knowledge form data. -- > *"Data Science is statistics on a Mac
. "* .center[ <img src="img/ElementsStatLearning.png" alt="Drawing" style="height: 250px;"/> <img src="img/ISL.png" alt="Drawing" style="height: 250px;"/> <img src="img/AppliedPredMod.png" alt="Drawing" style="height: 250px;"/> <img src="img/boehmke_greenwell.jpg" alt="Drawing" style="height: 250px;"/> <img src="img/molnar.png" alt="Drawing" style="height: 250px;"/> ] Source: Brandon M. Greenwell on [Introduction to Machine Learning in
](https://github.com/bgreenwell/intro-ml-r). --- name: two-cultures # Statistical modeling: the two cultures Consider a vector of input variables `\(\color{#e64173}{x}\)`, being transformed into some vector of response variables `\(\color{#FFA500}{y}\)` via a black box algorithm. .center[ <img src="img/Breiman_nature.png" alt="Drawing" style="width: 300px;"/> ] -- .pull-left[ .KULbginline[Statistical learning or data modeling culture] * assume statistical model, estimate parameter values * validate with goodness-of-fit tests and residual inspection .center[ <img src="img/Breiman_data_modeling.png" alt="Drawing" style="width: 300px;"/> ] ] -- .pull-right[ .KULbginline[Machine learning or algo modeling culture] * inside of the box is complex and unknown * find algorithm `\(\color{#3b3b9a}{f}(\color{#e64173}{x})\)` to predict `\(\color{#FFA500}{y}\)` * measure performance by predictive accuracy .center[ <img src="img/Breiman_algo_modeling.png" alt="Drawing" style="width: 300px;"/> ] ] Source: Breiman (2001, Statistical Science) on *Statistical modeling: the two cultures.* --- name: statistical-machine-learning # Newspeak from the two cultures <br> <br> | Statistical learning | Machine learning :------:|:-------------------------:|:-------------------------: .KULbginline[origin] | statistics | computer science .KULbginline[*f(x)*] | model | algorithm .KULbginline[emphasis] | interpretability, precision and uncertainty | large scale applicability, prediction accuracy .KULbginline[jargon] | parameters, estimation | weights, learning .KULbginline[CI] | uncertainty of parameters | no notion of uncertainty .KULbginline[assumptions] | explicit a priori assumption | no prior assumption, learn from the data <br> Source: read the blog [Why a mathematician, statistician and machine learner solve the same problem differently](https://blog.galvanize.com/why-a-mathematician-statistician-machine-learner-solve-the-same-problem-differently-2/) --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ As discussed in the lecture, many problems in ML can be approached as a .KULbginline[regression], .KULbginline[classification] or .KULbginline[clustering] problem. <br> .hi-pink[Q]: consider the following .hi-pink[three problem settings] and .hi-pink[label them] as regression, classification or clustering. <br> 1. In disability insurance: how do disability rates depend on the state of the economy (e.g. GDP)? 2. In MTPL insurance: predict whether a claim is attritional or large, *in casu* a claim that exceeds the threshold of 100 000 EUR? 3. How can we group customers based on the insurance products they bought from the company? ] --- class: inverse, center, middle name: accuracy # Model accuracy and loss functions <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # Predictive modeling How to use the observed data to learn or to estimate the unknown `\(\color{#3b3b9a}{f}(.)\)`? `$$\begin{eqnarray*} \color{#FFA500}{y} &=& \color{#3b3b9a}{f}(\color{#e64173}{x_1,x_2,\ldots,x_p})+\epsilon. \end{eqnarray*}$$` -- How do I .KULbginline[estimate] `\(\color{#3b3b9a}{f}(.)\)` - one way to phrase *all questions* that underly statistical & machine learning. -- .font140[.KULbginline[Take-aways]] .font160[
] - main reasons we want to .KULbginline[learn about] `\(\color{#3b3b9a}{f}(.)\)` -- .pull-left[ .font120[.KULbginline[prediction]] <br> predict the target `\(\color{#FFA500}{y}\)` as `\(\hat{\color{#3b3b9a}{f}}(\color{#e64173}{x})\)` <br> .font140[
] - as black box setting? <br> <br> .font120[.KULbginline[inference]] <br> how does target `\(\color{#FFA500}{y}\)` depend on features `\(\color{#e64173}{x}\)`? <br> .font140[
] - as white box setting? ] -- .pull-right[ <img src="img/prediction_inference.png" width="2200" style="display: block; margin: auto;" /> ] --- name: prediction-error # Prediction errors Why we're stuck with .KULbginline[irreducible error] assume `\(\hat{\color{#3b3b9a}{f}}\)` and `\(\color{#e64173}{x}\)` given, then $$ `\begin{aligned} \mathop{E}\left[ \left\{ \color{#FFA500}{y} - \hat{\color{#FFA500}{y}} \right\}^2 \right] &= \mathop{E}\left[ \left\{ \color{#3b3b9a}{f}(\color{#e64173}{x}) + \epsilon - \hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}) \right\}^2 \right] \\ &= \underbrace{\left[ \color{#3b3b9a}{f}(\color{#e64173}{x}) - \hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}) \right]^2}_{\text{Reducible}} + \underbrace{\mathop{\text{Var}} \left( \color{#e64173}{\epsilon} \right)}_{\text{Irreducible}} \end{aligned}` $$ In .KULbginline[less math]: - if `\(\epsilon\)` exists, then `\(\color{#e64173}{x}\)` cannot perfectly explain `\(\color{#FFA500}{y}\)` - so even if `\(\hat{\color{#3b3b9a}{f}} = \color{#3b3b9a}{f}\)`, we still have irreducible error. -- Thus, to form our .KULbginline[best predictors], we will .KULbginline[minimize reducible error]. --- name: model-accuracy # Model accuracy We assess .KULbginline[model] or .KULbginline[predictive accuracy] by evaluating how well predictions actually match observed data. -- Use .KULbginline[loss functions], i.e. metrics that compare predicted values to actual values. -- .pull-left[.KULbginline[Regression], use e.g. the .hi-pink[Mean Squared Error (MSE)] `$$\begin{eqnarray*} \frac{1}{n} \sum_{i=1}^n (\color{#FFA500}{y}_i - \hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}_i))^2, \end{eqnarray*}$$` Recall: `\(\color{#FFA500}{y}_i - \hat{\color{#FFA500}{y}}_i = \color{#FFA500}{y}_i - \hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}_i)\)` is the prediction error. Objective
: minimize!] -- .pull-right[.KULbginline[Classification], use e.g. the .hi-pink[cross-entropy] or .hi-pink[log loss] `$$\begin{eqnarray*} -\frac{1}{n} \sum_{i=1}^n \left(\color{#FFA500}{y}_i \cdot \log{(p_i)} + (1-\color{#FFA500}{y}_i) \cdot \log{(1-p_i)}\right). \end{eqnarray*}$$` Objective
: minimize! ] -- <br> .KULbginline[Many other useful loss functions] (e.g. deviance in regression, Gini index in classification). .font140[.KULbginline[Take-away]] .font160[
] - a loss function emphasizes certain types of errors over others, thus pick a meaningful one! --- class: inverse, center, middle name: overfit # Overfitting and bias-variance trade off <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # Overfitting The .KULbginline[Signal and the Noise] discussion! -- Which of the following three models (in green-blue-ish) will best generalize to new data? <img src="ML_part1_files/figure-html/overfitting-linear-regression-1.svg" width="100%" style="display: block; margin: auto;" /> .footnote[Inspired by Brandon Greenwell's [Introduction to Machine Learning in
](https://github.com/bgreenwell/intro-ml-r).] --- # Overfitting (cont.) With a small training error, but large test error, the model is .KULbginline[overfitting] or working too hard! -- The expected value of the .hi-pink[test MSE]: `$$\begin{eqnarray*} E\left(\color{#FFA500}{y}_0-\hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}_0)\right)^2 &=& \text{Var}(\hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}_0))+[\text{Bias}(\hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}_0))]^2+\text{Var}(\epsilon). \end{eqnarray*}$$` -- .font140[.KULbginline[In general]] - with more flexible methods * variance .font140[
] and bias .font140[
] * their relative rate of change determines whether the test error increases or decreases -- .font140[.KULbginline[Take-aways]] .font160[
] * U-shape curves of .hi-pink[test MSE] w.r.t model flexibility * the .hi-pink[bias-variance tradeoff] is central to quality prediction. --- # Bias-variance trade off <br> .center[ <img src="img/bias_variance_trade_off.png" alt="Drawing" style="width: 600px;"/> ] Source: James et al. (2021, 2nd edition) on [https://www.statlearning.com/](An introduction to statistical learning). --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ Data are generated from: `\(\color{#FFA500}{y} = \color{#3b3b9a}{f}(\color{#e64173}{x})+\epsilon\)`, with the black curve as the true `\(\color{#3b3b9a}{f}\)`. The orange (linear regression), blue (smoothing splines) and green (smoothing splines) curves are three estimates for `\(\color{#3b3b9a}{f}\)`, with increasing level of complexity. .hi-pink[Q]: which model do you prefer (orange, blue, green) for each of the following examples? Why? .center[ <img src="img/2.9 ISL.png" alt="Drawing" style="width: 500px;"/> ] .footnote[Example from James et al. (2021) on [https://www.statlearning.com/](An introduction to statistical learning).] ] --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ Data are generated from: `\(\color{#FFA500}{y} = \color{#3b3b9a}{f}(\color{#e64173}{x})+\epsilon\)`, with the black curve as the true `\(\color{#3b3b9a}{f}\)`. The orange (linear regression), blue (smoothing splines) and green (smoothing splines) curves are three estimates for `\(\color{#3b3b9a}{f}\)`, with increasing level of complexity. .hi-pink[Q]: which model do you prefer (orange, blue, green) for each of the following examples? Why? .center[ <img src="img/2.10 ISL.png" alt="Drawing" style="width: 500px;"/> ] .footnote[Example from James et al. (2021) on [https://www.statlearning.com/](An introduction to statistical learning).] ] --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ Data are generated from: `\(\color{#FFA500}{y} = \color{#3b3b9a}{f}(\color{#e64173}{x})+\epsilon\)`, with the black curve as the true `\(\color{#3b3b9a}{f}\)`. The orange (linear regression), blue (smoothing splines) and green (smoothing splines) curves are three estimates for `\(\color{#3b3b9a}{f}\)`, with increasing level of complexity. .hi-pink[Q]: which model do you prefer (orange, blue, green) for each of the following examples? Why? .center[ <img src="img/2.11 ISL.png" alt="Drawing" style="width: 500px;"/> ] .footnote[Example from James et al. (2021) on [https://www.statlearning.com/](An introduction to statistical learning).] ] --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ .font120[The *K*-nearest neighbors (KNN) classifier] - take the *K* observations in the training data set that are 'closest' to test observation `\(\color{#e64173}{x}_0\)`, calculate `$$\begin{eqnarray*} \text{Pr}(\color{#FFA500}{Y}=j|\color{#e64173}{X} = \color{#e64173}{x}_0) &=& \frac{1}{K} \sum_{i \in \mathcal{N}_0} \mathbb{I}(\color{#FFA500}{y}_i=j). \end{eqnarray*}$$` - KNN then assigns the test observation `\(\color{#e64173}{x}_0\)` to the class `\(j\)` with the highest probability, e.g. with *K=3* (from James et al., 2021) .center[ <img src="img/2.14 ISL.png" alt="Drawing" style="width: 300px;"/> ] .hi-pink[Q]: is KNN a supervised learning or unsupervised learning method? Discuss. ] --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ .font120[The *K*-nearest neighbors (KNN) classifier (cont.)] Now compare KNN with *K* equals 1, 10 and 100. .center[ <img src="img/KNN_K_1.png" alt="Drawing" style="height: 250px;"/> <img src="img/KNN_K_10.png" alt="Drawing" style="height: 250px;"/> <img src="img/KNN_K_100.png" alt="Drawing" style="height: 250px;"/> ] .hi-pink[Q]: which classifier do you prefer? Which of these classifiers is under-fitting, which one is over-fitting? ] --- class: inverse, center, middle name: resampling # Data splitting and resampling methods <br> with {caret} and {rsample} <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- name: ames # Ames Iowa housing data We will use the Ames Iowa housing data. There are 2,930 properties in the data set. The `Sale_Price` (target or response) was recorded along with 80 predictors, including: * location (e.g. neighborhood) and lot information * house components (garage, fireplace, pool, porch, etc.) * general assessments such as overall quality and condition * number of bedrooms, baths, and so on. More details in [De Cock (2011, Journal of Statistics Education)](http://ww2.amstat.org/publications/jse/v19n3/decock.pdf). The raw data are at [`http://bit.ly/2whgsQM`](http://bit.ly/2whgsQM) but we will use a processed version found in the [`AmesHousing`](https://github.com/topepo/AmesHousing) package. You will load the data with the `make_ames()` function from the `AmesHousing` library, and store the data in the object `ames`: ```r ames <- AmesHousing::make_ames() ``` --- name: data-splitting # Data splitting We fit our model on past data `\(\{(\color{#e64173}{x}_1,\color{#FFA500}{y}_1),(\color{#e64173}{x}_2,\color{#FFA500}{y}_2),\ldots, (\color{#e64173}{x}_n,\color{#FFA500}{y}_n)\}\)` and get `\(\hat{\color{#3b3b9a}{f}}\)`. *What we want*: how does our model .KULbginline[generalize] to new, unseen data `\((\color{#e64173}{x}_0,\color{#FFA500}{y}_0)\)`, or: is `\(\hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}_0)\)` close to `\(\color{#FFA500}{y}_0\)`? .left-column[ .KULbginline[Training set] * to develop, to train, to tune, to compare different settings, ... .KULbginline[Test set] * to obtain unbiased estimate of final model's performance. ] .right-column[ <img src="img/data_splitting.png" width="70%" style="display: block; margin: auto;" /> .footnote[Picture taken from [Introduction to Machine Learning in
](https://github.com/bgreenwell/intro-ml-r).] ] --- # Data splitting in base We first demonstrate the splitting of the `ames` housing data into a training and test set, using `base` R instructions. .pull-left[ ```r *set.seed(123) index_1 <- sample(1 : nrow(ames), size = round(nrow(ames) * 0.7)) train_1 <- ames[index_1, ] test_1 <- ames[-index_1, ] nrow(train_1)/nrow(ames) ``` ] .pull-right[ Use `set.seed()` for reproducibility. ] --- # Data splitting in base We first demonstrate the splitting of the `ames` housing data into a training and test set, using `base` R instructions. .pull-left[ ```r set.seed(123) *index_1 <- sample(1 : nrow(ames), * size = round(nrow(ames) * 0.7)) train_1 <- ames[index_1, ] test_1 <- ames[-index_1, ] nrow(train_1)/nrow(ames) ``` ] .pull-right[ Sample indices from `1 : nrow(ames)` such that in total 70% of the records is selected. Vector `index_1` now stores the row numbers of the selected records. ] --- # Data splitting in base We first demonstrate the splitting of the `ames` housing data into a training and test set, using `base` R instructions. .pull-left[ ```r set.seed(123) index_1 <- sample(1 : nrow(ames), size = round(nrow(ames) * 0.7)) *train_1 <- ames[index_1, ] test_1 <- ames[-index_1, ] nrow(train_1)/nrow(ames) ``` ] .pull-right[ Put the selected records in training set `train_1` by subsetting the original data frame `ames` with the row numbers stored in `index_1`. ] --- # Data splitting in base We first demonstrate the splitting of the `ames` housing data into a training and test set, using `base` R instructions. .pull-left[ ```r set.seed(123) index_1 <- sample(1 : nrow(ames), size = round(nrow(ames) * 0.7)) train_1 <- ames[index_1, ] *test_1 <- ames[-index_1, ] nrow(train_1)/nrow(ames) ``` ] .pull-right[ Put the not selected records in test set `test_1`. ] --- # Data splitting in base We first demonstrate the splitting of the `ames` housing data into a training and test set, using `base` R instructions. .pull-left[ ```r set.seed(123) index_1 <- sample(1 : nrow(ames), size = round(nrow(ames) * 0.7)) train_1 <- ames[index_1, ] test_1 <- ames[-index_1, ] *nrow(train_1)/nrow(ames) ## [1] 0.7 ``` ] .pull-right[ What is the ratio of the number of records in `train_1` versus original data set `ames`? ] --- # Data splitting in {caret} The {caret} package - short for Classification And REgression Training - contains functions to streamline the model training process for complex regression and classification problems. With the {caret} package, the function `createDataPartition` will do the job. .pull-left[ ```r *library(caret) *set.seed(123) index_2 <- caret::createDataPartition( y = ames$Sale_Price, p = 0.7, list = FALSE) train_2 <- ames[index_2, ] test_2 <- ames[-index_2, ] nrow(train_2)/nrow(ames) ``` ] .pull-right[ Load the library {caret}. Use `set.seed()` for reproducibility. ] --- # Data splitting in {caret} The {caret} package - short for Classification And REgression Training - contains functions to streamline the model training process for complex regression and classification problems. With the {caret} package, the function `createDataPartition` will do the job. .pull-left[ ```r library(caret) set.seed(123) *index_2 <- caret::createDataPartition( * y = ames$Sale_Price, * p = 0.7, * list = FALSE) train_2 <- ames[index_2, ] test_2 <- ames[-index_2, ] nrow(train_2)/nrow(ames) ``` ] .pull-right[ `createDataPartition` takes in `y` the vector of outcomes of the data set we wish to split. `createDataPartition` will do stratified sampling based on levels of `y` (for factor) or groups determined by the percentiles of `y` (for numeric). The percentage of data that goes to training is `p`. `list = FALSE` tells the function not to store the results in a list, but in a matrix (here: with 1 column) ] --- # Data splitting in {rsample} <img src="img/rsample.png" class="title-hex"> The {rsample} package, part of the {tidymodels} initiative of RStudio, is home to a wide variety of resampling functions. The documentation is at [rsample: the basics](https://tidymodels.github.io/rsample/articles/Basics.html). .pull-left[ ```r *library(rsample) *set.seed(123) split_1 <- rsample::initial_split(ames, prop = 0.7) train_3 <- training(split_1) test_3 <- testing(split_1) nrow(train_3)/nrow(ames) ``` ] .pull-right[ Load the `rsample` package. Use `set.seed()` for reproducibility. ] --- # Data splitting in {rsample} <img src="img/rsample.png" class="title-hex"> The {rsample} package, part of the {tidymodels} initiative of RStudio, is home to a wide very variety of resampling functions. The documentation is at [rsample: the basics](https://tidymodels.github.io/rsample/articles/Basics.html). .pull-left[ ```r library(rsample) set.seed(123) *split_1 <- rsample::initial_split(ames, prop = 0.7) train_3 <- training(split_1) test_3 <- testing(split_1) nrow(train_3)/nrow(ames) ``` ] .pull-right[ `initial_split` from the {rsample} package. Split the data `ames` into a training set and testing set. `prop` is the proportion of data to be retained as training ] --- # Data splitting in {rsample} <img src="img/rsample.png" class="title-hex"> The {rsample} package, part of the {tidymodels} initiative of RStudio, is home to a wide very variety of resampling functions. The documentation is at [rsample: the basics](https://tidymodels.github.io/rsample/articles/Basics.html). .pull-left[ ```r library(rsample) set.seed(123) split_1 <- rsample::initial_split(ames, prop = 0.7) *train_3 <- training(split_1) *test_3 <- testing(split_1) nrow(train_3)/nrow(ames) ``` ] .pull-right[ The result of `rsample::initial_split` is an `rset` object. It is stored in `split_1` and ready for inspection. Apply the functions `training` and `test` to this object to extract the data in each split. ] --- # Data splitting comparison As a check, we plot the `Sale_Price` as available in the train (in black) vs test (in red) data sets, created by each of the three demonstrated methods. <img src="ML_part1_files/figure-html/unnamed-chunk-17-1.svg" style="display: block; margin: auto;" /> --- # Resampling methods In [Data splitting](#data-splitting), we discussed *training* and *test* set. Let's now dive deeper into *resampling* methods. .hi-KUL[Validation set] (visual inspired by [Ed Rubin's course](https://github.com/edrubin/EC524W20)) - we hold out a subset of the training data (e.g. 30%) and then evaluate the model on this held out validation set - calculate the loss function on this validation set, as approximation of the true test error - .font140[
] high variability + inefficient use of data - picture .hi-KUL[validation set (30%)] and .hi-pink[training set (70%)] <img src="ML_part1_files/figure-html/plot-validation-set-1.svg" style="display: block; margin: auto;" /> --- # Resampling methods (cont.) In [Data splitting](#data-splitting), we discussed *training* and *test* set. Let's now dive deeper into *resampling* methods. .hi-KUL[*k* fold cross validation] (visual inspired by [Ed Rubin's course](https://github.com/edrubin/EC524W20)) - divide training data into *k* equally sized groups (e.g. .hi-KUL[group 1] on the picture) - iterate over the *k* groups, treating each as validation set once (and train model on the other *k-1* groups) (e.g. get .hi-KUL[MSE<sub>1<sub/>] corresponding to fold 1) - average the folds' loss to estimate the true test error - .font140[
] greater accuracy (compared to validation set). <img src="ML_part1_files/figure-html/plot-CV-set-1.svg" style="display: block; margin: auto;" /> --- # Resampling methods (cont.) In [Data splitting](#data-splitting), we discussed *training* and *test* set. Let's now dive deeper into *resampling* methods. .hi-KUL[*k* fold cross validation] (visual inspired by [Ed Rubin's course](https://github.com/edrubin/EC524W20)) - divide training data into *k* equally sized groups (e.g. .hi-KUL[group 1] on the picture) - iterate over the *k* groups, treating each as validation set once (and train model on the other *k-1* groups) (e.g. get .hi-KUL[MSE<sub>1<sub/>] corresponding to fold 1) - average the folds' loss to estimate the true test error - .font140[
] greater accuracy (compared to validation set). <img src="ML_part1_files/figure-html/plot-CV3-set-1.svg" style="display: block; margin: auto;" /> --- # Resampling methods (cont.) In [Data splitting](#data-splitting), we discussed *training* and *test* set. Let's now dive deeper into *resampling* methods. .hi-KUL[*k* fold cross validation] (picture from [Boehmke & Greenwell](https://koalaverse.github.io/homlr/)) .center[ <img src="img/k_fold_CV.png" width="95%" style="display: block; margin: auto;" /> ] --- # Resampling methods (cont.) In [Data splitting](#data-splitting), we discussed *training* and *test* set. Let's now dive deeper into *resampling* methods. .hi-KUL[Leave-one-out cross validation] (visual inspired by [Ed Rubin's course](https://github.com/edrubin/EC524W20)) - each observation takes a turn as the validation set (e.g. get .hi-KUL[MSE<sub>3<sub/>]) - other *n-1* observations are the training set - average the folds' loss to estimate the true test error - .font140[
] very computationally demanding. <img src="ML_part1_files/figure-html/plot-LOOCV-set-1.svg" style="display: block; margin: auto;" /> --- # Resampling methods in {caret} We set up 5-fold cross validation using the {caret} package. .pull-left[ ```r set.seed(123) *cv_folds <- caret::createFolds(y = ames$Sale_Price, * k = 5, list = TRUE, * returnTrain = TRUE) ``` ```r str(cv_folds) ``` ``` ## List of 5 ## $ Fold1: int [1:2344] 1 2 3 4 5 6 7 8 9 10 ... ## $ Fold2: int [1:2343] 2 3 4 6 7 8 9 11 13 14 ... ## $ Fold3: int [1:2344] 1 2 3 4 5 6 7 8 9 10 ... ## $ Fold4: int [1:2344] 1 3 5 6 10 11 12 13 14 15 ... ## $ Fold5: int [1:2345] 1 2 4 5 7 8 9 10 11 12 ... ``` ] .pull-right[ The `createFolds` function from {caret} splits the data into `k` groups. `list = TRUE` indicates that the results should be stored in a list `returnTrain = TRUE` indicates that the values returned (and stored) in the elements of the list are - per fold - the row numbers of the observations selected for training. ] --- # Resampling methods in {caret} We set up 5-fold cross validation using the {caret} package. .pull-left[ ```r set.seed(123) cv_folds <- caret::createFolds(y = ames$Sale_Price, k = 5, list = TRUE, returnTrain = TRUE) ``` ```r *str(cv_folds) ``` ``` ## List of 5 ## $ Fold1: int [1:2344] 1 2 3 4 5 6 7 8 9 10 ... ## $ Fold2: int [1:2343] 2 3 4 6 7 8 9 11 13 14 ... ## $ Fold3: int [1:2344] 1 2 3 4 5 6 7 8 9 10 ... ## $ Fold4: int [1:2344] 1 3 5 6 10 11 12 13 14 15 ... ## $ Fold5: int [1:2345] 1 2 4 5 7 8 9 10 11 12 ... ``` ] .pull-right[ Inspect the list `cv_folds` that was returned by `createFolds(.)`. This list has `k` elements, each storing the row numbers of the observations in the training set of the fold under consideration. ] --- # Resampling methods in {caret} <img src="img/purrr.png" class="title-hex"> .pull-left[ ```r *mean(ames[cv_folds$Fold1, ]$Sale_Price) ``` ``` ## [1] 180954.3 ``` ```r map_dbl(cv_folds, function(x) { mean(ames[x, ]$Sale_Price) }) ``` ``` ## Fold1 Fold2 Fold3 Fold4 Fold5 ## 180954.3 180781.8 180646.4 180563.0 181034.7 ``` ] .pull-right[ We calculate the average `Sale_Price` per fold, that is: we average the `Sale_Price` over all observations selected in the training set of a particular fold. That would go as follows, for `Fold1` in the list `cv_folds` ```r mean(ames[cv_folds$Fold1, ]$Sale_Price) ``` and similarly for `Fold2`, ..., `Fold5`. ] --- # Resampling methods in {caret} <img src="img/purrr.png" class="title-hex"> .pull-left[ ```r mean(ames[cv_folds$Fold1, ]$Sale_Price) ``` ``` ## [1] 180954.3 ``` ```r *map_dbl(cv_folds, * function(x) { * mean(ames[x, ]$Sale_Price) * }) ``` ``` ## Fold1 Fold2 Fold3 Fold4 Fold5 ## 180954.3 180781.8 180646.4 180563.0 181034.7 ``` ] .pull-right[ We apply the function `mean(ames[___, ]$Sale_Price)` over all `k` elements of the list `cv_folds`. `map_dbl(.x, .f)` is one of the `map` functions from the {purrr} package (part of {tidyverse}), used for functional programming in R. `map_dbl(.x, .f)` applies function `.f` to each element of list `.x`. The result is a double-precision vector, hence `map_dbl` and not just `map`. Btw, it is a historical anomaly that R has two names for its floating-point vectors, `double` and `numeric`. ] --- # Resampling methods in {rsample} <img src="img/rsample.png" class="title-hex"> .pull-left[ ```r set.seed(123) *cv_rsample <- rsample::vfold_cv(ames, v = 5) cv_rsample$splits ``` ``` ## [[1]] ## <Analysis/Assess/Total> ## <2344/586/2930> ## ## [[2]] ## <Analysis/Assess/Total> ## <2344/586/2930> ## ## [[3]] ## <Analysis/Assess/Total> ## <2344/586/2930> ## ## [[4]] ## <Analysis/Assess/Total> ## <2344/586/2930> ## ## [[5]] ## <Analysis/Assess/Total> ## <2344/586/2930> ``` ] .pull-right[ The function `vfold_cv` splits the data into `v` groups (called folds) of equal size. ] --- # Resampling methods in {rsample} <img src="img/rsample.png" class="title-hex"> .pull-left[ ```r set.seed(123) cv_rsample <- rsample::vfold_cv(ames, v = 5) *cv_rsample$splits ``` ``` ## [[1]] ## <Analysis/Assess/Total> ## <2344/586/2930> ## ## [[2]] ## <Analysis/Assess/Total> ## <2344/586/2930> ## ## [[3]] ## <Analysis/Assess/Total> ## <2344/586/2930> ## ## [[4]] ## <Analysis/Assess/Total> ## <2344/586/2930> ## ## [[5]] ## <Analysis/Assess/Total> ## <2344/586/2930> ``` ] .pull-right[ The function `vfold_cv` splits the data into `v` groups (called folds) of equal size. We store the result of `vfold_cv` in the object `cv_rsample`. The resulting object stores `v` resamples of the original data set. ] --- # Resampling methods in {rsample} <img src="img/rsample.png" class="title-hex"> .pull-left[ ```r set.seed(123) cv_rsample <- rsample::vfold_cv(ames, v = 5) ``` ```r *cv_rsample$splits[[1]] ``` ``` ## <Analysis/Assess/Total> ## <2344/586/2930> ``` ```r cv_rsample$splits[[1]] %>% analysis() %>% dim() ``` ``` ## [1] 2344 81 ``` ```r cv_rsample$splits[[1]] %>% assessment() %>% dim() ``` ``` ## [1] 586 81 ``` ] .pull-right[ Inspect the composition of the first resample: 2,344 (out of 2,930) observations go to the analysis data (for training, i.e. `v-1` folds), 586 (out of 2,930) observations go to the assessment data (for testing, the final fold). ] --- # Resampling methods in {rsample} <img src="img/rsample.png" class="title-hex"> .pull-left[ ```r set.seed(123) cv_rsample <- rsample::vfold_cv(ames, v = 5) ``` ```r cv_rsample$splits[[1]] ``` ``` ## <Analysis/Assess/Total> ## <2344/586/2930> ``` ```r *cv_rsample$splits[[1]] %>% analysis() %>% dim() ``` ``` ## [1] 2344 81 ``` ```r *cv_rsample$splits[[1]] %>% assessment() %>% dim() ``` ``` ## [1] 586 81 ``` ] .pull-right[ Inspect the composition of the first resample: get the dimensions (`dim()`) of the analysis data (`analysis()`) of the first resample get the dimensions (`dim()`) of the assessment data (`assessment()`) of the first resample. ] --- # Resampling methods in {rsample} <img src="img/purrr.png" class="title-hex"> <img src="img/rsample.png" class="title-hex"> .pull-left[ ```r map_dbl(cv_rsample$splits, function(x) { mean(rsample::analysis(x)$Sale_Price) }) ``` ``` ## [1] 181310.8 180991.0 180840.0 181268.6 179569.9 ``` ```r map_dbl(cv_rsample$splits, function(x) { nrow(rsample::analysis(x)) }) ``` ``` ## [1] 2344 2344 2344 2344 2344 ``` ] .pull-right[ As before, use `map_dbl(.x, .f)` to apply a function `.f` over all elements of a list `.x`. Here the list is stored in `cv_rsample$splits`, with `v = 5` elements. ] --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ .hi-pink[Q]: Now you're going to combine data splitting and resampling to create training, validation and test folds in the Ames data. Use `caret` or `rsample` and make the validation folds of the same size as the test fold. ] --- class: clear .pull-left[ with `caret` ```r set.seed(5678) ind_caret <- caret::createDataPartition( y = ames$Sale_Price, p = 5/6, list = FALSE) train_caret <- ames[ind_caret, ] test_caret <- ames[-ind_caret, ] cv_caret <- caret::createFolds( y = train_caret$Sale_Price, k = 5, list = TRUE, returnTrain = FALSE) purrr::map_dbl(cv_caret, ~ nrow(ames[., ])) ## Fold1 Fold2 Fold3 Fold4 Fold5 ## 488 488 489 489 489 nrow(test_caret) ## [1] 487 ``` ] .pull-right[ with `rsample` ```r set.seed(5678) ind_rsample <- rsample::initial_split(ames, prop = 5/6) train_rsample <- rsample::training(ind_rsample) test_rsample <- rsample::testing(ind_rsample) cv_rsample <- rsample::vfold_cv(train_rsample, v = 5) map_dbl(cv_rsample$splits, ~ nrow(rsample::assessment(.))) ## [1] 489 488 488 488 488 nrow(test_rsample) ## [1] 489 ``` ] --- class: inverse, center, middle name: tuning # Parameter tuning with {caret}, {rsample} and {purrr} <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # Tuning parameters .pull-left[Finding the optimal level of flexibility highlights the bias-variance tradeoff. .hi-pink[Bias] : the error that comes from inaccurately estimating `\(\color{#3b3b9a}{f}\)`. .hi-pink[Variance] : the amount `\(\hat{\color{#3b3b9a}{f}}\)` would change with a different training sample. .font140[.KULbginline[Take-aways]] .font160[
] : high variance models more prone to overfitting * use .hi-pink[resampling methods] to reduce this risk * hyperparameters (or *tuning parameters*) control complexity, and thus the bias-variance trade-off * identify their optimal setting, e.g. with a *grid search* * no analytic expression for these hyperparameters. ] .pull-right[ <img src="ML_part1_files/figure-html/bias-variance-knn-1.svg" width="100%" style="display: block; margin: auto;" /><img src="ML_part1_files/figure-html/bias-variance-knn-2.svg" width="100%" style="display: block; margin: auto;" /> .footnote[Code from Boehmke & Greenwell (2019, Chapter 2) on [Hands-on machine learning with R](https://koalaverse.github.io/homlr/).] ] --- # Tuning parameters via grid search .pull-left-alt[ .center[ <img src="img/flow_chart_applied_predictive_modeling.jpg" width="85%" style="display: block; margin: auto;" /> ] ] .pull-right-alt[ .hi-pink[Model training & validation phase] - define a set of candidate values (a *grid*) - assess model utility across the candidates (use clever *resampling*) - choose the optimal settings (optimize *loss*) - refit the model on entire training data with final tuning parameters - evaluate performance of the model on the test data (under
). .hi-pink[Model selection] - repeat the above steps for different models - compare performance of these models that will generalize to new data (via test data, under
). .footnote[Flow chart from Kuhn & Johnson (2013) on [Applied predictive modeling](http://appliedpredictivemodeling.com/).] ] --- # Training a model with {caret} .pull-left[ ```r set.seed(123) *cv <- trainControl(method = "cv", number = 5, returnResamp = "all", selectionFunction = "best") hyper_grid <- expand.grid(k = seq(2, 150, by = 2)) knn_fit <- train(y ~ x, data = df, method = "knn", trControl = cv, tuneGrid = hyper_grid) knn_fit$bestTune ``` ] .pull-right[ Use `trainControl` from {caret} to set some control parameters that will be used in the actual `train` function. Here, we use `method = cv` and `number = 5` for 5-fold cross validation. ] --- # Training a model with {caret} .pull-left[ ```r set.seed(123) cv <- trainControl(method = "cv", number = 5, * returnResamp = "all", * selectionFunction = "best") hyper_grid <- expand.grid(k = seq(2, 150, by = 2)) knn_fit <- train(y ~ x, data = df, method = "knn", trControl = cv, tuneGrid = hyper_grid) knn_fit$bestTune ``` ] .pull-right[ In `trainControl` we put `returnResamp = "all"` to store all resampled summary metrics. `selectionFunction = "best"` specifies how we select the optimal tuning parameter. With `"best"` the value that minimizes the performance (here: RMSE) is selected. Alternative: `selectionFunction = "oneSE"` applies the one standard error rule. ] --- # Training a model with {caret} .pull-left[ ```r set.seed(123) cv <- trainControl(method = "cv", number = 5, returnResamp = "all", selectionFunction = "best") *hyper_grid <- expand.grid(k = seq(2, 150, by = 2)) knn_fit <- train(y ~ x, data = df, method = "knn", trControl = cv, tuneGrid = hyper_grid) knn_fit$bestTune ``` ] .pull-right[ Set the grid of *K*-values that will be searched. `expand.grid` creates a data frame with one row for each value of *K* to consider. ] --- # Training a model with {caret} .pull-left[ ```r set.seed(123) cv <- trainControl(method = "cv", number = 5, returnResamp = "all", selectionFunction = "best") hyper_grid <- expand.grid(k = seq(2, 150, by = 2)) *knn_fit <- train(y ~ x, data = df, method = "knn", * trControl = cv, * tuneGrid = hyper_grid) knn_fit$bestTune ``` ] .pull-right[ {caret} will `train` the method `knn` using the settings in `trControl = cv`, across the values of *K* stored in `tuneGrid = hyper_grid`. The data `df` and formula `y ~ x` are used. ] --- # Training a model with {caret} .pull-left[ ```r set.seed(123) cv <- trainControl(method = "cv", number = 5, returnResamp = "all", selectionFunction = "best") hyper_grid <- expand.grid(k = seq(2, 150, by = 2)) knn_fit <- train(y ~ x, data = df, method = "knn", trControl = cv, tuneGrid = hyper_grid) *knn_fit$bestTune ``` ``` ## k ## 18 36 ``` ] .pull-right[ We retrieve the optimal value of the tuning parameter, according to the `selectionFunction`. For the folds created here and with `selectionFunction = "best"` the optimal *K* value is 36. What happens when you change to `selectionFunction = "oneSE"`? ] --- # Training a model with {caret} .pull-left[ ``` ## k ## 18 36 ``` <img src="ML_part1_files/figure-html/unnamed-chunk-67-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ ``` ## k ## 27 54 ``` <img src="ML_part1_files/figure-html/unnamed-chunk-68-1.svg" style="display: block; margin: auto;" /> ] --- # Training a model with {rsample} <img src="img/rsample.png" class="title-hex"> <img src="img/purrr.png" class="title-hex"> .pull-left[ Our starting point is the simulated data stored in `df`, resampled with 5-fold cross-validation. ```r set.seed(123) # for reproducibility cv_rsample <- vfold_cv(df, 5) cv_rsample$splits[1:3] ``` ``` ## [[1]] ## <Analysis/Assess/Total> ## <286/72/358> ## ## [[2]] ## <Analysis/Assess/Total> ## <286/72/358> ## ## [[3]] ## <Analysis/Assess/Total> ## <286/72/358> ``` ] .pull-right[ We fit the *KNN* on the holdout data in split *s*, using a given *K* value. ```r holdout_results <- function(s, k_val) { # Fit the model to the analysis data in split s df_train <- analysis(s) mod <- knnreg(y ~ x, k = k_val, data = df_train) # Get the remaining group holdout <- assessment(s) # Get predictions with the holdout data set res <- predict(mod, newdata = holdout) # Return observed and predicted values # on holdout set res <- tibble(obs = holdout$y, pred = res) res } ``` ] --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ Now you're going to combine the .hi-pink[resampling and model fitting instructions] and set up a first example of .hi-pink[tuning a parameter] over a grid of possible values: the *K* in a .hi-pink[KNN regression model]. .hi-pink[Q]: use the function `holdout_results(.s, .k)` as defined on the previous sheet. You will use this function to calculate the RMSE<sub>k</sub> of fold *k*. 1. Specify a grid of values of *K*, store it in `hyper_grid`. Use `expand.grid(.)` 2. Pick one of the resamples stored in `cv_rsample$splits` and pick a value from the grid. Calculate the RMSE on the holdout data of this split. 3. For all values in the tuning grid, calculate the RMSE averaged over all folds, and the corresponding standard error. 4. Use the results from .hi-pink[Q.3] to pick the value of *K* via minimal RMSE. 5. Pick the largest value of *K* such that the corresponding RMSE is below the minimal RMSE from .hi-pink[Q.4] plus its corresponding SE. ] --- class: clear .pull-left[ .hi-pink[Q.1] We set up the grid ```r hyper_grid <- expand.grid(k = seq(2, 150, by = 2)) hyper_grid %>% slice(1:3) ``` | k| |--:| | 2| | 4| | 6| .hi-pink[Q.2] We apply the function `holdout_results(.s, .k)` on the third resample, with the first value for *K* in the grid. ```r res <- holdout_results(cv_rsample$splits[[3]], hyper_grid[1, ]) sqrt(sum((res$obs - res$pred)^2)/nrow(res)) ``` ``` ## [1] 0.3608923 ``` ] .pull-right[ .hi-pink[Q.3] Mean RMSE over the 5 folds and corresponding SE. ```r RMSE <- numeric(nrow(hyper_grid)) SE <- numeric(nrow(hyper_grid)) for(i in 1:nrow(hyper_grid)){ cv_rsample$results <- map(cv_rsample$splits, holdout_results, hyper_grid[i, ]) res <- map_dbl(cv_rsample$results, function(x) mean((x$obs - x$pred)^2)) RMSE[i] <- mean(sqrt(res)) ; SE[i] <- sd(sqrt(res)) } ``` .hi-pink[Q.4] Choose *K* via minimal RMSE <table> <thead> <tr> <th style="text-align:right;"> RMSE </th> <th style="text-align:right;"> SE </th> <th style="text-align:right;"> k </th> <th style="text-align:right;"> lower </th> <th style="text-align:right;"> upper </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.2917121 </td> <td style="text-align:right;"> 0.0247127 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 0.2669995 </td> <td style="text-align:right;"> 0.3164248 </td> </tr> </tbody> </table> .hi-pink[Q.5] Choose *K* via the one-standard-error rule <table> <thead> <tr> <th style="text-align:right;"> RMSE </th> <th style="text-align:right;"> SE </th> <th style="text-align:right;"> k </th> <th style="text-align:right;"> lower </th> <th style="text-align:right;"> upper </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.3157639 </td> <td style="text-align:right;"> 0.0284855 </td> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 0.2872784 </td> <td style="text-align:right;"> 0.3442495 </td> </tr> </tbody> </table> ] --- class: clear <img src="ML_part1_files/figure-html/one-SE-rule-1.svg" style="display: block; margin: auto;" /> --- # Putting it all together .pull-left[ During the tuning process we inspect plots like the one on the right. .font140[.KULbginline[Take-aways]] .font160[
] *Less is more*: - we prefer simple over more complex - choose tuning parameters based on the numerically optimal value .KULbginline[OR] - choose a simpler model that is within a certain tolerance of the numerically best value - use the .hi-pink['one-standard-error' rule]. With the selected tuning parameters, we refit the model on the complete training set and use it to predict the test set (under
). ] .pull-right[ <img src="ML_part1_files/figure-html/unnamed-chunk-79-1.svg" width="85%" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle name: engineering # Target and feature engineering: <br> data pre-processing steps <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # What is feature engineering? .pull-left-alt[ .center[ <img src="img/feature_engineering_Kuhn.jpg" alt="Drawing" style="width: 165px;"/> <br> <br> <img src="img/boehmke_greenwell.jpg" alt="Drawing" style="width: 165px;"/> ] ] .pull-right-alt[ Feature engineering: - applies .hi-pink[pre-processing steps] to predictor (features) variables - .hi-pink[creates new input features] from your existing ones (e.g. network features derived from a social network in a fraud detection model). Target engineering: * transforms the response variable (or target) to improve the performance of a predictive model. The goal is to .KULbginline[make models more effective]. See Kuhn & Johnson (2019) on [Feature Engineering and Selection: A Practical Approach for Predictive Models](http://www.feat.engineering/) for a detailed discussion. ] --- name: black-white-box class: clear .font140[.KULbginline[Take-aways]] .font160[
] : *different models* have *different sensitivities* to the type of target and feature values in the model. .center[ <img src="img/Applied_Pred_overview_models.jpg" width="65%" style="display: block; margin: auto;" /> ] Source: Kuhn & Johnson (2013) on [Applied predictive modeling](http://appliedpredictivemodeling.com/). --- # Target engineering <img src="img/rsample.png" class="title-hex"> We load the `ames` data set from the {AmesHousing} package and apply a .hi-pink[stratified split] of the data into a training (70%) and test (30%) set. We stratify on the distribution of the target variable `Sale_Price` using the `strata` argument in `rsample::initial_split`. ```r ames <- AmesHousing::make_ames() set.seed(123) split <- rsample::initial_split(ames, prop = 0.7, `strata = "Sale_Price"`) ames_train <- rsample::training(split) ames_test <- rsample::testing(split) ``` We check the distribution of `Sale_Price` in both `ames_train` and `ames_test`. ```r summary(ames_train$Sale_Price) summary(ames_test$Sale_Price) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 12789 129500 160000 180923 213500 755000 ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 35000 129500 160000 180502 213500 745000 ``` --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ Inference with linear models often assumes that the target is generated from a normal distribution. .hi-pink[Q]: let's examine whether the `Sale_Price` target satisfies this assumption. 1. Plot a histogram of `Sale_Price`. Is normality a meaningful assumption? 2. Try some transformation functions such that the transformed target approaches a normal distribution. ] --- class: clear .pull-left[ .hi-pink[Q.1] original target <img src="ML_part1_files/figure-html/unnamed-chunk-85-1.svg" width="80%" style="display: block; margin: auto;" /> ```r summary(ames_train$Sale_Price) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 12789 129500 160000 180923 213500 755000 ``` ] .pull-right[ .hi-pink[Q.2] log-transformed target <img src="ML_part1_files/figure-html/unnamed-chunk-87-1.svg" width="80%" style="display: block; margin: auto;" /> ```r summary(log(ames_train$Sale_Price)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 9.456 11.771 11.983 12.020 12.271 13.534 ``` ] --- # Feature engineering steps Examples of common pre-processing steps: * Some models (e.g. KNN, Lasso, neural networks) require that the predictor variables are on the same scale. <br> .hi-pink[Centering (C)] and .hi-pink[scaling (S)] the predictors can be used for this purpose. * Other models are very sensitive to correlations between the predictors and filters or PCA signal extraction can improve these models. * Some models find .hi-pink[(near) zero-variance (NZV)] predictors problematic, and these should be removed before fitting the model. * In other cases, the data should be .hi-pink[encoded] in a specific way to make sure all predictors are numeric (e.g. one-hot encoding of factor variables in neural networks). * Many models cannot cope with .hi-pink[missing data] so .hi-pink[imputation strategies] might be necessary. * Development of new features that represent something important to the outcome. * (add your own example here!) This list is inspired by Max Kuhn (2019) on [Applied Machine Learning](https://github.com/topepo/aml-london-2019). --- # A blueprint for feature engineering .font140[.KULbginline[Take-aways]] .font160[
] : *a proper implementation* .pull-left[ * draft a .hi-pink[blueprint] of the necessary pre-processing steps, and their order * [Boehme & Greenwell (2019)](https://bradleyboehmke.github.io/HOML/engineering.html#proper-implementation) suggest 1. Filter out zero or near-zero variance features. <br> 2. Perform imputation if required. <br> 3. Normalize to resolve numeric feature skewness. <br> 4. Standardize (center and scale) numeric features. <br> 5. Perform dimension reduction (e.g., PCA) on <br> numeric features. <br> 6. One-hot or dummy encode categorical features. ] .pull-right[ * avoid .hi-pink[data leakage] in the pre-processing steps when applied to resampled data sets! .center[ <img src="img/data_leakage.png" width="100%" style="display: block; margin: auto;" /> ] ] --- # Feature engineering with {recipes} <img src="img/recipes.png" class="title-hex"> .pull-left[ We already detected the necessity of log-transforming `Sale_Price` when building linear models. We add another pre-processing step, inspired by the .hi-pink[high cardinality] feature `Neighborhood`. ```r ames_train %>% group_by(Neighborhood) %>% summarize(n_obs = n()) %>% arrange(n_obs) %>% slice(1:4) ``` <table> <thead> <tr> <th style="text-align:left;"> Neighborhood </th> <th style="text-align:right;"> n_obs </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Landmark </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Green_Hills </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Greens </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> Blueste </td> <td style="text-align:right;"> 8 </td> </tr> </tbody> </table> ] .pull-right[ <img src="ML_part1_files/figure-html/unnamed-chunk-92-1.svg" width="95%" style="display: block; margin: auto;" /> ] --- # Feature engineering with {recipes} <img src="img/recipes.png" class="title-hex"> .pull-left[ We'll use `recipe()` from the {recipes} package. The main idea is to .hi-pink[preprocess multiple datasets] using a single `recipe()`. Before we start, keep the following .KULbginline[fundamentals] of {recipes} in mind! ] -- .pull-right[ Creating a `recipe` takes the following steps: * get the *ingredients* (`recipe()`): specify the response and predictor variables * write the recipe (`step_zzz()`): define the *pre-processing steps*, such as imputation, creating dummy variables, scaling, and more * *prepare* the recipe (`prep()`): provide a dataset to base each step on (e.g. *calculate* constants to do centering and scaling) * *bake* the recipe (`bake()`): *apply* the pre-processing steps to your datasets. .footnote[Source: [Rebecca Barter's blog](http://www.rebeccabarter.com/blog/2019-06-06_pre_processing/)] ] --- # Feature engineering with {recipes} <img src="img/recipes.png" class="title-hex"> .pull-left[ Use `recipe()` to create the preprocessing blueprint (to be applied later) ```r library(recipes) mod_rec <- recipe(Sale_Price ~ ., data = ames_train) mod_rec ``` Now, `mod_rec` knows the role of each variable (`predictor` or `outcome`). We can use selectors such as `all_predictors()`, `all_outcomes()` or `all_nominal()`. ] .pull-right[ Extend `mod_rec` with two pre-processing steps: `step_log(all_outcomes())` `step_other(Neighborhood, threshold = 0.05)` to lump the levels that occur in less than 5% of data as "other". ```r mod_rec <- mod_rec %>% step_log(all_outcomes()) %>% step_other(Neighborhood, threshold = 0.05) mod_rec ``` ] --- # Feature engineering with {recipes} <img src="img/recipes.png" class="title-hex"> .center[ <img src="img/recipes_verbs.png" width="100%" style="display: block; margin: auto;" /> ] .footnote[Source Max Kuhn (2019) on [Applied Machine Learning](https://github.com/topepo/aml-london-2019).] Now that we have a preprocessing *specification*, we run on it on the `ames_train` to *prepare* (or `prep()`) the recipe. ```r mod_rec_trained <- prep(mod_rec, training = ames_train, verbose = TRUE, retain = TRUE) ``` ```r mod_rec_trained <- prep(mod_rec, training = ames_train, verbose = TRUE, retain = TRUE) ## oper 1 step log [training] ## oper 2 step other [training] ## The retained training set is ~ 0.82 Mb in memory. ``` The `retain = TRUE` indicates that the preprocessed training set should be saved. --- # Feature engineering with {recipes} <img src="img/recipes.png" class="title-hex"> ```r mod_rec_trained ``` Once the recipe is prepared, it can be applied to any data set using `bake()`. There is no need to `bake()` the data used in the `prep()` step; you get the processed training set with `juice()`. ```r ames_test_prep <- bake(mod_rec_trained, new_data = ames_test) ``` --- # Feature engineering with {recipes} <img src="img/recipes.png" class="title-hex"> .pull-left[ ```r ames_test_prep %>% group_by(Neighborhood) %>% summarize(n_obs = n()) %>% arrange(n_obs) ``` <table> <thead> <tr> <th style="text-align:left;"> Neighborhood </th> <th style="text-align:right;"> n_obs </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Sawyer </td> <td style="text-align:right;"> 43 </td> </tr> <tr> <td style="text-align:left;"> Northridge_Heights </td> <td style="text-align:right;"> 50 </td> </tr> <tr> <td style="text-align:left;"> Gilbert </td> <td style="text-align:right;"> 54 </td> </tr> <tr> <td style="text-align:left;"> Somerset </td> <td style="text-align:right;"> 60 </td> </tr> <tr> <td style="text-align:left;"> Edwards </td> <td style="text-align:right;"> 63 </td> </tr> <tr> <td style="text-align:left;"> College_Creek </td> <td style="text-align:right;"> 68 </td> </tr> <tr> <td style="text-align:left;"> Old_Town </td> <td style="text-align:right;"> 75 </td> </tr> <tr> <td style="text-align:left;"> North_Ames </td> <td style="text-align:right;"> 137 </td> </tr> <tr> <td style="text-align:left;"> other </td> <td style="text-align:right;"> 331 </td> </tr> </tbody> </table> ] .pull-right[ ```r juice(mod_rec_trained) %>% group_by(Neighborhood) %>% summarize(n_obs = n()) %>% arrange(n_obs) ``` <table> <thead> <tr> <th style="text-align:left;"> Neighborhood </th> <th style="text-align:right;"> n_obs </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Sawyer </td> <td style="text-align:right;"> 108 </td> </tr> <tr> <td style="text-align:left;"> Gilbert </td> <td style="text-align:right;"> 111 </td> </tr> <tr> <td style="text-align:left;"> Northridge_Heights </td> <td style="text-align:right;"> 116 </td> </tr> <tr> <td style="text-align:left;"> Somerset </td> <td style="text-align:right;"> 122 </td> </tr> <tr> <td style="text-align:left;"> Edwards </td> <td style="text-align:right;"> 131 </td> </tr> <tr> <td style="text-align:left;"> Old_Town </td> <td style="text-align:right;"> 164 </td> </tr> <tr> <td style="text-align:left;"> College_Creek </td> <td style="text-align:right;"> 199 </td> </tr> <tr> <td style="text-align:left;"> North_Ames </td> <td style="text-align:right;"> 306 </td> </tr> <tr> <td style="text-align:left;"> other </td> <td style="text-align:right;"> 792 </td> </tr> </tbody> </table> ] --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ Now you will extend the existing recipe in `mod_rec`, prepare and bake it again! .hi-pink[Q]: consult the [{recipes} manual](https://tidymodels.github.io/recipes/reference/index.html) and specify a recipe for the housing data that includes the following pre-processing steps (in this order) * log-transform the outcome variable * remove any zero-variance predictors * lump factor levels that occur in <= 5% of data as "other" for both `Neighborhood` as well as `House_Style` * center and scale all numeric features. 1. Specify the above recipe on the training set and store it in the object `mod_rec`. 2. Inspect the object `mod_rec` using `summary(mod_rec)`. What can you learn from this summary? 3. Prepare the recipe on the training data and then apply it to the test set. ] --- class: clear .pull-left[ First, let's try to get a grasp of the `House_Style` feature as well as the presence of zero-variance predictors. ```r ames_train %>% group_by(House_Style) %>% summarize(n_obs = n()) %>% arrange(n_obs) ``` <table> <thead> <tr> <th style="text-align:left;"> House_Style </th> <th style="text-align:right;"> n_obs </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Two_and_Half_Fin </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> One_and_Half_Unf </td> <td style="text-align:right;"> 15 </td> </tr> <tr> <td style="text-align:left;"> Two_and_Half_Unf </td> <td style="text-align:right;"> 17 </td> </tr> <tr> <td style="text-align:left;"> SFoyer </td> <td style="text-align:right;"> 61 </td> </tr> <tr> <td style="text-align:left;"> SLvl </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> One_and_Half_Fin </td> <td style="text-align:right;"> 214 </td> </tr> <tr> <td style="text-align:left;"> Two_Story </td> <td style="text-align:right;"> 609 </td> </tr> <tr> <td style="text-align:left;"> One_Story </td> <td style="text-align:right;"> 1036 </td> </tr> </tbody> </table> ] .pull-right[ <img src="ML_part1_files/figure-html/unnamed-chunk-107-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear To detect the presence of zero-variance and near-zero-variance features the `caret` library has the function `nearZeroVar` ```r library(caret) nzv <- caret::nearZeroVar(ames_train, saveMetrics = TRUE) ``` ```r names(ames_train)[nzv$zeroVar] ``` ``` ## character(0) ``` ```r names(ames_train)[nzv$nzv] ``` ``` ## [1] "Street" "Alley" "Land_Contour" ## [4] "Utilities" "Land_Slope" "Condition_2" ## [7] "Roof_Matl" "Bsmt_Cond" "BsmtFin_Type_2" ## [10] "BsmtFin_SF_2" "Heating" "Low_Qual_Fin_SF" ## [13] "Kitchen_AbvGr" "Functional" "Enclosed_Porch" ## [16] "Three_season_porch" "Screen_Porch" "Pool_Area" ## [19] "Pool_QC" "Misc_Feature" "Misc_Val" ``` So, no features have zero- variance, but 20 features have near-zero-variance. --- class: clear .pull-left[ We put the recipe together with the following steps ```r mod_rec <- recipe(Sale_Price ~ ., data = ames_train) %>% step_log(all_outcomes()) %>% step_other(Neighborhood, threshold = 0.05) %>% step_other(House_Style, threshold = 0.05) %>% step_zv(all_predictors()) %>% step_nzv(all_predictors()) %>% step_center(all_numeric(), -all_outcomes()) %>% step_scale(all_numeric(), -all_outcomes()) summary(mod_rec) %>% slice(1:6) ``` <table> <thead> <tr> <th style="text-align:left;"> variable </th> <th style="text-align:left;"> type </th> <th style="text-align:left;"> role </th> <th style="text-align:left;"> source </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> MS_SubClass </td> <td style="text-align:left;"> factor , unordered, nominal </td> <td style="text-align:left;"> predictor </td> <td style="text-align:left;"> original </td> </tr> <tr> <td style="text-align:left;"> MS_Zoning </td> <td style="text-align:left;"> factor , unordered, nominal </td> <td style="text-align:left;"> predictor </td> <td style="text-align:left;"> original </td> </tr> <tr> <td style="text-align:left;"> Lot_Frontage </td> <td style="text-align:left;"> double , numeric </td> <td style="text-align:left;"> predictor </td> <td style="text-align:left;"> original </td> </tr> <tr> <td style="text-align:left;"> Lot_Area </td> <td style="text-align:left;"> integer, numeric </td> <td style="text-align:left;"> predictor </td> <td style="text-align:left;"> original </td> </tr> <tr> <td style="text-align:left;"> Street </td> <td style="text-align:left;"> factor , unordered, nominal </td> <td style="text-align:left;"> predictor </td> <td style="text-align:left;"> original </td> </tr> <tr> <td style="text-align:left;"> Alley </td> <td style="text-align:left;"> factor , unordered, nominal </td> <td style="text-align:left;"> predictor </td> <td style="text-align:left;"> original </td> </tr> </tbody> </table> ] .pull-right[ ```r mod_rec ``` ] --- class: clear .pull-left[ We prep the recipe on `ames_train` ```r mod_rec_trained <- prep(mod_rec, training = ames_train, verbose = TRUE, retain = TRUE) ## oper 1 step log [training] ## oper 2 step other [training] ## oper 3 step other [training] ## oper 4 step zv [training] ## oper 5 step nzv [training] ## oper 6 step center [training] ## oper 7 step scale [training] ## The retained training set is ~ 0.75 Mb in memory. ``` and bake it on the `ames_test` data ```r ames_test_prep <- bake(mod_rec_trained, new_data = ames_test) ``` We inspect the processed training and test set ```r dim(juice(mod_rec_trained)) ``` ``` ## [1] 2049 60 ``` ] .pull-right[ Verify that `Sale_Price` is log-transformed (but not centred and scaled) ```r head(juice(mod_rec_trained)$Sale_Price) head(ames_train$Sale_Price) head(ames_test_prep$Sale_Price) head(ames_test$Sale_Price) ``` ``` ## [1] 11.57 11.39 11.70 11.74 11.12 11.63 ``` ``` ## [1] 105500 88000 120000 125000 67500 112000 ``` ``` ## [1] 11.56 12.15 12.18 12.16 12.37 12.15 ``` ``` ## [1] 105000 189900 195500 191500 236500 189000 ``` ```r levels(juice(mod_rec_trained)$House_Style) ``` ```r levels(ames_test_prep$House_Style) ``` ``` ## [1] "One_and_Half_Fin" "One_Story" ## [1] "Two_Story" "other" ``` ``` ## [1] "One_and_Half_Fin" "One_Story" ## [1] "Two_Story" "other" ``` ] --- # Putting it all together {rsample} and {recipes} <img src="img/recipes.png" class="title-hex"> <img src="img/rsample.png" class="title-hex"> Let's redo the KNN example, with centering and scaling of the x-feature, by combining {rsample}/{caret} with a recipe. .pull-left[ ```r # get the simulated data set.seed(123) # for reproducibility x <- seq(from = 0, to = 2 * pi, length = 500) y <- sin(x) + rnorm(length(x), sd = 0.3) df <- data.frame(x, y) %>% filter(x < 4.5) ``` ```r # specify the recipe library(recipes) rec <- recipe(y ~ x, data = df) rec <- rec %>% step_center(all_predictors()) %>% step_scale(all_predictors()) ``` ] .pull-right[ ```r # doing this on complete data set df rec_df <- prep(rec, training = df) mean(juice(rec_df)$x) # centered! ## [1] 1.473e-16 sd(juice(rec_df)$x) # scaled! ## [1] 1 ``` ```r # now we combine the recipe with rsample steps library(rsample) set.seed(123) # for reproducibility cv_rsample <- vfold_cv(df, 5) ``` ```r # we apply the steps in the recipe to each fold library(purrr) cv_rsample$recipes <- map(cv_rsample$splits, prepper, recipe = rec) # check `?prepper` ``` ] --- # Putting it all together {rsample} and {recipes} <img src="img/recipes.png" class="title-hex"> <img src="img/rsample.png" class="title-hex"> Let's redo the KNN example, with centering and scaling of the x-feature, by combining {rsample}/{caret} with a recipe. .pull-left[ Now you can inspect `cv_rsample` as follows ```r cv_rsample$recipes[[1]] juice(cv_rsample$recipes[[1]]) bake(cv_rsample$recipes[[1]], new_data = assessment(cv_rsample$splits[[1]])) ``` ] .pull-right[ ```r holdout_results <- function(s, rec, k_val) { # Fit the model to the analysis data in split s df_train <- juice(rec) mod <- knnreg(y ~ x, k = k_val, data = df_train) # Get the remaining group holdout <- bake(rec, new_data = assessment(s)) # Get predictions with the holdout data set res <- predict(mod, newdata = holdout) # Return observed and predicted values # on holdout set res <- tibble(obs = holdout$y, pred = res) res } ``` ```r res <- holdout_results(cv_rsample$splits[[2]], cv_rsample$recipes[[2]], k_val = 58) sqrt(sum((res$obs - res$pred)^2)/nrow(res)) ## [1] 0.3505 ``` ] --- # Putting it all together {rsample} and {recipes} <img src="img/recipes.png" class="title-hex"> <img src="img/rsample.png" class="title-hex"> Let's redo the KNN example, with centering and scaling of the x-feature, by combining {rsample}/{caret} with a recipe. ```r RMSE <- numeric(nrow(hyper_grid)) SE <- numeric(nrow(hyper_grid)) for(i in 1:nrow(hyper_grid)){ cv_rsample$results <- map2(cv_rsample$splits, cv_rsample$recipes, holdout_results, hyper_grid[i, ]) res <- map_dbl(cv_rsample$results, function(x) mean((x$obs - x$pred)^2)) RMSE[i] <- mean(sqrt(res)) ; SE[i] <- sd(sqrt(res)) } ``` --- class: inverse, center, middle name: regression # Regression models in R and <br> tidy model output with {broom} <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- name: models-in-R # Creating models in R The **formula** interface using R's [formula rules](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Formulae-for-statistical-models) to specify a *symbolic* representation of the terms: * response ~ variable, with `model_fn` referring to the specific model function you want to use, e.g. `lm` for linear regression ```r model_fn(Sale_Price ~ Gr_Liv_Area, data = ames) ``` * response ~ variable_1 + variable_2 ```r model_fn(Sale_Price ~ Gr_Liv_Area + Neighborhood, data = ames) ``` * response ~ variable_1 + variable_2 + their interaction ```r model_fn(Sale_Price ~ Gr_Liv_Area + Neighborhood + Neighborhood:Gr_Liv_Area, data = ames) ``` * shorthand for all predictors ```r model_fn(Sale_Price ~ ., data = ames) ``` --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ You will now fit some linear regression models on the `ames` housing data. <br> <br> You will explore the model fits with `base` R instructions as well as the functionalities offered by the {broom} package. <br> <br> .hi-pink[Q]: load the `ames` housing data set via `ames <- AmesHousing::make_ames()` 1. Fit a linear regression model with `Sale_Price` as response and `Gr_Liv_Area` as covariate. Store the resulting object as `model_1`. 2. Repeat your instruction, but now put it between brackets. What happens? 3. Inspect `model_1` with the following set of instructions - `summary(___)` - extract the fitted coefficients, using `___$coefficients` - what happens with `summary(___)$coefficients`? - extract fitted values, using `___$fitted.values` - now try to extract the R<sup>2</sup> of this model. ] --- class: clear .hi-pink[Q.1] Linear model with `Sale_Price` as a function of `Gr_Live_Area` ```r model_1 <- lm(Sale_Price ~ Gr_Liv_Area, data = ames) ``` .hi-pink[Q.3] Check `model_1` - What happens - do you *like* this display? ```r summary(model_1) ``` Now let's extract some meaningful information from `model_1` (using `base` R instructions) .pull-left[ ```r model_1$coefficients ``` ``` ## (Intercept) Gr_Liv_Area ## 13289.6 111.7 ``` ```r summary(model_1)$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 13289.6 3269.703 4.064 4.941e-05 ## Gr_Liv_Area 111.7 2.066 54.061 0.000e+00 ``` ] .pull-right[ ```r head(model_1$fitted.values) ``` ``` ## 1 2 3 4 5 6 ## 198255 113367 161731 248964 195239 192447 ``` ```r summary(model_1)$r.squared ``` ``` ## [1] 0.4995 ``` ] --- # Tidy model output <img src="img/broom.png" class="title-hex"> The package {broom} allows to summarize key information about statistical objects (e.g. a linear regression model) in so-called tidy tibbles. This makes it easy to report results, create plots and consistently work with large numbers of models at once. We briefly illustrate the three essential verbs of {broom}: `tidy()`, `glance()` and `augment()`. ```r model_1 %>% broom::tidy() ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 13289.6 </td> <td style="text-align:right;"> 3269.703 </td> <td style="text-align:right;"> 4.064 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Gr_Liv_Area </td> <td style="text-align:right;"> 111.7 </td> <td style="text-align:right;"> 2.066 </td> <td style="text-align:right;"> 54.061 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> ```r model_1 %>% broom::glance() ``` <table> <thead> <tr> <th style="text-align:right;"> r.squared </th> <th style="text-align:right;"> adj.r.squared </th> <th style="text-align:right;"> sigma </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> df </th> <th style="text-align:right;"> logLik </th> <th style="text-align:right;"> AIC </th> <th style="text-align:right;"> BIC </th> <th style="text-align:right;"> deviance </th> <th style="text-align:right;"> df.residual </th> <th style="text-align:right;"> nobs </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.4995 </td> <td style="text-align:right;"> 0.4994 </td> <td style="text-align:right;"> 56524 </td> <td style="text-align:right;"> 2923 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> -36218 </td> <td style="text-align:right;"> 72442 </td> <td style="text-align:right;"> 72460 </td> <td style="text-align:right;"> 9.355e+12 </td> <td style="text-align:right;"> 2928 </td> <td style="text-align:right;"> 2930 </td> </tr> </tbody> </table> --- # Tidy model output <img src="img/broom.png" class="title-hex"> The package {broom} allows to summarize key information about statistical objects (e.g. a linear regression model) in so-called tidy tibbles. This makes it easy to report results, create plots and consistently work with large numbers of models at once. We briefly illustrate the three essential verbs of `broom`: `tidy()`, `glance()` and `augment()`. ```r model_1 %>% broom::augment() %>% slice(1:5) ``` <table> <thead> <tr> <th style="text-align:right;"> Sale_Price </th> <th style="text-align:right;"> Gr_Liv_Area </th> <th style="text-align:right;"> .fitted </th> <th style="text-align:right;"> .resid </th> <th style="text-align:right;"> .hat </th> <th style="text-align:right;"> .sigma </th> <th style="text-align:right;"> .cooksd </th> <th style="text-align:right;"> .std.resid </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 215000 </td> <td style="text-align:right;"> 1656 </td> <td style="text-align:right;"> 198255 </td> <td style="text-align:right;"> 16745 </td> <td style="text-align:right;"> 4e-04 </td> <td style="text-align:right;"> 56533 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.2963 </td> </tr> <tr> <td style="text-align:right;"> 105000 </td> <td style="text-align:right;"> 896 </td> <td style="text-align:right;"> 113367 </td> <td style="text-align:right;"> -8367 </td> <td style="text-align:right;"> 8e-04 </td> <td style="text-align:right;"> 56534 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -0.1481 </td> </tr> <tr> <td style="text-align:right;"> 172000 </td> <td style="text-align:right;"> 1329 </td> <td style="text-align:right;"> 161731 </td> <td style="text-align:right;"> 10269 </td> <td style="text-align:right;"> 4e-04 </td> <td style="text-align:right;"> 56534 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.1817 </td> </tr> <tr> <td style="text-align:right;"> 244000 </td> <td style="text-align:right;"> 2110 </td> <td style="text-align:right;"> 248964 </td> <td style="text-align:right;"> -4964 </td> <td style="text-align:right;"> 8e-04 </td> <td style="text-align:right;"> 56534 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -0.0879 </td> </tr> <tr> <td style="text-align:right;"> 189900 </td> <td style="text-align:right;"> 1629 </td> <td style="text-align:right;"> 195239 </td> <td style="text-align:right;"> -5339 </td> <td style="text-align:right;"> 4e-04 </td> <td style="text-align:right;"> 56534 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -0.0945 </td> </tr> </tbody> </table> --- class: clear .pull-left[ ```r g_lm_1 <- ggplot(data = ames, aes(Gr_Liv_Area, Sale_Price)) + theme_bw() + geom_point(size = 1, alpha = 0.3) + geom_smooth(se = TRUE, method = "lm") + #scale_y_continuous(labels = scales::dollar) + ggtitle("Regression with AMES housing data") g_lm_1 ``` <img src="ML_part1_files/figure-html/unnamed-chunk-162-1.svg" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ ```r g_lm_2 <- model_1 %>% broom::augment() %>% ggplot(aes(Gr_Liv_Area, Sale_Price)) + theme_bw() + geom_point(size = 1, alpha = 0.3) + geom_line(aes(y = .fitted), col = KULbg) + #scale_y_continuous(labels = scales::dollar) + ggtitle("Regression with AMES housing data") g_lm_2 ``` <img src="ML_part1_files/figure-html/unnamed-chunk-164-1.svg" width="70%" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle # Generalized Linear Models <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # Linear and Generalized Linear Models .pull-left-alt[ .center[ <img src="img/esbjorn_GLM.jpg" alt="Drawing" style="width: 150px; height: 220px;"/> <br> <br> <img src="img/de_jong_GLM.jpg" alt="Drawing" style="width: 150px; height: 220px;"/> ] ] .pull-right-alt[ With .hi-pink[linear regression models] `lm(.)` - model specification `$$\begin{eqnarray*} \color{#FFA500}{Y} = \color{#e64173}{x}^{'}\color{#20B2AA}{\beta} + \epsilon. \end{eqnarray*}$$` - `\(\epsilon\)` is normally distributed with mean 0 and common variance `\(\sigma^2\)`, thus: `\(\color{#FFA500}{Y}\)` is normal with mean `\(\color{#e64173}{x}^{'}\color{#20B2AA}{\beta}\)` and variance `\(\sigma^2\)` With .hi-pink[generalized linear regression models] `glm(.)` - model specification `$$\begin{eqnarray*} g(E[\color{#FFA500}{Y}]) = \color{#e64173}{x}^{'}\color{#20B2AA}{\beta}. \end{eqnarray*}$$` - `\(g(.)\)` is the link function - `\(\color{#FFA500}{Y}\)` follows a distribution from the exponential family. ] --- name: mtpl # Motor Third Party Liability data <img src="img/pipe.png" class="title-hex"> <img src="img/dplyr.png" class="title-hex"> <img src="img/ggplot2.png" class="title-hex"> We will use the Motor Third Party Liability data set. There are 163,231 policyholders in this data set. The frequency of claiming (`nclaims`) and corresponding severity (`avg`, the amount paid on average per claim reported by a policyholder) are the .KULbginline[target variables] in this data set. Predictor variables are: * the exposure-to-risk, the duration of the insurance coverage (max. 1 year) * factor variables, e.g. gender, coverage, fuel * continuous, numeric variables, e.g. age of the policyholder, age of the car * spatial information: postal code (in Belgium) of the municipality where the policyholder resides. More details in [Henckaerts et al. (2018, Scandinavian Actuarial Journal)](https://katrienantonio.github.io/projects/2019/06/13/machine-learning/#data-driven) and [Henckaerts et al. (2020, North American Actuarial Journal)](https://katrienantonio.github.io/projects/2019/06/13/machine-learning/#tree-based-pricing). --- name: mtpl # Motor Third Party Liability data <img src="img/pipe.png" class="title-hex"> <img src="img/dplyr.png" class="title-hex"> <img src="img/ggplot2.png" class="title-hex"> You can load the data from the `data` folder as follows: ```r # install.packages("rstudioapi") dir <- dirname(rstudioapi::getActiveDocumentContext()$path) setwd(dir) mtpl_orig <- read.table('../data/PC_data.txt', header = TRUE, stringsAsFactors = TRUE) mtpl_orig <- as_tibble(mtpl_orig) ``` Alternatively, you can also go for: ```r # install.packages("here") dir <- here::here() setwd(dir) mtpl_orig <- read.table('../data/PC_data.txt', header = TRUE, stringsAsFactors = TRUE) mtpl_orig <- as_tibble(mtpl_orig) ``` Some basic exploratory steps with this data follow on the next sheet. --- name: mtpl # Motor Third Party Liability data <img src="img/pipe.png" class="title-hex"> <img src="img/dplyr.png" class="title-hex"> <img src="img/ggplot2.png" class="title-hex"> Note that the data `mtpl_orig` uses capitals for the variable names ```r mtpl_orig %>% slice(1:3) %>% dplyr::select(-LONG, -LAT) ``` <table> <thead> <tr> <th style="text-align:right;"> ID </th> <th style="text-align:right;"> NCLAIMS </th> <th style="text-align:right;"> AMOUNT </th> <th style="text-align:right;"> AVG </th> <th style="text-align:right;"> EXP </th> <th style="text-align:left;"> COVERAGE </th> <th style="text-align:left;"> FUEL </th> <th style="text-align:left;"> USE </th> <th style="text-align:left;"> FLEET </th> <th style="text-align:left;"> SEX </th> <th style="text-align:right;"> AGEPH </th> <th style="text-align:right;"> BM </th> <th style="text-align:right;"> AGEC </th> <th style="text-align:right;"> POWER </th> <th style="text-align:right;"> PC </th> <th style="text-align:left;"> TOWN </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1618 </td> <td style="text-align:right;"> 1618 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> TPL </td> <td style="text-align:left;"> gasoline </td> <td style="text-align:left;"> private </td> <td style="text-align:left;"> N </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 77 </td> <td style="text-align:right;"> 1000 </td> <td style="text-align:left;"> BRUSSEL </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> PO </td> <td style="text-align:left;"> gasoline </td> <td style="text-align:left;"> private </td> <td style="text-align:left;"> N </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 1000 </td> <td style="text-align:left;"> BRUSSEL </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> TPL </td> <td style="text-align:left;"> diesel </td> <td style="text-align:left;"> private </td> <td style="text-align:left;"> N </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 1000 </td> <td style="text-align:left;"> BRUSSEL </td> </tr> </tbody> </table> We change this to lower case variables, and rename `exp` to `expo`. ```r mtpl <- mtpl_orig %>% rename_all(tolower) %>% rename(expo = exp) names(mtpl) ## [1] "id" "nclaims" "amount" "avg" "expo" "coverage" ## [7] "fuel" "use" "fleet" "sex" "ageph" "bm" ## [13] "agec" "power" "pc" "town" "long" "lat" ``` --- class: clear name: first-steps-MTPL .pull-left[ ```r dim(mtpl) ``` ``` ## [1] 163231 18 ``` ```r mtpl %>% summarize(emp_freq = sum(nclaims) / sum(expo)) ``` <table> <thead> <tr> <th style="text-align:right;"> emp_freq </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.1393 </td> </tr> </tbody> </table> ```r mtpl %>% group_by(sex) %>% summarize(emp_freq = sum(nclaims) / sum(expo)) ``` <table> <thead> <tr> <th style="text-align:left;"> sex </th> <th style="text-align:right;"> emp_freq </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 0.1484 </td> </tr> <tr> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 0.1361 </td> </tr> </tbody> </table> ] .pull-right[ ```r g <- ggplot(mtpl, aes(nclaims)) + theme_bw() + geom_bar(aes(weight = expo), alpha = .5, col = KULbg, fill = KULbg) + labs(y = "Abs freq (in exposure)") + ggtitle("MTPL - number of claims") g ``` <img src="ML_part1_files/figure-html/unnamed-chunk-173-1.svg" width="80%" style="display: block; margin: auto;" /> ] --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ To get warmed up, let's load the `mtpl` data and do some .KULbginline[basic investigations] into the variables. The idea is to get a feel for the data. .hi-pink[Q]: you will work through the following exploratory steps. 1. Visualize the distribution of the `ageph` with a histogram. 2. For each age recorded in the data set `mtpl`: what is the total number of observations, the total exposure, and the corresponding total number of claims reported? 3. Calculate the empirical claim frequency, per unit of exposure, for each age and picture it. Discuss this figure. 4. Repeat the above for `bm`, the level occupied by the policyholder in the Belgian bonus- malus scale. ] --- class: clear .pull-left[ .hi-pink[Q.1] a histogram of `ageph` ```r ggplot(data = mtpl, aes(ageph)) + theme_bw() + geom_histogram(binwidth = 2, alpha = .5, col = KULbg, fill = KULbg) + labs(y = "Absolute frequency") + ggtitle("MTPL - age policyholder") ``` <img src="ML_part1_files/figure-html/unnamed-chunk-174-1.svg" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ .hi-pink[Q.2] for each `ageph` recorded ```r mtpl %>% group_by(ageph) %>% summarize(tot_claims = sum(nclaims), tot_expo = sum(expo), tot_obs = n()) ``` <table> <thead> <tr> <th style="text-align:right;"> ageph </th> <th style="text-align:right;"> tot_claims </th> <th style="text-align:right;"> tot_expo </th> <th style="text-align:right;"> tot_obs </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4.622 </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 28 </td> <td style="text-align:right;"> 93.022 </td> <td style="text-align:right;"> 116 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 113 </td> <td style="text-align:right;"> 342.285 </td> <td style="text-align:right;"> 393 </td> </tr> <tr> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 165 </td> <td style="text-align:right;"> 597.389 </td> <td style="text-align:right;"> 701 </td> </tr> <tr> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 202 </td> <td style="text-align:right;"> 778.827 </td> <td style="text-align:right;"> 952 </td> </tr> <tr> <td style="text-align:right;"> 23 </td> <td style="text-align:right;"> 297 </td> <td style="text-align:right;"> 1165.359 </td> <td style="text-align:right;"> 1379 </td> </tr> <tr> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 426 </td> <td style="text-align:right;"> 1752.249 </td> <td style="text-align:right;"> 2028 </td> </tr> </tbody> </table> ] --- class: clear .pull-left[ .hi-pink[Q.3] for each `ageph` recorded ```r freq_by_age <- mtpl %>% group_by(ageph) %>% summarize(emp_freq = sum(nclaims) / sum(expo)) ggplot(data = freq_by_age, aes(x = ageph, y = emp_freq)) + theme_bw() + geom_bar(stat = 'identity', alpha = .5, color = KULbg, fill = KULbg) + ggtitle('MTPL - empirical claim freq per age policyholder') ``` .hi-pink[Q.4] recycle the above instructions and replace `ageph` with `bm` ] .pull-right[ <br> <img src="ML_part1_files/figure-html/unnamed-chunk-178-1.svg" width="80%" style="display: block; margin: auto;" /> ] --- # Generalized Linear Models (GLMs) Modeling claim .hi-pink[frequency] and .hi-pink[severity] in the `mtpl` data set. .pull-left[ Target variable `nclaims` (frequency) <img src="ML_part1_files/figure-html/unnamed-chunk-179-1.svg" width="65%" style="display: block; margin: auto;" /> Suitable distributions: Poisson, Negative Binomial. ] .pull-right[ ... and `avg` (severity). <img src="ML_part1_files/figure-html/unnamed-chunk-180-1.svg" width="65%" style="display: block; margin: auto;" /> Suitable distributions: log-normal, gamma. ] --- # A Poisson GLM .pull-left[ ```r freq_glm_1 <- `glm`(nclaims ~ sex, offset = log(expo), `family = poisson(link = "log")`, `data = mtpl`) ``` ] .pull-right[ Fit a .KULbginline[Poisson GLM], with .KULbginline[logarithmic link] function. This implies: `\(\color{#FFA500}{Y}\)` ~ Poisson, with `$$\begin{eqnarray*} \log(E[\color{#FFA500}{Y}]) &=& \color{#e64173}{x}^{'}\color{#20B2AA}{\beta}, \end{eqnarray*}$$` or, `$$E[\color{#FFA500}{Y}] = \exp{(\color{#e64173}{x}^{'}\color{#20B2AA}{\beta})}.$$` Fit this model on `data = mtpl`. ] --- # A Poisson GLM (cont.) .pull-left[ ```r freq_glm_1 <- glm(`nclaims ~ sex`, `offset = log(expo)`, family = poisson(link = "log"), data = mtpl) ``` ] .pull-right[ Use `nclaims` as `\(\color{#FFA500}{Y}\)`. Use `gender` as the only (factor) variable in the linear predictor. Include `log(expo)` as an offset term in the linear predictor. Then, `$$\begin{eqnarray*} \color{#e64173}{x}^{'}\color{#20B2AA}{\beta} = \log{(\texttt{expo})}+\beta_0 + \beta_1 \mathbb{I}(\texttt{male}). \end{eqnarray*}$$` Put otherwise, `$$\begin{eqnarray*} E[\color{#FFA500}{Y}] = \texttt{expo} \cdot \exp{(\beta_0 + \beta_1 \mathbb{I}(\texttt{male}))} \end{eqnarray*},$$` where `\(\texttt{expo}\)` refers to `expo` the exposure variable. ] --- class: clear .pull-left[ ```r freq_glm_1 <- glm(nclaims ~ sex, `offset = log(expo)`, family = poisson(link = "log"), data = mtpl) ``` ```r freq_glm_1 %>% broom::tidy() ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -1.9076 </td> <td style="text-align:right;"> 0.0133 </td> <td style="text-align:right;"> -143.186 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> sexmale </td> <td style="text-align:right;"> -0.0866 </td> <td style="text-align:right;"> 0.0157 </td> <td style="text-align:right;"> -5.523 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> Mind the specification of `type.predict` when using `augment` with a GLM! ```r freq_glm_1 %>% broom::augment(type.predict = "response") ``` <table> <thead> <tr> <th style="text-align:right;"> nclaims </th> <th style="text-align:left;"> sex </th> <th style="text-align:right;"> .fitted </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 0.1361 </td> </tr> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 0.1484 </td> </tr> </tbody> </table> ] .pull-right[ The `predict` function of a GLM object offers 3 options: `"link"`, `"response"` or `"terms"`. The same options hold when `augment()` is applied to a GLM object. Let's see how the fitted values at `"response"` level are constructed: ```r exp(coef(freq_glm_1)[1]) ## (Intercept) ## 0.1484 exp(coef(freq_glm_1)[1] + coef(freq_glm_1)[2]) ## (Intercept) ## 0.1361 ``` Do you recognize these numbers? Last step: try `freq_glm_1 %>% glance()` or `summary(freq_glm_1)` for deviances. ] --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ You will further explore GLMs in R with the `glm(.)` function. .hi-pink[Q]: continue with the `freq_glm_1` object that was created, you will now explicitly call the `predict()` function on this object. 1. Verify the arguments of `predict.glm` using `? predict.glm`. 2. The help reveals the following structure `predict(.object, .newdata, type = ("..."))` where `.object` is the fitted GLM object, `.newdata` is (optionally) a data frame to look for the features used in the model, and `type` is `"link"`, `"response"` or `"terms"`. <br> Use `predict` with `freq_glm_1` and a newly created data frame. <br> Explore the different options for `type`, and their connections. 3. Fit a gamma GLM for `avg` (the claim severity) with log link. <br> Use `sex` as the only variable in the model. What do you conclude? ] --- class: clear .pull-left[ .hi-pink[Q.1] You can access the documentation via `? predict.glm`. .hi-pink[Q.2] You create new data frames (or tibbles) as follows ```r male_driver <- data.frame(expo = 1, sex = "male") female_driver <- data.frame(expo = 1, sex = "female") ``` Next, you apply `predict` with the GLM object `freq_glm_1` and one of these data frames, e.g. ```r predict(freq_glm_1, newdata = male_driver, type = "response") ``` ``` ## 1 ## 0.1361164 ``` ] .pull-right[ .hi-pink[Q.2] Next, you apply `predict` with the GLM object `freq_glm_1` and one of these data frames, e.g. ```r predict(freq_glm_1, newdata = male_driver, type = "response") ``` ``` ## 1 ## 0.1361164 ``` At the level of the linear predictor: ```r predict(freq_glm_1, newdata = male_driver, type = "link") ``` ``` ## 1 ## -1.994245 ``` ```r exp(predict(freq_glm_1, newdata = male_driver, type = "link")) ``` ``` ## 1 ## 0.1361164 ``` ] --- class: clear .hi-pink[Q.3] For the gamma regression model ```r sev_glm_1 <- glm(avg ~ sex, family = Gamma(link = "log"), data = mtpl) sev_glm_1 ``` ``` ## ## Call: glm(formula = avg ~ sex, family = Gamma(link = "log"), data = mtpl) ## ## Coefficients: ## (Intercept) sexmale ## 7.5730 -0.2581 ## ## Degrees of Freedom: 18294 Total (i.e. Null); 18293 Residual ## (144936 observations deleted due to missingness) ## Null Deviance: 46690 ## Residual Deviance: 46440 AIC: 299700 ``` --- class: inverse, center, middle # Generalized Additive Models with {mgcv} <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # Generalized Additive Models (GAMs) .pull-left-alt[ .center[ <img src="img/GAM_Hastie_Tibshirani.jpg" alt="Drawing" style="width: 150px; height: 220px;"/> <br> <br> <img src="img/GAM_Wood.jpg" alt="Drawing" style="width: 150px; height: 220px;"/> ] ] .pull-right-alt[ With .hi-pink[GLMs] `glm(.)` - transformation of the mean modelled with a linear predictor `$$\color{#e64173}{x}^{'}\color{#20B2AA}{\beta}$$` - not well suited for continuous risk factors that relate to the response in a non-linear way. With .hi-pink[Generalized Additive Models (GAMs)] - the predictor allows for smooth effects of continuous risk factors and spatial covariates, next to the linear terms, e.g. `$$\color{#e64173}{x}^{'}\color{#20B2AA}{\beta}+\sum_j f_j(x_j) + f(\texttt{lat}, \texttt{long})$$` - predictor is still additive - preferred R package is {mgcv} by Simon Wood. ] --- # A Poisson GAM .pull-left[ We continue working with `mtpl` and now focus on `ageph`. <img src="ML_part1_files/figure-html/unnamed-chunk-201-1.svg" width="75%" style="display: block; margin: auto;" /> ] .pull-right[ We will now explore .hi-pink[four different model specifications]: 1. `ageph` as linear effect in `glm` 2. `ageph` as factor variable in `glm` 3. `ageph` split manually into bins using `cut`, then used as factor in `glm` 4. a smooth effect of `ageph` in `mgcv::gam`. Let's go! Grid of observed `ageph` values ```r a <- min(mtpl$ageph):max(mtpl$ageph) ``` ] --- class: clear .pull-left[ .hi-pink[Model 1]: linear effect of `ageph` ```r freq_glm_age <- glm(nclaims ~ `ageph`, offset = log(expo), data = mtpl, family = poisson(link = "log")) pred_glm_age <- predict(freq_glm_age, newdata = data.frame(ageph = a, expo = 1), `type = "terms"`, se.fit = TRUE) b_glm_age <- pred_glm_age$fit l_glm_age <- pred_glm_age$fit - qnorm(0.975)*pred_glm_age$se.fit u_glm_age <- pred_glm_age$fit + qnorm(0.975)*pred_glm_age$se.fit df <- data.frame(a, b_glm_age, l_glm_age, u_glm_age) ``` ] .pull-right[ <img src="ML_part1_files/figure-html/unnamed-chunk-205-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ .hi-pink[Model 2]: `ageph` as factor variable in `glm` ```r freq_glm_age_f <- glm(nclaims ~ `as.factor(ageph)`, offset = log(expo), data = mtpl, family = poisson(link = "log")) pred_glm_age_f <- predict(freq_glm_age_f, newdata = data.frame(ageph = a, expo = 1), `type = "terms"`, se.fit = TRUE) b_glm_age_f <- pred_glm_age_f$fit l_glm_age_f <- pred_glm_age_f$fit - qnorm(0.975)*pred_glm_age_f$se.fit u_glm_age_f <- pred_glm_age_f$fit + qnorm(0.975)*pred_glm_age_f$se.fit df <- data.frame(a, b_glm_age_f, l_glm_age_f, u_glm_age_f) ``` ] .pull-right[ <img src="ML_part1_files/figure-html/unnamed-chunk-208-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ .hi-pink[Model 3]: `ageph` split into 5-year bins and then used in `glm` ```r *level <- seq(min(mtpl$ageph), max(mtpl$ageph), by = 5) freq_glm_age_c <- glm(nclaims ~ `cut(ageph, level)`, offset = log(expo), data = mtpl, family = poisson(link = "log")) pred_glm_age_c <- predict(freq_glm_age_c, newdata = data.frame(ageph = a, expo = 1), `type = "terms"`, se.fit = TRUE) b_glm_age_c <- pred_glm_age_c$fit l_glm_age_c <- pred_glm_age_c$fit - qnorm(0.975)*pred_glm_age_c$se.fit u_glm_age_c <- pred_glm_age_c$fit + qnorm(0.975)*pred_glm_age_c$se.fit df <- data.frame(a, b_glm_age_c, l_glm_age_c, u_glm_age_c) ``` ] .pull-right[ <img src="ML_part1_files/figure-html/unnamed-chunk-211-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ .hi-pink[Model 4]: smooth effect of `ageph` in `mgcv::gam` ```r library(mgcv) freq_gam_age <- gam(nclaims ~ `s(ageph)`, offset = log(expo), data = mtpl, family = poisson(link = "log")) pred_gam_age <- predict(freq_gam_age, newdata = data.frame(ageph = a, expo = 1), `type = "terms"`, se.fit = TRUE) b_gam_age <- pred_gam_age$fit l_gam_age <- pred_gam_age$fit - qnorm(0.975)*pred_gam_age$se.fit u_gam_age <- pred_gam_age$fit + qnorm(0.975)*pred_gam_age$se.fit df <- data.frame(a, b_gam_age, l_gam_age, u_gam_age) ``` ] .pull-right[ <img src="ML_part1_files/figure-html/unnamed-chunk-214-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear .hi-pink[Model 4] (revisited): picture smooth effect of `ageph` in `mgcv::gam` with built-in `plot`. ```r library(mgcv) freq_gam <- gam(nclaims ~ s(ageph), offset = log(expo), family = poisson(link = "log"), data = mtpl) plot(freq_gam, scheme = 4) ``` <img src="ML_part1_files/figure-html/unnamed-chunk-215-1.svg" width="35%" height="10%" style="display: block; margin: auto;" /> --- # More on GAMs .pull-left[ So, a GAM is a GLM where the linear predictor depends on .hi-pink[smooth functions] of covariates. Consider a GAM with the following predictor: `$$\color{#e64173}{x}^{'}\color{#20B2AA}{\beta}+f_j(x_j).$$` GAMs use .hi-pink[basis functions] to estimate the smooth effect `\(f_j(.)\)` `$$f_j(x_j) = \sum_{m=1}^M \beta_{jm} b_{jm}(x_j),$$` where the `\(b_{jm}(x)\)` are known basis functions and `\(\beta_{jm}\)` are coefficients that have to be estimated. ] .pull-right[ GAMs avoid overfitting by adding a .hi-pink[wiggliness penalty] to the likelihood `$$\int \left( f_j(x)'' \right)^2 = \beta_j^t \mathbf{S}_j\beta_j.$$` GAMs then balance goodness-of-fit and wiggliness via `$$- \log \mathcal{L}(\beta,\beta_j) + \lambda_j \cdot \beta_j^t \mathbf{S}_j\beta_j,$$` with `\(\lambda_j\)` the .hi-pink[smoothing parameter]. The smoothing parameter `\(\lambda_j\)` controls the trade-off between fit & smoothness. ] --- class: clear Let's run some experiments to illustrate the effect of the smoothing parameter ( `sp = .`), the number (`k = .`) and type of basis functions (`bs = .`). We use the `mcycle` data from {MASS}. <img src="ML_part1_files/figure-html/unnamed-chunk-216-1.svg" style="display: block; margin: auto;" /> --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ You will further explore GAMs in R with the `gam(.)` function from the {mgcv} package. .hi-pink[Q]: you will combine insights from building `glm` as well as `gam` objects by working through the following coding steps. 1. Fit a `gam` including some factor variables as well as a smooth effect of `ageph` and `bm`. Visualize the fitted smooth effects. 2. Specify risk profiles of drivers. Calculate their expected annual claim frequency from the constructed `gam`. 3. Explain (in words) which profiles would represent high vs low risk according to the constructed model. ] --- class: clear .pull-left[ .hi-pink[Q.1] examine the following `gam` fit ```r freq_gam_2 <- gam(nclaims ~ sex + fuel + use + s(ageph) + s(bm), offset = log(expo), data = mtpl, family = poisson(link = "log")) ``` ```r summary(freq_gam_2) ## ## Family: poisson ## Link function: log ## ## Formula: ## nclaims ~ sex + fuel + use + s(ageph) + s(bm) ## ## Parametric coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.917790 0.018124 -105.817 <2e-16 *** ## sexmale 0.009167 0.016043 0.571 0.5677 ## fuelgasoline -0.152730 0.015100 -10.114 <2e-16 *** ## usework -0.055345 0.033090 -1.673 0.0944 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df Chi.sq p-value ## s(ageph) 5.451 6.502 228.8 <2e-16 *** ## s(bm) 8.323 8.778 1227.3 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.0155 Deviance explained = 2.85% ## UBRE = -0.46446 Scale est. = 1 n = 163231 ``` ] .pull-right[ ```r plot(freq_gam_2, select = 1) ``` <img src="ML_part1_files/figure-html/unnamed-chunk-219-1.svg" width="45%" style="display: block; margin: auto;" /> ```r plot(freq_gam_2, select = 2) ``` <img src="ML_part1_files/figure-html/unnamed-chunk-220-1.svg" width="45%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ .hi-pink[Q.2] define some risk profiles ```r drivers <- data.frame(expo = c(1, 1, 1), sex = c("female", "female", "female"), fuel = c("diesel", "diesel", "diesel"), use = c("private", "private", "private"), ageph = c(18, 45, 65), bm = c(20, 5, 0)) drivers ``` <table> <thead> <tr> <th style="text-align:right;"> expo </th> <th style="text-align:left;"> sex </th> <th style="text-align:left;"> fuel </th> <th style="text-align:left;"> use </th> <th style="text-align:right;"> ageph </th> <th style="text-align:right;"> bm </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> diesel </td> <td style="text-align:left;"> private </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> diesel </td> <td style="text-align:left;"> private </td> <td style="text-align:right;"> 45 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> diesel </td> <td style="text-align:left;"> private </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> ] .pull-right[ Now, you predict the annual expected claim frequency for these profiles. ```r predict(freq_gam_2, newdata = drivers, type = "response") ``` <table> <thead> <tr> <th style="text-align:right;"> x </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.4031766 </td> </tr> <tr> <td style="text-align:right;"> 0.1727503 </td> </tr> <tr> <td style="text-align:right;"> 0.0951317 </td> </tr> </tbody> </table> ] --- class: inverse, center, middle # Regularized (G)LMs met {glmnet} <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # Statistical learning with sparsity .pull-left-alt[ .center[ <img src="img/sparsity_book.jpg" alt="Drawing" style="width: 150px; height: 220px;"/> <br> <br> <img src="img/boehmke_greenwell.jpg" alt="Drawing" style="width: 150px; height: 220px;"/> ] ] .pull-right-alt[ Why? * Sort through the mass of information and bring it down to .hi-pink[its bare essentials]. * One form of simplicity is .hi-pink[sparsity]. * Only a relatively small number of predictors play a role. How? .KULbginline[Automatic feature selection!] * Fit a model with all *p* predictors, but constrain or .hi-pink[regularize] the coefficient estimates. * Shrinking the coeffcient estimates can signifcantly reduce their variance. * Some types of shrinkage put some of the coefficients .KULbginline[exactly equal to zero]! ] --- # Ridge and lasso (least squares) regression .pull-left[ .KULbginline[Ridge] considers the least-squares optimization problem `$$\begin{eqnarray*} \min_{\beta_0,\boldsymbol{\beta}} \ \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 = \min_{\beta_0,\boldsymbol{\beta}} \ \text{RSS} \end{eqnarray*}$$` subject to a .hi-pink[budget constraint] `$$\begin{eqnarray*} \sum_{j=1}^p \beta_j^2 \leq \color{#e64173}{t}, \end{eqnarray*}$$` i.e. an `\(\ell_2\)` penalty. Shrinks the coefficient estimates (not the intercept) to zero. ] .pull-right[ .KULbginline[Lasso] considers the least-squares optimization problem `$$\begin{eqnarray*} \min_{\beta_0,\boldsymbol{\beta}} \ \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 = \min_{\beta_0,\boldsymbol{\beta}} \ \text{RSS} \end{eqnarray*}$$` subject to a .hi-pink[budget constraint] `$$\begin{eqnarray*} \sum_{j=1}^p |\beta_j| \leq \color{#e64173}{t}, \end{eqnarray*}$$` i.e. an `\(\ell_1\)` penalty. Shrinks the coefficient estimates (not the intercept) to zero and does variable selection! Lasso is for .hi-pink[L]east .hi-pink[a]bsolute .hi-pink[s]hrinkage and .hi-pink[s]election .hi-pink[o]perator. ] --- # Ridge and lasso (least squares) regression (cont.) .pull-left[ The .KULbginline[dual problem] formulation: * with ridge penalty: `$$\begin{eqnarray*} \min_{\beta_0,\boldsymbol{\beta}} \ \text{RSS} + \color{#e64173}{\lambda} \sum_{j=1}^p \beta_j^2 \end{eqnarray*}$$` * with lasso penalty: `$$\begin{eqnarray*} \min_{\beta_0,\boldsymbol{\beta}} \ \text{RSS}+\color{#e64173}{\lambda} \sum_{j=1}^p |\beta_j|. \end{eqnarray*}$$` `\(\color{#e64173}{\lambda}\)` is a tuning parameter; use resampling methods to pick a value! Both ridge and lasso require .hi-pink[centering and scaling] of the features. ] .pull-right[ .center[ <img src="img/6.7_ISL.png" width="100%" style="display: block; margin: auto;" /> ] Ellipses (around least-squares solution) represent regions of constant RSS. Lasso budget on the left and ridge budget on the right. Source: James et al. (2021) on [An introduction to statistical learning](https://www.statlearning.com/). ] --- # Regularized GLMs .pull-left[ We now focus on generalizations of linear models and the lasso. Minimize `$$\begin{eqnarray*} \min_{\beta_0,\ \beta} -\frac{1}{n} \log \mathcal{L}(\beta_0,\ \beta;\ y,\ X)+\lambda \|\boldsymbol{\beta}\|_1. \end{eqnarray*}$$` Here: * `\(\log \mathcal{L}\)` is the log-likelihood of a GLM. * `\(n\)` is the sample size * `\(\|\boldsymbol{\beta}\|_1 = \sum_{j=1}^p \beta_j\)` the `\(\ell_1\)` penalty. What happens if: * `\(\lambda \to 0\)`? * `\(\lambda \to \infty\)`? ] .pull-right[ .center[ <img src="img/lasso_path.png" width="100%" style="display: block; margin: auto;" /> ] The R package {glmnet} fits linear, logistic and multinomial, Poisson, and Cox regression models. ] --- # Fit a GLM with lasso regularization in {glmnet} {glmnet} is a package that fits a generalized linear model via penalized maximum likelihood. Main function call (with a selection of arguments, see `? glmnet` for a complete list) ```r fit <- glmnet(x, y, family = ., alpha = ., weights = ., offset = ., nlambda = ., standardize = ., intercept = .) ``` where * `x` is the input matrix and `y` is the response variable * `family` the response type, e.g. `family = poisson` * `weights` and `offset` * `nlambda` is the number of `\(\lambda\)` values, default is 100 * `standardize` should `x` be standardized prior to fitting the model sequence? * `intercept` should incercept be fitted? * `alpha` a value between 0 and 1, such that the penalty becomes `$$\begin{eqnarray*} \lambda P_{\alpha}(\boldsymbol{\beta}) = \lambda \cdot \sum_{j=1}^p \left\{\frac{(1-\alpha)}{2}\beta_j^2 + \alpha |\beta_j|\right\}. \end{eqnarray*}$$` Thus, with `\(\alpha = 1\)` the lasso penalty and `\(\alpha = 0\)` the ridge penalty results. --- # A first example of {glmnet} .pull-left[ Following [the vignette](https://web.stanford.edu/~hastie/Papers/Glmnet_Vignette.pdf) we start with penalized linear regression ```r library(glmnet) data(QuickStartExample) ``` This example loads an input matrix `x` and vector `y` of outcomes. The input matrix `x` is not standardized yet (check this!). We calibrate a lasso linear regression model ```r fit <- glmnet(x, y, family = "gaussian", `alpha = 1`, standardize = TRUE, intercept = TRUE) summary(fit) ``` ] .pull-right[ Note that the formula notation `y ~ x` can not be used with `glmnet`. Some `tidy` instructions are available for `glmnet` objects (but not all), e.g. ```r library(broom) tidy(fit) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> step </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> lambda </th> <th style="text-align:right;"> dev.ratio </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0.6607581 </td> <td style="text-align:right;"> 1.630762 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0.6312350 </td> <td style="text-align:right;"> 1.485890 </td> <td style="text-align:right;"> 0.0552832 </td> </tr> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0.5874616 </td> <td style="text-align:right;"> 1.353887 </td> <td style="text-align:right;"> 0.1458910 </td> </tr> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 0.5475769 </td> <td style="text-align:right;"> 1.233612 </td> <td style="text-align:right;"> 0.2211153 </td> </tr> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 0.5112354 </td> <td style="text-align:right;"> 1.124021 </td> <td style="text-align:right;"> 0.2835678 </td> </tr> </tbody> </table> ] --- class: clear .pull-left[ ```r plot(fit, `label = TRUE`) ``` <img src="ML_part1_files/figure-html/unnamed-chunk-234-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ```r plot(fit, label = TRUE, `xvar = 'lambda'`) ``` <img src="ML_part1_files/figure-html/unnamed-chunk-236-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ ```r plot(fit, `xvar = 'dev'`, label = TRUE) ``` <img src="ML_part1_files/figure-html/unnamed-chunk-238-1.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ```r print(fit) ## ## Call: glmnet(x = QuickStartExample$x, y = QuickStartExample$y, family = "gaussian", alpha = 1, standardize = TRUE, intercept = TRUE) ## ## Df %Dev Lambda ## 1 0 0.00 1.63100 ## 2 2 5.53 1.48600 ## 3 2 14.59 1.35400 ## 4 2 22.11 1.23400 ## 5 2 28.36 1.12400 ## 6 2 33.54 1.02400 ## 7 4 39.04 0.93320 ## 8 5 45.60 0.85030 ## 9 5 51.54 0.77470 ## 10 6 57.35 0.70590 ## 11 6 62.55 0.64320 ## 12 6 66.87 0.58610 ## 13 6 70.46 0.53400 ## 14 6 73.44 0.48660 ## 15 7 76.21 0.44330 ## 16 7 78.57 0.40400 ## 17 7 80.53 0.36810 ## 18 7 82.15 0.33540 ## 19 7 83.50 0.30560 ## 20 7 84.62 0.27840 ## 21 7 85.55 0.25370 ## 22 7 86.33 0.23120 ## 23 8 87.06 0.21060 ## 24 8 87.69 0.19190 ## 25 8 88.21 0.17490 ## 26 8 88.65 0.15930 ## 27 8 89.01 0.14520 ## 28 8 89.31 0.13230 ## 29 8 89.56 0.12050 ## 30 8 89.76 0.10980 ## 31 9 89.94 0.10010 ## 32 9 90.10 0.09117 ## 33 9 90.23 0.08307 ## 34 9 90.34 0.07569 ## 35 10 90.43 0.06897 ## 36 11 90.53 0.06284 ## 37 11 90.62 0.05726 ## 38 12 90.70 0.05217 ## 39 15 90.78 0.04754 ## 40 16 90.86 0.04331 ## 41 16 90.93 0.03947 ## 42 16 90.98 0.03596 ## 43 17 91.03 0.03277 ## 44 17 91.07 0.02985 ## 45 18 91.11 0.02720 ## 46 18 91.14 0.02479 ## 47 19 91.17 0.02258 ## 48 19 91.20 0.02058 ## 49 19 91.22 0.01875 ## 50 19 91.24 0.01708 ## 51 19 91.25 0.01557 ## 52 19 91.26 0.01418 ## 53 19 91.27 0.01292 ## 54 19 91.28 0.01178 ## 55 19 91.29 0.01073 ## 56 19 91.29 0.00978 ## 57 19 91.30 0.00891 ## 58 19 91.30 0.00812 ## 59 19 91.31 0.00739 ## 60 19 91.31 0.00674 ## 61 19 91.31 0.00614 ## 62 20 91.31 0.00559 ## 63 20 91.31 0.00510 ## 64 20 91.31 0.00464 ## 65 20 91.32 0.00423 ## 66 20 91.32 0.00386 ## 67 20 91.32 0.00351 ``` ] --- class: clear .pull-left[ Get estimated coefficients for handpicked value ```r coef(fit, `s = 0.1`) ``` ``` ## 21 x 1 sparse Matrix of class "dgCMatrix" ## s1 ## (Intercept) 0.150928072 ## V1 1.320597195 ## V2 . ## V3 0.675110234 ## V4 . ## V5 -0.817411518 ## V6 0.521436671 ## V7 0.004829335 ## V8 0.319415917 ## V9 . ## V10 . ## V11 0.142498519 ## V12 . ## V13 . ## V14 -1.059978702 ## V15 . ## V16 . ## V17 . ## V18 . ## V19 . ## V20 -1.021873704 ``` ] .pull-right[ `glmnet` returns a sequence of models for the users to choose from, i.e. a model for every `lambda`. How do we select the most appropriate model? Use cross-validation to pick a `lambda` value. The default is 10-folds cross-validation. ```r cv_fit <- cv.glmnet(QuickStartExample$x, QuickStartExample$y) ``` We can pick the `lambda` that minimizes the cross-validation error. ```r cv_fit$lambda.min ## [1] 0.07569327 ``` Or we use the one-standard-error-rule. ```r cv_fit$lambda.1se ## [1] 0.1593271 ``` ] --- class: clear We plot the cross-validation error for the inspected grid of `lambda` values. ```r plot(cv_fit) ``` <img src="ML_part1_files/figure-html/unnamed-chunk-247-1.svg" width="45%" height="40%" style="display: block; margin: auto;" /> --- class: clear .pull-left[ For the selected `lambda` (via `cv_fit$lambda.min`) we inspect which parameters are non-zero (on the right). Now, compare this to the selected variables obtained via `cv_fit$lambda.1se`. ] .pull-right[ ```r coef(fit, s = cv_fit$lambda.min) ``` ``` ## 21 x 1 sparse Matrix of class "dgCMatrix" ## s1 ## (Intercept) 0.14867414 ## V1 1.33377821 ## V2 . ## V3 0.69787701 ## V4 . ## V5 -0.83726751 ## V6 0.54334327 ## V7 0.02668633 ## V8 0.33741131 ## V9 . ## V10 . ## V11 0.17105029 ## V12 . ## V13 . ## V14 -1.07552680 ## V15 . ## V16 . ## V17 . ## V18 . ## V19 . ## V20 -1.05278699 ``` ] --- class: clear .pull-left[ The variables `V1`, `V3`, `V5-8`, `V11`, `V14` and `V20` are selected in the regression model. However, the corresponding estimates (on the left) are biased, and shrunk to zero. To remove this bias, we refit the model, only using the selected variables. ```r attach(QuickStartExample) subset <- data.frame(y = y, V1 = x[, 1], V3 = x[, 3], V5 = x[, 5], V6 = x[, 6], V7 = x[, 7], V8 = x[, 8], V11 = x[, 11], V14 = x[, 14], V20 = x[, 20]) final_model <- lm(y ~ V1 + V3 + V5 + V6 + V7 + V8 + V11 + V14 + V20, data = subset) final_model %>% broom::tidy() ``` What is your judgement about `V7` (see coefficients on the right)? ] .pull-right[ What do you observe when comparing the estimates below with those shown on the previous sheet? <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 0.1416891 </td> <td style="text-align:right;"> 0.0995658 </td> <td style="text-align:right;"> 1.4230704 </td> <td style="text-align:right;"> 0.1581730 </td> </tr> <tr> <td style="text-align:left;"> V1 </td> <td style="text-align:right;"> 1.3746695 </td> <td style="text-align:right;"> 0.0968211 </td> <td style="text-align:right;"> 14.1980421 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> V3 </td> <td style="text-align:right;"> 0.7688247 </td> <td style="text-align:right;"> 0.0942568 </td> <td style="text-align:right;"> 8.1567012 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> V5 </td> <td style="text-align:right;"> -0.8991610 </td> <td style="text-align:right;"> 0.1033747 </td> <td style="text-align:right;"> -8.6980793 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> V6 </td> <td style="text-align:right;"> 0.6115910 </td> <td style="text-align:right;"> 0.0900882 </td> <td style="text-align:right;"> 6.7888025 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> V7 </td> <td style="text-align:right;"> 0.0947279 </td> <td style="text-align:right;"> 0.0972959 </td> <td style="text-align:right;"> 0.9736059 </td> <td style="text-align:right;"> 0.3328618 </td> </tr> <tr> <td style="text-align:left;"> V8 </td> <td style="text-align:right;"> 0.3933822 </td> <td style="text-align:right;"> 0.0920456 </td> <td style="text-align:right;"> 4.2737767 </td> <td style="text-align:right;"> 0.0000477 </td> </tr> <tr> <td style="text-align:left;"> V11 </td> <td style="text-align:right;"> 0.2600734 </td> <td style="text-align:right;"> 0.0994215 </td> <td style="text-align:right;"> 2.6158659 </td> <td style="text-align:right;"> 0.0104367 </td> </tr> <tr> <td style="text-align:left;"> V14 </td> <td style="text-align:right;"> -1.1239616 </td> <td style="text-align:right;"> 0.0885267 </td> <td style="text-align:right;"> -12.6963039 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> V20 </td> <td style="text-align:right;"> -1.1491267 </td> <td style="text-align:right;"> 0.1117142 </td> <td style="text-align:right;"> -10.2863111 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> </tbody> </table> ] --- # {glmnet} and the MTPL data set .pull-left[ Next, we fit a .KULbginline[Poisson regression model with lasso penalty] on the `mtpl` data set. The regularization penalty helps us to select the interesting features from the data set. `glmnet` requires the features as input matrix `x` and the target as a vector `y`. Recall: * `mtpl` has .hi-pink[continuous] features (e.g. `ageph`, `bm`, `power`) * `mtpl` has .hi-pink[factor] variables with .hi-pink[two levels] (e.g. `sex`, `fleet`) * but also factor variables with .hi-pink[more than 2 levels] (`coverage`) ] .pull-right[ Consider different types of coding factor variables. Apply the `contrasts` function to the variable `coverage` ```r map(mtpl[, c("coverage")], contrasts, contrasts = FALSE) ## $coverage ## FO PO TPL ## FO 1 0 0 ## PO 0 1 0 ## TPL 0 0 1 ``` ```r map(mtpl[, c("coverage")], contrasts, contrasts = TRUE) ## $coverage ## PO TPL ## FO 0 0 ## PO 1 0 ## TPL 0 1 ``` What's the difference? ] --- # {glmnet} and the MTPL data set (cont.) We construct the input matrix for `glmnet`. ```r *y <- mtpl$nclaims x <- model.matrix( ~ coverage + fuel + use + fleet + sex + ageph + bm + agec + power, data = mtpl, contrasts.arg = map(mtpl[, c("coverage")], contrasts, contrasts = FALSE))[,-1] x[1:10,] ``` Put the response or outcome variable in `y`. In the `mtpl` data set we build a Poisson model for `nclaims`. --- # {glmnet} and the MTPL data set (cont.) We construct the input matrix for `glmnet`. ```r y <- mtpl$nclaims x <- `model.matrix`( ~ coverage + fuel + use + fleet + sex + ageph + bm + agec + power, data = mtpl, `contrasts.arg = map(mtpl[, c("coverage")], contrasts,` `contrasts = FALSE)`)[,-1] ``` Use `model.matrix` to create the input matrix `x`. We code the factor variable `coverage` with one-hot-encoding. Here, three dummy variables will be created for the three levels of `coverage`. The other factor variables `fuel`, `use`, `fleet`, `sex` are dummy coded, with one dummy variable. --- # {glmnet} and the MTPL data set (cont.) We construct the input matrix for `glmnet`. ```r y <- mtpl$nclaims x <- model.matrix( ~ coverage + fuel + use + fleet + sex + ageph + bm + agec + power, data = mtpl, contrasts.arg = map(mtpl[, c("coverage")], contrasts, contrasts = FALSE))`[,-1]` ``` Use `model.matrix` to create the input matrix `x`. We remove the first column, representing the intercept, from the `model.matrix`. --- # {glmnet} and the MTPL data set (cont.) Let's check the input matrix `x` ``` ## coverageFO coveragePO coverageTPL fuelgasoline usework fleetY sexmale ageph ## 1 0 0 1 1 0 0 1 50 ## 2 0 1 0 1 0 0 0 64 ## 3 0 0 1 0 0 0 1 60 ## 4 0 0 1 1 0 0 1 77 ## 5 0 0 1 1 0 0 0 28 ## 6 0 0 1 1 0 0 1 26 ## bm agec power ## 1 5 12 77 ## 2 5 3 66 ## 3 0 10 70 ## 4 0 15 57 ## 5 9 7 70 ## 6 11 12 70 ``` You are now ready to fit a regularized Poisson GLM for `y` with input `x`. Let's go! --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ You will fit a regularized Poisson GLM on the `mtpl` data with the {glmnet} package. .hi-pink[Q]: using the constructed `y` and `x` 1. Fit a `glmnet` with lasso penalty and store the fitted object in `mtpl_glmnet`. Use the following arguments `family = "poisson", offset = ___`. 2. Display the order of the variables and their names via `row.names(mtpl_glmnet$beta)`. 3. Plot the solutions path. Pick a meaningful value for `lambda` via cross-validation. 4. Which variables are selected in the lasso model? As a last step, you will fit a Poisson GLM with the selected variables. What do you see? 5. List some pros and cons of the above strategy. ] --- class: clear .pull-left[ .hi-pink[Q.1] fit a regularized Poisson GLM ```r alpha <- 1 # for lasso penalty mtpl_glmnet <- glmnet(x = x, y = y, family = "poisson", offset = log(mtpl$expo), alpha = alpha, standardize = TRUE, intercept = TRUE) ``` .hi-pink[Q.2] display the variables via ```r row.names(mtpl_glmnet$beta) ## [1] "coverageFO" "coveragePO" "coverageTPL" "fuelgasoline" "usework" ## [6] "fleetY" "sexmale" "ageph" "bm" "agec" ## [11] "power" ``` ] .pull-right[ .hi-pink[Q.3] plot the solutions path ```r plot(mtpl_glmnet, xvar = 'lambda', label = TRUE) ``` <img src="ML_part1_files/figure-html/unnamed-chunk-261-1.svg" width="85%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ .hi-pink[Q.3] pick a value for `lambda` ```r set.seed(123) fold_id <- sample(rep(1:10, length.out = nrow(mtpl)), nrow(mtpl)) mtpl_glmnet_cv <- cv.glmnet(x, y, family = "poisson", alpha = alpha, nfolds = 10, foldid = fold_id, type.measure = "deviance", standardize = TRUE, intercept = TRUE) plot(mtpl_glmnet_cv) ``` <img src="ML_part1_files/figure-html/unnamed-chunk-262-1.svg" width="45%" style="display: block; margin: auto;" /> ] .pull-right[ ```r coef(mtpl_glmnet_cv, s = "lambda.min") ## 12 x 1 sparse Matrix of class "dgCMatrix" ## s1 ## (Intercept) -2.106680932 ## coverageFO -0.006499730 ## coveragePO . ## coverageTPL 0.050002173 ## fuelgasoline -0.165864612 ## usework -0.069292342 ## fleetY -0.049283838 ## sexmale -0.013718073 ## ageph -0.006347490 ## bm 0.058564280 ## agec -0.002004356 ## power 0.003448081 ``` ] --- class: clear .pull-left[ .hi-pink[Q.3] pick a value for `lambda` ```r set.seed(123) fold_id <- sample(rep(1:10, length.out = nrow(mtpl)), nrow(mtpl)) mtpl_glmnet_cv <- cv.glmnet(x, y, family = "poisson", alpha = alpha, nfolds = 10, foldid = fold_id, type.measure = "deviance", standardize = TRUE, intercept = TRUE) plot(mtpl_glmnet_cv) ``` <img src="ML_part1_files/figure-html/unnamed-chunk-264-1.svg" width="45%" style="display: block; margin: auto;" /> ] .pull-right[ ```r coef(mtpl_glmnet_cv, s = "lambda.1se") ## 12 x 1 sparse Matrix of class "dgCMatrix" ## s1 ## (Intercept) -2.124039910 ## coverageFO . ## coveragePO . ## coverageTPL . ## fuelgasoline . ## usework . ## fleetY . ## sexmale . ## ageph -0.002916928 ## bm 0.046163778 ## agec . ## power . ``` ] --- class: clear .pull-left[ .hi-pink[Q.4] refit the models using only the selected features ```r mtpl$coverage <- relevel(mtpl$coverage, "PO") mtpl_formula_refit <- nclaims ~ 1 + coverage + fuel + use + fleet + sex + ageph + bm + agec + power mtpl_glm_refit <- glm(mtpl_formula_refit, data = mtpl, offset = log(mtpl$expo), family = poisson()) ``` ] .pull-right[ The selection obtained via `lambda.min` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -1.9892872 </td> <td style="text-align:right;"> 0.0401325 </td> <td style="text-align:right;"> -49.5679730 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> coverageFO </td> <td style="text-align:right;"> 0.0044293 </td> <td style="text-align:right;"> 0.0244274 </td> <td style="text-align:right;"> 0.1813238 </td> <td style="text-align:right;"> 0.8561134 </td> </tr> <tr> <td style="text-align:left;"> coverageTPL </td> <td style="text-align:right;"> 0.0743796 </td> <td style="text-align:right;"> 0.0172363 </td> <td style="text-align:right;"> 4.3152799 </td> <td style="text-align:right;"> 0.0000159 </td> </tr> <tr> <td style="text-align:left;"> fuelgasoline </td> <td style="text-align:right;"> -0.1731052 </td> <td style="text-align:right;"> 0.0153266 </td> <td style="text-align:right;"> -11.2944557 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> usework </td> <td style="text-align:right;"> -0.0862841 </td> <td style="text-align:right;"> 0.0334470 </td> <td style="text-align:right;"> -2.5797233 </td> <td style="text-align:right;"> 0.0098880 </td> </tr> <tr> <td style="text-align:left;"> fleetY </td> <td style="text-align:right;"> -0.1226498 </td> <td style="text-align:right;"> 0.0435289 </td> <td style="text-align:right;"> -2.8176618 </td> <td style="text-align:right;"> 0.0048375 </td> </tr> <tr> <td style="text-align:left;"> sexmale </td> <td style="text-align:right;"> -0.0253198 </td> <td style="text-align:right;"> 0.0162468 </td> <td style="text-align:right;"> -1.5584505 </td> <td style="text-align:right;"> 0.1191265 </td> </tr> <tr> <td style="text-align:left;"> ageph </td> <td style="text-align:right;"> -0.0074262 </td> <td style="text-align:right;"> 0.0005391 </td> <td style="text-align:right;"> -13.7764864 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> bm </td> <td style="text-align:right;"> 0.0639249 </td> <td style="text-align:right;"> 0.0017328 </td> <td style="text-align:right;"> 36.8902457 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> agec </td> <td style="text-align:right;"> -0.0004698 </td> <td style="text-align:right;"> 0.0019368 </td> <td style="text-align:right;"> -0.2425874 </td> <td style="text-align:right;"> 0.8083251 </td> </tr> <tr> <td style="text-align:left;"> power </td> <td style="text-align:right;"> 0.0038535 </td> <td style="text-align:right;"> 0.0003799 </td> <td style="text-align:right;"> 10.1421096 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> </tbody> </table> ] --- class: clear .pull-left[ .hi-pink[Q.4] refit the models using only the selected features ```r mtpl_formula_refit_2 <- nclaims ~ 1 + ageph + bm mtpl_glm_refit_2 <- glm(mtpl_formula_refit_2, data = mtpl, offset = log(mtpl$expo), family = poisson()) ``` ] .pull-right[ The selection obtained via `lambda.1se` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -1.8251292 </td> <td style="text-align:right;"> 0.0282345 </td> <td style="text-align:right;"> -64.64189 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> ageph </td> <td style="text-align:right;"> -0.0083839 </td> <td style="text-align:right;"> 0.0005274 </td> <td style="text-align:right;"> -15.89605 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> bm </td> <td style="text-align:right;"> 0.0625774 </td> <td style="text-align:right;"> 0.0017141 </td> <td style="text-align:right;"> 36.50764 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> ] --- name: wrap-up # Thanks! <img src="img/xaringan.png" class="title-hex"> <br> <br> <br> <br> Slides created with the R package [xaringan](https://github.com/yihui/xaringan). <br> <br> <br> Course material available via <br>
https://github.com/katrienantonio/hands-on-machine-learning-R-module-1