Hands-on Machine Learning with R

# Hands-on Machine Learning with R - Module 1

<span style="color:#FAFAFA;font-size:18.0pt">
Katrien Antonio & Roel Henckaerts & Jonas Crevecoeur <br> <br> 
</span>

<span style="color:#FAFAFA;font-size:14.0pt">
Webinar | October - November, 2023<br>
</span>

---

# Prologue

---

# Introduction

### Course

The course repo on GitHub, where you can find the data sets, lecture sheets, R scripts and R markdown files.

### Us

---

# Checklist

☑ Do you have a fairly recent version of R?
 
☑ Do you have a fairly recent version of RStudio? 
 
☑ Have you installed the R packages listed in the software requirements?

☑ Have you created an account on posit cloud (to avoid any local installation issues)?
  
---

# Why this course?

### The goals of this course .font140[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M156.6 384.9L125.7 354c-8.5-8.5-11.5-20.8-7.7-32.2c3-8.9 7-20.5 11.8-33.8L24 288c-8.6 0-16.6-4.6-20.9-12.1s-4.2-16.7 .2-24.1l52.5-88.5c13-21.9 36.5-35.3 61.9-35.3l82.3 0c2.4-4 4.8-7.7 7.2-11.3C289.1-4.1 411.1-8.1 483.9 5.3c11.6 2.1 20.6 11.2 22.8 22.8c13.4 72.9 9.3 194.8-111.4 276.7c-3.5 2.4-7.3 4.8-11.3 7.2v82.3c0 25.4-13.4 49-35.3 61.9l-88.5 52.5c-7.4 4.4-16.6 4.5-24.1 .2s-12.1-12.2-12.1-20.9V380.8c-14.1 4.9-26.4 8.9-35.7 11.9c-11.2 3.6-23.4 .5-31.8-7.8zM384 168a40 40 0 1 0 0-80 40 40 0 1 0 0 80z"/></svg>]

* develop practical .KULbginline[machine learning (ML) foundations]

* .KULbginline[fill in the gaps] left by traditional training in actuarial science or econometrics

* focus on the use of ML methods for the .KULbginline[analysis of frequency + severity data], but also .KULbginline[non-standard data] such as images

* .KULbginline[explore] a substantial range of .KULbginline[methods (and data types)] (from GLMs to deep learning), but - most importantly - .KULbginline[build foundation] so that you can explore other methods (and data types) yourself.

<br>

> *"In short, we will cover things that we wish someone had taught us in our undergraduate programs."* 
> <br>
> .font80[This quote is from the [Data science for economists course](http://github.com/uo-ec607/lectures) by Grant McDermott.]

---

# Module 1's Outline

* [Prologue](#prologue)

* [Knowing me, knowing you: <br> statistical and machine learning](#knowing)

- Supervised and unsupervised learning
  - Regression and classification
  - Statistical modeling: the two cultures

* [Model accuracy and loss functions](#accuracy)

* [Overfitting and bias-variance tradeoff](#overfit)

* [Data splitting, Resampling methods](#resampling)

]

* [Parameter tuning](#tuning)

- with {caret}, {rsample} and {purrr}

* [Target and feature engineering](#engineering)

- Data leakage
  - Pre-processing steps
  - Specifying blue-prints with {recipes}
  - Putting it all together: {recipes} and {caret}/{rsample}

* [Regression models](#regression)

- Creating models in R and tidy model output with {broom}
  - GLMs with {glm}
  - GAMs with {mgcv}
  - Regularized (G)LMs with {glmnet}.

]

---

name: map-ML-world
class: right, middle, clear
background-image: url("img/map_ML_world.jpg")
background-size: 45% 
background-position: left

.KULbginline[Some roadmaps to explore the ML landscape...]

.font60[Source: [Machine Learning for Everyone In simple words. With real-world examples. Yes, again.](https://vas3k.com/blog/machine_learning/)]

---

name: map-ML-world
class: right, middle, clear
background-image: url("img/main_types_ML.jpg")
background-size: 85% 
background-position: middle

---

# Knowing me, knowing you: 
<br> <br>
# statistical and machine learning

---

# Supervised learning

Supervised learning builds ("learns") a .blue[model] `$\color{#3b3b9a}{f}$` (*the Signal*) such that the .orange[outcome or target] `$\color{#fb6107}{Y}$` can be written as

`$$\color{#FFA500}{Y} = \color{#3b3b9a}{f}(\color{#e64173}{x_1, \ldots, x_p}) + \epsilon$$`
with .pink[features] `$\color{#e64173}{x_1, \ldots, x_p}$` and error term `$\epsilon$` (*the Noise*).

Supervised learners construct .KULbginline[predictive models].

.footnote[Picture taken from [Machine Learning for Everyone. In simple words. With real-world examples. Yes, again](https://vas3k.com/blog/machine_learning/)]

]

.center[
<img src="img/supervised_unsupervised_drawing.jpg" width="100%" style="display: block; margin: auto;" />
]

]
---

# Unsupervised learning

With unsupervised learning there is .KULbginline[NO] .orange[outcome or target] `$\color{#FFA500}{Y}$`, only the feature vector `$\color{#e64173}{x = (x_1, \ldots, x_p)}$`.

Let `$n$` denote the sample size and `$p$` the number of features.

Then, `$\color{#e64173}{X}$` is the `$n \times p$` matrix of features, with `$\color{#e64173}{x}_{i,j}$` observation `$i$` on variable or feature `$j$`.

Unsupervised learners construct .KULbginline[descriptive models], without any *supervising* output, letting the data "speak for itself".

]

.center[
<img src="img/K-means_drawing.jpg" width="80%" height="55%" style="display: block; margin: auto;" />
]

.footnote[Picture taken from [Machine Learning for Everyone. In simple words. With real-world examples. Yes, again](https://vas3k.com/blog/machine_learning/)]

]
---

.center[
<img src="img/supervised_unsupervised_robot.jpg" width="90%" style="display: block; margin: auto;" />
]

.footnote[Picture taken from [this source](https://twitter.com/athena_schools/status/1063013435779223553).]

---

# What's in a name?

.KULbginline[Machine learning] constructs algorithms that learn from data.

.KULbginline[Statistical learning] emphasizes statistical models and the assessment of uncertainty.

.KULbginline[Data science] applies mathematics, statistics, machine learning, engineering, etc. to extract knowledge form data.
--

> *"Data Science is statistics on a Mac <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M318.7 268.7c-.2-36.7 16.4-64.4 50-84.8-18.8-26.9-47.2-41.7-84.7-44.6-35.5-2.8-74.3 20.7-88.5 20.7-15 0-49.4-19.7-76.4-19.7C63.3 141.2 4 184.8 4 273.5q0 39.3 14.4 81.2c12.8 36.7 59 126.7 107.2 125.2 25.2-.6 43-17.9 75.8-17.9 31.8 0 48.3 17.9 76.4 17.9 48.6-.7 90.4-82.5 102.6-119.3-65.2-30.7-61.7-90-61.7-91.9zm-56.6-164.2c27.3-32.4 24.8-61.9 24-72.5-24.1 1.4-52 16.4-67.9 34.9-17.5 19.8-27.8 44.3-25.6 71.9 26.1 2 49.9-11.4 69.5-34.3z"/></svg>. "*

.center[
<img src="img/ElementsStatLearning.png" alt="Drawing" style="height: 250px;"/>  &nbsp;&nbsp;&nbsp; <img src="img/ISL.png" alt="Drawing" style="height: 250px;"/> 
&nbsp;&nbsp;&nbsp; <img src="img/AppliedPredMod.png" alt="Drawing" style="height: 250px;"/> &nbsp;&nbsp;&nbsp; <img src="img/boehmke_greenwell.jpg" alt="Drawing" style="height: 250px;"/> &nbsp;&nbsp;&nbsp; <img src="img/molnar.png" alt="Drawing" style="height: 250px;"/>
]

Source: Brandon M. Greenwell on [Introduction to Machine Learning in <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#F92672;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>](https://github.com/bgreenwell/intro-ml-r).

---

# Statistical modeling: the two cultures

Consider a vector of input variables `$\color{#e64173}{x}$`, being transformed into some vector of response variables `$\color{#FFA500}{y}$` via a black box algorithm.

.KULbginline[Statistical learning or data modeling culture]

* assume statistical model, estimate parameter values
* validate with goodness-of-fit tests and residual inspection

.KULbginline[Machine learning or algo modeling culture]

* inside of the box is complex and unknown
* find algorithm `$\color{#3b3b9a}{f}(\color{#e64173}{x})$` to predict `$\color{#FFA500}{y}$`
* measure performance by predictive accuracy

]

Source: Breiman (2001, Statistical Science) on *Statistical modeling: the two cultures.*
---

# Newspeak from the two cultures

<br>

Source: read the blog [Why a mathematician, statistician and machine learner solve the same problem differently](https://blog.galvanize.com/why-a-mathematician-statistician-machine-learner-solve-the-same-problem-differently-2/)

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

As discussed in the lecture, many problems in ML can be approached as a .KULbginline[regression], .KULbginline[classification] or .KULbginline[clustering] problem.

<br>

.hi-pink[Q]: consider the following .hi-pink[three problem settings] and .hi-pink[label them] as regression, classification or clustering.

<br>

1. In disability insurance: how do disability rates depend on the state of the economy (e.g. GDP)?

2. In MTPL insurance: predict whether a claim is attritional or large, *in casu* a claim that exceeds the threshold of 100 000 EUR?

3. How can we group customers based on the insurance products they bought from the company?

]

---

# Model accuracy and loss functions

---

# Predictive modeling

How to use the observed data to learn or to estimate the unknown `$\color{#3b3b9a}{f}(.)$`?

`$$\begin{eqnarray*}
\color{#FFA500}{y} &=& \color{#3b3b9a}{f}(\color{#e64173}{x_1,x_2,\ldots,x_p})+\epsilon.
\end{eqnarray*}$$`

How do I .KULbginline[estimate] `$\color{#3b3b9a}{f}(.)$` - one way to phrase *all questions* that underly statistical & machine learning.

.font140[.KULbginline[Take-aways]] .font160[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M480 32c0-12.9-7.8-24.6-19.8-29.6s-25.7-2.2-34.9 6.9L381.7 53c-48 48-113.1 75-181 75H192 160 64c-35.3 0-64 28.7-64 64v96c0 35.3 28.7 64 64 64l0 128c0 17.7 14.3 32 32 32h64c17.7 0 32-14.3 32-32V352l8.7 0c67.9 0 133 27 181 75l43.6 43.6c9.2 9.2 22.9 11.9 34.9 6.9s19.8-16.6 19.8-29.6V300.4c18.6-8.8 32-32.5 32-60.4s-13.4-51.6-32-60.4V32zm-64 76.7V240 371.3C357.2 317.8 280.5 288 200.7 288H192V192h8.7c79.8 0 156.5-29.8 215.3-83.3z"/></svg>] &nbsp; - &nbsp; main reasons we want to .KULbginline[learn about] `$\color{#3b3b9a}{f}(.)$`

.pull-left[
.font120[.KULbginline[prediction]] 
<br>
predict the target `$\color{#FFA500}{y}$` as `$\hat{\color{#3b3b9a}{f}}(\color{#e64173}{x})$`
<br>
.font140[<svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M50.7 58.5L0 160H208V32H93.7C75.5 32 58.9 42.3 50.7 58.5zM240 160H448L397.3 58.5C389.1 42.3 372.5 32 354.3 32H240V160zm208 32H0V416c0 35.3 28.7 64 64 64H384c35.3 0 64-28.7 64-64V192z"/></svg>] - as black box setting? 
<br> <br>
.font120[.KULbginline[inference]]
<br>
how does target `$\color{#FFA500}{y}$` depend on features `$\color{#e64173}{x}$`? 
<br>
.font140[<svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M58.9 42.1c3-6.1 9.6-9.6 16.3-8.7L320 64 564.8 33.4c6.7-.8 13.3 2.7 16.3 8.7l41.7 83.4c9 17.9-.6 39.6-19.8 45.1L439.6 217.3c-13.9 4-28.8-1.9-36.2-14.3L320 64 236.6 203c-7.4 12.4-22.3 18.3-36.2 14.3L37.1 170.6c-19.3-5.5-28.8-27.2-19.8-45.1L58.9 42.1zM321.1 128l54.9 91.4c14.9 24.8 44.6 36.6 72.5 28.6L576 211.6v167c0 22-15 41.2-36.4 46.6l-204.1 51c-10.2 2.6-20.9 2.6-31 0l-204.1-51C79 419.7 64 400.5 64 378.5v-167L191.6 248c27.8 8 57.6-3.8 72.5-28.6L318.9 128h2.2z"/></svg>] - as white box setting? 
]
--
.pull-right[

<img src="img/prediction_inference.png" width="2200" style="display: block; margin: auto;" />
]

---

# Prediction errors

Why we're stuck with .KULbginline[irreducible error]

assume `$\hat{\color{#3b3b9a}{f}}$` and `$\color{#e64173}{x}$` given, then

$$
`\begin{aligned}
  \mathop{E}\left[ \left\{ \color{#FFA500}{y} - \hat{\color{#FFA500}{y}} \right\}^2 \right]
  &=
  \mathop{E}\left[ \left\{ \color{#3b3b9a}{f}(\color{#e64173}{x}) + \epsilon - \hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}) \right\}^2 \right] \\
  &= \underbrace{\left[ \color{#3b3b9a}{f}(\color{#e64173}{x}) - \hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}) \right]^2}_{\text{Reducible}} + \underbrace{\mathop{\text{Var}} \left( \color{#e64173}{\epsilon} \right)}_{\text{Irreducible}}
\end{aligned}`
$$
In .KULbginline[less math]:

- if `$\epsilon$` exists, then `$\color{#e64173}{x}$` cannot perfectly explain `$\color{#FFA500}{y}$`

- so even if `$\hat{\color{#3b3b9a}{f}} = \color{#3b3b9a}{f}$`, we still have irreducible error.

Thus, to form our .KULbginline[best predictors], we will .KULbginline[minimize reducible error].

---

# Model accuracy

We assess .KULbginline[model] or .KULbginline[predictive accuracy] by 
evaluating how well predictions actually match observed data.

Use .KULbginline[loss functions], i.e. metrics that compare predicted values to actual values.

`$$\begin{eqnarray*}
\frac{1}{n} \sum_{i=1}^n (\color{#FFA500}{y}_i - \hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}_i))^2,
\end{eqnarray*}$$`

Recall: `$\color{#FFA500}{y}_i - \hat{\color{#FFA500}{y}}_i = \color{#FFA500}{y}_i - \hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}_i)$` is the prediction error.

Objective <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M448 256A192 192 0 1 0 64 256a192 192 0 1 0 384 0zM0 256a256 256 0 1 1 512 0A256 256 0 1 1 0 256zm256 80a80 80 0 1 0 0-160 80 80 0 1 0 0 160zm0-224a144 144 0 1 1 0 288 144 144 0 1 1 0-288zM224 256a32 32 0 1 1 64 0 32 32 0 1 1 -64 0z"/></svg> : minimize!]

.pull-right[.KULbginline[Classification], use e.g. the .hi-pink[cross-entropy] or .hi-pink[log loss]
`$$\begin{eqnarray*}
-\frac{1}{n} \sum_{i=1}^n \left(\color{#FFA500}{y}_i \cdot \log{(p_i)} + (1-\color{#FFA500}{y}_i) \cdot \log{(1-p_i)}\right).
\end{eqnarray*}$$`
 
Objective <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M448 256A192 192 0 1 0 64 256a192 192 0 1 0 384 0zM0 256a256 256 0 1 1 512 0A256 256 0 1 1 0 256zm256 80a80 80 0 1 0 0-160 80 80 0 1 0 0 160zm0-224a144 144 0 1 1 0 288 144 144 0 1 1 0-288zM224 256a32 32 0 1 1 64 0 32 32 0 1 1 -64 0z"/></svg> : minimize!

]

<br>

.KULbginline[Many other useful loss functions] (e.g. deviance in regression, Gini index in classification).

.font140[.KULbginline[Take-away]] .font160[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M480 32c0-12.9-7.8-24.6-19.8-29.6s-25.7-2.2-34.9 6.9L381.7 53c-48 48-113.1 75-181 75H192 160 64c-35.3 0-64 28.7-64 64v96c0 35.3 28.7 64 64 64l0 128c0 17.7 14.3 32 32 32h64c17.7 0 32-14.3 32-32V352l8.7 0c67.9 0 133 27 181 75l43.6 43.6c9.2 9.2 22.9 11.9 34.9 6.9s19.8-16.6 19.8-29.6V300.4c18.6-8.8 32-32.5 32-60.4s-13.4-51.6-32-60.4V32zm-64 76.7V240 371.3C357.2 317.8 280.5 288 200.7 288H192V192h8.7c79.8 0 156.5-29.8 215.3-83.3z"/></svg>] &nbsp; - &nbsp; a loss function emphasizes certain types of errors over others, thus pick a meaningful one!

---

# Overfitting and bias-variance trade off

---

# Overfitting

The .KULbginline[Signal and the Noise] discussion!

Which of the following three models (in green-blue-ish) will best generalize to new data?

.footnote[Inspired by Brandon Greenwell's [Introduction to Machine Learning in <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#F92672;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>](https://github.com/bgreenwell/intro-ml-r).]

---

# Overfitting (cont.)

With a small training error, but large test error, the model is .KULbginline[overfitting] or working too hard!

The expected value of the .hi-pink[test MSE]:

`$$\begin{eqnarray*}
    E\left(\color{#FFA500}{y}_0-\hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}_0)\right)^2 &=& \text{Var}(\hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}_0))+[\text{Bias}(\hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}_0))]^2+\text{Var}(\epsilon).
    \end{eqnarray*}$$`

* variance .font140[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M256 512A256 256 0 1 0 256 0a256 256 0 1 0 0 512zM385 231c9.4 9.4 9.4 24.6 0 33.9s-24.6 9.4-33.9 0l-71-71V376c0 13.3-10.7 24-24 24s-24-10.7-24-24V193.9l-71 71c-9.4 9.4-24.6 9.4-33.9 0s-9.4-24.6 0-33.9L239 119c9.4-9.4 24.6-9.4 33.9 0L385 231z"/></svg>] and bias .font140[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M256 0a256 256 0 1 0 0 512A256 256 0 1 0 256 0zM127 281c-9.4-9.4-9.4-24.6 0-33.9s24.6-9.4 33.9 0l71 71L232 136c0-13.3 10.7-24 24-24s24 10.7 24 24l0 182.1 71-71c9.4-9.4 24.6-9.4 33.9 0s9.4 24.6 0 33.9L273 393c-9.4 9.4-24.6 9.4-33.9 0L127 281z"/></svg>]

* their relative rate of change determines whether the test error increases or decreases

* U-shape curves of .hi-pink[test MSE] w.r.t model flexibility

* the .hi-pink[bias-variance tradeoff] is central to quality prediction.

---

# Bias-variance trade off

<br>

Source: James et al. (2021, 2nd edition) on [https://www.statlearning.com/](An introduction to statistical learning).

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

Data are generated from: `$\color{#FFA500}{y} = \color{#3b3b9a}{f}(\color{#e64173}{x})+\epsilon$`, with the black curve as the true `$\color{#3b3b9a}{f}$`. The orange (linear regression), blue (smoothing splines) and green (smoothing splines) curves are three estimates for `$\color{#3b3b9a}{f}$`, with increasing level of complexity.

.hi-pink[Q]: which model do you prefer (orange, blue, green) for each of the following examples? Why?

.footnote[Example from James et al. (2021) on [https://www.statlearning.com/](An introduction to statistical learning).]

]

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

.hi-pink[Q]: which model do you prefer (orange, blue, green) for each of the following examples? Why?

.footnote[Example from James et al. (2021) on [https://www.statlearning.com/](An introduction to statistical learning).]

]

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

.hi-pink[Q]: which model do you prefer (orange, blue, green) for each of the following examples? Why?

.footnote[Example from James et al. (2021) on [https://www.statlearning.com/](An introduction to statistical learning).]

]

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

- take the *K* observations in the training data set that are 'closest' to test observation `$\color{#e64173}{x}_0$`, calculate

`$$\begin{eqnarray*}
\text{Pr}(\color{#FFA500}{Y}=j|\color{#e64173}{X} = \color{#e64173}{x}_0) &=& \frac{1}{K} \sum_{i \in \mathcal{N}_0} \mathbb{I}(\color{#FFA500}{y}_i=j).
\end{eqnarray*}$$`

- KNN then assigns the test observation `$\color{#e64173}{x}_0$` to the class `$j$` with the highest probability, e.g. with *K=3* (from James et al., 2021)

]

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

Now compare KNN with *K* equals 1, 10 and 100.

.center[
<img src="img/KNN_K_1.png" alt="Drawing" style="height: 250px;"/> 
&nbsp;&nbsp;&nbsp; <img src="img/KNN_K_10.png" alt="Drawing" style="height: 250px;"/> 
&nbsp;&nbsp;&nbsp; <img src="img/KNN_K_100.png" alt="Drawing" style="height: 250px;"/> 
]

.hi-pink[Q]: which classifier do you prefer? Which of these classifiers is under-fitting, which one is over-fitting?

]

---

# Data splitting and resampling methods <br> with {caret} and {rsample}

---

# Ames Iowa housing data

We will use the Ames Iowa housing data. There are 2,930 properties in the data set.

The `Sale_Price` (target or response) was recorded along with 80 predictors, including:

* location (e.g. neighborhood) and lot information
* house components (garage, fireplace, pool, porch, etc.)
* general assessments such as overall quality and condition
* number of bedrooms, baths, and so on.

More details in [De Cock (2011, Journal of Statistics Education)](http://ww2.amstat.org/publications/jse/v19n3/decock.pdf).

The raw data are at [`http://bit.ly/2whgsQM`](http://bit.ly/2whgsQM) but we will use a processed version found in the [`AmesHousing`](https://github.com/topepo/AmesHousing) package.

You will load the data with the `make_ames()` function from the `AmesHousing` library, and store the data in the object `ames`:

```r
ames <- AmesHousing::make_ames()
```

---

# Data splitting

We fit our model on past data `$\{(\color{#e64173}{x}_1,\color{#FFA500}{y}_1),(\color{#e64173}{x}_2,\color{#FFA500}{y}_2),\ldots, (\color{#e64173}{x}_n,\color{#FFA500}{y}_n)\}$`
and get `$\hat{\color{#3b3b9a}{f}}$`.

*What we want*: how does our model .KULbginline[generalize] to new, unseen data `$(\color{#e64173}{x}_0,\color{#FFA500}{y}_0)$`, or: &nbsp; is `$\hat{\color{#3b3b9a}{f}}(\color{#e64173}{x}_0)$` close to `$\color{#FFA500}{y}_0$`?

.KULbginline[Training set]

* to develop, to train, to tune, to compare different settings, ...

.KULbginline[Test set]

* to obtain unbiased estimate of final model's performance.
]

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .footnote[Picture taken from [Introduction to Machine Learning in <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#F92672;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>](https://github.com/bgreenwell/intro-ml-r).]

]
---

# Data splitting in base

We first demonstrate the splitting of the `ames` housing data into a training and test set, using `base` R instructions.

```r
*set.seed(123)
index_1 <- sample(1 : nrow(ames), 
                  size = round(nrow(ames) * 0.7))  
train_1 <- ames[index_1, ]   
test_1  <- ames[-index_1, ]

nrow(train_1)/nrow(ames)
```

]

Use `set.seed()` for reproducibility.

]

---

# Data splitting in base

We first demonstrate the splitting of the `ames` housing data into a training and test set, using `base` R instructions.

```r
set.seed(123) 
*index_1 <- sample(1 : nrow(ames),
*                 size = round(nrow(ames) * 0.7))
train_1 <- ames[index_1, ]   
test_1  <- ames[-index_1, ]

nrow(train_1)/nrow(ames)
```

]

Sample indices from `1 : nrow(ames)` such that in total 70% of the records is selected.

Vector `index_1` now stores the row numbers of the selected records.

]

---

# Data splitting in base

We first demonstrate the splitting of the `ames` housing data into a training and test set, using `base` R instructions.

```r
set.seed(123) 
index_1 <- sample(1 : nrow(ames), 
                  size = round(nrow(ames) * 0.7))   
*train_1 <- ames[index_1, ]
test_1  <- ames[-index_1, ]

nrow(train_1)/nrow(ames)
```

]

Put the selected records in training set `train_1` by subsetting the original data frame `ames` with the row numbers stored in `index_1`.

]

---

# Data splitting in base

We first demonstrate the splitting of the `ames` housing data into a training and test set, using `base` R instructions.

```r
set.seed(123) 
index_1 <- sample(1 : nrow(ames), 
                  size = round(nrow(ames) * 0.7))   
train_1 <- ames[index_1, ]   
*test_1  <- ames[-index_1, ]

nrow(train_1)/nrow(ames)
```

]

Put the not selected records in test set `test_1`.

]

---

# Data splitting in base

We first demonstrate the splitting of the `ames` housing data into a training and test set, using `base` R instructions.

```r
set.seed(123) 
index_1 <- sample(1 : nrow(ames), 
                  size = round(nrow(ames) * 0.7))   
train_1 <- ames[index_1, ]   
test_1  <- ames[-index_1, ]

*nrow(train_1)/nrow(ames)
## [1] 0.7
```

]

What is the ratio of the number of records in `train_1` versus original data set `ames`?

]

---

# Data splitting in {caret}

The {caret} package - short for Classification And REgression Training - contains functions to streamline the model training process for complex regression and classification problems.

With the {caret} package, the function `createDataPartition` will do the job.

```r
*library(caret)
*set.seed(123)
index_2 <- caret::createDataPartition(
                    y = ames$Sale_Price, 
                    p = 0.7, 
                    list = FALSE)
train_2 <- ames[index_2, ]
test_2  <- ames[-index_2, ]

nrow(train_2)/nrow(ames) 
```

]

Load the library {caret}.

Use `set.seed()` for reproducibility.

]

---

# Data splitting in {caret}

The {caret} package - short for Classification And REgression Training - contains functions to streamline the model training process for complex regression and classification problems.

With the {caret} package, the function `createDataPartition` will do the job.

```r
library(caret) 
set.seed(123)  
*index_2 <- caret::createDataPartition(
*                   y = ames$Sale_Price,
*                   p = 0.7,
*                   list = FALSE)
train_2 <- ames[index_2, ]
test_2  <- ames[-index_2, ]

nrow(train_2)/nrow(ames) 
```

]

`createDataPartition` takes in `y` the vector of outcomes of the data set we wish to split. `createDataPartition` will do stratified sampling based on levels of `y` (for factor) or groups determined by the percentiles of `y` (for numeric).

The percentage of data that goes to training is `p`.

`list = FALSE` tells the function not to store the results in a list, but in a matrix (here: with 1 column)

]

---

# Data splitting in {rsample}  <img src="img/rsample.png" class="title-hex">

The {rsample} package, part of the {tidymodels} initiative of RStudio, is home to a wide variety of resampling functions.

The documentation is at [rsample: the basics](https://tidymodels.github.io/rsample/articles/Basics.html).

```r
*library(rsample)
*set.seed(123)
split_1  <- rsample::initial_split(ames, prop = 0.7)
train_3  <- training(split_1)
test_3   <- testing(split_1)

nrow(train_3)/nrow(ames)
```
]

Load the `rsample` package.

Use `set.seed()` for reproducibility.

]

---

# Data splitting in {rsample}  <img src="img/rsample.png" class="title-hex">

The {rsample} package, part of the {tidymodels} initiative of RStudio, is home to a wide very variety of resampling functions.

The documentation is at [rsample: the basics](https://tidymodels.github.io/rsample/articles/Basics.html).

```r
library(rsample)
set.seed(123) 
*split_1  <- rsample::initial_split(ames, prop = 0.7)
train_3  <- training(split_1)
test_3   <- testing(split_1)

nrow(train_3)/nrow(ames)
```
]

`initial_split` from the {rsample} package.

Split the data `ames` into a training set and testing set.

`prop` is the proportion of data to be retained as training

]

---

# Data splitting in {rsample} <img src="img/rsample.png" class="title-hex">

The {rsample} package, part of the {tidymodels} initiative of RStudio, is home to a wide very variety of resampling functions.

The documentation is at [rsample: the basics](https://tidymodels.github.io/rsample/articles/Basics.html).

```r
library(rsample)
set.seed(123) 
split_1  <- rsample::initial_split(ames, prop = 0.7)  
*train_3  <- training(split_1)
*test_3   <- testing(split_1)

nrow(train_3)/nrow(ames)
```
]

The result of `rsample::initial_split` is an `rset` object.

It is stored in `split_1` and ready for inspection.

Apply the functions `training` and `test` to this object to extract the data in each split.

]

---

# Data splitting comparison

As a check, we plot the `Sale_Price` as available in the train (in black) vs test (in red) data sets, created by each of the three demonstrated methods.

---

# Resampling methods

In [Data splitting](#data-splitting), we discussed *training* and *test* set. Let's now dive deeper into *resampling* methods.

.hi-KUL[Validation set] (visual inspired by [Ed Rubin's course](https://github.com/edrubin/EC524W20))

- we hold out a subset of the training data (e.g. 30%) and then evaluate the model on this held out validation set

- calculate the loss function on this validation set, as approximation of the true test error

- .font140[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M323.8 477.2c-38.2 10.9-78.1-11.2-89-49.4l-5.7-20c-3.7-13-10.4-25-19.5-35l-51.3-56.4c-8.9-9.8-8.2-25 1.6-33.9s25-8.2 33.9 1.6l51.3 56.4c14.1 15.5 24.4 34 30.1 54.1l5.7 20c3.6 12.7 16.9 20.1 29.7 16.5s20.1-16.9 16.5-29.7l-5.7-20c-5.7-19.9-14.7-38.7-26.6-55.5c-5.2-7.3-5.8-16.9-1.7-24.9s12.3-13 21.3-13L448 288c8.8 0 16-7.2 16-16c0-6.8-4.3-12.7-10.4-15c-7.4-2.8-13-9-14.9-16.7s.1-15.8 5.3-21.7c2.5-2.8 4-6.5 4-10.6c0-7.8-5.6-14.3-13-15.7c-8.2-1.6-15.1-7.3-18-15.2s-1.6-16.7 3.6-23.3c2.1-2.7 3.4-6.1 3.4-9.9c0-6.7-4.2-12.6-10.2-14.9c-11.5-4.5-17.7-16.9-14.4-28.8c.4-1.3 .6-2.8 .6-4.3c0-8.8-7.2-16-16-16H286.5c-12.6 0-25 3.7-35.5 10.7l-61.7 41.1c-11 7.4-25.9 4.4-33.3-6.7s-4.4-25.9 6.7-33.3l61.7-41.1c18.4-12.3 40-18.8 62.1-18.8H384c34.7 0 62.9 27.6 64 62c14.6 11.7 24 29.7 24 50c0 4.5-.5 8.8-1.3 13c15.4 11.7 25.3 30.2 25.3 51c0 6.5-1 12.8-2.8 18.7C504.8 238.3 512 254.3 512 272c0 35.3-28.6 64-64 64l-92.3 0c4.7 10.4 8.7 21.2 11.8 32.2l5.7 20c10.9 38.2-11.2 78.1-49.4 89zM32 384c-17.7 0-32-14.3-32-32V128c0-17.7 14.3-32 32-32H96c17.7 0 32 14.3 32 32V352c0 17.7-14.3 32-32 32H32z"/></svg>] high variability + inefficient use of data

- picture .hi-KUL[validation set (30%)] and .hi-pink[training set (70%)]

---

# Resampling methods (cont.)

In [Data splitting](#data-splitting), we discussed *training* and *test* set. Let's now dive deeper into *resampling* methods.

.hi-KUL[*k* fold cross validation] (visual inspired by [Ed Rubin's course](https://github.com/edrubin/EC524W20))

- divide training data into *k* equally sized groups (e.g. .hi-KUL[group 1] on the picture)

- iterate over the *k* groups, treating each as validation set once (and train model on the other *k-1* groups) (e.g. get .hi-KUL[MSE<sub>1<sub/>] corresponding to fold 1)

- average the folds' loss to estimate the true test error

- .font140[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M323.8 34.8c-38.2-10.9-78.1 11.2-89 49.4l-5.7 20c-3.7 13-10.4 25-19.5 35l-51.3 56.4c-8.9 9.8-8.2 25 1.6 33.9s25 8.2 33.9-1.6l51.3-56.4c14.1-15.5 24.4-34 30.1-54.1l5.7-20c3.6-12.7 16.9-20.1 29.7-16.5s20.1 16.9 16.5 29.7l-5.7 20c-5.7 19.9-14.7 38.7-26.6 55.5c-5.2 7.3-5.8 16.9-1.7 24.9s12.3 13 21.3 13L448 224c8.8 0 16 7.2 16 16c0 6.8-4.3 12.7-10.4 15c-7.4 2.8-13 9-14.9 16.7s.1 15.8 5.3 21.7c2.5 2.8 4 6.5 4 10.6c0 7.8-5.6 14.3-13 15.7c-8.2 1.6-15.1 7.3-18 15.1s-1.6 16.7 3.6 23.3c2.1 2.7 3.4 6.1 3.4 9.9c0 6.7-4.2 12.6-10.2 14.9c-11.5 4.5-17.7 16.9-14.4 28.8c.4 1.3 .6 2.8 .6 4.3c0 8.8-7.2 16-16 16H286.5c-12.6 0-25-3.7-35.5-10.7l-61.7-41.1c-11-7.4-25.9-4.4-33.3 6.7s-4.4 25.9 6.7 33.3l61.7 41.1c18.4 12.3 40 18.8 62.1 18.8H384c34.7 0 62.9-27.6 64-62c14.6-11.7 24-29.7 24-50c0-4.5-.5-8.8-1.3-13c15.4-11.7 25.3-30.2 25.3-51c0-6.5-1-12.8-2.8-18.7C504.8 273.7 512 257.7 512 240c0-35.3-28.6-64-64-64l-92.3 0c4.7-10.4 8.7-21.2 11.8-32.2l5.7-20c10.9-38.2-11.2-78.1-49.4-89zM32 192c-17.7 0-32 14.3-32 32V448c0 17.7 14.3 32 32 32H96c17.7 0 32-14.3 32-32V224c0-17.7-14.3-32-32-32H32z"/></svg>] greater accuracy (compared to validation set).

---

# Resampling methods (cont.)

In [Data splitting](#data-splitting), we discussed *training* and *test* set. Let's now dive deeper into *resampling* methods.

.hi-KUL[*k* fold cross validation] (visual inspired by [Ed Rubin's course](https://github.com/edrubin/EC524W20))

- divide training data into *k* equally sized groups (e.g. .hi-KUL[group 1] on the picture)

- iterate over the *k* groups, treating each as validation set once (and train model on the other *k-1* groups) (e.g. get .hi-KUL[MSE<sub>1<sub/>] corresponding to fold 1)

- average the folds' loss to estimate the true test error

- .font140[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M323.8 34.8c-38.2-10.9-78.1 11.2-89 49.4l-5.7 20c-3.7 13-10.4 25-19.5 35l-51.3 56.4c-8.9 9.8-8.2 25 1.6 33.9s25 8.2 33.9-1.6l51.3-56.4c14.1-15.5 24.4-34 30.1-54.1l5.7-20c3.6-12.7 16.9-20.1 29.7-16.5s20.1 16.9 16.5 29.7l-5.7 20c-5.7 19.9-14.7 38.7-26.6 55.5c-5.2 7.3-5.8 16.9-1.7 24.9s12.3 13 21.3 13L448 224c8.8 0 16 7.2 16 16c0 6.8-4.3 12.7-10.4 15c-7.4 2.8-13 9-14.9 16.7s.1 15.8 5.3 21.7c2.5 2.8 4 6.5 4 10.6c0 7.8-5.6 14.3-13 15.7c-8.2 1.6-15.1 7.3-18 15.1s-1.6 16.7 3.6 23.3c2.1 2.7 3.4 6.1 3.4 9.9c0 6.7-4.2 12.6-10.2 14.9c-11.5 4.5-17.7 16.9-14.4 28.8c.4 1.3 .6 2.8 .6 4.3c0 8.8-7.2 16-16 16H286.5c-12.6 0-25-3.7-35.5-10.7l-61.7-41.1c-11-7.4-25.9-4.4-33.3 6.7s-4.4 25.9 6.7 33.3l61.7 41.1c18.4 12.3 40 18.8 62.1 18.8H384c34.7 0 62.9-27.6 64-62c14.6-11.7 24-29.7 24-50c0-4.5-.5-8.8-1.3-13c15.4-11.7 25.3-30.2 25.3-51c0-6.5-1-12.8-2.8-18.7C504.8 273.7 512 257.7 512 240c0-35.3-28.6-64-64-64l-92.3 0c4.7-10.4 8.7-21.2 11.8-32.2l5.7-20c10.9-38.2-11.2-78.1-49.4-89zM32 192c-17.7 0-32 14.3-32 32V448c0 17.7 14.3 32 32 32H96c17.7 0 32-14.3 32-32V224c0-17.7-14.3-32-32-32H32z"/></svg>] greater accuracy (compared to validation set).

---

# Resampling methods (cont.)

In [Data splitting](#data-splitting), we discussed *training* and *test* set. Let's now dive deeper into *resampling* methods.

.hi-KUL[*k* fold cross validation] (picture from [Boehmke & Greenwell](https://koalaverse.github.io/homlr/))

---

# Resampling methods (cont.)

In [Data splitting](#data-splitting), we discussed *training* and *test* set. Let's now dive deeper into *resampling* methods.

.hi-KUL[Leave-one-out cross validation] (visual inspired by [Ed Rubin's course](https://github.com/edrubin/EC524W20))

- each observation takes a turn as the validation set (e.g. get .hi-KUL[MSE<sub>3<sub/>])

- other *n-1* observations are the training set

- average the folds' loss to estimate the true test error

---

# Resampling methods in {caret}

We set up 5-fold cross validation using the {caret} package.

```r
set.seed(123)  
*cv_folds <- caret::createFolds(y = ames$Sale_Price,
*                              k = 5, list = TRUE,
*                              returnTrain = TRUE)
```

```r
str(cv_folds)
```

```
## List of 5
##  $ Fold1: int [1:2344] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold2: int [1:2343] 2 3 4 6 7 8 9 11 13 14 ...
##  $ Fold3: int [1:2344] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold4: int [1:2344] 1 3 5 6 10 11 12 13 14 15 ...
##  $ Fold5: int [1:2345] 1 2 4 5 7 8 9 10 11 12 ...
```
]

The `createFolds` function from {caret} splits the data into `k` groups.

`list = TRUE` indicates that the results should be stored in a list

`returnTrain = TRUE` indicates that the values returned (and stored) in the elements of the list are - per fold - the row numbers of the observations selected for training.

]

---

# Resampling methods in {caret}

We set up 5-fold cross validation using the {caret} package.

```r
set.seed(123)  
cv_folds <- caret::createFolds(y = ames$Sale_Price,        
                               k = 5, list = TRUE,     
                               returnTrain = TRUE)    
```

```r
*str(cv_folds)
```

Inspect the list `cv_folds` that was returned by `createFolds(.)`.

This list has `k` elements, each storing the row numbers of the observations in the training set of the fold under consideration.

]

---

# Resampling methods in {caret} <img src="img/purrr.png" class="title-hex">

```r
*mean(ames[cv_folds$Fold1, ]$Sale_Price)
```

```
## [1] 180954.3
```

```r
map_dbl(cv_folds,                     
        function(x) {                   
          mean(ames[x, ]$Sale_Price)    
        })                              
```

```
##    Fold1    Fold2    Fold3    Fold4    Fold5 
## 180954.3 180781.8 180646.4 180563.0 181034.7
```

]

We calculate the average `Sale_Price` per fold, that is: we average the `Sale_Price` over all observations selected in the training set of a particular fold.

That would go as follows, for `Fold1` in the list `cv_folds`

```r
mean(ames[cv_folds$Fold1, ]$Sale_Price)
```

and similarly for `Fold2`, ..., `Fold5`.

]

---

# Resampling methods in {caret} <img src="img/purrr.png" class="title-hex">

```r
mean(ames[cv_folds$Fold1, ]$Sale_Price) 
```

```
## [1] 180954.3
```

```r
*map_dbl(cv_folds,
*       function(x) {
*         mean(ames[x, ]$Sale_Price)
*       })
```

```
##    Fold1    Fold2    Fold3    Fold4    Fold5 
## 180954.3 180781.8 180646.4 180563.0 181034.7
```

]

We apply the function `mean(ames[___, ]$Sale_Price)` over all `k` elements of the list `cv_folds`.

`map_dbl(.x, .f)` is one of the `map` functions from the {purrr} package (part of {tidyverse}), used for functional programming in R.

`map_dbl(.x, .f)` applies function `.f` to each element of list `.x`.

The result is a double-precision vector, hence `map_dbl` and not just `map`.

Btw, it is a historical anomaly that R has two names for its floating-point vectors, `double` and `numeric`.

]

---

# Resampling methods in {rsample} <img src="img/rsample.png" class="title-hex">

```r
set.seed(123)  
*cv_rsample <- rsample::vfold_cv(ames, v = 5)
cv_rsample$splits
```

```
## [[1]]
## <Analysis/Assess/Total>
## <2344/586/2930>
## 
## [[2]]
## <Analysis/Assess/Total>
## <2344/586/2930>
## 
## [[3]]
## <Analysis/Assess/Total>
## <2344/586/2930>
## 
## [[4]]
## <Analysis/Assess/Total>
## <2344/586/2930>
## 
## [[5]]
## <Analysis/Assess/Total>
## <2344/586/2930>
```

]

The function `vfold_cv` splits the data into `v` groups (called folds) of equal size.

]

---

# Resampling methods in {rsample} <img src="img/rsample.png" class="title-hex">

```r
set.seed(123)  
cv_rsample <- rsample::vfold_cv(ames, v = 5) 
*cv_rsample$splits
```

]

The function `vfold_cv` splits the data into `v` groups (called folds) of equal size.

We store the result of `vfold_cv` in the object `cv_rsample`.

The resulting object stores `v` resamples of the original data set.

]

---

# Resampling methods in {rsample} <img src="img/rsample.png" class="title-hex">

```r
set.seed(123)  
cv_rsample <- rsample::vfold_cv(ames, v = 5) 
```

```r
*cv_rsample$splits[[1]]
```

```
## <Analysis/Assess/Total>
## <2344/586/2930>
```

```r
cv_rsample$splits[[1]] %>% analysis() %>% dim()
```

```
## [1] 2344   81
```

```r
cv_rsample$splits[[1]] %>% assessment() %>% dim()
```

```
## [1] 586  81
```

]

Inspect the composition of the first resample:

2,344 (out of 2,930) observations go to the analysis data (for training, i.e. `v-1` folds),

586 (out of 2,930) observations go to the assessment data (for testing, the final fold).

]

---

# Resampling methods in {rsample} <img src="img/rsample.png" class="title-hex">

```r
set.seed(123)  
cv_rsample <- rsample::vfold_cv(ames, v = 5) 
```

```r
cv_rsample$splits[[1]] 
```

```
## <Analysis/Assess/Total>
## <2344/586/2930>
```

```r
*cv_rsample$splits[[1]] %>% analysis() %>% dim()
```

```
## [1] 2344   81
```

```r
*cv_rsample$splits[[1]] %>% assessment() %>% dim()
```

```
## [1] 586  81
```

]

Inspect the composition of the first resample:

get the dimensions (`dim()`) of the analysis data (`analysis()`) of the first resample

get the dimensions (`dim()`) of the assessment data (`assessment()`) of the first resample.

]

---

# Resampling methods in {rsample} <img src="img/purrr.png" class="title-hex"> <img src="img/rsample.png" class="title-hex">

```r
map_dbl(cv_rsample$splits,
        function(x) {
          mean(rsample::analysis(x)$Sale_Price)
        })
```

```
## [1] 181310.8 180991.0 180840.0 181268.6 179569.9
```

```r
map_dbl(cv_rsample$splits,
        function(x) {
          nrow(rsample::analysis(x))
        })
```

```
## [1] 2344 2344 2344 2344 2344
```

]

As before, use `map_dbl(.x, .f)` to apply a function `.f` over all elements of a list `.x`.

Here the list is stored in `cv_rsample$splits`, with `v = 5` elements.

]

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

.hi-pink[Q]: Now you're going to combine data splitting and resampling to create training, validation and test folds in the Ames data.

Use `caret` or `rsample` and make the validation folds of the same size as the test fold.
]

---

with `caret`

```r
set.seed(5678)
ind_caret <- caret::createDataPartition(
                      y = ames$Sale_Price,
                      p = 5/6, list = FALSE)
train_caret <- ames[ind_caret, ]
test_caret  <- ames[-ind_caret, ]

cv_caret <- caret::createFolds(
            y = train_caret$Sale_Price, k = 5, 
            list = TRUE, returnTrain = FALSE)

purrr::map_dbl(cv_caret,
               ~ nrow(ames[., ]))
## Fold1 Fold2 Fold3 Fold4 Fold5 
##   488   488   489   489   489
nrow(test_caret)
## [1] 487
```

]

with `rsample`

```r
set.seed(5678)
ind_rsample  <- rsample::initial_split(ames,
                                       prop = 5/6)
train_rsample  <- rsample::training(ind_rsample)
test_rsample   <- rsample::testing(ind_rsample)

cv_rsample <- rsample::vfold_cv(train_rsample, v = 5)

map_dbl(cv_rsample$splits,
        ~ nrow(rsample::assessment(.)))
## [1] 489 488 488 488 488
nrow(test_rsample)
## [1] 489
```

]

---

# Parameter tuning with {caret}, {rsample} and {purrr}

---

# Tuning parameters

.hi-pink[Variance] : the amount `$\hat{\color{#3b3b9a}{f}}$` would change with a different training sample.

* use .hi-pink[resampling methods] to reduce this risk

* hyperparameters (or *tuning parameters*) control complexity, 
and thus the bias-variance trade-off

* identify their optimal setting, e.g. with a *grid search*

* no analytic expression for these hyperparameters.
]

.footnote[Code from Boehmke & Greenwell (2019, Chapter 2) on [Hands-on machine learning with R](https://koalaverse.github.io/homlr/).]

]

---

# Tuning parameters via grid search

.center[
<img src="img/flow_chart_applied_predictive_modeling.jpg" width="85%" style="display: block; margin: auto;" />
]
]

- define a set of candidate values (a *grid*)

- assess model utility across the candidates (use clever *resampling*)

- choose the optimal settings (optimize *loss*)

- refit the model on entire training data with final tuning parameters

- evaluate performance of the model on the test data (under <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M144 144v48H304V144c0-44.2-35.8-80-80-80s-80 35.8-80 80zM80 192V144C80 64.5 144.5 0 224 0s144 64.5 144 144v48h16c35.3 0 64 28.7 64 64V448c0 35.3-28.7 64-64 64H64c-35.3 0-64-28.7-64-64V256c0-35.3 28.7-64 64-64H80z"/></svg>).

- repeat the above steps for different models

- compare performance of these models that will generalize to new data (via test data, under <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M144 144v48H304V144c0-44.2-35.8-80-80-80s-80 35.8-80 80zM80 192V144C80 64.5 144.5 0 224 0s144 64.5 144 144v48h16c35.3 0 64 28.7 64 64V448c0 35.3-28.7 64-64 64H64c-35.3 0-64-28.7-64-64V256c0-35.3 28.7-64 64-64H80z"/></svg>).

.footnote[Flow chart from Kuhn & Johnson (2013) on [Applied predictive modeling](http://appliedpredictivemodeling.com/).]
]

---

# Training a model with {caret}

```r
set.seed(123)
*cv <- trainControl(method = "cv", number = 5,
                   returnResamp = "all",  
                   selectionFunction = "best")
hyper_grid <- expand.grid(k = seq(2, 150, by = 2)) 
knn_fit <- train(y ~ x, data = df, method = "knn", 
                        trControl = cv, 
                        tuneGrid = hyper_grid)
knn_fit$bestTune
```

]

Use `trainControl` from {caret} to set some control parameters that will be used in the actual `train` function.

Here, we use `method = cv` and `number = 5` for 5-fold cross validation.

]

---

# Training a model with {caret}

```r
set.seed(123)
cv <- trainControl(method = "cv", number = 5, 
*                  returnResamp = "all",
*                  selectionFunction = "best")
hyper_grid <- expand.grid(k = seq(2, 150, by = 2)) 
knn_fit <- train(y ~ x, data = df, method = "knn", 
                        trControl = cv, 
                        tuneGrid = hyper_grid)
knn_fit$bestTune
```

]

In `trainControl` we put `returnResamp = "all"` to store all resampled summary metrics.

`selectionFunction = "best"` specifies how we select the optimal tuning parameter. With `"best"` the value that minimizes the performance (here: RMSE) is selected.

Alternative: `selectionFunction = "oneSE"` applies the one standard error rule.

]

---

# Training a model with {caret}

```r
set.seed(123)
cv <- trainControl(method = "cv", number = 5, 
                   returnResamp = "all",  
                   selectionFunction = "best") 
*hyper_grid <- expand.grid(k = seq(2, 150, by = 2))
knn_fit <- train(y ~ x, data = df, method = "knn", 
                        trControl = cv, 
                        tuneGrid = hyper_grid)
knn_fit$bestTune
```

]

Set the grid of *K*-values that will be searched.

`expand.grid` creates a data frame with one row for each value of *K* to consider.

]

---

# Training a model with {caret}

```r
set.seed(123)
cv <- trainControl(method = "cv", number = 5, 
                   returnResamp = "all",  
                   selectionFunction = "best") 
hyper_grid <- expand.grid(k = seq(2, 150, by = 2))  
*knn_fit <- train(y ~ x, data = df, method = "knn",
*                       trControl = cv,
*                       tuneGrid = hyper_grid)
knn_fit$bestTune
```

]

{caret} will `train` the method `knn` using the settings in `trControl = cv`, across the values of *K* stored in `tuneGrid = hyper_grid`.

The data `df` and formula `y ~ x` are used.

]

---

# Training a model with {caret}

```r
set.seed(123)
cv <- trainControl(method = "cv", number = 5, 
                   returnResamp = "all",  
                   selectionFunction = "best") 
hyper_grid <- expand.grid(k = seq(2, 150, by = 2))  
knn_fit <- train(y ~ x, data = df, method = "knn", 
                        trControl = cv, 
                        tuneGrid = hyper_grid) 
*knn_fit$bestTune
```

```
##     k
## 18 36
```

]

We retrieve the optimal value of the tuning parameter, according to the `selectionFunction`.

For the folds created here and with `selectionFunction = "best"` the optimal *K* value is 36.

What happens when you change to `selectionFunction = "oneSE"`?

]

---

# Training a model with {caret}

```
##     k
## 18 36
```

]

```
##     k
## 27 54
```

]

---

# Training a model with {rsample} <img src="img/rsample.png" class="title-hex"> <img src="img/purrr.png" class="title-hex">

Our starting point is the simulated data stored in `df`, resampled with 5-fold cross-validation.

```r
set.seed(123)  # for reproducibility
cv_rsample <- vfold_cv(df, 5)
cv_rsample$splits[1:3] 
```

```
## [[1]]
## <Analysis/Assess/Total>
## <286/72/358>
## 
## [[2]]
## <Analysis/Assess/Total>
## <286/72/358>
## 
## [[3]]
## <Analysis/Assess/Total>
## <286/72/358>
```
]

We fit the *KNN* on the holdout data in split *s*, using a given *K* value.

```r
holdout_results <- function(s, k_val) {
  # Fit the model to the analysis data in split s
  df_train <- analysis(s)
  mod <- knnreg(y ~ x, k = k_val, data = df_train)
  # Get the remaining group
  holdout <- assessment(s)
  # Get predictions with the holdout data set
  res <- predict(mod, newdata = holdout)
  # Return observed and predicted values 
  #                            on holdout set
  res <- tibble(obs = holdout$y, pred = res)
  res
}
```

]

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

Now you're going to combine the .hi-pink[resampling and model fitting instructions] and set up a first example of .hi-pink[tuning a parameter] over a grid of possible values: the *K* in a .hi-pink[KNN regression model].

.hi-pink[Q]: use the function `holdout_results(.s, .k)` as defined on the previous sheet. You will use this function to calculate the RMSE<sub>k</sub> of fold *k*.

1. Specify a grid of values of *K*, store it in `hyper_grid`. Use `expand.grid(.)`

2. Pick one of the resamples stored in `cv_rsample$splits` and pick a value from the grid. Calculate the RMSE on the holdout data of this split.

3. For all values in the tuning grid, calculate the RMSE averaged over all folds, and the corresponding standard error.

4. Use the results from .hi-pink[Q.3] to pick the value of *K* via minimal RMSE.

5. Pick the largest value of *K* such that the corresponding RMSE is below the minimal RMSE from .hi-pink[Q.4] plus its corresponding SE.

]

---

```r
hyper_grid <- expand.grid(k = seq(2, 150, by = 2))  
hyper_grid %>% slice(1:3) 
```

|  k|
|--:|
|  2|
|  4|
|  6|

.hi-pink[Q.2] We apply the function `holdout_results(.s, .k)` on the 
third resample, with the first value for *K* in the grid.

```r
res <- holdout_results(cv_rsample$splits[[3]], 
                       hyper_grid[1, ])
sqrt(sum((res$obs - res$pred)^2)/nrow(res))
```

```
## [1] 0.3608923
```
]

```r
RMSE <- numeric(nrow(hyper_grid))
SE <- numeric(nrow(hyper_grid))
for(i in 1:nrow(hyper_grid)){
  cv_rsample$results <- map(cv_rsample$splits,
                            holdout_results,
                            hyper_grid[i, ])
  res <- map_dbl(cv_rsample$results, 
                 function(x) mean((x$obs - x$pred)^2))
  RMSE[i] <- mean(sqrt(res)) ; SE[i] <- sd(sqrt(res))
}

```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> RMSE </th>
   <th style="text-align:right;"> SE </th>
   <th style="text-align:right;"> k </th>
   <th style="text-align:right;"> lower </th>
   <th style="text-align:right;"> upper </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.2917121 </td>
   <td style="text-align:right;"> 0.0247127 </td>
   <td style="text-align:right;"> 24 </td>
   <td style="text-align:right;"> 0.2669995 </td>
   <td style="text-align:right;"> 0.3164248 </td>
  </tr>
</tbody>
</table>

.hi-pink[Q.5] Choose *K* via the one-standard-error rule
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> RMSE </th>
   <th style="text-align:right;"> SE </th>
   <th style="text-align:right;"> k </th>
   <th style="text-align:right;"> lower </th>
   <th style="text-align:right;"> upper </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.3157639 </td>
   <td style="text-align:right;"> 0.0284855 </td>
   <td style="text-align:right;"> 70 </td>
   <td style="text-align:right;"> 0.2872784 </td>
   <td style="text-align:right;"> 0.3442495 </td>
  </tr>
</tbody>
</table>

]

---

---

# Putting it all together

During the tuning process we inspect plots like the one on the right.

- we prefer simple over more complex

- choose tuning parameters based on the numerically optimal value .KULbginline[OR]

- choose a simpler model that is within a certain tolerance of the
numerically best value

- use the .hi-pink['one-standard-error' rule].

With the selected tuning parameters, we refit the model on the complete training set and use it to predict the test set (under <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M144 144v48H304V144c0-44.2-35.8-80-80-80s-80 35.8-80 80zM80 192V144C80 64.5 144.5 0 224 0s144 64.5 144 144v48h16c35.3 0 64 28.7 64 64V448c0 35.3-28.7 64-64 64H64c-35.3 0-64-28.7-64-64V256c0-35.3 28.7-64 64-64H80z"/></svg>).

]

]

---

# Target and feature engineering: <br> data pre-processing steps

---

# What is feature engineering?

.center[
<img src="img/feature_engineering_Kuhn.jpg" alt="Drawing" style="width: 165px;"/>
<br> <br> <img src="img/boehmke_greenwell.jpg" alt="Drawing" style="width: 165px;"/>  
]

]

- applies .hi-pink[pre-processing steps] to predictor (features) variables

- .hi-pink[creates new input features] from your existing ones (e.g. network features derived from a social network in a fraud detection model).

Target engineering:

* transforms the response variable (or target) to improve the performance of a predictive model.

The goal is to .KULbginline[make models more effective].

See Kuhn & Johnson (2019) on [Feature Engineering and Selection: A Practical Approach for Predictive Models](http://www.feat.engineering/) for a detailed discussion.

]

---

.font140[.KULbginline[Take-aways]] .font160[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M480 32c0-12.9-7.8-24.6-19.8-29.6s-25.7-2.2-34.9 6.9L381.7 53c-48 48-113.1 75-181 75H192 160 64c-35.3 0-64 28.7-64 64v96c0 35.3 28.7 64 64 64l0 128c0 17.7 14.3 32 32 32h64c17.7 0 32-14.3 32-32V352l8.7 0c67.9 0 133 27 181 75l43.6 43.6c9.2 9.2 22.9 11.9 34.9 6.9s19.8-16.6 19.8-29.6V300.4c18.6-8.8 32-32.5 32-60.4s-13.4-51.6-32-60.4V32zm-64 76.7V240 371.3C357.2 317.8 280.5 288 200.7 288H192V192h8.7c79.8 0 156.5-29.8 215.3-83.3z"/></svg>] : *different models* have *different sensitivities* to the type of target and feature values in the model.

.center[
<img src="img/Applied_Pred_overview_models.jpg" width="65%" style="display: block; margin: auto;" />
]

Source: Kuhn & Johnson (2013) on [Applied predictive modeling](http://appliedpredictivemodeling.com/).

---

# Target engineering <img src="img/rsample.png" class="title-hex">

We load the `ames` data set from the {AmesHousing} package and apply a .hi-pink[stratified split] of the data into a training (70%) and test (30%) set.

We stratify on the distribution of the target variable `Sale_Price` using the `strata` argument in `rsample::initial_split`.

```r
ames <- AmesHousing::make_ames()
set.seed(123)  
split  <- rsample::initial_split(ames, prop = 0.7, 
                                       `strata = "Sale_Price"`) 
ames_train  <- rsample::training(split)
ames_test   <- rsample::testing(split)
```

We check the distribution of `Sale_Price` in both `ames_train` and `ames_test`.

```r
summary(ames_train$Sale_Price)
summary(ames_test$Sale_Price)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129500  160000  180923  213500  755000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35000  129500  160000  180502  213500  745000
```
---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

Inference with linear models often assumes that the target is generated from a normal distribution.

1. Plot a histogram of `Sale_Price`. Is normality a meaningful assumption?

2. Try some transformation functions such that the transformed target approaches a normal distribution.  
]

---

```r
summary(ames_train$Sale_Price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129500  160000  180923  213500  755000
```

]

```r
summary(log(ames_train$Sale_Price))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.456  11.771  11.983  12.020  12.271  13.534
```
]

---

# Feature engineering steps

Examples of common pre-processing steps:

* Some models (e.g. KNN, Lasso, neural networks) require that the predictor variables are on the same scale. 
<br>
.hi-pink[Centering (C)] and .hi-pink[scaling (S)] the predictors can be used for this purpose.

* Other models are very sensitive to correlations between the predictors and filters or PCA signal extraction can improve these models.

* Some models find .hi-pink[(near) zero-variance (NZV)] predictors problematic, and these should be removed before fitting the model.

* In other cases, the data should be .hi-pink[encoded] in a specific way to make sure all predictors are numeric (e.g. one-hot encoding of factor variables in neural networks).

* Many models cannot cope with .hi-pink[missing data] so .hi-pink[imputation strategies] might be necessary.

* Development of new features that represent something important to the outcome.

* (add your own example here!)

This list is inspired by Max Kuhn (2019) on [Applied Machine Learning](https://github.com/topepo/aml-london-2019).

---

# A blueprint for feature engineering

* draft a .hi-pink[blueprint] of the necessary pre-processing steps, and their order

* [Boehme & Greenwell (2019)](https://bradleyboehmke.github.io/HOML/engineering.html#proper-implementation) suggest

&nbsp; &nbsp; &nbsp; &nbsp; 1. Filter out zero or near-zero variance features. <br>
&nbsp; &nbsp; &nbsp; &nbsp; 2. Perform imputation if required. <br>
&nbsp; &nbsp; &nbsp; &nbsp; 3. Normalize to resolve numeric feature skewness. <br>
&nbsp; &nbsp; &nbsp; &nbsp; 4. Standardize (center and scale) numeric features. <br>
&nbsp; &nbsp; &nbsp; &nbsp; 5. Perform dimension reduction (e.g., PCA) on <br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; numeric features. <br>
&nbsp; &nbsp; &nbsp; &nbsp; 6. One-hot or dummy encode categorical features.

]

* avoid .hi-pink[data leakage] in the pre-processing steps when applied to resampled data sets!

]

---

# Feature engineering with {recipes} <img src="img/recipes.png" class="title-hex">

We already detected the necessity of log-transforming `Sale_Price` when building linear models.

We add another pre-processing step, inspired by the .hi-pink[high cardinality] feature `Neighborhood`.

```r
ames_train %>% group_by(Neighborhood) %>% 
  summarize(n_obs = n()) %>% 
  arrange(n_obs) %>% slice(1:4)
```

]

]

---

# Feature engineering with {recipes} <img src="img/recipes.png" class="title-hex">

The main idea is to .hi-pink[preprocess multiple datasets] using a single `recipe()`.

Before we start, keep the following .KULbginline[fundamentals] of {recipes} in mind!
]

Creating a `recipe` takes the following steps:

* get the *ingredients* (`recipe()`): specify the response and predictor variables

* write the recipe (`step_zzz()`): define the *pre-processing steps*, such as imputation, creating dummy variables, scaling, and more

* *prepare* the recipe (`prep()`): provide a dataset to base each step on (e.g. *calculate* constants to do centering and scaling)

* *bake* the recipe (`bake()`): *apply* the pre-processing steps to your datasets.

.footnote[Source: [Rebecca Barter's blog](http://www.rebeccabarter.com/blog/2019-06-06_pre_processing/)]

]
---

# Feature engineering with {recipes} <img src="img/recipes.png" class="title-hex">

Use `recipe()` to create the preprocessing blueprint (to be applied later)

```r
library(recipes)
mod_rec <- recipe(Sale_Price ~ ., data = ames_train)
mod_rec
```

Now, `mod_rec` knows the role of each variable (`predictor` or `outcome`).

We can use selectors such as `all_predictors()`, `all_outcomes()` or `all_nominal()`.

]

Extend `mod_rec` with two pre-processing steps:

`step_log(all_outcomes())`

`step_other(Neighborhood, threshold = 0.05)` to lump the levels that occur in less than 5% of data as "other".

```r
mod_rec <- mod_rec %>% step_log(all_outcomes()) %>%
           step_other(Neighborhood, threshold = 0.05)
mod_rec
```

]

---

# Feature engineering with {recipes} <img src="img/recipes.png" class="title-hex">

.footnote[Source Max Kuhn (2019) on [Applied Machine Learning](https://github.com/topepo/aml-london-2019).]

Now that we have a preprocessing *specification*, we run on it on the `ames_train` to *prepare* (or `prep()`) the recipe.

```r
mod_rec_trained <- prep(mod_rec, training = ames_train, verbose = TRUE, retain = TRUE)
```

```r
mod_rec_trained <- prep(mod_rec, training = ames_train, verbose = TRUE, retain = TRUE)
## oper 1 step log [training] 
## oper 2 step other [training] 
## The retained training set is ~ 0.82 Mb  in memory.
```

The `retain = TRUE` indicates that the preprocessed training set should be saved.
---

# Feature engineering with {recipes} <img src="img/recipes.png" class="title-hex">

```r
mod_rec_trained
```

Once the recipe is prepared, it can be applied to any data set using `bake()`. There is no need to `bake()` the data used in the `prep()` step; you get the processed training set with `juice()`.

```r
ames_test_prep <- bake(mod_rec_trained, new_data = ames_test)
```

---

# Feature engineering with {recipes} <img src="img/recipes.png" class="title-hex">

```r
ames_test_prep %>% group_by(Neighborhood) %>% 
  summarize(n_obs = n()) %>% 
  arrange(n_obs) 
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Neighborhood </th>
   <th style="text-align:right;"> n_obs </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Sawyer </td>
   <td style="text-align:right;"> 43 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Northridge_Heights </td>
   <td style="text-align:right;"> 50 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Gilbert </td>
   <td style="text-align:right;"> 54 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Somerset </td>
   <td style="text-align:right;"> 60 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Edwards </td>
   <td style="text-align:right;"> 63 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> College_Creek </td>
   <td style="text-align:right;"> 68 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Old_Town </td>
   <td style="text-align:right;"> 75 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> North_Ames </td>
   <td style="text-align:right;"> 137 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> other </td>
   <td style="text-align:right;"> 331 </td>
  </tr>
</tbody>
</table>

]

```r
juice(mod_rec_trained) %>% group_by(Neighborhood) %>% 
  summarize(n_obs = n()) %>% 
  arrange(n_obs) 
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Neighborhood </th>
   <th style="text-align:right;"> n_obs </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Sawyer </td>
   <td style="text-align:right;"> 108 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Gilbert </td>
   <td style="text-align:right;"> 111 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Northridge_Heights </td>
   <td style="text-align:right;"> 116 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Somerset </td>
   <td style="text-align:right;"> 122 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Edwards </td>
   <td style="text-align:right;"> 131 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Old_Town </td>
   <td style="text-align:right;"> 164 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> College_Creek </td>
   <td style="text-align:right;"> 199 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> North_Ames </td>
   <td style="text-align:right;"> 306 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> other </td>
   <td style="text-align:right;"> 792 </td>
  </tr>
</tbody>
</table>

]

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

Now you will extend the existing recipe in `mod_rec`, prepare and bake it again!

.hi-pink[Q]: consult the [{recipes} manual](https://tidymodels.github.io/recipes/reference/index.html) and specify a recipe for the housing data that includes the following pre-processing steps (in this order)

* log-transform the outcome variable
* remove any zero-variance predictors
* lump factor levels that occur in <= 5% of data as "other" for both `Neighborhood` as well as `House_Style`
* center and scale all numeric features.

1. Specify the above recipe on the training set and store it in the object `mod_rec`. 
2. Inspect the object `mod_rec` using `summary(mod_rec)`. What can you learn from this summary?
3. Prepare the recipe on the training data and then apply it to the test set.

]

---

First, let's try to get a grasp of the `House_Style` feature as well as the presence of zero-variance predictors.

```r
ames_train %>% group_by(House_Style) %>% 
              summarize(n_obs = n()) %>% 
               arrange(n_obs)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> House_Style </th>
   <th style="text-align:right;"> n_obs </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Two_and_Half_Fin </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> One_and_Half_Unf </td>
   <td style="text-align:right;"> 15 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Two_and_Half_Unf </td>
   <td style="text-align:right;"> 17 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> SFoyer </td>
   <td style="text-align:right;"> 61 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> SLvl </td>
   <td style="text-align:right;"> 91 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> One_and_Half_Fin </td>
   <td style="text-align:right;"> 214 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Two_Story </td>
   <td style="text-align:right;"> 609 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> One_Story </td>
   <td style="text-align:right;"> 1036 </td>
  </tr>
</tbody>
</table>

]

]

---

To detect the presence of zero-variance and near-zero-variance features the `caret` library has the function `nearZeroVar`

```r
library(caret)
nzv <- caret::nearZeroVar(ames_train, saveMetrics = TRUE)
```

```r
names(ames_train)[nzv$zeroVar]
```

```
## character(0)
```

```r
names(ames_train)[nzv$nzv]
```

```
##  [1] "Street"             "Alley"              "Land_Contour"      
##  [4] "Utilities"          "Land_Slope"         "Condition_2"       
##  [7] "Roof_Matl"          "Bsmt_Cond"          "BsmtFin_Type_2"    
## [10] "BsmtFin_SF_2"       "Heating"            "Low_Qual_Fin_SF"   
## [13] "Kitchen_AbvGr"      "Functional"         "Enclosed_Porch"    
## [16] "Three_season_porch" "Screen_Porch"       "Pool_Area"         
## [19] "Pool_QC"            "Misc_Feature"       "Misc_Val"
```

So, no features have zero- variance, but 20 features have near-zero-variance.

---

We put the recipe together with the following steps

```r
mod_rec <- recipe(Sale_Price ~ ., data = ames_train) %>%
        step_log(all_outcomes()) %>%
        step_other(Neighborhood, threshold = 0.05) %>%
        step_other(House_Style, threshold = 0.05) %>%
        step_zv(all_predictors()) %>% 
        step_nzv(all_predictors()) %>%
        step_center(all_numeric(), -all_outcomes()) %>%
        step_scale(all_numeric(), -all_outcomes())
summary(mod_rec) %>% slice(1:6)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> variable </th>
   <th style="text-align:left;"> type </th>
   <th style="text-align:left;"> role </th>
   <th style="text-align:left;"> source </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> MS_SubClass </td>
   <td style="text-align:left;"> factor   , unordered, nominal </td>
   <td style="text-align:left;"> predictor </td>
   <td style="text-align:left;"> original </td>
  </tr>
  <tr>
   <td style="text-align:left;"> MS_Zoning </td>
   <td style="text-align:left;"> factor   , unordered, nominal </td>
   <td style="text-align:left;"> predictor </td>
   <td style="text-align:left;"> original </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Lot_Frontage </td>
   <td style="text-align:left;"> double , numeric </td>
   <td style="text-align:left;"> predictor </td>
   <td style="text-align:left;"> original </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Lot_Area </td>
   <td style="text-align:left;"> integer, numeric </td>
   <td style="text-align:left;"> predictor </td>
   <td style="text-align:left;"> original </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Street </td>
   <td style="text-align:left;"> factor   , unordered, nominal </td>
   <td style="text-align:left;"> predictor </td>
   <td style="text-align:left;"> original </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Alley </td>
   <td style="text-align:left;"> factor   , unordered, nominal </td>
   <td style="text-align:left;"> predictor </td>
   <td style="text-align:left;"> original </td>
  </tr>
</tbody>
</table>

]

```r
mod_rec
```

]

---

We prep the recipe on `ames_train`

```r
mod_rec_trained <- prep(mod_rec, 
                        training = ames_train, 
                        verbose = TRUE, retain = TRUE)
## oper 1 step log [training] 
## oper 2 step other [training] 
## oper 3 step other [training] 
## oper 4 step zv [training] 
## oper 5 step nzv [training] 
## oper 6 step center [training] 
## oper 7 step scale [training] 
## The retained training set is ~ 0.75 Mb  in memory.
```

and bake it on the `ames_test` data

```r
ames_test_prep <- bake(mod_rec_trained, 
                                new_data = ames_test)
```

We inspect the processed training and test set

```r
dim(juice(mod_rec_trained)) 
```

```
## [1] 2049   60
```

]

Verify that `Sale_Price` is log-transformed (but not centred and scaled)

```r
head(juice(mod_rec_trained)$Sale_Price) 
head(ames_train$Sale_Price)
head(ames_test_prep$Sale_Price)
head(ames_test$Sale_Price)
```

```
## [1] 11.57 11.39 11.70 11.74 11.12 11.63
```

```
## [1] 105500  88000 120000 125000  67500 112000
```

```
## [1] 11.56 12.15 12.18 12.16 12.37 12.15
```

```
## [1] 105000 189900 195500 191500 236500 189000
```

```r
levels(juice(mod_rec_trained)$House_Style)
```

```r
levels(ames_test_prep$House_Style)
```

```
## [1] "One_and_Half_Fin" "One_Story"
## [1] "Two_Story" "other"
```

]

---

# Putting it all together {rsample} and {recipes} <img src="img/recipes.png" class="title-hex"> <img src="img/rsample.png" class="title-hex">

Let's redo the KNN example, with centering and scaling of the x-feature, by combining {rsample}/{caret} with a recipe.

```r
# get the simulated data
set.seed(123)  # for reproducibility
x <- seq(from = 0, to = 2 * pi, length = 500)
y <- sin(x) + rnorm(length(x), sd = 0.3)
df <- data.frame(x, y) %>% filter(x < 4.5)
```

```r
# specify the recipe
library(recipes)
rec <- recipe(y ~ x, data = df)
rec <- rec %>% step_center(all_predictors()) %>%
               step_scale(all_predictors())
```

]

```r
# doing this on complete data set df
rec_df <- prep(rec, training = df)
mean(juice(rec_df)$x) # centered!
## [1] 1.473e-16
sd(juice(rec_df)$x)   # scaled!
## [1] 1
```

```r
# now we combine the recipe with rsample steps
library(rsample)
set.seed(123)  # for reproducibility
cv_rsample <- vfold_cv(df, 5)
```

```r
# we apply the steps in the recipe to each fold
library(purrr)
cv_rsample$recipes <- map(cv_rsample$splits, prepper, 
                          recipe = rec)
# check `?prepper`
```

]

---

# Putting it all together {rsample} and {recipes} <img src="img/recipes.png" class="title-hex"> <img src="img/rsample.png" class="title-hex">

Let's redo the KNN example, with centering and scaling of the x-feature, by combining {rsample}/{caret} with a recipe.

Now you can inspect `cv_rsample` as follows

```r
cv_rsample$recipes[[1]]
juice(cv_rsample$recipes[[1]])
bake(cv_rsample$recipes[[1]], 
     new_data = assessment(cv_rsample$splits[[1]]))
```

]

```r
holdout_results <- function(s, rec, k_val) {
  # Fit the model to the analysis data in split s
  df_train <- juice(rec)
  mod <- knnreg(y ~ x, k = k_val, data = df_train)
  # Get the remaining group
  holdout <- bake(rec, new_data = assessment(s))
  # Get predictions with the holdout data set
  res <- predict(mod, newdata = holdout)
  # Return observed and predicted values 
  #                            on holdout set
  res <- tibble(obs = holdout$y, pred = res)
  res
}
```

```r
res <- holdout_results(cv_rsample$splits[[2]], 
                       cv_rsample$recipes[[2]], 
                       k_val = 58)
sqrt(sum((res$obs - res$pred)^2)/nrow(res))
## [1] 0.3505
```

]

---

# Putting it all together {rsample} and {recipes} <img src="img/recipes.png" class="title-hex"> <img src="img/rsample.png" class="title-hex">

Let's redo the KNN example, with centering and scaling of the x-feature, by combining {rsample}/{caret} with a recipe.

```r
RMSE <- numeric(nrow(hyper_grid))
SE <- numeric(nrow(hyper_grid))
for(i in 1:nrow(hyper_grid)){
  cv_rsample$results <- map2(cv_rsample$splits, cv_rsample$recipes,
                            holdout_results,
                            hyper_grid[i, ])
  res <- map_dbl(cv_rsample$results, 
                 function(x) mean((x$obs - x$pred)^2))
  RMSE[i] <- mean(sqrt(res)) ; SE[i] <- sd(sqrt(res))
}
```

---

# Regression models in R and <br> tidy model output with {broom}

---

# Creating models in R

The **formula** interface using R's [formula rules](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Formulae-for-statistical-models) to specify a *symbolic* representation of the terms:

* response ~ variable, with `model_fn` referring to the specific model function you want to use, e.g. `lm` for linear regression

```r
model_fn(Sale_Price ~ Gr_Liv_Area, data = ames)
```

* response ~ variable_1 + variable_2

```r
model_fn(Sale_Price ~ Gr_Liv_Area + Neighborhood, data = ames)
```

* response ~ variable_1 + variable_2 + their interaction

```r
model_fn(Sale_Price ~ Gr_Liv_Area + Neighborhood + Neighborhood:Gr_Liv_Area, data = ames)
```

* shorthand for all predictors

```r
model_fn(Sale_Price ~ ., data = ames)
```

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

You will now fit some linear regression models on the `ames` housing data. 
<br> <br> 
You will explore the model fits with `base` R instructions as well as the functionalities offered by the {broom} package.
<br> <br>
.hi-pink[Q]: load the `ames` housing data set via `ames <- AmesHousing::make_ames()`

1. Fit a linear regression model with `Sale_Price` as response and `Gr_Liv_Area` as covariate. Store the resulting object as `model_1`.

2. Repeat your instruction, but now put it between brackets. What happens?

3. Inspect `model_1` with the following set of instructions

- `summary(___)`
- extract the fitted coefficients, using `___$coefficients`
- what happens with `summary(___)$coefficients`?
- extract fitted values, using `___$fitted.values`
- now try to extract the R<sup>2</sup> of this model.

]

---

```r
model_1 <- lm(Sale_Price ~ Gr_Liv_Area, data = ames)
```

```r
summary(model_1)
```

Now let's extract some meaningful information from `model_1` (using `base` R instructions)

```r
model_1$coefficients
```

```
## (Intercept) Gr_Liv_Area 
##     13289.6       111.7
```

```r
summary(model_1)$coefficients
```

```
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  13289.6   3269.703   4.064 4.941e-05
## Gr_Liv_Area    111.7      2.066  54.061 0.000e+00
```

]

```r
head(model_1$fitted.values)
```

```
##      1      2      3      4      5      6 
## 198255 113367 161731 248964 195239 192447
```

```r
summary(model_1)$r.squared
```

```
## [1] 0.4995
```

]

---

# Tidy model output <img src="img/broom.png" class="title-hex">

The package {broom} allows to summarize key information about statistical objects (e.g. a linear regression model) in so-called tidy tibbles.

This makes it easy to report results, create plots and consistently work with large numbers of models at once.

We briefly illustrate the three essential verbs of {broom}: `tidy()`, `glance()` and `augment()`.

```r
model_1 %>% broom::tidy() 
```

```r
model_1 %>% broom::glance() 
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> r.squared </th>
   <th style="text-align:right;"> adj.r.squared </th>
   <th style="text-align:right;"> sigma </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> df </th>
   <th style="text-align:right;"> logLik </th>
   <th style="text-align:right;"> AIC </th>
   <th style="text-align:right;"> BIC </th>
   <th style="text-align:right;"> deviance </th>
   <th style="text-align:right;"> df.residual </th>
   <th style="text-align:right;"> nobs </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.4995 </td>
   <td style="text-align:right;"> 0.4994 </td>
   <td style="text-align:right;"> 56524 </td>
   <td style="text-align:right;"> 2923 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> -36218 </td>
   <td style="text-align:right;"> 72442 </td>
   <td style="text-align:right;"> 72460 </td>
   <td style="text-align:right;"> 9.355e+12 </td>
   <td style="text-align:right;"> 2928 </td>
   <td style="text-align:right;"> 2930 </td>
  </tr>
</tbody>
</table>

---

# Tidy model output <img src="img/broom.png" class="title-hex">

The package {broom} allows to summarize key information about statistical objects (e.g. a linear regression model) in so-called tidy tibbles.

This makes it easy to report results, create plots and consistently work with large numbers of models at once.

We briefly illustrate the three essential verbs of `broom`: `tidy()`, `glance()` and `augment()`.

```r
model_1 %>% broom::augment() %>% slice(1:5)
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Sale_Price </th>
   <th style="text-align:right;"> Gr_Liv_Area </th>
   <th style="text-align:right;"> .fitted </th>
   <th style="text-align:right;"> .resid </th>
   <th style="text-align:right;"> .hat </th>
   <th style="text-align:right;"> .sigma </th>
   <th style="text-align:right;"> .cooksd </th>
   <th style="text-align:right;"> .std.resid </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 215000 </td>
   <td style="text-align:right;"> 1656 </td>
   <td style="text-align:right;"> 198255 </td>
   <td style="text-align:right;"> 16745 </td>
   <td style="text-align:right;"> 4e-04 </td>
   <td style="text-align:right;"> 56533 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0.2963 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 105000 </td>
   <td style="text-align:right;"> 896 </td>
   <td style="text-align:right;"> 113367 </td>
   <td style="text-align:right;"> -8367 </td>
   <td style="text-align:right;"> 8e-04 </td>
   <td style="text-align:right;"> 56534 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> -0.1481 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 172000 </td>
   <td style="text-align:right;"> 1329 </td>
   <td style="text-align:right;"> 161731 </td>
   <td style="text-align:right;"> 10269 </td>
   <td style="text-align:right;"> 4e-04 </td>
   <td style="text-align:right;"> 56534 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0.1817 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 244000 </td>
   <td style="text-align:right;"> 2110 </td>
   <td style="text-align:right;"> 248964 </td>
   <td style="text-align:right;"> -4964 </td>
   <td style="text-align:right;"> 8e-04 </td>
   <td style="text-align:right;"> 56534 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> -0.0879 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 189900 </td>
   <td style="text-align:right;"> 1629 </td>
   <td style="text-align:right;"> 195239 </td>
   <td style="text-align:right;"> -5339 </td>
   <td style="text-align:right;"> 4e-04 </td>
   <td style="text-align:right;"> 56534 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> -0.0945 </td>
  </tr>
</tbody>
</table>

---

```r
g_lm_1 <- ggplot(data = ames, 
                 aes(Gr_Liv_Area, Sale_Price)) + 
  theme_bw() +
  geom_point(size = 1, alpha = 0.3) +
  geom_smooth(se = TRUE, method = "lm") +
  #scale_y_continuous(labels = scales::dollar) +
  ggtitle("Regression with AMES housing data")
g_lm_1
```

]

```r
g_lm_2 <- model_1 %>% broom::augment() %>% 
ggplot(aes(Gr_Liv_Area, Sale_Price)) + 
    theme_bw() +
    geom_point(size = 1, alpha = 0.3) +
    geom_line(aes(y = .fitted), col = KULbg) +
    #scale_y_continuous(labels = scales::dollar) +
    ggtitle("Regression with AMES housing data")
g_lm_2

```

]

---

# Generalized Linear Models

---

# Linear and Generalized Linear Models

.center[
<img src="img/esbjorn_GLM.jpg" alt="Drawing" style="width: 150px; height: 220px;"/>
<br> <br> <img src="img/de_jong_GLM.jpg" alt="Drawing" style="width: 150px; height: 220px;"/>  
]

]

With .hi-pink[linear regression models] `lm(.)`

- model specification

`$$\begin{eqnarray*}
    \color{#FFA500}{Y} = \color{#e64173}{x}^{'}\color{#20B2AA}{\beta} + \epsilon.
\end{eqnarray*}$$`

- `$\epsilon$` is normally distributed with mean 0 and common variance `$\sigma^2$`, thus: `$\color{#FFA500}{Y}$` is normal with mean `$\color{#e64173}{x}^{'}\color{#20B2AA}{\beta}$` and variance `$\sigma^2$`

With .hi-pink[generalized linear regression models] `glm(.)`

- model specification

`$$\begin{eqnarray*}
    g(E[\color{#FFA500}{Y}]) = \color{#e64173}{x}^{'}\color{#20B2AA}{\beta}.
\end{eqnarray*}$$`

- `$g(.)$` is the link function

- `$\color{#FFA500}{Y}$` follows a distribution from the exponential family.

]

---

# Motor Third Party Liability data <img src="img/pipe.png" class="title-hex"> <img src="img/dplyr.png" class="title-hex"> <img src="img/ggplot2.png" class="title-hex">

We will use the Motor Third Party Liability data set. There are 163,231 policyholders in this data set.

The frequency of claiming (`nclaims`) and corresponding severity (`avg`, the amount paid on average per claim reported by a policyholder) are the .KULbginline[target variables] in this data set.

Predictor variables are:

* the exposure-to-risk, the duration of the insurance coverage (max. 1 year)
* factor variables, e.g. gender, coverage, fuel
* continuous, numeric variables, e.g. age of the policyholder, age of the car
* spatial information: postal code (in Belgium) of the municipality where the policyholder resides.

More details in [Henckaerts et al. (2018, Scandinavian Actuarial Journal)](https://katrienantonio.github.io/projects/2019/06/13/machine-learning/#data-driven) and [Henckaerts et al. (2020, North American Actuarial Journal)](https://katrienantonio.github.io/projects/2019/06/13/machine-learning/#tree-based-pricing).

---

# Motor Third Party Liability data <img src="img/pipe.png" class="title-hex"> <img src="img/dplyr.png" class="title-hex"> <img src="img/ggplot2.png" class="title-hex">

You can load the data from the `data` folder as follows:

```r
# install.packages("rstudioapi")
dir <- dirname(rstudioapi::getActiveDocumentContext()$path)
setwd(dir)
mtpl_orig <- read.table('../data/PC_data.txt',
                        header = TRUE, 
                        stringsAsFactors = TRUE)
mtpl_orig <- as_tibble(mtpl_orig)
```

Alternatively, you can also go for:

```r
# install.packages("here")
dir <- here::here()   
setwd(dir) 
mtpl_orig <- read.table('../data/PC_data.txt', 
                        header = TRUE,
                        stringsAsFactors = TRUE)
mtpl_orig <- as_tibble(mtpl_orig)
```

Some basic exploratory steps with this data follow on the next sheet.

---

# Motor Third Party Liability data <img src="img/pipe.png" class="title-hex"> <img src="img/dplyr.png" class="title-hex"> <img src="img/ggplot2.png" class="title-hex">

Note that the data `mtpl_orig` uses capitals for the variable names

```r
mtpl_orig %>% slice(1:3) %>% dplyr::select(-LONG, -LAT)
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> ID </th>
   <th style="text-align:right;"> NCLAIMS </th>
   <th style="text-align:right;"> AMOUNT </th>
   <th style="text-align:right;"> AVG </th>
   <th style="text-align:right;"> EXP </th>
   <th style="text-align:left;"> COVERAGE </th>
   <th style="text-align:left;"> FUEL </th>
   <th style="text-align:left;"> USE </th>
   <th style="text-align:left;"> FLEET </th>
   <th style="text-align:left;"> SEX </th>
   <th style="text-align:right;"> AGEPH </th>
   <th style="text-align:right;"> BM </th>
   <th style="text-align:right;"> AGEC </th>
   <th style="text-align:right;"> POWER </th>
   <th style="text-align:right;"> PC </th>
   <th style="text-align:left;"> TOWN </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1618 </td>
   <td style="text-align:right;"> 1618 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> TPL </td>
   <td style="text-align:left;"> gasoline </td>
   <td style="text-align:left;"> private </td>
   <td style="text-align:left;"> N </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 50 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> 77 </td>
   <td style="text-align:right;"> 1000 </td>
   <td style="text-align:left;"> BRUSSEL </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> PO </td>
   <td style="text-align:left;"> gasoline </td>
   <td style="text-align:left;"> private </td>
   <td style="text-align:left;"> N </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 64 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 66 </td>
   <td style="text-align:right;"> 1000 </td>
   <td style="text-align:left;"> BRUSSEL </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> TPL </td>
   <td style="text-align:left;"> diesel </td>
   <td style="text-align:left;"> private </td>
   <td style="text-align:left;"> N </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 60 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 70 </td>
   <td style="text-align:right;"> 1000 </td>
   <td style="text-align:left;"> BRUSSEL </td>
  </tr>
</tbody>
</table>

We change this to lower case variables, and rename `exp` to `expo`.

```r
mtpl <- mtpl_orig %>% rename_all(tolower) %>% rename(expo = exp)
names(mtpl)
##  [1] "id"       "nclaims"  "amount"   "avg"      "expo"     "coverage"
##  [7] "fuel"     "use"      "fleet"    "sex"      "ageph"    "bm"      
## [13] "agec"     "power"    "pc"       "town"     "long"     "lat"
```

---

```r
dim(mtpl)
```

```
## [1] 163231     18
```

```r
mtpl %>% 
  summarize(emp_freq = sum(nclaims) / sum(expo))
```

```r
mtpl %>% 
  group_by(sex) %>% 
  summarize(emp_freq = sum(nclaims) / sum(expo))
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> sex </th>
   <th style="text-align:right;"> emp_freq </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 0.1484 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 0.1361 </td>
  </tr>
</tbody>
</table>

]

```r
g <- ggplot(mtpl, aes(nclaims)) + theme_bw() + 
     geom_bar(aes(weight = expo),
              alpha = .5, col = KULbg, fill = KULbg) + 
     labs(y = "Abs freq (in exposure)") +
     ggtitle("MTPL - number of claims")
g
```

]

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

To get warmed up, let's load the `mtpl` data and do some .KULbginline[basic investigations] into the variables. The idea is to get a feel for the data.

1. Visualize the distribution of the `ageph` with a histogram.

2. For each age recorded in the data set `mtpl`: what is the total number of observations, the total exposure, and the corresponding total number of claims reported?

3. Calculate the empirical claim frequency, per unit of exposure, for each age and picture it. Discuss this figure.

4. Repeat the above for `bm`, the level occupied by the policyholder in the Belgian bonus- malus scale.

]

---

```r
ggplot(data = mtpl, aes(ageph)) + theme_bw() +
  geom_histogram(binwidth = 2, alpha = .5,
                 col = KULbg, fill = KULbg) +
  labs(y = "Absolute frequency") +
  ggtitle("MTPL - age policyholder")
```

]

```r
mtpl %>% 
  group_by(ageph) %>% 
  summarize(tot_claims = sum(nclaims),
            tot_expo = sum(expo),
            tot_obs = n())
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> ageph </th>
   <th style="text-align:right;"> tot_claims </th>
   <th style="text-align:right;"> tot_expo </th>
   <th style="text-align:right;"> tot_obs </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 18 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 4.622 </td>
   <td style="text-align:right;"> 16 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 19 </td>
   <td style="text-align:right;"> 28 </td>
   <td style="text-align:right;"> 93.022 </td>
   <td style="text-align:right;"> 116 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 113 </td>
   <td style="text-align:right;"> 342.285 </td>
   <td style="text-align:right;"> 393 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 21 </td>
   <td style="text-align:right;"> 165 </td>
   <td style="text-align:right;"> 597.389 </td>
   <td style="text-align:right;"> 701 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 22 </td>
   <td style="text-align:right;"> 202 </td>
   <td style="text-align:right;"> 778.827 </td>
   <td style="text-align:right;"> 952 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 23 </td>
   <td style="text-align:right;"> 297 </td>
   <td style="text-align:right;"> 1165.359 </td>
   <td style="text-align:right;"> 1379 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 24 </td>
   <td style="text-align:right;"> 426 </td>
   <td style="text-align:right;"> 1752.249 </td>
   <td style="text-align:right;"> 2028 </td>
  </tr>
</tbody>
</table>

]

---

```r
freq_by_age <- mtpl %>% 
  group_by(ageph) %>% 
  summarize(emp_freq = sum(nclaims) / sum(expo))

ggplot(data = freq_by_age,
       aes(x = ageph, y = emp_freq)) + theme_bw() +
  geom_bar(stat = 'identity', alpha = .5,
           color = KULbg, fill = KULbg) +
  ggtitle('MTPL - empirical claim freq per 
          age policyholder')
```

]

<br>

]

---

# Generalized Linear Models (GLMs)

Modeling claim .hi-pink[frequency] and .hi-pink[severity] in the `mtpl` data set.

Target variable `nclaims` (frequency)

Suitable distributions: Poisson, Negative Binomial.

]

...  and `avg` (severity).

Suitable distributions: log-normal, gamma.

]

---

# A Poisson GLM

```r
freq_glm_1 <- `glm`(nclaims ~ sex, offset = log(expo), 
                  `family = poisson(link = "log")`, 
                  `data = mtpl`) 
```

]

This implies:

`$\color{#FFA500}{Y}$` ~ Poisson, with

`$$\begin{eqnarray*}
    \log(E[\color{#FFA500}{Y}]) &=& \color{#e64173}{x}^{'}\color{#20B2AA}{\beta},
\end{eqnarray*}$$`

or,

`$$E[\color{#FFA500}{Y}] = \exp{(\color{#e64173}{x}^{'}\color{#20B2AA}{\beta})}.$$`

Fit this model on `data = mtpl`. 
  
]

---
  
# A Poisson GLM (cont.)

```r
freq_glm_1 <- glm(`nclaims ~ sex`, `offset = log(expo)`, 
                  family = poisson(link = "log"), 
                  data = mtpl)
```

]

Use `gender` as the only (factor) variable in the linear predictor.

Include `log(expo)` as an offset term in the linear predictor.

Then,

`$$\begin{eqnarray*}
\color{#e64173}{x}^{'}\color{#20B2AA}{\beta} = \log{(\texttt{expo})}+\beta_0 + \beta_1 \mathbb{I}(\texttt{male}). \end{eqnarray*}$$`

Put otherwise,

`$$\begin{eqnarray*}
E[\color{#FFA500}{Y}] = \texttt{expo} \cdot \exp{(\beta_0 + \beta_1 \mathbb{I}(\texttt{male}))} \end{eqnarray*},$$`
where `$\texttt{expo}$` refers to `expo` the exposure variable.
]

---

```r
freq_glm_1 <- glm(nclaims ~ sex, `offset = log(expo)`, 
                  family = poisson(link = "log"), 
                  data = mtpl)
```

```r
freq_glm_1 %>% broom::tidy()
```

Mind the specification of `type.predict` when using `augment` with a GLM!

```r
freq_glm_1 %>% broom::augment(type.predict = "response")
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> nclaims </th>
   <th style="text-align:left;"> sex </th>
   <th style="text-align:right;"> .fitted </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 0.1361 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 0.1484 </td>
  </tr>
</tbody>
</table>

]

The `predict` function of a GLM object offers 3 options: `"link"`, `"response"` or `"terms"`.

The same options hold when `augment()` is applied to a GLM object.

Let's see how the fitted values at `"response"` level are constructed:

```r
exp(coef(freq_glm_1)[1])
## (Intercept) 
##      0.1484
exp(coef(freq_glm_1)[1] + coef(freq_glm_1)[2])
## (Intercept) 
##      0.1361
```

Do you recognize these numbers?

Last step:

try `freq_glm_1 %>% glance()` or `summary(freq_glm_1)` for deviances.

]

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

You will further explore GLMs in R with the `glm(.)` function.

.hi-pink[Q]: continue with the `freq_glm_1` object that was created, you will now explicitly call the `predict()` function on this object.

1. Verify the arguments of `predict.glm` using `? predict.glm`.

2. The help reveals the following structure `predict(.object, .newdata, type = ("..."))` where `.object` is the fitted GLM object, `.newdata` is (optionally) a data frame to look for the features used in the model, and `type` is  `"link"`, `"response"` or `"terms"`. <br> Use `predict` with `freq_glm_1` and a newly created data frame. <br> Explore the different options for `type`, and their connections.

3. Fit a gamma GLM for `avg` (the claim severity) with log link. <br>
Use `sex` as the only variable in the model. What do you conclude?
]

---

```r
male_driver <- data.frame(expo = 1, sex = "male")
female_driver <- data.frame(expo = 1, sex = "female")
```

Next, you apply `predict` with the GLM object `freq_glm_1` and one of these data frames, e.g.

```r
predict(freq_glm_1, newdata = male_driver, 
                    type = "response")
```

```
##         1 
## 0.1361164
```

]

.hi-pink[Q.2] Next, you apply `predict` with the GLM object `freq_glm_1` and one of these data frames, e.g.

```r
predict(freq_glm_1, newdata = male_driver, 
                    type = "response")
```

```
##         1 
## 0.1361164
```

At the level of the linear predictor:

```r
predict(freq_glm_1, newdata = male_driver, 
                    type = "link")
```

```
##         1 
## -1.994245
```

```r
exp(predict(freq_glm_1, newdata = male_driver, 
                        type = "link"))
```

```
##         1 
## 0.1361164
```

]

---

```r
sev_glm_1 <- glm(avg ~ sex, family = Gamma(link = "log"), data = mtpl)
sev_glm_1
```

```
## 
## Call:  glm(formula = avg ~ sex, family = Gamma(link = "log"), data = mtpl)
## 
## Coefficients:
## (Intercept)      sexmale  
##      7.5730      -0.2581  
## 
## Degrees of Freedom: 18294 Total (i.e. Null);  18293 Residual
##   (144936 observations deleted due to missingness)
## Null Deviance:	    46690 
## Residual Deviance: 46440 	AIC: 299700
```

---

# Generalized Additive Models with {mgcv}

---

# Generalized Additive Models (GAMs)

.center[
<img src="img/GAM_Hastie_Tibshirani.jpg" alt="Drawing" style="width: 150px; height: 220px;"/>
<br> <br> <img src="img/GAM_Wood.jpg" alt="Drawing" style="width: 150px; height: 220px;"/>  
]

]

With .hi-pink[GLMs] `glm(.)`

- transformation of the mean modelled with a linear predictor `$$\color{#e64173}{x}^{'}\color{#20B2AA}{\beta}$$`

- not well suited for continuous risk factors that relate to the response in
a non-linear way.

With .hi-pink[Generalized Additive Models (GAMs)]

- the predictor allows for smooth effects of continuous risk factors and spatial covariates, next to the linear terms, e.g.

`$$\color{#e64173}{x}^{'}\color{#20B2AA}{\beta}+\sum_j f_j(x_j) + f(\texttt{lat}, \texttt{long})$$`

- predictor is still additive

- preferred R package is {mgcv} by Simon Wood.

]

---

# A Poisson GAM

]

We will now explore .hi-pink[four different model specifications]:

1. `ageph` as linear effect in `glm`

2. `ageph` as factor variable in `glm`

3. `ageph` split manually into bins using `cut`, then used as factor in `glm`

4. a smooth effect of `ageph` in `mgcv::gam`.

Let's go!

Grid of observed `ageph` values

```r
a <- min(mtpl$ageph):max(mtpl$ageph)
```

]

---

```r
freq_glm_age <- glm(nclaims ~ `ageph`, 
                    offset = log(expo), 
                    data = mtpl, 
                    family = poisson(link = "log"))
pred_glm_age <- predict(freq_glm_age, 
          newdata = data.frame(ageph = a, expo = 1), 
          `type = "terms"`, se.fit = TRUE)
b_glm_age <- pred_glm_age$fit
l_glm_age <- pred_glm_age$fit 
                  - qnorm(0.975)*pred_glm_age$se.fit
u_glm_age <- pred_glm_age$fit 
                  + qnorm(0.975)*pred_glm_age$se.fit
df <- data.frame(a, b_glm_age, l_glm_age, u_glm_age)
```

]

]
---

```r
freq_glm_age_f <- glm(nclaims ~ `as.factor(ageph)`, 
                    offset = log(expo), 
                    data = mtpl, 
                    family = poisson(link = "log"))
pred_glm_age_f <- predict(freq_glm_age_f, 
          newdata = data.frame(ageph = a, expo = 1), 
          `type = "terms"`, se.fit = TRUE)
b_glm_age_f <- pred_glm_age_f$fit
l_glm_age_f <- pred_glm_age_f$fit 
                  - qnorm(0.975)*pred_glm_age_f$se.fit
u_glm_age_f <- pred_glm_age_f$fit 
                  + qnorm(0.975)*pred_glm_age_f$se.fit
df <- data.frame(a, b_glm_age_f, 
                    l_glm_age_f, u_glm_age_f)
```

]

]

---

```r
*level <- seq(min(mtpl$ageph), max(mtpl$ageph), by = 5)
freq_glm_age_c <- glm(nclaims ~ `cut(ageph, level)`, 
                    offset = log(expo), 
                    data = mtpl, 
                    family = poisson(link = "log"))
pred_glm_age_c <- predict(freq_glm_age_c, 
          newdata = data.frame(ageph = a, expo = 1), 
          `type = "terms"`, se.fit = TRUE)
b_glm_age_c <- pred_glm_age_c$fit
l_glm_age_c <- pred_glm_age_c$fit 
                  - qnorm(0.975)*pred_glm_age_c$se.fit
u_glm_age_c <- pred_glm_age_c$fit 
                  + qnorm(0.975)*pred_glm_age_c$se.fit
df <- data.frame(a, b_glm_age_c, 
                    l_glm_age_c, u_glm_age_c)
```

]

]

---

```r
library(mgcv)
freq_gam_age <- gam(nclaims ~ `s(ageph)`, 
                    offset = log(expo), 
                    data = mtpl, 
                    family = poisson(link = "log"))
pred_gam_age <- predict(freq_gam_age, 
          newdata = data.frame(ageph = a, expo = 1), 
          `type = "terms"`, se.fit = TRUE)
b_gam_age <- pred_gam_age$fit
l_gam_age <- pred_gam_age$fit -
                  qnorm(0.975)*pred_gam_age$se.fit
u_gam_age <- pred_gam_age$fit +
                  qnorm(0.975)*pred_gam_age$se.fit
df <- data.frame(a, b_gam_age, 
                    l_gam_age, u_gam_age)
```

]

]

---

```r
library(mgcv)
freq_gam <- gam(nclaims ~ s(ageph), offset = log(expo), family = poisson(link = "log"), data = mtpl)
plot(freq_gam, scheme = 4)
```

---

# More on GAMs

.pull-left[
So, a GAM is a GLM where the linear predictor depends on .hi-pink[smooth functions] of covariates.

Consider a GAM with the following predictor:

`$$\color{#e64173}{x}^{'}\color{#20B2AA}{\beta}+f_j(x_j).$$`

GAMs use .hi-pink[basis functions] to estimate the smooth effect `$f_j(.)$`

`$$f_j(x_j) = \sum_{m=1}^M \beta_{jm} b_{jm}(x_j),$$`

where the `$b_{jm}(x)$` are known basis functions and `$\beta_{jm}$` are coefficients that have to be estimated.
]

GAMs avoid overfitting by adding a .hi-pink[wiggliness penalty] to the likelihood

`$$\int \left( f_j(x)'' \right)^2 = \beta_j^t \mathbf{S}_j\beta_j.$$`
GAMs then balance goodness-of-fit and wiggliness via

`$$- \log \mathcal{L}(\beta,\beta_j) + \lambda_j \cdot \beta_j^t \mathbf{S}_j\beta_j,$$`

with `$\lambda_j$` the .hi-pink[smoothing parameter].

The smoothing parameter `$\lambda_j$` controls the trade-off between fit & smoothness.
]

---

Let's run some experiments to illustrate the effect of the smoothing parameter (   `sp = .`), the number (`k = .`) and type of basis functions (`bs = .`). We use the `mcycle` data from {MASS}.

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

You will further explore GAMs in R with the `gam(.)` function from the {mgcv} package.

.hi-pink[Q]: you will combine insights from building `glm` as well as `gam` objects by working through the following coding steps.

1. Fit a `gam` including some factor variables as well as a smooth effect of `ageph` and `bm`. Visualize the fitted smooth effects.

2. Specify risk profiles of drivers. Calculate their expected annual claim frequency from the constructed `gam`.

3. Explain (in words) which profiles would represent high vs low risk according to the constructed model. 
]

---

```r
freq_gam_2 <- gam(nclaims ~ sex + fuel + use + 
                    s(ageph) + s(bm),
                  offset = log(expo), data = mtpl,
                  family = poisson(link = "log"))
```

```r
summary(freq_gam_2)
## 
## Family: poisson 
## Link function: log 
## 
## Formula:
## nclaims ~ sex + fuel + use + s(ageph) + s(bm)
## 
## Parametric coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)  -1.917790   0.018124 -105.817   <2e-16 ***
## sexmale       0.009167   0.016043    0.571   0.5677    
## fuelgasoline -0.152730   0.015100  -10.114   <2e-16 ***
## usework      -0.055345   0.033090   -1.673   0.0944 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df Chi.sq p-value    
## s(ageph) 5.451  6.502  228.8  <2e-16 ***
## s(bm)    8.323  8.778 1227.3  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0155   Deviance explained = 2.85%
## UBRE = -0.46446  Scale est. = 1         n = 163231
```

]

```r
plot(freq_gam_2, select = 1)
```

```r
plot(freq_gam_2, select = 2)
```

]

---

```r
drivers <- data.frame(expo = c(1, 1, 1), 
                      sex = c("female", "female", "female"), 
                      fuel = c("diesel", "diesel", "diesel"), 
                      use = c("private", "private", "private"), 
                      ageph = c(18, 45, 65), bm = c(20, 5, 0))
drivers 
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> expo </th>
   <th style="text-align:left;"> sex </th>
   <th style="text-align:left;"> fuel </th>
   <th style="text-align:left;"> use </th>
   <th style="text-align:right;"> ageph </th>
   <th style="text-align:right;"> bm </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;"> diesel </td>
   <td style="text-align:left;"> private </td>
   <td style="text-align:right;"> 18 </td>
   <td style="text-align:right;"> 20 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;"> diesel </td>
   <td style="text-align:left;"> private </td>
   <td style="text-align:right;"> 45 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;"> diesel </td>
   <td style="text-align:left;"> private </td>
   <td style="text-align:right;"> 65 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

]

```r
predict(freq_gam_2, newdata = drivers, 
        type = "response")
```

]

---

# Regularized (G)LMs met {glmnet}

---

# Statistical learning with sparsity

.center[
<img src="img/sparsity_book.jpg" alt="Drawing" style="width: 150px; height: 220px;"/>
<br> <br> <img src="img/boehmke_greenwell.jpg" alt="Drawing" style="width: 150px; height: 220px;"/>  
]

]

Why?

* Sort through the mass of information and bring it down to .hi-pink[its bare essentials].

* One form of simplicity is .hi-pink[sparsity].

* Only a relatively small number of predictors play a role.

How? &nbsp; &nbsp; .KULbginline[Automatic feature selection!]

* Fit a model with all *p* predictors, but constrain or .hi-pink[regularize] the coefficient estimates.

* Shrinking the coeffcient estimates can signifcantly reduce their variance.

* Some types of shrinkage put some of the coefficients .KULbginline[exactly equal to zero]!

]

---

# Ridge and lasso (least squares) regression

`$$\begin{eqnarray*}
\min_{\beta_0,\boldsymbol{\beta}} \ \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 = \min_{\beta_0,\boldsymbol{\beta}} \ \text{RSS}
\end{eqnarray*}$$`

subject to a .hi-pink[budget constraint]

`$$\begin{eqnarray*}
\sum_{j=1}^p \beta_j^2 \leq \color{#e64173}{t},
\end{eqnarray*}$$`

i.e. an `$\ell_2$` penalty.

Shrinks the coefficient estimates (not the intercept) to zero.

]

`$$\begin{eqnarray*}
\min_{\beta_0,\boldsymbol{\beta}} \ \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 = \min_{\beta_0,\boldsymbol{\beta}} \ \text{RSS}
\end{eqnarray*}$$`

subject to a .hi-pink[budget constraint]

`$$\begin{eqnarray*}
\sum_{j=1}^p |\beta_j| \leq \color{#e64173}{t},
\end{eqnarray*}$$`

i.e. an `$\ell_1$` penalty.

Shrinks the coefficient estimates (not the intercept) to zero and does variable selection!

Lasso is for .hi-pink[L]east .hi-pink[a]bsolute .hi-pink[s]hrinkage and .hi-pink[s]election .hi-pink[o]perator.

]

---

# Ridge and lasso (least squares) regression (cont.)

The .KULbginline[dual problem] formulation:

* with ridge penalty:

`$$\begin{eqnarray*}
\min_{\beta_0,\boldsymbol{\beta}} \ \text{RSS} + \color{#e64173}{\lambda} \sum_{j=1}^p \beta_j^2
\end{eqnarray*}$$`

* with lasso penalty:
`$$\begin{eqnarray*}
\min_{\beta_0,\boldsymbol{\beta}} \ \text{RSS}+\color{#e64173}{\lambda} \sum_{j=1}^p |\beta_j|.
\end{eqnarray*}$$`

`$\color{#e64173}{\lambda}$` is a tuning parameter; use resampling methods to pick a value!

Both ridge and lasso require .hi-pink[centering and scaling] of the features.

]

Ellipses (around least-squares solution) represent regions of constant RSS.

Lasso budget on the left and ridge budget on the right.

Source: James et al. (2021) on [An introduction to statistical learning](https://www.statlearning.com/).

]

---

# Regularized GLMs

We now focus on generalizations of linear models and the lasso.

Minimize

`$$\begin{eqnarray*}
\min_{\beta_0,\ \beta} -\frac{1}{n} \log \mathcal{L}(\beta_0,\ \beta;\ y,\ X)+\lambda \|\boldsymbol{\beta}\|_1.
\end{eqnarray*}$$`

Here:

* `$\log \mathcal{L}$` is the log-likelihood of a GLM.

* `$n$` is the sample size

* `$\|\boldsymbol{\beta}\|_1 = \sum_{j=1}^p \beta_j$` the `$\ell_1$` penalty.

What happens if:

* `$\lambda \to 0$`?

* `$\lambda \to \infty$`?

]

The R package {glmnet} fits linear, logistic and multinomial, Poisson, and Cox regression models.

]
---

# Fit a GLM with lasso regularization in {glmnet}

{glmnet} is a package that fits a generalized linear model via penalized maximum likelihood.

Main function call (with a selection of arguments, see `? glmnet` for a complete list)

```r
fit <- glmnet(x, y, family = ., alpha = ., weights = ., offset = ., nlambda = ., standardize = ., intercept = .)
```

where

* `x` is the input matrix and `y` is the response variable
* `family` the response type, e.g. `family = poisson`
* `weights` and `offset` 
* `nlambda` is the number of `$\lambda$` values, default is 100
* `standardize` should `x` be standardized prior to fitting the model sequence?
* `intercept` should incercept be fitted?
* `alpha` a value between 0 and 1, such that the penalty becomes
`$$\begin{eqnarray*}
\lambda P_{\alpha}(\boldsymbol{\beta}) = \lambda \cdot \sum_{j=1}^p \left\{\frac{(1-\alpha)}{2}\beta_j^2 + \alpha |\beta_j|\right\}.
\end{eqnarray*}$$`
Thus, with `$\alpha = 1$` the lasso penalty and `$\alpha = 0$` the ridge penalty results.

---

# A first example of {glmnet}

Following [the vignette](https://web.stanford.edu/~hastie/Papers/Glmnet_Vignette.pdf) we start with penalized linear regression

```r
library(glmnet)
data(QuickStartExample)
```

This example loads an input matrix `x` and vector `y` of outcomes. The input matrix `x` is not standardized yet (check this!).

We calibrate a lasso linear regression model

```r
fit <- glmnet(x, y, family = "gaussian", 
              `alpha = 1`, standardize = TRUE, 
              intercept = TRUE)
summary(fit)
```

]

Note that the formula notation `y ~ x` can not be used with `glmnet`.

Some `tidy` instructions are available for `glmnet` objects (but not all), e.g.

```r
library(broom)
tidy(fit)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> step </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> lambda </th>
   <th style="text-align:right;"> dev.ratio </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.6607581 </td>
   <td style="text-align:right;"> 1.630762 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 0.6312350 </td>
   <td style="text-align:right;"> 1.485890 </td>
   <td style="text-align:right;"> 0.0552832 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 0.5874616 </td>
   <td style="text-align:right;"> 1.353887 </td>
   <td style="text-align:right;"> 0.1458910 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 0.5475769 </td>
   <td style="text-align:right;"> 1.233612 </td>
   <td style="text-align:right;"> 0.2211153 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 0.5112354 </td>
   <td style="text-align:right;"> 1.124021 </td>
   <td style="text-align:right;"> 0.2835678 </td>
  </tr>
</tbody>
</table>

]

---

```r
plot(fit, `label = TRUE`)
```

]

```r
plot(fit, label = TRUE, `xvar = 'lambda'`)
```

]

---

```r
plot(fit, `xvar = 'dev'`, label = TRUE)
```

]

```r
print(fit) 
## 
## Call:  glmnet(x = QuickStartExample$x, y = QuickStartExample$y, family = "gaussian",      alpha = 1, standardize = TRUE, intercept = TRUE) 
## 
##    Df  %Dev  Lambda
## 1   0  0.00 1.63100
## 2   2  5.53 1.48600
## 3   2 14.59 1.35400
## 4   2 22.11 1.23400
## 5   2 28.36 1.12400
## 6   2 33.54 1.02400
## 7   4 39.04 0.93320
## 8   5 45.60 0.85030
## 9   5 51.54 0.77470
## 10  6 57.35 0.70590
## 11  6 62.55 0.64320
## 12  6 66.87 0.58610
## 13  6 70.46 0.53400
## 14  6 73.44 0.48660
## 15  7 76.21 0.44330
## 16  7 78.57 0.40400
## 17  7 80.53 0.36810
## 18  7 82.15 0.33540
## 19  7 83.50 0.30560
## 20  7 84.62 0.27840
## 21  7 85.55 0.25370
## 22  7 86.33 0.23120
## 23  8 87.06 0.21060
## 24  8 87.69 0.19190
## 25  8 88.21 0.17490
## 26  8 88.65 0.15930
## 27  8 89.01 0.14520
## 28  8 89.31 0.13230
## 29  8 89.56 0.12050
## 30  8 89.76 0.10980
## 31  9 89.94 0.10010
## 32  9 90.10 0.09117
## 33  9 90.23 0.08307
## 34  9 90.34 0.07569
## 35 10 90.43 0.06897
## 36 11 90.53 0.06284
## 37 11 90.62 0.05726
## 38 12 90.70 0.05217
## 39 15 90.78 0.04754
## 40 16 90.86 0.04331
## 41 16 90.93 0.03947
## 42 16 90.98 0.03596
## 43 17 91.03 0.03277
## 44 17 91.07 0.02985
## 45 18 91.11 0.02720
## 46 18 91.14 0.02479
## 47 19 91.17 0.02258
## 48 19 91.20 0.02058
## 49 19 91.22 0.01875
## 50 19 91.24 0.01708
## 51 19 91.25 0.01557
## 52 19 91.26 0.01418
## 53 19 91.27 0.01292
## 54 19 91.28 0.01178
## 55 19 91.29 0.01073
## 56 19 91.29 0.00978
## 57 19 91.30 0.00891
## 58 19 91.30 0.00812
## 59 19 91.31 0.00739
## 60 19 91.31 0.00674
## 61 19 91.31 0.00614
## 62 20 91.31 0.00559
## 63 20 91.31 0.00510
## 64 20 91.31 0.00464
## 65 20 91.32 0.00423
## 66 20 91.32 0.00386
## 67 20 91.32 0.00351
```

]

---

Get estimated coefficients for handpicked value

```r
coef(fit, `s = 0.1`)
```

```
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  0.150928072
## V1           1.320597195
## V2           .          
## V3           0.675110234
## V4           .          
## V5          -0.817411518
## V6           0.521436671
## V7           0.004829335
## V8           0.319415917
## V9           .          
## V10          .          
## V11          0.142498519
## V12          .          
## V13          .          
## V14         -1.059978702
## V15          .          
## V16          .          
## V17          .          
## V18          .          
## V19          .          
## V20         -1.021873704
```

]

`glmnet` returns a sequence of models for the users to choose from, i.e. a model for every `lambda`.

How do we select the most appropriate model?

Use cross-validation to pick a `lambda` value. The default is 10-folds cross-validation.

```r
cv_fit <- cv.glmnet(QuickStartExample$x, QuickStartExample$y)
```

We can pick the `lambda` that minimizes the cross-validation error.

```r
cv_fit$lambda.min
## [1] 0.07569327
```

Or we use the one-standard-error-rule.

```r
cv_fit$lambda.1se
## [1] 0.1593271
```

]

---

We plot the cross-validation error for the inspected grid of `lambda` values.

```r
plot(cv_fit)
```

---

For the selected `lambda` (via `cv_fit$lambda.min`) we inspect which parameters are non-zero (on the right).

Now, compare this to the selected variables obtained via `cv_fit$lambda.1se`.

]

```r
coef(fit, s = cv_fit$lambda.min)
```

```
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept)  0.14867414
## V1           1.33377821
## V2           .         
## V3           0.69787701
## V4           .         
## V5          -0.83726751
## V6           0.54334327
## V7           0.02668633
## V8           0.33741131
## V9           .         
## V10          .         
## V11          0.17105029
## V12          .         
## V13          .         
## V14         -1.07552680
## V15          .         
## V16          .         
## V17          .         
## V18          .         
## V19          .         
## V20         -1.05278699
```

]

---

The variables `V1`, `V3`, `V5-8`, `V11`, `V14` and `V20` are selected in the regression model.

However, the corresponding estimates (on the left) are biased, and shrunk to zero.

To remove this bias, we refit the model, only using the selected variables.

```r
attach(QuickStartExample)
subset <- data.frame(y = y, V1 = x[, 1], V3 = x[, 3], 
                     V5 = x[, 5], V6 = x[, 6], 
                     V7 = x[, 7], V8 = x[, 8], 
                     V11 = x[, 11], V14 = x[, 14], 
                     V20 = x[, 20])
final_model <- lm(y ~ V1 + V3 + V5 + V6 + V7 + V8 + 
                      V11 + V14 + V20, data = subset)
final_model %>% broom::tidy()
```

What is your judgement about `V7` (see coefficients on the right)?

]

What do you observe when comparing the estimates below with those shown on the previous sheet?

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 0.1416891 </td>
   <td style="text-align:right;"> 0.0995658 </td>
   <td style="text-align:right;"> 1.4230704 </td>
   <td style="text-align:right;"> 0.1581730 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> V1 </td>
   <td style="text-align:right;"> 1.3746695 </td>
   <td style="text-align:right;"> 0.0968211 </td>
   <td style="text-align:right;"> 14.1980421 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> V3 </td>
   <td style="text-align:right;"> 0.7688247 </td>
   <td style="text-align:right;"> 0.0942568 </td>
   <td style="text-align:right;"> 8.1567012 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> V5 </td>
   <td style="text-align:right;"> -0.8991610 </td>
   <td style="text-align:right;"> 0.1033747 </td>
   <td style="text-align:right;"> -8.6980793 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> V6 </td>
   <td style="text-align:right;"> 0.6115910 </td>
   <td style="text-align:right;"> 0.0900882 </td>
   <td style="text-align:right;"> 6.7888025 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> V7 </td>
   <td style="text-align:right;"> 0.0947279 </td>
   <td style="text-align:right;"> 0.0972959 </td>
   <td style="text-align:right;"> 0.9736059 </td>
   <td style="text-align:right;"> 0.3328618 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> V8 </td>
   <td style="text-align:right;"> 0.3933822 </td>
   <td style="text-align:right;"> 0.0920456 </td>
   <td style="text-align:right;"> 4.2737767 </td>
   <td style="text-align:right;"> 0.0000477 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> V11 </td>
   <td style="text-align:right;"> 0.2600734 </td>
   <td style="text-align:right;"> 0.0994215 </td>
   <td style="text-align:right;"> 2.6158659 </td>
   <td style="text-align:right;"> 0.0104367 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> V14 </td>
   <td style="text-align:right;"> -1.1239616 </td>
   <td style="text-align:right;"> 0.0885267 </td>
   <td style="text-align:right;"> -12.6963039 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> V20 </td>
   <td style="text-align:right;"> -1.1491267 </td>
   <td style="text-align:right;"> 0.1117142 </td>
   <td style="text-align:right;"> -10.2863111 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
</tbody>
</table>

]

---

# {glmnet} and the MTPL data set

Next, we fit a .KULbginline[Poisson regression model with lasso penalty] on the `mtpl` data set.

The regularization penalty helps us to select the interesting features from the data set.

`glmnet` requires the features as input matrix `x` and the target as a vector `y`.

Recall:

* `mtpl` has .hi-pink[continuous] features (e.g. `ageph`, `bm`, `power`)

* `mtpl` has .hi-pink[factor] variables with .hi-pink[two levels] (e.g. `sex`, `fleet`)

* but also factor variables with .hi-pink[more than 2 levels] (`coverage`)

]

Consider different types of coding factor variables.

Apply the `contrasts` function to the variable `coverage`

```r
map(mtpl[, c("coverage")], contrasts, 
    contrasts = FALSE)
## $coverage
##     FO PO TPL
## FO   1  0   0
## PO   0  1   0
## TPL  0  0   1
```

```r
map(mtpl[, c("coverage")], contrasts, 
    contrasts = TRUE)
## $coverage
##     PO TPL
## FO   0   0
## PO   1   0
## TPL  0   1
```

What's the difference?
]

---

# {glmnet} and the MTPL data set (cont.)

We construct the input matrix for `glmnet`.

```r
*y <- mtpl$nclaims

x <- model.matrix( ~ coverage + fuel + use + fleet + sex + ageph + bm +
                     agec + power, data = mtpl, 
                   contrasts.arg = map(mtpl[, c("coverage")], contrasts, 
                                       contrasts = FALSE))[,-1]

x[1:10,]
```

Put the response or outcome variable in `y`.

In the `mtpl` data set we build a Poisson model for `nclaims`.

---

# {glmnet} and the MTPL data set (cont.)

We construct the input matrix for `glmnet`.

```r
y <- mtpl$nclaims

x <- `model.matrix`( ~ coverage + fuel + use + fleet + sex + ageph + bm +
                     agec + power, data = mtpl, 
                     `contrasts.arg = map(mtpl[, c("coverage")], contrasts,` 
                                      `contrasts = FALSE)`)[,-1]

```

Use `model.matrix` to create the input matrix `x`.

We code the factor variable `coverage` with one-hot-encoding. Here, three dummy variables will be created for the three levels of `coverage`.

The other factor variables `fuel`, `use`, `fleet`, `sex` are dummy coded, with one dummy variable.

---

# {glmnet} and the MTPL data set (cont.)

We construct the input matrix for `glmnet`.

```r
y <- mtpl$nclaims

x <- model.matrix( ~ coverage + fuel + use + fleet + sex + ageph + bm +
                     agec + power, data = mtpl, 
                     contrasts.arg = map(mtpl[, c("coverage")], contrasts, 
                                       contrasts = FALSE))`[,-1]`

```

Use `model.matrix` to create the input matrix `x`.

We remove the first column, representing the intercept, from the `model.matrix`.

---

# {glmnet} and the MTPL data set (cont.)

Let's check the input matrix `x`

```
##   coverageFO coveragePO coverageTPL fuelgasoline usework fleetY sexmale ageph
## 1          0          0           1            1       0      0       1    50
## 2          0          1           0            1       0      0       0    64
## 3          0          0           1            0       0      0       1    60
## 4          0          0           1            1       0      0       1    77
## 5          0          0           1            1       0      0       0    28
## 6          0          0           1            1       0      0       1    26
##   bm agec power
## 1  5   12    77
## 2  5    3    66
## 3  0   10    70
## 4  0   15    57
## 5  9    7    70
## 6 11   12    70
```

You are now ready to fit a regularized Poisson GLM for `y` with input `x`.

Let's go!

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

You will fit a regularized Poisson GLM on the `mtpl` data with the {glmnet} package.

1. Fit a `glmnet` with lasso penalty and store the fitted object in `mtpl_glmnet`. Use the following arguments `family = "poisson", offset = ___`.

2. Display the order of the variables and their names via `row.names(mtpl_glmnet$beta)`.

3. Plot the solutions path. Pick a meaningful value for `lambda` via cross-validation.

4. Which variables are selected in the lasso model? As a last step, you will fit a Poisson GLM with the selected variables. What do you see?

5. List some pros and cons of the above strategy.
]

---

```r
alpha <- 1 # for lasso penalty
mtpl_glmnet <- glmnet(x = x, y = y, 
                      family = "poisson", 
                      offset = log(mtpl$expo), 
                      alpha = alpha, 
                      standardize = TRUE, 
                      intercept = TRUE)
```

```r
row.names(mtpl_glmnet$beta) 
##  [1] "coverageFO"   "coveragePO"   "coverageTPL"  "fuelgasoline" "usework"     
##  [6] "fleetY"       "sexmale"      "ageph"        "bm"           "agec"        
## [11] "power"
```

]

```r
plot(mtpl_glmnet, xvar = 'lambda', label = TRUE)  
```

]

---

```r
set.seed(123)
fold_id <- sample(rep(1:10, length.out = nrow(mtpl)), 
                  nrow(mtpl))
mtpl_glmnet_cv <- cv.glmnet(x, y, family = "poisson", 
                            alpha = alpha, 
                            nfolds = 10, 
                            foldid = fold_id, 
                            type.measure = "deviance", 
                            standardize = TRUE, 
                            intercept = TRUE)
plot(mtpl_glmnet_cv)
```

]

```r
coef(mtpl_glmnet_cv, s = "lambda.min")
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  -2.106680932
## coverageFO   -0.006499730
## coveragePO    .          
## coverageTPL   0.050002173
## fuelgasoline -0.165864612
## usework      -0.069292342
## fleetY       -0.049283838
## sexmale      -0.013718073
## ageph        -0.006347490
## bm            0.058564280
## agec         -0.002004356
## power         0.003448081
```

]

---

```r
set.seed(123)
fold_id <- sample(rep(1:10, length.out = nrow(mtpl)), 
                  nrow(mtpl))
mtpl_glmnet_cv <- cv.glmnet(x, y, family = "poisson", 
                            alpha = alpha, 
                            nfolds = 10, 
                            foldid = fold_id,
                            type.measure = "deviance", 
                            standardize = TRUE, 
                            intercept = TRUE)
plot(mtpl_glmnet_cv)
```

]

```r
coef(mtpl_glmnet_cv, s = "lambda.1se")
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  -2.124039910
## coverageFO    .          
## coveragePO    .          
## coverageTPL   .          
## fuelgasoline  .          
## usework       .          
## fleetY        .          
## sexmale       .          
## ageph        -0.002916928
## bm            0.046163778
## agec          .          
## power         .
```

]

---

```r
mtpl$coverage <- relevel(mtpl$coverage, "PO")
mtpl_formula_refit <- nclaims ~ 1 + coverage + 
                      fuel + use + fleet + sex + 
                      ageph + bm + agec + power
mtpl_glm_refit <- glm(mtpl_formula_refit, 
                      data = mtpl, 
                      offset = log(mtpl$expo), 
                      family = poisson())
```

]

The selection obtained via `lambda.min`

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> -1.9892872 </td>
   <td style="text-align:right;"> 0.0401325 </td>
   <td style="text-align:right;"> -49.5679730 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> coverageFO </td>
   <td style="text-align:right;"> 0.0044293 </td>
   <td style="text-align:right;"> 0.0244274 </td>
   <td style="text-align:right;"> 0.1813238 </td>
   <td style="text-align:right;"> 0.8561134 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> coverageTPL </td>
   <td style="text-align:right;"> 0.0743796 </td>
   <td style="text-align:right;"> 0.0172363 </td>
   <td style="text-align:right;"> 4.3152799 </td>
   <td style="text-align:right;"> 0.0000159 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> fuelgasoline </td>
   <td style="text-align:right;"> -0.1731052 </td>
   <td style="text-align:right;"> 0.0153266 </td>
   <td style="text-align:right;"> -11.2944557 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> usework </td>
   <td style="text-align:right;"> -0.0862841 </td>
   <td style="text-align:right;"> 0.0334470 </td>
   <td style="text-align:right;"> -2.5797233 </td>
   <td style="text-align:right;"> 0.0098880 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> fleetY </td>
   <td style="text-align:right;"> -0.1226498 </td>
   <td style="text-align:right;"> 0.0435289 </td>
   <td style="text-align:right;"> -2.8176618 </td>
   <td style="text-align:right;"> 0.0048375 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sexmale </td>
   <td style="text-align:right;"> -0.0253198 </td>
   <td style="text-align:right;"> 0.0162468 </td>
   <td style="text-align:right;"> -1.5584505 </td>
   <td style="text-align:right;"> 0.1191265 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ageph </td>
   <td style="text-align:right;"> -0.0074262 </td>
   <td style="text-align:right;"> 0.0005391 </td>
   <td style="text-align:right;"> -13.7764864 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bm </td>
   <td style="text-align:right;"> 0.0639249 </td>
   <td style="text-align:right;"> 0.0017328 </td>
   <td style="text-align:right;"> 36.8902457 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> agec </td>
   <td style="text-align:right;"> -0.0004698 </td>
   <td style="text-align:right;"> 0.0019368 </td>
   <td style="text-align:right;"> -0.2425874 </td>
   <td style="text-align:right;"> 0.8083251 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> power </td>
   <td style="text-align:right;"> 0.0038535 </td>
   <td style="text-align:right;"> 0.0003799 </td>
   <td style="text-align:right;"> 10.1421096 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
</tbody>
</table>

]

---

```r
mtpl_formula_refit_2 <- nclaims ~ 1 + ageph + bm 
mtpl_glm_refit_2 <- glm(mtpl_formula_refit_2, 
                        data = mtpl, 
                        offset = log(mtpl$expo), 
                        family = poisson())
```

]

The selection obtained via `lambda.1se`

]

---
name: wrap-up

# Thanks!  <img src="img/xaringan.png" class="title-hex">

Slides created with the R package [xaringan](https://github.com/yihui/xaringan).
<br> <br> <br>
Course material available via 
<br>
<svg aria-hidden="true" role="img" viewBox="0 0 496 512" style="height:1em;width:0.97em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#116E8A;overflow:visible;position:relative;"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> https://github.com/katrienantonio/hands-on-machine-learning-R-module-1