Loss modelling, reserving and fraud analytics in R

# Loss modelling, reserving and fraud analytics in R
## A hands-on workshop
<html>
<div style="float:left">

</div>
<hr align='center' color='#116E8A' size=1px width=97%>
</html>
### Katrien Antonio & Jonas Crevecoeur
### <a href="https://www.github.com/katrienantonio/workshop-loss-reserv-fraud">IA|BE workshop</a> | June, 2021

---

# Prologue

---

# Introduction

### Course

The course repo on GitHub, where you can find the data sets, lecture sheets, R scripts and R markdown files.

### Us

---

# Checklist

☑ Do you have a fairly recent version of R?
  
  ```r
  version$version.string
  ## [1] "R version 4.0.3 (2020-10-10)"
  ```

☑ Do you have a fairly recent version of RStudio? 
  
  ```r
  RStudio.Version()$version
  ## Requires an interactive session but should return something like "[1] ‘1.3.1093’"
  ```

☑ Have you installed the R packages listed in the software requirements?

☑ Have you created an account on RStudio Cloud (to avoid any local installation issues)?
  
---

# Why this training?

### The goals of this training .font140[<svg style="height:0.8em;top:.04em;position:relative;fill:#116E8A;" viewBox="0 0 512 512"><path d="M505.05 19.1a15.89 15.89 0 0 0-12.2-12.2C460.65 0 435.46 0 410.36 0c-103.2 0-165.1 55.2-211.29 128H94.87A48 48 0 0 0 52 154.49l-49.42 98.8A24 24 0 0 0 24.07 288h103.77l-22.47 22.47a32 32 0 0 0 0 45.25l50.9 50.91a32 32 0 0 0 45.26 0L224 384.16V488a24 24 0 0 0 34.7 21.49l98.7-49.39a47.91 47.91 0 0 0 26.5-42.9V312.79c72.59-46.3 128-108.4 128-211.09.1-25.2.1-50.4-6.85-82.6zM384 168a40 40 0 1 1 40-40 40 40 0 0 1-40 40z"/></svg>]

* develop practical .KULbginline[data handling foundations], e.g. going from granular claims development data to aggregated figures

* visualize and explore data

* cover essential actuarial modelling tasks, including loss modelling for pricing, claims reserving and building a fraud detection model

* learn by doing, get you started (in particular when you have limited experience in R).

<br>

> *"In short, we will cover things that we wish someone had taught us in our undergraduate programs."* 
> <br>
> .font80[This quote is from the [Data science for economists course](http://github.com/uo-ec607/lectures) by Grant McDermott.]

---

# Why R and RStudio?

### Data science positivism

- Next to Python, R has become the *de facto* language for data science, with a cutting edge *machine learning toolbox*.
- See: [The Popularity of Data Science Software](http://r4stats.com/articles/popularity/)
- R is open-source with a very active community of users spanning academia and industry.

### Bridge to actuarial science, econometrics and other tools

- R has all of the statistics and econometrics support, and is amazingly adaptable as a “glue” language to other programming languages and APIs.
- R does not try to be everything to everyone. The RStudio IDE and ecosystem allow for further, seemless integration (with e.g. python, keras, tensorflow or C).
- Widely used in actuarial undergraduate programs

### Disclaimer + Read more

- It's also the language that we know best.
- If you want to read more: [R-vs-Python](https://blog.rstudio.com/2019/12/17/r-vs-python-what-s-the-best-for-language-for-data-science/), [when to use Python or R](https://www.datacamp.com/community/blog/when-to-use-python-or-r) or [Hadley Wickham on the future of R](https://qz.com/1661487/hadley-wickham-on-the-future-of-r-python-and-the-tidyverse/)

---

background-image: url("image/tidyverse2.1.png")
background-size: cover
background-size: 65% 
background-position: center

---

# Welcome to the tidyverse!

><p align="justify">The .KULbginline[tidyverse] is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. </p>

More on: [tidyverse](https://www.tidyverse.org).

Install the packages with `install.packages("tidyverse")`. Then run `library(tidyverse)` to load the core tidyverse.

---

# Today's Outline

* [Prologue](#prologue)

* [What's out there: the R universe](#universe) (see the prework)

- why R and RStudio?
  - welcome to the tidyverse!
  - principles of tidy data
  - workflow of a data scientist

* [Data wrangling and visualisation](#tidyverse) (see the prework)

- tibbles
  - pipe operator `%>%`
  - {dplyr} instructions
  - {ggplot2} for data visualisation
  - what else is there?

]

* [Data sets used in the session](#data-sets)

- MTPL data on frequency and severity of claims
  - Secura Re losses
  
* [Fitting frequency models in R](#frequency)

- frequency - severity approach
  - the `$(a,b,0)$` class
  - MLE, model selection and evaluation tools
  - excess zeroes
 
* [Fitting severity models in R](#severity)

- from simple to complex parametric distributions
  - be aware of truncation and censoring
  - a global fit via splicing
  - case-study on the Secura Re losses
  - what else is there?
  
]

---

# Data sets used in the session

---

# Data sets used in this session - MTPL <img src="image/pipe.png" class="title-hex"> <img src="image/dplyr.png" class="title-hex"> <img src="image/ggplot2.png" class="title-hex">

We illustrate some first data handling steps on the Motor Third Party Liability data set. There are 163 231 policyholders in this data set.

The frequency of claiming (`nclaims`) and corresponding severity (`avg`, the amount paid on average per claim reported by a policyholder) are the .KULbginline[target variables] in this data set.

Predictor variables are:

* the exposure-to-risk, the duration of the insurance coverage (max. 1 year)
* factor variables, e.g. gender, coverage, fuel
* continuous, numeric variables, e.g. age of the policyholder, age of the car
* spatial information: postal code (in Belgium) of the municipality where the policyholder resides.

More details in [Henckaerts et al. (2018, Scandinavian Actuarial Journal)](https://katrienantonio.github.io/portfolio/machine-learning) and [Henckaerts et al. (2020, North American Actuarial Journal)](https://katrienantonio.github.io/portfolio/machine-learning).

---

# Data sets used in this session - MTPL <img src="image/pipe.png" class="title-hex"> <img src="image/dplyr.png" class="title-hex"> <img src="image/ggplot2.png" class="title-hex">

You can load the data from a .R script in the course material:

```r
# install.packages("rstudioapi")
dir <- dirname(rstudioapi::getActiveDocumentContext()$path)
setwd(dir)
mtpl_orig <- read.table('./data/PC_data.txt', 
                                  header = TRUE)
mtpl_orig <- as_tibble(mtpl_orig)
```

If you work in an R notebook or R markdown file, you can also go for:

```r
# install.packages("here")
library(here)
dir <- here::here()   
setwd(dir) 
mtpl_orig <- read.table('./data/PC_data.txt', 
                                  header = TRUE)
mtpl_orig <- as_tibble(mtpl_orig)
```

The last instruction transforms `mtpl_orig` into a `tibble`.

---

These instructions are recommended because they .KULbginline[avoid referring to a specific working directory on your computer], e.g.

```r
setwd("C:\\Users\\u0043788\\Dropbox\\Arcturus course\\data") 
mtpl_orig <- read.table('PC_data.txt', header = TRUE)
```

If you organize your data analysis or project in a folder on your computer that holds all relevant files, then the above instructions allow you (and your colleagues) to get started right away.

The only thing you need is an organized file structure.

The {rstudioapi} package is developed for RStudio. The {here} library is more general.

Do note that when working in a .Rmd, {here} will use the folder that markdown lives in as the working directory, but if you are working in a script (.R) the working directory is the top level of the project file.

---

First of all, we explore the structure of `mtpl_orig` with `str()`.

```r
str(mtpl_orig)
## tibble [163,231 x 18] (S3: tbl_df/tbl/data.frame)
##  $ ID      : int [1:163231] 1 2 3 4 5 6 7 8 9 10 ...
##  $ NCLAIMS : int [1:163231] 1 0 0 0 1 0 1 0 0 0 ...
##  $ AMOUNT  : num [1:163231] 1618 0 0 0 156 ...
##  $ AVG     : num [1:163231] 1618 NA NA NA 156 ...
##  $ EXP     : num [1:163231] 1 1 1 1 0.0466 ...
##  $ COVERAGE: chr [1:163231] "TPL" "PO" "TPL" "TPL" ...
##  $ FUEL    : chr [1:163231] "gasoline" "gasoline" "diesel" "gasoline" ...
##  $ USE     : chr [1:163231] "private" "private" "private" "private" ...
##  $ FLEET   : chr [1:163231] "N" "N" "N" "N" ...
##  $ SEX     : chr [1:163231] "male" "female" "male" "male" ...
##  $ AGEPH   : int [1:163231] 50 64 60 77 28 26 26 58 59 34 ...
##  $ BM      : int [1:163231] 5 5 0 0 9 11 11 11 0 7 ...
##  $ AGEC    : int [1:163231] 12 3 10 15 7 12 8 14 3 6 ...
##  $ POWER   : int [1:163231] 77 66 70 57 70 70 55 47 98 74 ...
##  $ PC      : int [1:163231] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
##  $ TOWN    : chr [1:163231] "BRUSSEL" "BRUSSEL" "BRUSSEL" "BRUSSEL" ...
##  $ LONG    : num [1:163231] 4.36 4.36 4.36 4.36 4.36 ...
##  $ LAT     : num [1:163231] 50.8 50.8 50.8 50.8 50.8 ...
```

---

Or with `head()`...

```r
head(mtpl_orig) %>% kable(format = 'html')
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> ID </th>
   <th style="text-align:right;"> NCLAIMS </th>
   <th style="text-align:right;"> AMOUNT </th>
   <th style="text-align:right;"> AVG </th>
   <th style="text-align:right;"> EXP </th>
   <th style="text-align:left;"> COVERAGE </th>
   <th style="text-align:left;"> FUEL </th>
   <th style="text-align:left;"> USE </th>
   <th style="text-align:left;"> FLEET </th>
   <th style="text-align:left;"> SEX </th>
   <th style="text-align:right;"> AGEPH </th>
   <th style="text-align:right;"> BM </th>
   <th style="text-align:right;"> AGEC </th>
   <th style="text-align:right;"> POWER </th>
   <th style="text-align:right;"> PC </th>
   <th style="text-align:left;"> TOWN </th>
   <th style="text-align:right;"> LONG </th>
   <th style="text-align:right;"> LAT </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1618.0010 </td>
   <td style="text-align:right;"> 1618.0010 </td>
   <td style="text-align:right;"> 1.0000000 </td>
   <td style="text-align:left;"> TPL </td>
   <td style="text-align:left;"> gasoline </td>
   <td style="text-align:left;"> private </td>
   <td style="text-align:left;"> N </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 50 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> 77 </td>
   <td style="text-align:right;"> 1000 </td>
   <td style="text-align:left;"> BRUSSEL </td>
   <td style="text-align:right;"> 4.355223 </td>
   <td style="text-align:right;"> 50.84539 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0.0000 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> 1.0000000 </td>
   <td style="text-align:left;"> PO </td>
   <td style="text-align:left;"> gasoline </td>
   <td style="text-align:left;"> private </td>
   <td style="text-align:left;"> N </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 64 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 66 </td>
   <td style="text-align:right;"> 1000 </td>
   <td style="text-align:left;"> BRUSSEL </td>
   <td style="text-align:right;"> 4.355223 </td>
   <td style="text-align:right;"> 50.84539 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0.0000 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> 1.0000000 </td>
   <td style="text-align:left;"> TPL </td>
   <td style="text-align:left;"> diesel </td>
   <td style="text-align:left;"> private </td>
   <td style="text-align:left;"> N </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 60 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 70 </td>
   <td style="text-align:right;"> 1000 </td>
   <td style="text-align:left;"> BRUSSEL </td>
   <td style="text-align:right;"> 4.355223 </td>
   <td style="text-align:right;"> 50.84539 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0.0000 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> 1.0000000 </td>
   <td style="text-align:left;"> TPL </td>
   <td style="text-align:left;"> gasoline </td>
   <td style="text-align:left;"> private </td>
   <td style="text-align:left;"> N </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 77 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 57 </td>
   <td style="text-align:right;"> 1000 </td>
   <td style="text-align:left;"> BRUSSEL </td>
   <td style="text-align:right;"> 4.355223 </td>
   <td style="text-align:right;"> 50.84539 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 155.9746 </td>
   <td style="text-align:right;"> 155.9746 </td>
   <td style="text-align:right;"> 0.0465753 </td>
   <td style="text-align:left;"> TPL </td>
   <td style="text-align:left;"> gasoline </td>
   <td style="text-align:left;"> private </td>
   <td style="text-align:left;"> N </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 28 </td>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 70 </td>
   <td style="text-align:right;"> 1000 </td>
   <td style="text-align:left;"> BRUSSEL </td>
   <td style="text-align:right;"> 4.355223 </td>
   <td style="text-align:right;"> 50.84539 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0.0000 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> 1.0000000 </td>
   <td style="text-align:left;"> TPL </td>
   <td style="text-align:left;"> gasoline </td>
   <td style="text-align:left;"> private </td>
   <td style="text-align:left;"> N </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 26 </td>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> 70 </td>
   <td style="text-align:right;"> 1000 </td>
   <td style="text-align:left;"> BRUSSEL </td>
   <td style="text-align:right;"> 4.355223 </td>
   <td style="text-align:right;"> 50.84539 </td>
  </tr>
</tbody>
</table>

---

Note that the data `mtpl_orig` uses capitals for the variable names

```r
mtpl_orig %>% slice(1:3) %>% select(-LONG, -LAT) %>% kable(format = 'html')
```

We change this to lower case variables, and rename `exp` to `expo`.

```r
mtpl <- mtpl_orig %>%
  # rename all columns 
  rename_all(function(.name) {
    .name %>% 
      # replace all names with the lowercase versions
      tolower 
    })
mtpl <- rename(mtpl, expo = exp)
```

Check `rename_all()` in {dplyr}.

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

To get warmed up, let's load the `mtpl` data and do some .KULbginline[basic investigations] into the variables. The idea is to get a feel for the data.

Your starting point are the instructions in the R script on the course website.

1. Calculate the empirical claim frequency, per unit of exposure. Then do the same per gender. What do you conclude?

2. Visualize the distribution of `nclaims` with a bar plot, use `geom_bar` as a layer in `ggplot`. Try different versions of this bar plot.

3. Visualize the distribution of `avg` with a density plot, use `geom_density` as a layer. Restrict the range of values displayed in the plot.

4. Visualize the distribution of `avg` with a histogram, use `geom_histogram`. Play with the `binwidth`.

]

---

```r
dim(mtpl)
```

```
## [1] 163231     18
```

```r
mtpl %>% summarize(emp_freq = 
                      sum(nclaims) / sum(expo))
```

```r
mtpl %>% 
  group_by(sex) %>% 
  summarize(emp_freq = sum(nclaims) / sum(expo))
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> sex </th>
   <th style="text-align:right;"> emp_freq </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 0.1484325 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 0.1361164 </td>
  </tr>
</tbody>
</table>
]

```r
g <- ggplot(mtpl, aes(nclaims)) + theme_bw() + 
     geom_bar(aes(weight = expo), col = KULbg, 
              fill = KULbg, alpha = 0.5) + 
     labs(y = "Abs freq (in exposure)") +
     ggtitle("MTPL - number of claims")
g
```

]

---

]

```r
g <- ggplot(mtpl, aes(nclaims)) + theme_bw()
g + geom_bar(aes(y = (..count..)/sum(..count..)), 
    col = KULbg, fill = KULbg, alpha = 0.5) + 
  labs(y = "Relative frequency") +
  ggtitle("MTPL - relative number of claims")
```

]

---

With a density plot: mind the `filter(.)` instruction and the use of `xlim(.,.)`

```r
g_dens <- mtpl %>% filter(avg > 0 & avg <= 81000) %>% 
          ggplot(aes(x = avg)) + 
          theme_bw() +
          geom_density(adjust = 3, col = KULbg, 
                       fill = KULbg, alpha = 0.5) + 
          xlim(0, 1e4) + 
          ylab("") + xlab("severity") +
    ggtitle("MTPL data - average amount per claim")
g_dens
```

]

]
---

With a histogram:

```r
g <- mtpl %>% filter(avg > 0 & avg <= 81000) %>% 
     ggplot(aes(x = avg)) + 
      theme_bw() + xlab("severity") +
      geom_histogram(binwidth = 100, col = KULbg, 
            fill = KULbg, alpha = 0.5) + 
      xlim(0, 1e4) +
      labs(y = "Frequency histogram")
g
```

]

]
---

# Data sets used in this session - Secura Re losses

.pull-left[
Secura Belgian Re automobile claims from 1988 to 2001 data, gathered from several European insurance companies, .KULbginlin[exceeding 1 200 000 Euro].

The data were, among others, corrected for inflation, see the {ReIns} library.

```r
secura <- read.table(file="./data/SecuraRe.txt", 
                   header = TRUE, sep = "\t")
```

The density plots on the right show typical features of insurance loss data:

- positive
- skewed to the right
- heavy right tail
- (possibly) subject to truncation and censoring.

]

.pull-right[
<img src="loss_modelling_reserving_analytics_day_1_files/figure-html/unnamed-chunk-24-1.svg" style="display: block; margin: auto;" />
]

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

You will now use the instructions covered in the exploration of the `mtpl` data to picture the `secura` losses.

1. Visualize the distribution of `Loss` in `secura` with a density plot, use `geom_density` as a layer.

2. Do the same for `log(Loss)`.

3. Plot the `Loss` versus `Year`. Use the `geom_point` layer.

]

---

The `secura` losses over time...

```r
ggplot(secura, aes(Year, Loss)) + theme_bw() + 
  geom_point(colour = KULbg, size = 2, alpha = 0.5) + 
  ggtitle('Secura Re - losses arriving over time')
```
]

.pull-right[
<img src="loss_modelling_reserving_analytics_day_1_files/figure-html/unnamed-chunk-26-1.svg" width="100%" style="display: block; margin: auto;" />
]

---

# Fitting frequency models in R

---

# The frequency - severity approach

.KULbginline[Assumptions:]

* claim severity and claim frequency are independent

* we ignore any additional covariates (more on that next week).

]

Each record `$i$` in the data set registers
  
* .KULbginline[exposure], `$e_i$`, fraction of the year in which the contract was active 
  
* .KULbginline[number of claims], `$N_i$`, observed claim count in the period of exposure
  
* .KULbginline[average claim size] per reported claim (or: severity), `$S_i$`, total losses in the period of exposure divided by the number of claims, if `$N_i>0$`.

]

---

# The (a, b, 0) class

Let `$N$` be a count or frequency random variable, with probability function (pf) `$Pr(N = k) = p_k$`.

The `$(a,b,0)$`-class of frequency distributions is a two-parameter class of distributions that satisfy

`$$k \cdot \frac{p_k}{p_{k-1}} = a \cdot k + b \quad k \geq 1.$$`

The probability `$p_0$` then follows from `$\sum_{k=0}^{+\infty} p_k = 1$`.

Members of this class are the:

* Poisson
* Binomial
* Negative Binomial
* Geometric distribution.

]

]

---

Relevant claim frequency distributions in the `$(a,b,0)$` class:

<br/>

Member           |   slope `$a$` | Probability function `$p_k$` |  Parameter(s)  | Moments
:-------------------------|:-------------------------|:-------------------------|:-------------------------
Binomial | `$a < 0$` | `$\frac{n!}{k!(n-k)!} \cdot p^k \cdot (1-p)^k$`  |  `$p \in (0, 1)$` |  `$E(N) = n \cdot p$`
         |    |                         | `$n \in \mathbb{N}$`  | `$\text{Var}(N) = n \cdot p \cdot (1-p)$` 
Poisson | `$a = 0$` | `$e^{-\lambda} \cdot \frac{\lambda^k}{k!}$`  |  `$\lambda \in (0, \infty)$`  |    `$E(N) = \lambda$`
   |    |   |   | `$\text{Var}(N) = \lambda$` 
Negative Binomial | `$a > 0$` | `$\frac{\Gamma(r+k)}{k! \Gamma(r)} \cdot \left( \frac{r}{r+\mu}\right)^r \cdot \left( \frac{\mu}{r+\mu} \right)^k$` | `$\mu \in (0, \infty)$` | `$E(N) = \mu$`
       |    |   | `$r \in (0, \infty)$`   |   `$\text{Var}(N) = \mu + \frac{\mu^2}{r}$`

---

# The (a, b, 0) class: a visual check

To check the `$(a,b,0)$` relation for `mtpl$nclaims` we start from the empirical count distribution

```r
mtpl$nclaims %>% table %>% prop.table
```

We store the empirical probabilities `$\hat{p}_k$` in a vector

```r
empirical <- mtpl$nclaims %>% 
  table %>% prop.table %>% as.numeric
```

and the `$k \cdot \frac{p_k}{p_{k-1}}$`

```r
k <- 1:(length(empirical) - 1)
ab0_relation <- empirical[k+1] / empirical[k] * k  
```

```r
ab0_data <- tibble(k = k, ab0_rel = ab0_relation)
```

]

The positive slope when fitting `$a \cdot k + b$` indicates a NegBin distribution for `mtpl$nclaims`, see

```r
ab0_data %>% lm(ab0_rel ~ k, data = .)
```

and visually

]

---

# Empirical mean and variance of the claim frequency

.pull-left[
Let's explore different ways to calculate the .KULbginline[empirical mean claim frequency].

Inspect the following instructions and list pros and cons:

```r
mean(mtpl$nclaims)
sum(mtpl$nclaims)/sum(mtpl$expo)
weighted.mean(mtpl$nclaims/mtpl$expo, mtpl$expo)
mtpl %>% 
  summarize(emp_freq = sum(nclaims) / sum(expo)) 
```

```
## [1] 0.1239715
## [1] 0.1393352
## [1] 0.1393352
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> emp_freq </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.1393352 </td>
  </tr>
</tbody>
</table>
]

What about the .KULbginline[empirical variance]? .KULbginline[Overdispersion]!

```r
m <- sum(mtpl$nclaims)/sum(mtpl$expo)
m
## [1] 0.1393352
var <- sum((mtpl$nclaims - m * mtpl$expo)^2)/
                sum(mtpl$expo)
var
## [1] 0.1517246
```

Here we use the expression for the variance:

$$\text{Var}(X) = \text{E}[(X-\text{E}X)^2] $$
We empirically calculate the first expected value by taking exposure into account.

Moreover, we compare each realization of `nclaims` with its expected value, i.e. `m * expo` where `m` is the empirical mean claim frequency.

]

---

# Maximum likelihood estimation (MLE)

Fit a distribution to the data by .KULbginline[maximizing the likelihood] over the unknown parameter vector `$\theta$`

$$ \mathcal{L}(\theta) = \prod_{i} Pr(N_i = n_i \mid \theta).$$

In practice, we minimize the negative log-likelihood
`$$L(\theta) = - \sum_{i} \log(Pr(N_i = n_i \mid \theta)).$$`

If exposure is available, we'll typically incorporate the exposure measure in the mean of the distribution
`$$L(\theta) = - \sum_{i} \log(Pr(N_i = n_i \mid \theta,\ e_i)).$$`

]

Let's make this more concrete for the Poisson distribution:

`$$N_i \sim \texttt{Poi}(e_i \cdot \lambda).$$`
The corresponding log-likelihood becomes:

`$$\sum_i n_i \cdot \log(e_i \cdot \lambda) - e_i \cdot \lambda - \log(n_i!).$$`
How about .KULbginline[fitting a parametric count distribution] to data in R?

]

---

# Maximum likelihood estimation (MLE) in R

A (naive) first solution uses `fitdistr(.)` from the {MASS} library

```r
library(MASS)
fitdistr(mtpl$nclaims, "poisson")
##       lambda   
##   0.1239715495 
##  (0.0008714846)
```

Which number do you recognize here for `$\hat{\lambda}$`?

Alternatively, we can use the `glm(.)` function in a smart way:

```r
freq_glm_poi <- `glm`(nclaims ~ 1, 
                  `family = poisson(link = "log")`, 
                  `data = mtpl`) 
```

]

Fit a .KULbginline[Poisson GLM], with .KULbginline[logarithmic link] function.

This implies:

`$\color{#FFA500}{Y}$` ~ Poisson, with only an intercept in the linear predictor

`$$\begin{eqnarray*}
    \log(E[\color{#FFA500}{Y}]) &=& \color{#20B2AA}{\beta_0},
\end{eqnarray*}$$`

or,

`$$E[\color{#FFA500}{Y}] = \lambda = \exp{(\color{#20B2AA}{\beta_0})}.$$`

Fit this model on `data = mtpl`.

]

---

```r
freq_glm_poi <- glm(`nclaims ~ 1`, `offset = log(expo)`, 
                  family = poisson(link = "log"), 
                  data = mtpl)
freq_glm_poi %>% broom::tidy()
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> -1.970873 </td>
   <td style="text-align:right;"> 0.0070297 </td>
   <td style="text-align:right;"> -280.3632 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

What is your estimate for the expected annual claim frequency?

]

Use only an intercept in the linear predictor `~ 1`.

Include `log(expo)` as an offset term in the linear predictor.

Then,

`$$\begin{eqnarray*}
\log{(\texttt{expo})}+\beta_0. \end{eqnarray*}$$`

Put otherwise,

`$$\begin{eqnarray*}
E[\color{#FFA500}{Y}] = \texttt{expo} \cdot \exp{(\beta_0)} \end{eqnarray*},$$`
where `$\texttt{expo}$` refers to `expo` the exposure variable.
]

---

# Maximum likelihood estimation (MLE) in R (cont.)

.KULbginline[Writing a function] for the negative log-likelihood is another option.

The general framework for a one-parameter distribution then becomes

```r
neg_loglik_distr <- function(par, freq, expo){
  -sum(ddistr(freq, par, log = T))
}
nlm(neg_loglik_distr, 1, hessian = TRUE, 
    freq = mtpl$nclaims, expo = mtpl$expo)
```

Tweaking the general framework to the Poisson setting then becomes ...

]

```r
neg_loglik_pois <- function(par, freq, expo){
  lambda <- expo*exp(par)
  -sum(dpois(freq, lambda, log = T))
}
sol_poi <- nlm(neg_loglik_pois, 1, hessian = TRUE, 
               freq = mtpl$nclaims, expo = mtpl$expo)
```

Inspect the results

```r
exp(sol_poi$estimate)
## [1] 0.1393351
sqrt(diag(solve(sol_poi$hessian))) 
## [1] 0.007029369
sol_poi$minimum
## [1] 63837.82
```

Do you recognize these numbers?

]

---

We now focus on fitting the Negative Binomial distribution to the `mtpl$nclaims` frequency data, using `glm.nb()` from the {MASS} library.

```r
library(MASS)
freq_glm_nb <- glm.nb(`nclaims ~ 1 +` 
                         `offset(log(expo))`, 
                  link = log,
                  data = mtpl)
freq_glm_nb %>% broom::tidy()
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> -1.968155 </td>
   <td style="text-align:right;"> 0.0073391 </td>
   <td style="text-align:right;"> -268.174 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

The additional parameter `$r$` is estimated as 
(point estimate and standard error)

```
## [1] 1.435807
## [1] 0.08039722
```

]

Inspect the model output via

```r
summary(freq_glm_nb)
## 
## Call:
## glm.nb(formula = nclaims ~ 1 + offset(log(expo)), data = mtpl, 
##     link = log, init.theta = 1.435806632)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5164  -0.5164  -0.5164  -0.4263   4.8511  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.968155   0.007339  -268.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.4358) family taken to be 1)
## 
##     Null deviance: 78254  on 163230  degrees of freedom
## Residual deviance: 78254  on 163230  degrees of freedom
## AIC: 127192
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.4358 
##           Std. Err.:  0.0804 
## 
##  2 x log-likelihood:  -127187.8520
```

]

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

1. Starting from the handcrafted Poisson likelihood and its numerical optimization, can you write similar instructions for the Negative Binomial?

2. Can you match the estimates obtained with your routine with those of the `glm.nb()` call?

]

---

Let's now focus on the handcrafted Negative Binomial likelihood optimization:

```r
neg_loglik_nb <- function(par, freq, expo){
  mu <- expo*exp(par[1])
  r <- exp(par[2])
  -sum(dnbinom(freq, size = r, mu = mu, log = TRUE))
}
sol_nb <- nlm(neg_loglik_nb, c(1, 1), hessian = TRUE, 
              freq = mtpl$nclaims, expo = mtpl$expo)
```

Inspect the resulting estimates

```r
sol_nb$estimate
## [1] -1.9681557  0.3617181
```

]

We inspect the results

```r
exp(sol_nb$estimate)
## [1] 0.1397143 1.4357941
sqrt(diag(solve(sol_nb$hessian))) 
## [1] 0.007349782 0.056009902
sol_nb$minimum
## [1] 63593.93
```

The variance estimated by this Negative Binomial model is then

```r
exp(sol_nb$estimate[1]) + 
(exp(sol_nb$estimate[1])^2)/(exp(sol_nb$estimate[2]))
## [1] 0.1533096
```

which is very close to the empirical variance!

]

---

# Model selection and evaluation tools

The .KULbginline[AIC] can be used to compare models

`$$AIC = 2 \cdot \#\text{param} - 2 \cdot \text{log-likelihood}.$$`

```r
AIC_poi <- 2*length(sol_poi$estimate) + 2*sol_poi$minimum
AIC_nb <- 2*length(sol_nb$estimate) + 2*sol_nb$minimum
c(AIC_nb = AIC_nb, AIC_poi = AIC_poi)
##   AIC_nb  AIC_poi 
## 127191.9 127677.6
```

Smaller is better, thus preferred distribution is the Negative Binomial based on AIC!

---

Next to this, we compare the .KULbginline[observed and fitted claim count] distribution.

For example, with the fitted Negative Binomial distribution

```r
observed_count = rep(0, 5)
model_based_count = rep(0, 5)

for(i in 1:5) {
  observed_count[i] = sum(mtpl$nclaims == i-1)
  model_based_count[i] = sum(dnbinom(i-1, size = freq_glm_nb$theta, mu = fitted(freq_glm_nb)))
}

data.frame(frequency = 0:4,
           observed_count = observed_count,
           model_based_count = round(model_based_count))
##   frequency observed_count model_based_count
## 1         0         144936            145014
## 2         1          16556             16347
## 3         2           1558              1685
## 4         3            162               167
## 5         4             17                16
```

How would you explain the code (in your own words)? Can you adjust the code to the Poisson distribution?

---

# Excess zeroes in claim count data

Standard count distributions have often difficulty to capture the large number of zeroes in claim frequency data.

We propose two strategies to model an .KULbginline[excessive number of zeroes].

With a .KULbginline[Zero-Inflated (ZI)] distribution:

$$ 
P(N^{ZI} = k) = 
`\begin{cases}
\pi + (1-\pi) \cdot P(N=0) & k = 0 \\
(1-\pi) \cdot P(N = k) & k \gt 0.
\end{cases}`
$$
Zero-inflated count models are two-component mixture models combining a point mass at zero with a proper count distribution. Thus, there are two sources of zeroes: zeroes may come from both the point mass and from the count component.

With a .KULbginline[Hurdle] distribution:

$$ 
P(N^{hurdle} = k) = 
`\begin{cases}
\pi & k = 0 \\
(1-\pi)\cdot \frac{P(N=k)}{1-P(N = 0)} & k \gt 0.
\end{cases}`
$$
How can we fit these distributions to given data with R?

---

.pull-left[
The [library {pscl}](https://github.com/atahk/pscl) enables maximum likelihood estimation of zero-inflated and hurdle models for count data.

```r
library(pscl)
f_ZIP <- pscl::zeroinfl(nclaims ~ 1, offset = log(expo), 
                  dist = "poisson", data = mtpl)
summary(f_ZIP)
## 
## Call:
## pscl::zeroinfl(formula = nclaims ~ 1, data = mtpl, offset = log(expo), 
##     dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3584 -0.3584 -0.3584 -0.2976 36.1142 
## 
## Count model coefficients (poisson with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.48286    0.02226  -66.63   <2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.46932    0.05483   -8.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 11 
## Log-likelihood: -6.362e+04 on 2 Df
```

]

The `dist = .` available in `zeroinfl(.)` are the Poisson, Negative Binomial and the geometric distribution.

The `offset = .` is an a priori known component to be included in the linear predictor of the count model (as we did before).

Usually the count model is a Poisson or negative binomial regression (with log link).

A binary model is used that captures the probability of zero inflation. In the simplest case only with an intercept but potentially containing regressors. 
For this zero-inflation model, a binomial model with different links can be used, typically logit or probit.

]

---

More detailed inspection of the output stored in `f_ZIP`

```r
f_ZIP$coefficients
## $count
## (Intercept) 
##   -1.482859 
## 
## $zero
## (Intercept) 
##  -0.4693167
(f_ZIP_lambda <- exp(as.numeric(f_ZIP$coefficients[1])))
## [1] 0.2269878
(f_ZIP_pi <- exp(as.numeric(f_ZIP$coefficients[2]))/(1+exp(as.numeric(f_ZIP$coefficients[2]))))
## [1] 0.384778
(AIC_ZIP <- 2*length(f_ZIP$coefficients) - 2*f_ZIP$loglik)
## [1] 127241.3
```

The mean and variance (for `expo == 1`) as captured by the ZIP

```r
(ZIP_mean <- (1-f_ZIP_pi)*f_ZIP_lambda)
## [1] 0.1396479
(ZIP_var <- (1-f_ZIP_pi)*f_ZIP_lambda*(1+f_ZIP_pi*f_ZIP_lambda))
## [1] 0.1518447
```

---

A handcrafted MLE with the .KULbginline[ZIP distribution] can be put together as

```r
neg_loglik_ZIP <- function(par, freq, expo){
  lambda <- expo*exp(par[1])
  p <- exp(par[2])/(1+exp(par[2]))
  
  -sum((freq == 0) * (log(p + (1-p)*dpois(0, lambda, log = FALSE)))) - 
    sum((freq != 0) * (log((1-p)) + dpois(freq, lambda, log = TRUE)))
}

sol_ZIP <- nlm(neg_loglik_ZIP, c(1, 1), hessian = TRUE, 
              freq = mtpl$nclaims, expo = mtpl$expo)
sol_ZIP$estimate
## [1] -1.4829209 -0.4694681
```

and their corresponding standard errors

```r
(sol_ZIP_se <- sqrt(diag(solve(sol_ZIP$hessian))))
## [1] 0.02224688 0.05480811
```

---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

## <i class="fa fa-edit"></i> <br> Your turn

]

1. Adjust the code of the ZIP to fit the hurdle Poisson model, use `pscl::hurdle()`.

2. Starting from the handcrafted ZIP likelihood, write code for the hurdle Poisson model.

]

---

# Fitting severity models in R

---

# From simple to complex parametric distributions

Famous examples of .KULbginline[1 and 2 parameter distributions] used in loss modelling:

- Exponential
- Lognormal
- Gamma and Inverse Gaussian.

Some less straightforward .KULbginline[more flexible parametric distributions]:

- Burr (3 parameters)
- GB2 (4 parameters).

Typically challenging to fit, picking meaningful starting values in numerical optimization routines matters!

]

]

---

<br/>

Distribution           |   Density | Mean  |  R  | 
:-------------------------|:-------------------------|:-------------------------|
Exponential | `$f(y) = \displaystyle \lambda e^{-\lambda y}$` | `$\text{E}[Y] = \displaystyle 1/\lambda$`  | `dexp` ... 
Gamma | `$f(y) = \displaystyle \frac{1}{\Gamma(\alpha)}\beta^{\alpha}y^{\alpha-1}e^{-\beta y}$` | `$\text{E}[Y] = \displaystyle \alpha/\beta$`  | `dgamma` ...
Inverse Gaussian | `$f(y) = \displaystyle \left(\frac{\lambda}{2\pi y^3}\right)^{1/2}\exp{\left[\frac{-\lambda(y-\mu)^2}{2\mu^2y}\right]}$` | `$\text{E}[Y] = \mu$`  | `dinvgaus` package {statmod}
Lognormal | `$f(y) = \displaystyle \frac{1}{\sqrt{2\pi} \sigma y}\exp{\left[-\frac{1}{2} \left(\frac{\log{y}-\mu}{\sigma}\right)^2\right]}$` | `$\text{E}[Y] = \displaystyle \exp{\left(\mu+\frac{1}{2}\sigma^2\right)}$` | `dlnorm` ...

The {actuar} package has some functions related to the Burr and GB2 distributions, e.g. `dburr` and `dgenbeta`.

Mind the .KULbginline[different parametrizations] of these distributions!

---

# Be aware of truncation!

In `secura` only losses .KULbginline[above 1.2M EUR] are registered.

We can picture this as follows:

Lower/left truncation: deductible, reinsurance.

Upper/right truncation: earthquake magnitudes.

How would you tackle this left-truncation in `secura` when fitting a loss distribution to the data?

---

# Be aware of truncation and censoring!

---

# Global fitting distributions

That's .KULbginline[hard], because:

- different behaviour of attritional and large losses (body vs tail).

Typically, no standard parametric distribution provides suitable global fit:

- Lognormal, Weibull, etc. underestimate tail risk
- Pareto, Generalised Pareto Distribution (GPD), etc. can only be used for large losses
- complex distributions (e.g. GB2).

]

.KULbginline[Splicing (or composite modeling)] offers a useful strategy to construct a global fit.

Intuitively:

- different components on different intervals of losses
- glue these components together into a well-defined density.

This technique is popular e.g. in modelling operational risk data.

]

---

# Splicing: a body-tail example

.pull-left[
Combine two distributions (one for the .KULbginline[body] and one for the .KULbginline[tail]) in a splicing model:

`$$f(x;\boldsymbol{\Theta}) = \begin{cases}
	0 & \quad \text{if }x< t^l \\
	\pi \frac{f_1^*(x;t,\boldsymbol{\Theta}_1)}{F_1^*(t;\boldsymbol{\Theta}_1)-F_1^*(t^l;\boldsymbol{\Theta}_1)} & \quad \text{if }t^l \leq x\leq t \\
	(1-\pi) \frac{f_2^*(x;t,\boldsymbol{\Theta}_2)}{1-F_2^*(t;\boldsymbol{\Theta}_1)} &\quad \text{if } x > t,
	\end{cases}$$`

where `$t^l$` is the .KULbginline[left-truncation point] and `$t$` the .KULbginline[split point] (body vs. tail).

Questions for you:

- which in the above should be estimated from the data?
- why the conditional densities, 
  constructed from original pdf's `$f_1^*(.)$` and `$f_2^*(.)$` on the positive real line?

]

Some examples of body-tail fits:

Body           |   Tail 
:-------------------------|:-------------------------|
		Exponential			|	Pareto 
		Lognormal			|	Pareto
		Weibull				|	Pareto
		Mixture of two exponentials	|	Generalised Pareto

]

---

# A body-tail fit for the Secura Re losses

Let's combine an .KULbginline[exponential] (body - below threshold `$t$`) and a .KULbginline[Pareto] (tail - above `$t$`)

`$$\begin{eqnarray*}
f(x) &=& \begin{cases}
	0 & x \leq t^l \\
	\pi \cdot \color{blue}{\frac{f_1^{\star}(x)}{F_1^{\star}(t)-F_1^{\star}(t^l)}} & t^l \leq x \leq t \\
(1-\pi) \cdot \color{orange}{\frac{f_2^{\star}(x)}{1-F_2^{\star}(t)}} & x > t \end{cases},
\end{eqnarray*}$$`

with

- `$f_1^{\star}$` and `$F_1^{\star}$` the PDF and CDF of an exponential distribution

`$$\begin{eqnarray*}
\frac{f_1^{\star}(x)}{F_1^{\star}(t)-F_1^{\star}(t^l)} &=& \frac{\lambda \exp\{-\lambda x\}}{\exp\{-\lambda \cdot t^l\}-\exp\{-\lambda \cdot t\}}=\frac{\lambda \exp\{-\lambda (x-t^l)\}}{1-\exp\{-\lambda(t-t^l)\}},
\end{eqnarray*}$$`

- `$f_2^{\star}$` and `$F_2^{\star}$` the PDF and CDF of a Pareto distribution with unit scale

`$$\begin{eqnarray*}
\frac{f_2^{\star}(x)}{1-F_2^{\star}(t)} &=& \frac{\frac1{\gamma} x^{-\frac1{\gamma}-1}}{t^{-\frac1{\gamma}}}.
\end{eqnarray*}$$`

---

# Choice of threshold

Techniques from .KULbginline[extreme value theory (EVT)] are necessary to estimate .KULbginlin[split point] above which a Pareto distribution is plausible.

A first visual tool is the .KULbginline[mean excess plot]:

- the mean excess function or mean residual life function is

`$$e(t) = E(X-t|X>t)$$`

- empirically, using a sample `$(x_1,\ldots,x_n)$`

`$$\hat{e}_n(t) = \frac{\sum_{i=1}^n x_i \boldsymbol{1}_{(t,\infty)}(x_i)}{\sum_{i=1}^n \boldsymbol{1}_{(t,\infty)}(x_i)}-t.$$`
- the mean excess plot evaluates the above at values `$t=x_{n-k,n}$`, the `$(k+1)$`th largest observation.

]

The {ReIns} package implements many useful functions from this book:

]

---

Via {ReIns} we can easily plot the mean excess function:

```r
library(ReIns)
ReIns::MeanExcess(secura$Loss)
```

By default, the mean excess scores are plotted as a function of the data `k = FALSE`.

You can also plot these as function of the tail parameter `k = TRUE`.

The mean excess function of an EXP distribution is constant, then:

- .KULbginline[HTE] ('Heavier than exponential'), mean excess function .KULbginline[ultimately increases]

- .KULbginline[LTE] ('Lighter than exponential'), when mean excess function .KULbginline[ultimately decreases].

]

]

---

A second visual tool is a QQplot, e.g. the .KULbginline[Exponential QQplot]:

- recall `$F_{\lambda}(x) = 1-\exp{(-\lambda x)}$` with `$x>0$`

- then `$Q_{\lambda}(p) = -\frac{1}{\lambda} \log{(1-p)}$` for `$p \in (0,1)$`

- use `$p_{i,n} := i/(n+1)$` and plot

`$$(-\log{(1-p_{i,n})},x_{i,n}).$$`

]

```r
ReIns::ExpQQ(secura$Loss)
```

]

---

A second visual tool is a QQplot, e.g. the .KULbginline[Pareto QQplot]:

- recall `$F_{\gamma}(x) = 1-x^{-1/\gamma}$` with `$x>0$`

- then `$-\log{(1-p)} = \frac{1}{\gamma} \log x$` for `$p \in (0,1)$`

- use `$p_{i,n} := i/(n+1)$` and plot

`$$(-\log{(1-p_{i,n})},\log x_{i,n}).$$`

A distribution is of Pareto-type when the Pareto QQplot is ultimately linear near the largest observations.

]

```r
ReIns::ParetoQQ(secura$Loss)
```

]

---

# Choice of threshold - the Hill estimator

We estimate the slope of the Pareto QQplot to the right of the reference point

`$$(-\log{(\frac{k+1}{n+1})},\log x_{n-k,n}).$$`
This slope can be expressed as

`$$H_{k,n} = \frac{1}{k} \sum_{j=1}^k \log X_{n-j+1,n} - \log X_{n-k,n}.$$`

When the data has a Pareto tail the Hill plot becomes linear.

But mind the bias and variance trade off!

]

```r
H <- ReIns::Hill(secura$Loss, k = FALSE, 
                 lwd = 2, plot = TRUE, main = "")
```

]

---

# Choice of threshold - the Hill estimator

We estimate the slope of the Pareto QQplot to the right of the reference point

`$$(-\log{(\frac{k+1}{n+1})},\log x_{n-k,n}).$$`
This slope can be expressed as

`$$H_{k,n} = \frac{1}{k} \sum_{j=1}^k \log X_{n-j+1,n} - \log X_{n-k,n}.$$`

When the data has a Pareto tail the Hill plot becomes linear.

But mind the bias and variance trade off!

]

```r
H <- ReIns::Hill(secura$Loss, k = TRUE, 
                 lwd = 2, plot = TRUE, main = "")
```

]

---

The Hill plot shows many possible estimates for `$\gamma$` in the Pareto distribution `$F(x)=1-x^{-1/\gamma}$` that we want to use for the tail of the distribution.

.KULbginline[How to pick the split point?]

We consider the Asymptotic Mean Squared Error of `$H_{k,n}$` (i.e. variance + squared bias).

We plot

`$$(k,\widehat{\text{AMSE}}(H_{k,n}));k=1,\ldots, n-1\}$$`

and pick the `$k$` that minimizes the AMSE.

]

```r
H <- ReIns::Hill(secura$Loss, lwd = 2, plot = TRUE, 
                 main = "")
kopt <- ReIns::Hill.kopt(secura$Loss, 
                         plot = FALSE)$kopt
abline(v = kopt, lwd = 2, lty = 2)
```

]

---

Using these insights from EVT we arrive at a .KULbginline[final choice] for the split point or threshold `$t$`.

We pick the optimal `$k$` by minimizing AMSE of the Hill estimator.

Then, the corresponding split point or threshold `$t := X_{n-k,n}$`, the `$(k+1)$`th largest observation in the sample.

This reasoning also gives `$\hat{\gamma}$` via the corresponding `$H_{k,n}$`.

]

```r
# sample size
n <- length(secura$Loss)
n
## [1] 371

# Hill estimator
kopt <- Hill.kopt(secura$Loss, plot = FALSE)$kopt
kopt
## [1] 95

# chosen threshold
threshold <- sort(secura$Loss)[n - kopt]
threshold
## [1] 2580026

# estimate for gamma using Hill estimator
gamma <- H$gamma[kopt]
gamma
## [1] 0.2710874
```

]

---

# Exponential - Pareto splicing model

We are now ready to fit the .KULbginline[body-tail] model proposed earlier on:

with

- `$f_1^{\star}$` and `$F_1^{\star}$` the PDF and CDF of an exponential distribution

- `$f_2^{\star}$` and `$F_2^{\star}$` the PDF and CDF of a Pareto distribution with unit scale

`$$\begin{eqnarray*}
\frac{f_2^{\star}(x)}{1-F_2^{\star}(t)} &=& \frac{\frac1{\gamma} x^{-\frac1{\gamma}-1}}{t^{-\frac1{\gamma}}}.
\end{eqnarray*}$$`

---

In fact, we now have meaningful choices/estimates for:

- `$t^l$` left truncation point at 1.2M EUR

- `$t$` the split point estimated at `threshold`

```r
threshold
## [1] 2580026
```

- the Hill estimator in `gamma`

```r
gamma
## [1] 0.2710874
```

- the fraction of losses below the split point

```r
(p <- sum(secura$Loss <= threshold)/n)
## [1] 0.7439353
```
or

```r
(n-kopt)/n
## [1] 0.7439353
```

---

class:clear

```r
sh <- 1200000 # shift
I <- secura$Loss <= threshold # indicator

neg.loglik <- function(par) {
	lambda <- exp(par[1]); alpha <- exp(par[2]) 
	# likelihood
	L <- I * (n-kopt)/n * lambda * exp(-lambda*(secura$Loss-sh))/(1 - exp(-lambda*(threshold-sh))) + 
		(1-I) * kopt/n * alpha * (secura$Loss)^(-alpha-1)/(threshold)^(-alpha)
	# negative log-likelihood
	-sum(log(L)) 
}

m1 <- mean(secura$Loss)
par.init <- c(1/(m1-sh), 1/m1)
oo <- nlm(neg.loglik, log(par.init))
(lambda <- exp(oo$estimate[1])) 
## [1] 6.710413e-07
(alpha <- exp(oo$estimate[2]))
## [1] 3.688812
(gamma <- 1/alpha)
## [1] 0.27109
```

---

By working out the integrals starting from the spliced density, we find:

for `$t^l=1,200,000\leq x \leq t$`:

`$$\begin{eqnarray*}
F(x) &=& \pi \cdot \frac{1-\exp\{-\lambda(x-\text{1,200,000})\}}{1-\exp\{-\lambda(t-\text{1,200,000})\}}.
\end{eqnarray*}$$`

and for `$x>t$` we find

`$$F(x)=1-(1-\pi) \left( \frac{x}{t} \right)^{-\frac1{\gamma}}.$$`

Note the different expressions for the CDF for the different ranges of `$x$` values!

---

Following function can be used the compute the .KULbginline[CDF of the spliced distribution] (try the integrals!)

```r
# CDF for Exp-Pa splicing model
ExpPa_cdf <- function(x, sh, threshold, lambda, gamma) {
  
  p <- numeric(length(x))
  
  p[x <= threshold] <- (n - kopt) / n * pexp(x[x <= threshold] - sh, rate = lambda) / 
                                        pexp(threshold - sh, rate = lambda)
  
  p[x > threshold] <- 1 - kopt / n * (x[x > threshold] / threshold) ^ (-1/gamma)
  
  return(p)
}
```

---

To assess the GoF of the fitted spliced distribution, we plot the empirical CDF together with the fitted CDF.

---

# A mixture of Erlangs for the body

What if it is not straightforward to find a suitable distribution for the .KULbginline[body of the data]?

.KULbginline[Mixtures of Erlangs]:

- versatile class of distributions, i.e. dense in the space of positive, continuous distributions
- mathematically tractable
- fitting procedure in {ReIns} package.

But, a ME distribution has an .KULbginline[exponential tail], thus no heavy tails!

More details in [Verbelen et al. (2015, ASTIN Bulletin)](https://katrienantonio.github.io/portfolio/loss-models).

]

The pdf of a mixture of Erlangs:

`$$\begin{align*}
						f_X(x ; \boldsymbol{\alpha}, \boldsymbol{r}, \theta) &= \sum_{j=1}^{M} \alpha_j f_E(x;r_j, \theta) \\
						&= \sum_{j=1}^{M} \alpha_j \frac{x^{r_j-1} 	e^{-x/\theta}}{\theta^{r_j}(r_j-1)!},
\end{align*}$$`

with number of Erlangs `$M$`, mixing weights `$\boldsymbol{\alpha}$`, common scale `$\theta$` and positive integer shape parameters `$(r_1,\ldots, r_M)$`.

The mixing weights should satisfy `$\alpha_j>0$` and `$\sum_j \alpha_j = 1$`.

Question for you: the Erlang distribution may remind you of a better known distribution among actuaries. Which one?

]

---

```r
# Fit ME model using internal function from 
# ReIns package
MEfit_sec <- ReIns:::.ME_tune(secura$Loss, 
                              trunclower = sh, 
                              criterium = "BIC", 
                              M = 10, 
                              s = 1:40)$best_model

# Fitted parameters
MEfit_sec$alpha
## [1] 0.9707281 0.0292719
MEfit_sec$shape
## [1]  5 16
MEfit_sec$theta
## [1] 359731.4
```
]

.pull-right[
<img src="loss_modelling_reserving_analytics_day_1_files/figure-html/unnamed-chunk-79-1.svg" width="80%" style="display: block; margin: auto;" />
]

---

```r
ME_VaR <- Vectorize(ReIns:::.ME_VaR, 
                    vectorize.args = "p")

# QQ-plot
plot(ME_VaR((1:n) / (n+1), theta = MEfit_sec$theta, 
            shape = MEfit_sec$shape, 
            alpha = MEfit_sec$alpha, 
            trunclower = sh), 
     sort(secura$Loss), 
     main = "ME QQ-plot", 
     xlab = "Fitted quantiles", 
     ylab = "Empirical quantiles")
abline(0, 1)
```

]

]

---

Next, we combine what we learned on splicing with the Mixture of Erlangs distribution, to construct .KULbginline[a ME-Pa splicing model] with the splicing point `threshold` estimated earlier on.

Again, we start from `$M=10$` Erlangs in the mixture part for the body.

```r
# Fit ME-Pa splicing model
splicefit_sec <- ReIns::SpliceFitPareto(
                    secura$Loss, 
                    tsplice = threshold, 
                    M = 10, 
                    trunclower = sh)

# Summary
summary(splicefit_sec)
```

How would you describe the resulting model?

]

```
## 
## ----------------------------------------
## Summary of splicing fit
## ----------------------------------------
## 
## const = 0.744
## 
## pi = (0.744, 0.256)
## 
## t0 = 1 200 000
## 
## t = 2 580 026
## 
## type = (ME, Pa)
## 
## * * * * * * * * * * * * * * * * * * * * 
## 
## p = 1
## 
## r = 7
## 
## theta = 249 142
## 
## M = 1
## 
## M_initial = 10
## 
## * * * * * * * * * * * * * * * * * * * * 
## 
## gamma = 0.271
## 
## endpoint = Inf
```
]
]

---

The first way to assess the GoF is a plot of the ECDF and the fitted CDF.

```r
# ECDF plot
x <- seq(10^6, 10^7, 10^3)
SpliceECDF(x, secura$Loss, splicefit_sec, lwd = 2)
abline(v = splicefit_sec$t, lty = 2, lwd = 2)

# Zoomed ECDF plot
SpliceECDF(x, secura$Loss, splicefit_sec, 
           lwd = 2, ylim = c(0,0.3))
abline(v = splicefit_sec$t, lty = 2, lwd = 2)
```

]

]

---

The first way to assess the GoF is a plot of the ECDF and the fitted CDF.

```r
# ECDF plot
x <- seq(10^6, 10^7, 10^3)
SpliceECDF(x, secura$Loss, splicefit_sec, lwd = 2)
abline(v = splicefit_sec$t, lty = 2, lwd = 2)

# Zoomed ECDF plot
SpliceECDF(x, secura$Loss, splicefit_sec, 
           lwd = 2, ylim = c(0,0.3))
abline(v = splicefit_sec$t, lty = 2, lwd = 2)
```

]

]

---

Additionally, a PP-plot can be used which plots the empirical survival function vs. the fitted survival function.
To focus on the tails, a version with minus log-scale is also used, i.e.

`$$-\log(1-\hat{F}(x_{i,n}))\ \text{vs.}\ -\log(1-F(x_{i,n})).$$`

```r
# PP-plot
SplicePP(secura$Loss, splicefit_sec)
# PP-plot with minus log-scale
SplicePP(secura$Loss, splicefit_sec, log = TRUE)
```

]

]

---

```r
# QQ-plot
SpliceQQ(secura$Loss, splicefit_sec)
```
]
.pull-right[
<img src="loss_modelling_reserving_analytics_day_1_files/figure-html/unnamed-chunk-92-1.svg" style="display: block; margin: auto;" />

]

---
# Risk measures

Some popular risk measures are

* Value-at-Risk

$$ VaR_{1-p} = F^{-1}(1-p)$$

* Value-at-Risk quantifies tail events.

<br>

* Conditional Tail Expectation

`\begin{align}
CTE_{1-p} &= E(X \mid X > VaR_{1-p}) \\
          &= \frac{1}{p} \int_{1-p}^{1} VaR_\gamma(X) d\gamma.
\end{align}`

* Conditional Tail Expectation quantifies the expected severity of tail events (.KULbginline[what if?]).

---
# Risk measures for a body-tail fit

We give a general expression for the CDF and quantile function for the body-tail model:

<br>

$$
`\begin{eqnarray*}
F(x) &=& \begin{cases}
	0 & x \leq t^l \\
	\pi \cdot \color{blue}{\frac{F_1^{\star}(x) - F_1^{\star}(t^l)}{F_1^{\star}(t)-F_1^{\star}(t^l)}} & t^l \leq x \leq t \\
\pi + (1-\pi) \cdot \color{orange}{\frac{F_2^{\star}(x) - F_2^{\star}(t)}{1-F_2^{\star}(t)}} & x > t \end{cases}
\end{eqnarray*}`
$$
]

.pull-left[
The quantile function is obtained by inverting the CDF:
$$
`\begin{eqnarray*}
F^{-1}(p) &=& \begin{cases}
Q_1^{*} \left( \frac{p}{\pi} \cdot \color{blue}{ (F_1^{\star}(t) - F_1^{\star}(t^l)) + F_1^{\star}(t^l)} \right)  & 0 \leq p \leq \pi \\
Q_2^{*} \left( \frac{p-\pi}{1-\pi} \cdot \color{orange}{(1 - F_2^{\star}(t)) +F_2^{\star}(t)} \right) & \pi \leq p \leq 1 \end{cases}
\end{eqnarray*}`
$$
where `$Q_1^{*}$` and `$Q_2^{*}$` denote the quantile functions of the body resp. tail distribution used in the splicing model.
]
---
# Risk measures for a body-tail fit

```r
# Quantile function for Exp-Pa splicing model
ExpPa_quantile <- function(p, sh, threshold, lambda, gamma) {
  
  pi <- (n-kopt)/n
  
  x <- numeric(length(p))
  
  x[p <= pi] <- qexp(p[p <= pi] / pi * (pexp(threshold, rate = lambda) - pexp(sh, rate = lambda)) + 
                       pexp(sh, rate = lambda), rate = lambda)
  
  x[p > pi] <- threshold * (1 - (p[p > pi] - pi) / (1-pi))^(-gamma)

return(x)
}
```

---
# Risk measures for a body-tail fit

```r
ExpPa_quantile(0.95, sh, threshold, lambda, gamma)
## [1] 4017259
```

and check

```r
ExpPa_cdf(ExpPa_quantile(0.95, sh, 
                         threshold, 
                         lambda, gamma), 
          sh, threshold, lambda, gamma)
## [1] 0.95
```

]

```r
qf <- function(p) {
  ExpPa_quantile(p, sh, threshold, lambda, gamma) 
}

int <- integrate(qf, lower = 0.95, upper = 1) 
int$value / 0.05
## [1] 5511323
```

<br/>

The {ReIns} package has built-in functions to calculate these (and other) risk measures for the ME-Pa spliced fit, see e.g. `ReIns::VaR(p, splicefit_sec)` and `ReIns::CTE(p, splicefit_sec)`.

]

---
name: wrap-up

# Thanks!  <img src="image/xaringan.png" class="title-hex">

Slides created with the R package [xaringan](https://github.com/yihui/xaringan).
<br> <br> <br>
Course material available via 
<br>
<svg style="height:0.8em;top:.04em;position:relative;fill:#116E8A;" viewBox="0 0 496 512"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> https://github.com/katrienantonio/workshop-loss-reserv-fraud