You’ll work through the essential steps of the implementation in R
of the methodology outlined in the paper “A data driven binning strategy for the construction of insurance tariff classes”, by Henckaerts, Antonio, Clijsters and Verbelen in Scandinavian Actuarial Journal (accepted in 2018). The methodology is illustrated with the same data set as the one used by Henckaerts et al. (2018).
First of all, you specify the path where data and output will be stored. Pay attention to the way how directories are specified in R
(with forward slash or double back slash)
path <- file.path('C:/Users/u0043788/Dropbox/APC Module Data Science/computer labs/pricing analytics')
Within this folder you store the data set P&Cdata.txt
. You also create a subfolder called Shape file Belgie postcodes
where you unpack the zip file with the shape file of Belgium at postcode level. You now download, install and load the packages that will be used throughout the workshop: data.table
, dplyr
, mgcv
, evtree
, classInt
, rgdal
, RColorBrewer
, ggplot2
, ggmap
, grid
and gridExtra
. You can use the following instructions to install (if necessary) and load the packages.
packages <- c("data.table", "dplyr", "mgcv", "evtree", "classInt", "rgdal", "RColorBrewer", "ggplot2", "ggmap", "grid", "gridExtra")
suppressMessages(packages <- lapply(packages, FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x)
library(x, character.only = TRUE)
}
}))
You are now ready to load the data and build predictive models.
You will work with the MTPL data analyzed in Henckaerts et al. (2018). The data are stored in .txt
format. The relevant object is called DT
and we will approach it as a data.frame
.
path.MTPL <- file.path(path, "P&Cdata.txt")
path.MTPL
## [1] "C:/Users/u0043788/Dropbox/APC Module Data Science/computer labs/pricing analytics/P&Cdata.txt"
DT <- read.table(path.MTPL, header = TRUE)
DT <- as.data.frame(DT)
Having the data frame DT
available, you start exploring this object. Verify what the following instructions do.
str(DT)
names(DT)
summary(DT)
head(DT)
Exercises:
To build severity models (later on), you take a subset of DT
with all observations for which AMOUNT
is strictly positive and AVG
is less than 81000 EUR. This choice is discussed in Section 3.2 of the paper, and we apply here the same specification to our simulated data (by means of example).
DT.sev <- DT[DT$AMOUNT>0 & DT$AVG<=81000, ]
Exercise:
DT.sev
using the functions introduced above.You start with exploring the data. First, you focus on the empirical, average claim frequency.
mean(DT$NCLAIMS)
## [1] 0.1239715
sum(DT$NCLAIMS)/sum(DT$EXP)
## [1] 0.1393352
# using the pipe operator
library(dplyr)
DT %>% summarize(tot_claims = sum(NCLAIMS))
## tot_claims
## 1 20236
DT %>% summarize(emp_freq = sum(NCLAIMS) / sum(EXP))
## emp_freq
## 1 0.1393352
What is the difference between these instructions to calculate the empirical, average claim frequency?
What about the empirical variance?
m <- sum(DT$NCLAIMS)/sum(DT$EXP)
m
## [1] 0.1393352
var <- sum((DT$NCLAIMS - m*DT$EXP)^2)/sum(DT$EXP)
var
## [1] 0.1517246
Exercise:
You now want to find out how many policyholders reported 0 claims, 1 claim, and so on. Thus, you want to explore the empirical frequency distribution.
table(DT$NCLAIMS)
##
## 0 1 2 3 4 5
## 144936 16556 1558 162 17 2
prop.table(table(DT$NCLAIMS))
##
## 0 1 2 3 4
## 8.879196e-01 1.014268e-01 9.544756e-03 9.924585e-04 1.041469e-04
## 5
## 1.225257e-05
Exercises:
EXP
, the exposure, and COVERAGE
, the type of coverage of each policyholder.To reproduce the graphs from the paper (though now on your simulated data), you will use the ggplot2
package. Let’s start with a simple bar plot of the NCLAIMS
in the black and white theme of ggplot2
.
library("colorspace")
library("ggplot2")
g <- ggplot(data = DT, aes(NCLAIMS)) + theme_bw()
g + geom_bar()
The weight
argument allows you to weight the number of policyholders who file 0 claims, 1 claim and so on by EXP
instead of simply counting the number of policyholders.
g <- ggplot(data = DT, aes(NCLAIMS)) + theme_bw()
g + geom_bar(aes(weight = EXP))
And you can also plot the relative frequency of the policyholders filing 0 claims, 1 claim and so on.
g <- ggplot(data = DT, aes(NCLAIMS)) + theme_bw()
g + geom_bar(aes(y = (..count..)/sum(..count..))) + labs(y = "Relative frequency")
You should check ggplot2 barplot to learn more. To specify your own theme, you define some visualisation parameters and colors that will be used in your ggplot
calls.
col <- "#003366"
fill <- "#99CCFF"
ylab <- "Relative frequency"
This creates the plots in blue as used in the paper
g <- ggplot(data = DT, aes(NCLAIMS)) + theme_bw()
g + geom_bar(col = col, fill = fill)
You will now step from the barplot to a histogram (in ggplot2
), see ggplot2 histogram. Here is the histogram of AGEPH
showing the age composition of the policyholders
g <- ggplot(data = DT, aes(AGEPH)) + theme_bw()
g + geom_histogram(binwidth = 2, col = col, fill = fill)
and the relative frequency histogram
g <- ggplot(data = DT, aes(AGEPH)) + theme_bw()
g + geom_histogram(aes(y = (..count..)/sum(..count..)), binwidth = 2, col = col, fill = fill) + labs(y = "Relative frequency")
To reproduce the graphs from the paper, we write functions as a wrapper for the instructions used above.
ggplot.bar <- function(DT, variable, xlab){
ggplot(data = DT, aes(as.factor(variable)), environment = environment()) + theme_bw() +
geom_bar(aes(y = (..count..)/sum(..count..)), col = col, fill = fill) + labs(x = xlab, y = ylab)
}
ggplot.hist <- function(DT, variable, xlab, binwidth){
ggplot(data = DT, aes(variable), environment = environment()) + theme_bw() +
geom_histogram(aes(y = (..count..)/sum(..count..)), binwidth = binwidth, col = col, fill = fill) +
labs(x = xlab, y = ylab)
}
You call these functions to create a bunch of ggplot
s, store each of them and finally you arrange the plots in a nice way.
# Frequency, exposure and total severity
plot.eda.nclaims <- ggplot.bar(DT, variable = DT$NCLAIMS, "nclaims")
plot.eda.exp <- ggplot.hist(DT, DT$EXP, "exp", 0.05)
plot.eda.amount <- ggplot(data = DT.sev, aes(AMOUNT)) + geom_density(adjust = 3, col = col, fill = fill) + xlim(0,1e4) + ylab(ylab) + xlab("amount") + theme_bw()
# Bar plots of factor variables
plot.eda.coverage <- ggplot.bar(DT, DT$COVERAGE, "coverage")
plot.eda.fuel <- ggplot.bar(DT, DT$FUEL, "fuel")
plot.eda.sex <- ggplot.bar(DT, DT$SEX, "sex")
plot.eda.use <- ggplot.bar(DT, DT$USE, "use")
plot.eda.fleet <- ggplot.bar(DT, DT$FLEET, "fleet")
# Histograms of continuous variables
plot.eda.ageph <- ggplot.hist(DT, DT$AGEPH, "ageph", 2)
plot.eda.agec <- ggplot.hist(DT, DT$AGEC, "agec", 1)
plot.eda.bm <- ggplot.bar(DT, DT$BM, "bm")
plot.eda.power <- ggplot.hist(DT, DT$POWER, "power", 10)
# Putting these together
library(grid)
library(gridExtra)
grid.arrange(plot.eda.nclaims, plot.eda.exp, plot.eda.amount, plot.eda.coverage, plot.eda.fuel, plot.eda.sex, plot.eda.use, plot.eda.fleet, plot.eda.ageph, plot.eda.power, plot.eda.agec, plot.eda.bm, ncol = 4)
To visualize the composition of the portfolio along the combination of AGEPH
and POWER
you will use a two dimensional kernel density estimate. The results will be displayed with contours. More details on geom_density_2d.
plot.eda.agephpower <- ggplot(DT, aes(x = AGEPH, y = POWER)) + stat_density2d(aes(fill = ..level..), geom="polygon")
plot.eda.agephpower <- plot.eda.agephpower + scale_fill_gradient("Density", low="#99CCFF", high="#003366")
plot.eda.agephpower <- plot.eda.agephpower + theme_bw() + xlab("ageph") + ylab("power")
plot.eda.agephpower
You now want to visualize the exposure in each municipality relative to the area of the municipality. First, you calculate the total exposure per postal code. This can be done in several ways (using the data.table
or tidyverse
packages), but with base R
functions:
DT_PC <- aggregate(EXP ~ PC, data=DT, sum)
DT_PC$N <- DT_PC$EXP
head(DT_PC)
## PC EXP N
## 1 1000 961.3178 961.3178
## 2 1030 738.3397 738.3397
## 3 1040 357.2904 357.2904
## 4 1050 524.6164 524.6164
## 5 1060 279.9452 279.9452
## 6 1070 898.8301 898.8301
Exercise:
aggregate
instruction using the %>%
operator?You load the shapefile of Belgium at postal code level with the following function:
library(rgdal)
setwd('C://Users/u0043788/Dropbox/APC Module Data Science/computer labs/pricing analytics')
readShapefile = function(){
belgium_shape <- readOGR(dsn = path.expand(paste(getwd(), "/Shape file Belgie postcodes", sep = "")), layer = "npc96_region_Project1")
belgium_shape <- spTransform(belgium_shape, CRS("+proj=longlat +datum=WGS84"))
belgium_shape$id <- row.names(belgium_shape)
return(belgium_shape)
}
belgium_shape <- readShapefile()
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\u0043788\Dropbox\APC Module Data Science\computer labs\pricing analytics\Shape file Belgie postcodes", layer: "npc96_region_Project1"
## with 1146 features
## It has 6 fields
str(belgium_shape@data)
## 'data.frame': 1146 obs. of 7 variables:
## $ POSTCODE : int 1300 1301 1310 1315 1320 1325 1330 1331 1332 1340 ...
## $ NAAM1 : Factor w/ 22 levels "ANDERLECHT","BRUSSEL",..: NA NA NA NA NA NA NA NA NA NA ...
## $ NAAM2 : Factor w/ 1138 levels "'s Gravenvoeren",..: 1070 118 561 499 82 204 844 854 358 778 ...
## $ POSTCODELA: Factor w/ 1146 levels "1000","1020",..: 23 24 25 26 27 28 29 30 31 32 ...
## $ Shape_Leng: num 42814 15954 20276 43081 29648 ...
## $ Shape_Area: num 32629722 9641834 15381287 38638405 38812269 ...
## $ id : chr "0" "1" "2" "3" ...
You merge this data frame and DT_PC
by postal code such that variable N
is now available in the data frame with the shapefile information. You then calculate the exposure per area unit. Using the cut
function you determine which municipalities have a low, average or high relative exposure.
belgium_shape@data <- merge(belgium_shape@data, DT_PC, by.x = "POSTCODE", by.y = "PC", all.x = TRUE)
belgium_shape@data$freq <- belgium_shape@data$N/belgium_shape@data$Shape_Area
belgium_shape@data$freq_class <- cut(belgium_shape@data$freq, breaks = quantile(belgium_shape@data$freq, c(0, 0.2, 0.8, 1), na.rm = TRUE), right = FALSE, include.lowest = TRUE, labels = c("low","average","high"))
belgium_shape_f <- fortify(belgium_shape)
belgium_shape_f <- merge(belgium_shape_f, belgium_shape@data, all.x = TRUE)
Finally, each postal code gets a color on the map of Belgium based on its low, average or high exposure per area unit.
library(ggmap)
plot.eda.map <- ggplot(belgium_shape_f, aes(long, lat, group = group)) + geom_polygon(aes(fill = belgium_shape_f$freq_class), colour = "black", size=0.1)
plot.eda.map <- plot.eda.map + theme_bw() + labs(fill = "Relative\nfrequency") + scale_fill_brewer(palette = "Blues", na.value = "white")
plot.eda.map
To learn more about maps: see maps with ggplot2.
Exercises:
In the previous Section you explored the data that are available. However, you did not yet link covariates to claims. In this Section you focus on marginal, empirical models by calculating the empirical claim frequency per level of a covariate. Here you see how this works.
DT %>% group_by(AGEPH) %>%
summarize(emp_freq = sum(NCLAIMS) / sum(EXP))
## # A tibble: 78 x 2
## AGEPH emp_freq
## <int> <dbl>
## 1 18 1.08
## 2 19 0.301
## 3 20 0.330
## 4 21 0.276
## 5 22 0.259
## 6 23 0.255
## 7 24 0.243
## 8 25 0.233
## 9 26 0.216
## 10 27 0.192
## # ... with 68 more rows
And you can directly visualize these empirical frequencies.
DT %>% group_by(AGEPH) %>%
summarize(emp_freq = sum(NCLAIMS) / sum(EXP)) %>%
ggplot(aes(x = AGEPH, y = emp_freq)) + theme_bw() +
geom_point(color = "#003366")
Exercises:
You estimate a first Poisson GLM for NCLAIMS
, using the log of EXP
as offset, and FUEL
as the only covariate.
glm.freq1 <- glm(NCLAIMS ~ FUEL, offset = log(EXP), data = DT, family = poisson(link = "log"))
summary(glm.freq1)
##
## Call:
## glm(formula = NCLAIMS ~ FUEL, family = poisson(link = "log"),
## data = DT, offset = log(EXP))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5617 -0.5122 -0.5122 -0.4323 5.7418
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.84669 0.01193 -154.8 <2e-16 ***
## FUELgasoline -0.18451 0.01476 -12.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 89944 on 163230 degrees of freedom
## Residual deviance: 89791 on 163229 degrees of freedom
## AIC: 127527
##
## Number of Fisher Scoring iterations: 6
Note that the same model can be estimated with gam
in the mgcv
package. Later on you will need the extra flexibility offered by the additive models from this package.
library(mgcv)
gam.freq1 <- gam(NCLAIMS ~ FUEL, offset = log(EXP), data = DT, family = poisson(link = "log"))
summary(gam.freq1)
##
## Family: poisson
## Link function: log
##
## Formula:
## NCLAIMS ~ FUEL
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.84669 0.01193 -154.8 <2e-16 ***
## FUELgasoline -0.18451 0.01476 -12.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.00205 Deviance explained = 0.17%
## UBRE = -0.44989 Scale est. = 1 n = 163231
With a continuous covariate (like AGEPH
) you do not want to bin ages upfront, but you first check how the annual expected number of claims changes as a function of AGEPH
(only). The s(AGEPH)
includes a smooth effect of AGEPH
in the regression model.
gam.freq2 <- gam(NCLAIMS ~ s(AGEPH), offset = log(EXP), data = DT, family = poisson(link = "log"))
summary(gam.freq2)
##
## Family: poisson
## Link function: log
##
## Formula:
## NCLAIMS ~ s(AGEPH)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.995266 0.007194 -277.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(AGEPH) 5.917 6.946 1390 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.00853 Deviance explained = 1.47%
## UBRE = -0.45702 Scale est. = 1 n = 163231
You now want to visualize the estimated effect with tailor made graphs. Use the predict
function with type= "terms"
to obtain \(\hat{s}(\text{AGEPH})\) for all observations in the data set.
pred <- predict(gam.freq2, type = "terms", se = TRUE)
str(pred)
## List of 2
## $ fit : num [1:163231, 1] -0.0359 -0.2976 -0.2273 -0.2948 0.3518 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:163231] "1" "2" "3" "4" ...
## .. ..$ : chr "s(AGEPH)"
## $ se.fit: num [1:163231, 1] 0.0135 0.0179 0.0173 0.0313 0.0139 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:163231] "1" "2" "3" "4" ...
## .. ..$ : chr "s(AGEPH)"
## - attr(*, "constant")= Named num -2
## ..- attr(*, "names")= chr "(Intercept)"
For each unique policyholder age in the data set you extract: the age, the fitted effect and a 95% confidence interval.
b <- pred$fit[,1]
l <- pred$fit[,1] - qnorm(0.975) * pred$se.fit[,1]
u <- pred$fit[,1] + qnorm(0.975) * pred$se.fit[,1]
x <- DT$AGEPH
df <- unique(data.frame(x, b, l, u))
Visualize the fitted effect and the pointwise confidence intervals with ggplot
.
p <- ggplot(df, aes(x = x))
p <- p + geom_line(aes(y = b), size = 1, col="#003366")
p <- p + geom_line(aes(y = l), size = 0.5, linetype = 2, col="#99CCFF")
p <- p + geom_line(aes(y = u), size = 0.5, linetype = 2, col="#99CCFF")
p <- p + xlab("ageph") + ylab(expression(hat(f)(ageph))) + theme_bw()
p
Wrap these instructions in a function ggplot.gam
to visualize the smooth effects estimated with gam
.
ggplot.gam <- function(model,variable,gam_term,xlabel,ylabel){
pred <- predict(model, type = "terms", se = TRUE)
col_index <- which(colnames(pred$fit)==gam_term)
x <- variable
b <- pred$fit[, col_index]
l <- pred$fit[, col_index] - qnorm(0.975) * pred$se.fit[, col_index]
u <- pred$fit[, col_index] + qnorm(0.975) * pred$se.fit[, col_index]
df <- unique(data.frame(x, b, l, u))
p <- ggplot(df, aes(x = x))
p <- p + geom_line(aes(y = b), size = 1,col="#003366")
p <- p + geom_line(aes(y = l), size = 0.5, linetype = 2,col="#99CCFF")
p <- p + geom_line(aes(y = u), size = 0.5, linetype = 2,col="#99CCFF")
p <- p + xlab(xlabel) + ylab(ylabel) + theme_bw()
p
}
plot.gam.freq.ageph <- ggplot.gam(gam.freq2, DT$AGEPH, "s(AGEPH)", "ageph", expression(hat(f)(ageph)))
plot.gam.freq.ageph
Exercises:
pred
when type
is response
, link
and terms
;age
, age
as factor, ad hoc age
bins and smooth effect of age
.You now fit a GAM where NCLAIMS
is modelled with a smooth effect of AGEPH
, a smooth effect of POWER
and an interaction effect of AGEPH
and POWER
that is constructed as a correction on top of the main smooth effects of the continuous risk factors.
gam.freq3 <- gam(NCLAIMS ~ s(AGEPH) + s(POWER) + ti(AGEPH, POWER, bs = "tp"), offset = log(EXP), data = DT, family = poisson(link = "log"))
summary(gam.freq3)
##
## Family: poisson
## Link function: log
##
## Formula:
## NCLAIMS ~ s(AGEPH) + s(POWER) + ti(AGEPH, POWER, bs = "tp")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.997175 0.007221 -276.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(AGEPH) 6.028 7.047 1419.733 < 2e-16 ***
## s(POWER) 6.821 7.666 76.713 2.94e-13 ***
## ti(AGEPH,POWER) 1.989 2.740 5.951 0.0989 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.00924 Deviance explained = 1.58%
## UBRE = -0.45751 Scale est. = 1 n = 163231
You visualize the main smooth effects with the function ggplot.gam
as defined before.
plot.gam.freq.ageph <- ggplot.gam(gam.freq3, DT$AGEPH, "s(AGEPH)", "ageph", expression(hat(f)(ageph)))
plot.gam.freq.ageph
plot.gam.freq.power <- ggplot.gam(gam.freq3, DT$POWER, "s(POWER)", "power", expression(hat(f)(power)))
plot.gam.freq.power
To visualize the fitted interaction effect, you first create a 2d grid along the range of values observed for the AGEPH
and POWER
variables.
getExtendedAgephPower <- function(){
ageph <- seq(min(DT$AGEPH), max(DT$AGEPH))
power <- seq(min(DT$POWER), max(DT$POWER))
agephpower <- expand.grid(ageph, power)
DText_agephpower <- data.frame("AGEPH" = agephpower$Var1, "POWER" = agephpower$Var2)
return(DText_agephpower)
}
DText_agephpower <- getExtendedAgephPower()
You then extract the fitted values of the interaction effect ti(AGEPH,POWER)
across this grid and you visualize the fitted effect.
pred <- predict(gam.freq3, DText_agephpower, type = "terms", terms = "ti(AGEPH,POWER)")
GAMext.freq.AGEPHPOWER <- data.frame(DText_agephpower$AGEPH, DText_agephpower$POWER,pred)
names(GAMext.freq.AGEPHPOWER) <- c("ageph","power","s")
plot.gam.freq.agephpower <- ggplot(data=GAMext.freq.AGEPHPOWER, aes(ageph, power, z = s)) + geom_raster(aes(fill = s)) + theme_bw()
plot.gam.freq.agephpower <- plot.gam.freq.agephpower + scale_fill_gradient(expression(hat(f)(ageph, power)), low="#99CCFF", high="#003366")
plot.gam.freq.agephpower <- plot.gam.freq.agephpower + stat_contour(breaks = seq(-0.7, 0.7, 0.05), lty = 2, colour = "white") + stat_contour(breaks = 0, lty = 1, colour = "white")
plot.gam.freq.agephpower
This essentially shows you how to deal with an interaction effect of two continuous risk factors.
You now model the expected value of the NCLAIM
with a (2d) smooth effect of the LAT
and LONG
of the centre of the postal code. This approach captures spatial heterogeneity in the data.
gam.freq4 <- gam(NCLAIMS ~ s(LONG,LAT), offset = log(EXP), data = DT, family = poisson(link = "log"))
summary(gam.freq4)
##
## Family: poisson
## Link function: log
##
## Formula:
## NCLAIMS ~ s(LONG, LAT)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.979195 0.007089 -279.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(LONG,LAT) 26.52 28.54 493.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.00371 Deviance explained = 0.552%
## UBRE = -0.45168 Scale est. = 1 n = 163231
To visualize the fitted effect s(LONG,LAT)
you extract the LAT
and LONG
of all postal codes in Belgium.
belgium_shape = readShapefile()
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\u0043788\Dropbox\APC Module Data Science\computer labs\pricing analytics\Shape file Belgie postcodes", layer: "npc96_region_Project1"
## with 1146 features
## It has 6 fields
str(belgium_shape@data)
## 'data.frame': 1146 obs. of 7 variables:
## $ POSTCODE : int 1300 1301 1310 1315 1320 1325 1330 1331 1332 1340 ...
## $ NAAM1 : Factor w/ 22 levels "ANDERLECHT","BRUSSEL",..: NA NA NA NA NA NA NA NA NA NA ...
## $ NAAM2 : Factor w/ 1138 levels "'s Gravenvoeren",..: 1070 118 561 499 82 204 844 854 358 778 ...
## $ POSTCODELA: Factor w/ 1146 levels "1000","1020",..: 23 24 25 26 27 28 29 30 31 32 ...
## $ Shape_Leng: num 42814 15954 20276 43081 29648 ...
## $ Shape_Area: num 32629722 9641834 15381287 38638405 38812269 ...
## $ id : chr "0" "1" "2" "3" ...
str(coordinates(belgium_shape))
## num [1:1146, 1:2] 4.58 4.57 4.45 4.77 4.75 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:1146] "0" "1" "2" "3" ...
## ..$ : NULL
postcode_DT <- data.frame(PC = belgium_shape@data$POSTCODE, LONG = coordinates(belgium_shape)[,1], LAT = coordinates(belgium_shape)[,2])
Now you use predict
to extract the fitted smooth effect for all these postal codes.
pred <- predict(gam.freq4, newdata = postcode_DT, type = "terms", terms = "s(LONG,LAT)")
DT_pred <- data.frame(PC = postcode_DT$PC, LONG = postcode_DT$LONG, LAT = postcode_DT$LAT, pred)
names(DT_pred)[4] <- "s(LONG,LAT)"
Merge the fitted smooth effects with the shape file data and visualize on the map of Belgium.
belgium_shape@data <- merge(belgium_shape@data, DT_pred, by.x = "POSTCODE", by.y = "PC", all.x = TRUE)
belgium_shape_f <- fortify(belgium_shape)
belgium_shape_f <- merge(belgium_shape_f, belgium_shape@data, by = "id", all.x = TRUE)
plot.gam.freq.map <- ggplot(belgium_shape_f, aes(long, lat, group = group)) + geom_polygon(aes(fill = belgium_shape_f$`s(LONG,LAT)`))
plot.gam.freq.map <- plot.gam.freq.map + theme_bw() + scale_fill_gradient(low="#99CCFF",high="#003366") + labs(fill = expression(hat(f)(long,lat)))
plot.gam.freq.map
You are now ready to fit a GAM with factor variables, smooth effects of continuous risk factor, interaction effects of continuous risk factors and a spatial effect.
gam.freq <- gam(NCLAIMS ~ COVERAGE + FUEL + s(AGEPH) + s(BM) + s(POWER) + s(LONG,LAT) + ti(AGEPH,POWER,bs="tp"), offset = log(EXP), data = DT, family = poisson(link = "log"))
summary(gam.freq)
##
## Family: poisson
## Link function: log
##
## Formula:
## NCLAIMS ~ COVERAGE + FUEL + s(AGEPH) + s(BM) + s(POWER) + s(LONG,
## LAT) + ti(AGEPH, POWER, bs = "tp")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9745911 0.0216710 -91.117 < 2e-16 ***
## COVERAGEPO 0.0008931 0.0239860 0.037 0.97
## COVERAGETPL 0.1221058 0.0221623 5.510 3.6e-08 ***
## FUELgasoline -0.1817652 0.0158127 -11.495 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(AGEPH) 5.709 6.754 295.74 < 2e-16 ***
## s(BM) 6.458 7.373 991.37 < 2e-16 ***
## s(POWER) 6.393 7.282 107.33 < 2e-16 ***
## s(LONG,LAT) 25.754 28.243 418.77 < 2e-16 ***
## ti(AGEPH,POWER) 1.731 2.154 15.88 0.000464 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0196 Deviance explained = 3.5%
## UBRE = -0.46764 Scale est. = 1 n = 163231
You visualize the fitted smooth effects using the function ggplot.gam
specified above. You store the resulting ggplot
s as objects such that these plots can be arranged in a grid of plots later on.
plot.gam.freq.ageph <- ggplot.gam(gam.freq,DT$AGEPH, "s(AGEPH)", "ageph", expression(hat(f)[1](ageph)))
plot.gam.freq.power <- ggplot.gam(gam.freq,DT$POWER, "s(POWER)", "power", expression(hat(f)[2](power)))
plot.gam.freq.bm <- ggplot.gam(gam.freq, DT$BM, "s(BM)", "bm", expression(hat(f)[3](bm)))
To visualize the interaction effect you use the grid of AGEPH
and POWER
values as constructed above. To obtain fitted values over this grid with the GAM stored in object gam.freq
, you have to make sure that the data set DText_agephpower
has the same structure as the original data set DT
. Thus, you should also include variables COVERAGE
, FUEL
, BM
, LONG
, LAT
and EXP
.
DText_agephpower <- getExtendedAgephPower()
DText_agephpower$COVERAGE <- DT$COVERAGE[1]
DText_agephpower$FUEL <- DT$FUEL[1]
DText_agephpower[c("BM", "LONG", "LAT","EXP")] <- c(DT$BM[1], DT$LONG[1], DT$LAT[1], DT$EXP[1])
pred <- predict(gam.freq, DText_agephpower, type = "terms", terms = "ti(AGEPH,POWER)")
GAMext.freq.AGEPHPOWER <- data.frame(DText_agephpower$AGEPH, DText_agephpower$POWER, pred)
names(GAMext.freq.AGEPHPOWER) <- c("ageph", "power", "s")
Visualize the fitted interaction effect and store the plot as plot.gam.freq.agephpower
.
plot.gam.freq.agephpower <- ggplot(data = GAMext.freq.AGEPHPOWER, aes(ageph, power, z=s)) + geom_raster(aes(fill = s)) + theme_bw()
plot.gam.freq.agephpower <- plot.gam.freq.agephpower + scale_fill_gradient(expression(hat(f)[4](ageph,power)), low = "#99CCFF", high = "#003366")
plot.gam.freq.agephpower <- plot.gam.freq.agephpower + stat_contour(breaks = seq(-0.7, 0.7, 0.05), lty = 2, colour = "white") + stat_contour(breaks = 0, lty = 1, colour = "white")
With a similar reasoning you obtain the fitted spatial effect over all postal codes in Belgium.
belgium_shape = readShapefile()
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\u0043788\Dropbox\APC Module Data Science\computer labs\pricing analytics\Shape file Belgie postcodes", layer: "npc96_region_Project1"
## with 1146 features
## It has 6 fields
DT_maps <- data.frame(PC = belgium_shape@data$POSTCODE, LONG = coordinates(belgium_shape)[,1], LAT = coordinates(belgium_shape)[,2])
DT_maps$COVERAGE <- DT$COVERAGE[1]
DT_maps$FUEL <- DT$FUEL[1]
DT_maps[c("BM", "AGEPH", "POWER", "EXP")] <- c(DT$BM[1], DT$AGEPH[1], DT$POWER[1], DT$EXP[1])
pred = predict(gam.freq,newdata = DT_maps, type = "terms", terms = "s(LONG,LAT)")
DT_pred = data.frame(PC = DT_maps$PC, LONG = DT_maps$LONG, LAT = DT_maps$LAT,pred)
names(DT_pred)[4] <- "s(LONG,LAT)"
belgium_shape@data <- merge(belgium_shape@data, DT_pred, by.x = "POSTCODE", by.y = "PC", all.x = TRUE)
belgium_shape_f <- fortify(belgium_shape)
belgium_shape_f <- merge(belgium_shape_f, belgium_shape@data, by = "id", all.x=TRUE)
Visualize the fitted spatial effect and store the plot as plot.gam.freq.map
.
plot.gam.freq.map <- ggplot(belgium_shape_f, aes(long, lat, group = group)) + geom_polygon(aes(fill = belgium_shape_f$`s(LONG,LAT)`))
plot.gam.freq.map <- plot.gam.freq.map + theme_bw() + scale_fill_gradient(low="#99CCFF",high="#003366") + labs(fill = expression(hat(f)[5](long,lat)))
Now you arrange the plots created in a matrix and print the resulting graph.
layout <- rbind(c(1,1,2,2,3,3),
c(4,4,4,5,5,5))
grid.arrange(plot.gam.freq.ageph, plot.gam.freq.power, plot.gam.freq.bm, plot.gam.freq.agephpower, plot.gam.freq.map, layout_matrix = layout)
You construct a data set with all Belgian postal codes, their corresponding LAT
and LONG
(of the center of the postal code) and the fitted smooth spatial effect from gam.freq
the optimal GAM.
pred <- predict(gam.freq, newdata = DT_maps, type = "terms", terms = "s(LONG,LAT)")
GAM.freq.LONGLAT = data.frame("pc" = factor(DT_maps$PC), "long" = DT_maps$LONG, "lat" = DT_maps$LAT,pred)
names(GAM.freq.LONGLAT) <- c("pc","long","lat","s")
GAM.freq.LONGLAT <- GAM.freq.LONGLAT[order(GAM.freq.LONGLAT$pc), ]
str(GAM.freq.LONGLAT)
## 'data.frame': 1146 obs. of 4 variables:
## $ pc : Factor w/ 1146 levels "1000","1020",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ long: num 4.36 4.34 4.37 4.38 4.37 ...
## $ lat : num 50.8 50.9 50.9 50.8 50.8 ...
## $ s : num 0.311 0.264 0.304 0.327 0.329 ...
head(GAM.freq.LONGLAT)
## pc long lat s
## 1145 1000 4.355223 50.84539 0.3109892
## 1142 1020 4.338003 50.88874 0.2643301
## 1137 1030 4.374677 50.86217 0.3042472
## 1127 1040 4.382733 50.83333 0.3266787
## 1131 1050 4.369740 50.82342 0.3287719
## 1135 1060 4.332471 50.83013 0.3067092
You now cluster the fitted spatial effect GAM.freq.LONGLAT$s
in - say - 5 bins or clusters, using the Fisher-Jenks clustering technique. This clustering method is available in R
in the package classInt
.
num_bins = 5
library(classInt)
classint.fisher = classIntervals(GAM.freq.LONGLAT$s, num_bins, style = "fisher")
str(classint.fisher)
## List of 2
## $ var : num [1:1146] 0.311 0.264 0.304 0.327 0.329 ...
## $ brks: num [1:6] -0.4809 -0.2701 -0.1411 -0.0358 0.1096 ...
## - attr(*, "style")= chr "fisher"
## - attr(*, "parameters")= num [1:5, 1:4] 0.113 -0.035 -0.14 -0.267 -0.481 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:4] "min" "max" "class mean" "class sd"
## - attr(*, "nobs")= int 1146
## - attr(*, "call")= language classIntervals(var = GAM.freq.LONGLAT$s, n = num_bins, style = "fisher")
## - attr(*, "intervalClosure")= chr "left"
## - attr(*, "class")= chr "classIntervals"
classint.fisher$brks
## [1] -0.48086349 -0.27012356 -0.14106325 -0.03575633 0.10961949 0.34051818
min(GAM.freq.LONGLAT$s)
## [1] -0.4808635
max(GAM.freq.LONGLAT$s)
## [1] 0.3405182
You visualize the resulting object classint.fisher
with the built-in plot
function from the classInt
package.
crp <- colorRampPalette(c("#99CCFF","#003366"))
plot(classint.fisher, crp(num_bins), xlab = expression(hat(f)[5](long,lat)), main = "Fisher")
You visualize the clusters obtained with Fisher-Jenks on the map of Belgium.
belgium_shape <- readShapefile()
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\u0043788\Dropbox\APC Module Data Science\computer labs\pricing analytics\Shape file Belgie postcodes", layer: "npc96_region_Project1"
## with 1146 features
## It has 6 fields
str(belgium_shape@data)
## 'data.frame': 1146 obs. of 7 variables:
## $ POSTCODE : int 1300 1301 1310 1315 1320 1325 1330 1331 1332 1340 ...
## $ NAAM1 : Factor w/ 22 levels "ANDERLECHT","BRUSSEL",..: NA NA NA NA NA NA NA NA NA NA ...
## $ NAAM2 : Factor w/ 1138 levels "'s Gravenvoeren",..: 1070 118 561 499 82 204 844 854 358 778 ...
## $ POSTCODELA: Factor w/ 1146 levels "1000","1020",..: 23 24 25 26 27 28 29 30 31 32 ...
## $ Shape_Leng: num 42814 15954 20276 43081 29648 ...
## $ Shape_Area: num 32629722 9641834 15381287 38638405 38812269 ...
## $ id : chr "0" "1" "2" "3" ...
belgium_shape@data <- merge(belgium_shape@data, GAM.freq.LONGLAT[c("pc","s")], by.x = "POSTCODE", by.y = "pc", all.x = TRUE)
belgium_shape@data$class_fisher <- cut(as.numeric(belgium_shape@data$s), breaks = classint.fisher$brks, right = FALSE, include.lowest=TRUE, dig.lab = 2)
belgium_shape_f <- fortify(belgium_shape)
belgium_shape_f <- merge(belgium_shape_f, belgium_shape@data, by="id", all.x=TRUE)
plot.bin.map.fisher <- ggplot(belgium_shape_f, aes(long,lat, group = group)) + geom_polygon(aes(fill = belgium_shape_f$class_fisher)) + theme_bw() + labs(fill = "Fisher") + scale_fill_brewer(palette="Blues", na.value = "white")
plot.bin.map.fisher
Exercise:
You now create a new data frame DT.geo
with the relevant variables from the original DT
data frame and merge this data frame with the GAM.freq.LONGLAT
data by postal code. You then split the fitted spatial effect DT.geo$s
with the cut
function where the breaks
are stored in classint.fisher$brks
. These define the clusters as calculated with the Fisher-Jenks technique.
DT.geo <- DT[c("NCLAIMS", "EXP", "COVERAGE", "FUEL", "AGEPH", "BM", "POWER", "PC")]
DT.geo <- merge(DT.geo, GAM.freq.LONGLAT, by.x = "PC", by.y = "pc", all.x = TRUE)
DT.geo$GEO <- as.factor(cut(DT.geo$s, breaks = classint.fisher$brks, right = FALSE, include.lowest=TRUE, dig.lab = 2))
head(DT.geo$GEO)
## [1] [0.11,0.34] [0.11,0.34] [0.11,0.34] [0.11,0.34] [0.11,0.34] [0.11,0.34]
## 5 Levels: [-0.48,-0.27) [-0.27,-0.14) [-0.14,-0.036) ... [0.11,0.34]
The data frame DT.geo
now stores the original data plus the geographical bins (1 to 5) calculated with Fisher-Jenks. You refit the GAM, but instead of using a smooth spatial effect of LAT
and LONG
you include GEO
as a factor variable.
gam.freq.geo <- gam(NCLAIMS ~ COVERAGE + FUEL + s(AGEPH) + s(BM) + s(POWER) + ti(AGEPH,POWER,bs="tp") +
GEO, offset=log(EXP) , data = DT.geo, family = poisson(link = "log"))
summary(gam.freq.geo)
##
## Family: poisson
## Link function: log
##
## Formula:
## NCLAIMS ~ COVERAGE + FUEL + s(AGEPH) + s(BM) + s(POWER) + ti(AGEPH,
## POWER, bs = "tp") + GEO
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.273459 0.055655 -40.849 < 2e-16 ***
## COVERAGEPO -0.004242 0.023915 -0.177 0.85921
## COVERAGETPL 0.118691 0.022002 5.395 6.86e-08 ***
## FUELgasoline -0.179976 0.015758 -11.421 < 2e-16 ***
## GEO[-0.27,-0.14) 0.124469 0.056027 2.222 0.02631 *
## GEO[-0.14,-0.036) 0.173899 0.054566 3.187 0.00144 **
## GEO[-0.036,0.11) 0.330240 0.053198 6.208 5.37e-10 ***
## GEO[0.11,0.34] 0.531954 0.054329 9.791 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(AGEPH) 5.663 6.709 295.71 < 2e-16 ***
## s(BM) 6.709 7.587 1019.68 < 2e-16 ***
## s(POWER) 6.409 7.296 111.05 < 2e-16 ***
## ti(AGEPH,POWER) 1.795 2.264 14.96 0.000856 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0195 Deviance explained = 3.46%
## UBRE = -0.46769 Scale est. = 1 n = 163231
You first extract the data set on which the regression tree will be constructed. For instance, for the tree that splits the smooth effect of AGEPH
you need a data frame with each unique AGEPH
, the corresponding fitted smooth effect and the corresponding weight. For the weight you use the number of policyholders (obtained with aggregate
) in the original data set with this specific age. You can use the function getGAMdata_single
to extract the data to bin the smooth effect of a single covariate.
getGAMdata_single = function(GAMmodel,term,var,varname){
pred = predict(GAMmodel, type = "terms", terms = term)
DT_pred = data.frame("x"=var, pred)
DT_pred = DT_pred[order(DT_pred$x),]
names(DT_pred) = c("x","s")
DT_unique = unique(DT_pred)
DT_exp <- aggregate(s ~ x, data=DT_pred, length)
DT_exp$exp <- DT_exp$s
DT_exp <- DT_exp[c("x","exp")]
GAM_data = merge(DT_unique,DT_exp,by="x")
names(GAM_data) = c(varname,"s","exp")
GAM_data = GAM_data[which(GAM_data$exp!=0),]
return(GAM_data)
}
Extract the relevant data set to split AGEPH
, BM
and POWER
:
GAM.freq.AGEPH <- getGAMdata_single(gam.freq.geo, "s(AGEPH)", DT.geo$AGEPH, "ageph")
GAM.freq.BM <- getGAMdata_single(gam.freq.geo, "s(BM)", DT.geo$BM, "bm")
GAM.freq.POWER <- getGAMdata_single(gam.freq.geo, "s(POWER)", DT.geo$POWER, "power")
head(GAM.freq.AGEPH)
## ageph s exp
## 1 18 0.4855297 16
## 2 19 0.4512298 116
## 3 20 0.4169813 393
## 4 21 0.3828736 701
## 5 22 0.3490415 952
## 6 23 0.3156652 1379
The function getGAMdata_int
extracts the relevant information for interaction effects and is therefore applied to split the ti(AGEPH,POWER)
.
getGAMdata_int = function(GAMmodel,term,var1,var2,varname1,varname2){
pred <- predict(GAMmodel, type = "terms",terms = term)
DT_pred <- data.frame("x"=var1,"y"=var2, pred)
DT_pred <- with(DT_pred, DT_pred[order(x,y),])
names(DT_pred) = c("x","y","s")
DT_unique = unique(DT_pred)
DT_exp <- aggregate(s ~ x+y, data=DT_pred, length)
DT_exp$exp <- DT_exp$s
DT_exp <- DT_exp[c("x","y","exp")]
GAM_data = merge(DT_unique,DT_exp,by=c("x","y"))
names(GAM_data) = c(varname1,varname2,"s","exp")
GAM_data = GAM_data[which(GAM_data$exp!=0),]
return(GAM_data)
}
GAM.freq.AGEPHPOWER = getGAMdata_int(gam.freq.geo,"ti(AGEPH,POWER)",DT.geo$AGEPH, DT.geo$POWER,"ageph","power")
head(GAM.freq.AGEPHPOWER)
## ageph power s exp
## 1 18 21 -0.12567090 1
## 2 18 34 -0.07475940 1
## 3 18 36 -0.06703996 1
## 4 18 37 -0.06319774 2
## 5 18 39 -0.05555274 1
## 6 18 40 -0.05175175 3
You use the package evtree
to construct the regression trees. While working on the paper we slightly changed the original code (on line 46) to accommodate the fact that we build trees on data sets where each covariate value (e.g. AGEPH
) is observed once, with a certain weight. Therefore you source the function evtree.R
that is available in your documentation.
library(evtree)
source("evtree.R")
You specify the control parameters such each bin (or leaf node) should at least contain 5% of the policyholders in the entire portfolio, the tuning parameter \(\alpha\) is put equal to 550 and trees have a maximum depth of 5 leaf nodes. Do note thate the optimal tuning parameter should be tuned for the specific, simulated data set you are using. However, by means of example, we just use the optimal value obtained for the data set analyzed in the paper.
ctrl.freq = evtree.control(minbucket = 0.05*nrow(DT), alpha = 550, maxdepth = 5)
By means of example, you construct the tree on the GAM.freq.AGEPH
data set wheree s
is the response, AGEPH
the covariate to split on and exp
stores the weights.
evtree.freq.AGEPH <- evtree(s ~ ageph, data = GAM.freq.AGEPH, weights = exp, control = ctrl.freq)
evtree.freq.AGEPH
##
## Model formula:
## s ~ ageph
##
## Fitted party:
## [1] root
## | [2] ageph < 56
## | | [3] ageph < 26: 0.303 (n = 8258, err = 22.9)
## | | [4] ageph >= 26
## | | | [5] ageph < 51
## | | | | [6] ageph < 33
## | | | | | [7] ageph < 29: 0.191 (n = 9265, err = 4.8)
## | | | | | [8] ageph >= 29: 0.107 (n = 13939, err = 6.9)
## | | | | [9] ageph >= 33: 0.036 (n = 69055, err = 9.8)
## | | | [10] ageph >= 51: -0.031 (n = 15674, err = 7.2)
## | [11] ageph >= 56
## | | [12] ageph < 61: -0.115 (n = 12591, err = 6.3)
## | | [13] ageph >= 61
## | | | [14] ageph < 73: -0.197 (n = 25323, err = 6.8)
## | | | [15] ageph >= 73: -0.148 (n = 9126, err = 23.2)
##
## Number of inner nodes: 7
## Number of terminal nodes: 8
plot(evtree.freq.AGEPH)
This can be done in the exact same way for the other smooth effect. To split the interaction effect, splits can be made both on AGEPH
and POWER
.
evtree.freq.BM <- evtree(s ~ bm, data = GAM.freq.BM, weights = exp, control = ctrl.freq)
evtree.freq.BM
##
## Model formula:
## s ~ bm
##
## Fitted party:
## [1] root
## | [2] bm < 3
## | | [3] bm < 2
## | | | [4] bm < 1: -0.226 (n = 61650, err = 0.0)
## | | | [5] bm >= 1: -0.099 (n = 26971, err = 0.0)
## | | [6] bm >= 2: 0.001 (n = 9522, err = 0.0)
## | [7] bm >= 3
## | | [8] bm < 9
## | | | [9] bm < 7: 0.106 (n = 29529, err = 19.0)
## | | | [10] bm >= 7: 0.213 (n = 9789, err = 7.4)
## | | [11] bm >= 9
## | | | [12] bm < 11: 0.360 (n = 12868, err = 24.6)
## | | | [13] bm >= 11: 0.524 (n = 12902, err = 37.0)
##
## Number of inner nodes: 6
## Number of terminal nodes: 7
evtree.freq.POWER <- evtree(s ~ power, data = GAM.freq.POWER, weights = exp, control = ctrl.freq)
evtree.freq.POWER
##
## Model formula:
## s ~ power
##
## Fitted party:
## [1] root
## | [2] power < 36: -0.174 (n = 13726, err = 80.9)
## | [3] power >= 36
## | | [4] power < 46: -0.040 (n = 43411, err = 18.4)
## | | [5] power >= 46
## | | | [6] power < 75: 0.017 (n = 82594, err = 17.6)
## | | | [7] power >= 75: 0.116 (n = 23500, err = 59.0)
##
## Number of inner nodes: 3
## Number of terminal nodes: 4
evtree.freq.AGEPHPOWER <- evtree(s ~ ageph + power,data = GAM.freq.AGEPHPOWER, weights = exp, control = ctrl.freq)
evtree.freq.AGEPHPOWER
##
## Model formula:
## s ~ ageph + power
##
## Fitted party:
## [1] root
## | [2] ageph < 57
## | | [3] ageph < 40
## | | | [4] power < 49: -0.029 (n = 25328, err = 5.8)
## | | | [5] power >= 49
## | | | | [6] power < 72: 0.008 (n = 24166, err = 3.2)
## | | | | [7] power >= 72: 0.047 (n = 8544, err = 3.2)
## | | [8] ageph >= 40: 0.000 (n = 60426, err = 4.8)
## | [9] ageph >= 57
## | | [10] power < 68
## | | | [11] power < 47: 0.039 (n = 16548, err = 7.1)
## | | | [12] power >= 47: -0.005 (n = 19982, err = 4.0)
## | | [13] power >= 68: -0.052 (n = 8237, err = 4.7)
##
## Number of inner nodes: 6
## Number of terminal nodes: 7
The next lines of code visualize the splits on top of the fitted smooth effects. The syntax is rather technical, but uses similar instructions as demonstrated above.
splits_evtree = function(evtreemodel,GAMvar,DTvar){
preds=predict(evtreemodel,type="node")
nodes=data.frame("x"=GAMvar,"nodes"=preds)
nodes$change=c(0,pmin(1,diff(nodes$nodes)))
splits_evtree=unique(c(min(DTvar),nodes$x[which(nodes$change==1)],max(DTvar)))
return(splits_evtree)
}
splits2D_evtree = function(evtreemodel,GAMdata,GAMdata_X,GAMdata_Y){
pred = predict(evtreemodel,GAMdata,type="response")
values <- data.frame("X"=GAMdata_X,"Y"=GAMdata_Y,"pred"=pred)
min.X <- as.numeric(tapply(values$X, values$pred, min))
min.Y <- as.numeric(tapply(values$Y, values$pred, min))
max.X <- as.numeric(tapply(values$X, values$pred, max))
max.Y <- as.numeric(tapply(values$Y, values$pred, max))
splits_2D_evtree <- data.frame("xmin"=min.X,"xmax"=max.X,"ymin"=min.Y,"ymax"=max.Y)
return(splits_2D_evtree)
}
splits.freq.AGEPH = splits_evtree(evtree.freq.AGEPH,GAM.freq.AGEPH$ageph,DT$AGEPH)
splits.freq.AGEPH
## [1] 18 26 29 33 51 56 61 73 95
splits.freq.BM = splits_evtree(evtree.freq.BM,GAM.freq.BM$bm,DT$BM)
splits.freq.BM
## [1] 0 1 2 3 7 9 11 22
splits.freq.POWER = splits_evtree(evtree.freq.POWER,GAM.freq.POWER$power,DT$POWER)
splits.freq.POWER
## [1] 10 36 46 75 243
DText_agephpower <- getExtendedAgephPower()
DText_agephpower$COVERAGE <- DT$COVERAGE[1]
DText_agephpower$FUEL <- DT$FUEL[1]
DText_agephpower[c("BM", "LONG", "LAT","EXP")] <- c(DT$BM[1], DT$LONG[1], DT$LAT[1], DT$EXP[1])
pred <- predict(gam.freq,DText_agephpower,type = "terms",terms = "ti(AGEPH,POWER)")
GAMext.freq.AGEPHPOWER <- data.frame(DText_agephpower$AGEPH,DText_agephpower$POWER,pred)
names(GAMext.freq.AGEPHPOWER) <- c("ageph","power","s")
splits.freq.AGEPHPOWER = splits2D_evtree(evtree.freq.AGEPHPOWER,GAMext.freq.AGEPHPOWER,GAMext.freq.AGEPHPOWER$ageph,GAMext.freq.AGEPHPOWER$power)
splits.freq.AGEPHPOWER
## xmin xmax ymin ymax
## 1 57 95 68 243
## 2 18 39 10 48
## 3 57 95 47 67
## 4 40 56 10 243
## 5 18 39 49 71
## 6 57 95 10 46
## 7 18 39 72 243
plot.bin.freq.ageph = ggplot.gam(gam.freq,DT$AGEPH,"s(AGEPH)","ageph",expression(hat(f)[1](ageph))) + geom_vline(xintercept = splits.freq.AGEPH[2:(length(splits.freq.AGEPH)-1)])
plot.bin.freq.power = ggplot.gam(gam.freq,DT$POWER,"s(POWER)","power",expression(hat(f)[2](power))) + geom_vline(xintercept = splits.freq.POWER[2:(length(splits.freq.POWER)-1)])
plot.bin.freq.bm = ggplot.gam(gam.freq,DT$BM,"s(BM)","bm",expression(hat(f)[3](bm))) + geom_vline(xintercept = splits.freq.BM[2:(length(splits.freq.BM)-1)])
plot.bin.freq.agephpower <- ggplot(data=GAMext.freq.AGEPHPOWER,aes(ageph,power)) + geom_raster(aes(fill=s)) + theme_bw() +
scale_fill_gradient(expression(hat(f)[4](ageph,power)),low="#99CCFF",high="#003366") +
stat_contour(aes(z=s),breaks=seq(-0.7,0.7,0.05),lty=2,colour="white") + stat_contour(aes(z=s),breaks=0,lty=1,colour="white") +
geom_segment(aes(x=xmin,y=ymin,xend=xmin,yend=ymax),data=splits.freq.AGEPHPOWER) +
geom_segment(aes(x=xmin,y=ymin,xend=xmax,yend=ymin),data=splits.freq.AGEPHPOWER) +
geom_segment(aes(x=xmin,y=ymax,xend=xmax,yend=ymax),data=splits.freq.AGEPHPOWER) +
geom_segment(aes(x=xmax,y=ymin,xend=xmax,yend=ymax),data=splits.freq.AGEPHPOWER)
grid.arrange(plot.bin.freq.ageph,plot.bin.freq.power,plot.bin.freq.bm,plot.bin.freq.agephpower,ncol=2)
As a final step, you construct a data frame with the response NCLAIMS
, the exposure EXP
, the factor variables COVERAGE
, FUEL
and GEO
, and you add the binned AGEPH
, BM
, POWER
and the interaction of AGEPH
and POWER
to this data set.
DT.freq.bin <- DT.geo[c("NCLAIMS", "EXP", "COVERAGE", "FUEL", "GEO")]
DT.freq.bin$AGEPH <- cut(DT.geo$AGEPH, splits.freq.AGEPH, right = FALSE, include.lowest = TRUE)
summary(DT.freq.bin$AGEPH)
## [18,26) [26,29) [29,33) [33,51) [51,56) [56,61) [61,73) [73,95]
## 8258 9265 13939 69055 15674 12591 25323 9126
DT.freq.bin$BM <- cut(DT.geo$BM, splits.freq.BM, right = FALSE, include.lowest = TRUE)
summary(DT.freq.bin$BM)
## [0,1) [1,2) [2,3) [3,7) [7,9) [9,11) [11,22]
## 61650 26971 9522 29529 9789 12868 12902
DT.freq.bin$POWER <- cut(DT.geo$POWER, splits.freq.POWER, right = FALSE, include.lowest=TRUE)
summary(DT.freq.bin$POWER)
## [10,36) [36,46) [46,75) [75,243]
## 13726 43411 82594 23500
DT.freq.bin$AGEPHPOWER <- round(predict(evtree.freq.AGEPHPOWER, data.frame("ageph" = DT.geo$AGEPH, "power" = DT.geo$POWER), type = "response"), digits=3)
DT.freq.bin$AGEPHPOWER[abs(DT.freq.bin$AGEPHPOWER) < 0.01] <- 0
DT.freq.bin$AGEPHPOWER <- as.factor(DT.freq.bin$AGEPHPOWER)
summary(DT.freq.bin$AGEPHPOWER)
## -0.052 -0.029 0 0.039 0.047
## 8237 25328 104574 16548 8544
and for the factor variables
summary(DT.freq.bin$COVERAGE)
## FO PO TPL
## 22107 45988 95136
summary(DT.freq.bin$FUEL)
## diesel gasoline
## 50398 112833
summary(DT.freq.bin$GEO)
## [-0.48,-0.27) [-0.27,-0.14) [-0.14,-0.036) [-0.036,0.11) [0.11,0.34]
## 4106 21825 34241 71374 31685
You now fit a GLM with only factor variables, and you first choose meaningful reference level for each of the factor variables used.
DT.freq.bin$AGEPH <- relevel(DT.freq.bin$AGEPH, ref = "[33,51)")
DT.freq.bin$BM <- relevel(DT.freq.bin$BM, ref = "[0,1)")
DT.freq.bin$POWER <- relevel(DT.freq.bin$POWER, ref = "[46,75)")
DT.freq.bin$AGEPHPOWER <- relevel(DT.freq.bin$AGEPHPOWER, ref = "0")
DT.freq.bin$GEO <- relevel(DT.freq.bin$GEO, ref = "[-0.036,0.11)")
DT.freq.bin$COVERAGE <- relevel(DT.freq.bin$COVERAGE, ref = "TPL")
DT.freq.bin$FUEL <- relevel(DT.freq.bin$FUEL, ref = "gasoline")
** Exercise**:
relevel
function. How would you choose the reference levels?Resulting GLM for NCLAIMS
is stored in glm.freq
and uses only factor variables.
glm.freq <- gam(NCLAIMS ~ COVERAGE + FUEL + AGEPH + BM + POWER + AGEPHPOWER + GEO, offset = log(EXP) , data = DT.freq.bin, family = poisson(link = "log"))
summary(glm.freq)
##
## Family: poisson
## Link function: log
##
## Formula:
## NCLAIMS ~ COVERAGE + FUEL + AGEPH + BM + POWER + AGEPHPOWER +
## GEO
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.17531 0.02138 -101.734 < 2e-16 ***
## COVERAGEFO -0.11399 0.02187 -5.212 1.87e-07 ***
## COVERAGEPO -0.11964 0.01678 -7.129 1.01e-12 ***
## FUELdiesel 0.17802 0.01574 11.313 < 2e-16 ***
## AGEPH[18,26) 0.27717 0.02967 9.341 < 2e-16 ***
## AGEPH[26,29) 0.14379 0.02925 4.917 8.80e-07 ***
## AGEPH[29,33) 0.06492 0.02612 2.485 0.01295 *
## AGEPH[51,56) -0.06093 0.02648 -2.301 0.02142 *
## AGEPH[56,61) -0.16279 0.03330 -4.888 1.02e-06 ***
## AGEPH[61,73) -0.24733 0.02963 -8.348 < 2e-16 ***
## AGEPH[73,95] -0.19082 0.04086 -4.670 3.01e-06 ***
## BM[1,2) 0.12522 0.02296 5.454 4.93e-08 ***
## BM[2,3) 0.18461 0.03314 5.570 2.55e-08 ***
## BM[3,7) 0.34292 0.02125 16.140 < 2e-16 ***
## BM[7,9) 0.48463 0.02978 16.271 < 2e-16 ***
## BM[9,11) 0.54521 0.02702 20.176 < 2e-16 ***
## BM[11,22] 0.78219 0.02592 30.175 < 2e-16 ***
## POWER[10,36) -0.20689 0.03305 -6.259 3.88e-10 ***
## POWER[36,46) -0.06471 0.02268 -2.853 0.00434 **
## POWER[75,243] 0.11630 0.02807 4.143 3.43e-05 ***
## AGEPHPOWER-0.052 -0.07053 0.04641 -1.520 0.12859
## AGEPHPOWER-0.029 -0.02585 0.02609 -0.991 0.32176
## AGEPHPOWER0.039 0.05843 0.03967 1.473 0.14078
## AGEPHPOWER0.047 0.03483 0.03730 0.934 0.35051
## GEO[-0.48,-0.27) -0.32882 0.05320 -6.181 6.37e-10 ***
## GEO[-0.27,-0.14) -0.20339 0.02326 -8.744 < 2e-16 ***
## GEO[-0.14,-0.036) -0.15519 0.01944 -7.985 1.40e-15 ***
## GEO[0.11,0.34] 0.19847 0.01822 10.894 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.0193 Deviance explained = 3.43%
## UBRE = -0.46754 Scale est. = 1 n = 163231
anova(glm.freq)
##
## Family: poisson
## Link function: log
##
## Formula:
## NCLAIMS ~ COVERAGE + FUEL + AGEPH + BM + POWER + AGEPHPOWER +
## GEO
##
## Parametric Terms:
## df Chi.sq p-value
## COVERAGE 2 62.92 2.17e-14
## FUEL 1 127.99 < 2e-16
## AGEPH 7 193.42 < 2e-16
## BM 6 1058.05 < 2e-16
## POWER 3 64.95 5.13e-14
## AGEPHPOWER 4 12.55 0.0137
## GEO 4 388.92 < 2e-16
Now it’s your turn to explore the severity models proposed in the paper!