Machine learning for insurance pricing

Picture from Henckaerts et al., 2021

In a series of papers we put focus on:

  • smart and data driven preprocessing of continuous and spatial risk factors for use in GLMs in Henckaerts et al., 2018
  • a comparative analysis of GLMs, GAMs, and tree-based machine learning methods (trees, RFs, GBMs) in insurance pricing with frequency and severity modelling, see Henckaerts et al., 2021
  • the smart engineering of a GLM as a global surrogate model for a machine learning method (e.g., a GBM), developed in Henckaerts et al., 2022.

Henckaerts et al., 2018 develops a fully data driven strategy to incorporate continuous risk factors and geographical information in an insurance tariff. This strategy elegantly combines GAMs, trees, clustering method and GLMs.

A notebook reproducing the results of the Henckaerts et al., 2018 paper is available on GitHub.

Henckaerts et al., 2021 presents a detailed discussion of the use of tree-based machine learning methods (i.e. trees, random forests and gradient boosting machines) to model claim frequency and severity data. The goal of this paper is to investigate how tree-based pricing models perform compared to the classical actuarial approach with GLMs and GAMs. This comparison puts focus on statistical performance, interpretation and business implications.

Essential coding steps to reproduce the main findings from the Henckaerts et al., 2021 paper are here. Claim severity modelling with tree-based methods is illustrated in this notebook. The distRforest package provides and R implementation for fitting random forests with a variety of loss functions.

The Henckaerts et al., 2022 paper proposes a procedure to develop an interpretable global surrogate for a complex system. Knowledge is extracted from a black box via partial dependence effects. These are used to perform smart feature engineering by grouping variable values. This results in a segmentation of the feature space with automatic variable selection. A transparent generalized linear model (GLM) is fit to the features in categorical format and their relevant interactions. This GLM serves as a global surrogate to the original black box and replaces it in production.

Source code is available from the maidrr package for R.

This is joint work with Roel Henckaerts, Marie-Pier Côté and Roel Verbelen.

Katrien Antonio
Katrien Antonio
professor in actuarial science and insurance analytics

I’m a professor in actuarial science who loves data science, programming and teaching.