Prediction region for average claim occurrence rate and average claim size in motor insurance

The third-party motor insurance data from Sweden for 1977 described by Andrews and Herzberg in 1985 contain average claim occurrence rate (Pc) , average claim size (Ca) for category of vehicles specified by the kilometres travelled per year (K), geographical zone (Z), no claims bonus (B) and make of car (M). The categorical variables Z and M may first be represented respectively by the vectors (Z1, Z2, ... , Z6) and (M1, M2, ... ,M8) of binary variables. The variable (Pc , Ca) is next modelled to be dependent on X = (K, Z1, Z2, ... , Z6, B, M1,M2, ... ,M8) via a conditional distribution which is derived from an 18-dimensional powernormal distribution. From the conditional distribution, a prediction region for (Pc , Ca) can be obtained to provide useful information on the possible ranges of average claim occurrence rate and average claim size for a given category of vehicles.


Introduction
Determination of fair and accurate tariffs is an important issue in the car insurance industry. Decision-trees and combination of regression techniques have been used to analyse the car insurance data [1][2][3][4]. Jørgensen and de Souza [5] assumed that the claim number is Poisson distributed while the cost for individual claims is gamma distributed and modelled the expected cost of claims per insured unit as a function of the explanatory variables. Double generalized linear models of which the underlying exponential families of distributions are restricted to the Tweedie family were investigated by Jørgensen and Smyth [6] and Andersen and Bonat [7]. These authors modelled the dispersion and mean of the costs simultaneously and produced the tariffs based on the model fitted to the Swedish third-party automobile portfolio of 1977. In the double generalized linear models, the logarithm of the mean is very often set to be a linear function of the explanatory variables. Yang et al. [8] used a gradient tree-boosting algorithm to replace the logarithmic mean by a highly complex functional form. An extension to the lasso which imposes the grouped elastic net penalty was used by Qian et al. [9] to select the explanatory variables to be included in the final Tweedie's compound Poisson model. Another alternative to determine the motor insurance rate was proposed by Pan et al. [10]. They used the multivariate power-normal distribution to fit the Swedish third-party motor insurance data for 1977 and proposed using a conditional distribution for the payment per insured to determine the motor insurance rate.
The approach based on the Tweedie family of generalized linear models implicitly assumes that the claim counts and amounts are independent. However, in practice, claim frequency and severity may be dependent. Several authors attempted to deal with the case when dependency between the claim frequency and severity exists [11][12][13][14][15][16].
In this paper, without assuming that claim counts and amounts are independent, we proceed to derive a two-dimensional conditional distribution for the variables average claim occurrence rate and average claim size from the multivariate power-normal distribution for the vector of variables given by the average claim occurrence rate, average claim size, and the those formed from the various characteristics of the insured vehicles. From the conditional distribution, we construct a prediction region for the variables average claim occurrence rate and average claim size. This region provides useful information on the possible ranges of average claim occurrence rate and average claim size for a given category of vehicles. From the conditional distribution, we next find the distribution for the variable payment per insured which equals the product of the average claim occurrence rate and the average claim size. From the distribution for the payment per insured, we construct a prediction interval for the payment per insured. The first 100 resulting prediction intervals are found to have a shorter average length, but comparable estimated coverage probability, when compared with those given in Pan et al. [10]. Thus, the conditional distribution derived in this paper would provide a good alternative method for determining the motor insurance rate.
This paper contains 5 sections. The second section outlines a numerical method for finding a two-dimensional conditional distribution from a multivariate power-normal distribution (MPN). The third section describes the construction of two-dimensional prediction region for ( , ) = (Average claim occurrence rate, Average claim size) as well as in-sample and out-of-sample prediction intervals for the claim size = per insured. In Section 4, we fit an MPN distribution to the Swedish third-party motor insurance data for 1977 and derive a two-dimensional conditional distribution for ( , ). A prediction region for ( , ) is next derived from the two-dimensional conditional distribution. From the Swedish data for 1977 we also find prediction intervals for the claim size per insured using the two-dimensional conditional distribution. Finally, Section 5 concludes the paper.

Evaluation of two-dimensional conditional distribution
Yeo and Johnson [17] introduced the following power transformation of the standard normal random variable : In Equation (1), the variable ̃ is said to have a power-normal distribution with parameters + and − . Suppose = ( 1 , 2 , … , ) is a vector of random variables and the -th variable is given by where > 0 is a constant and ̃ has a power-normal distribution with parameters + and − . Furthermore let be a × 1 vector of constants and a × orthogonal matrix. Then is said to have a − dimensional power-normal distribution with parameters , , + , − , , 1 ≤ ≤ .
When the values of the initial − 2 components of are given, it is possible to find numerically the conditional distribution of the last two components. The required procedure is as follows: ( (6) When 1 , 2¸… , −2 are given, the conditional joint pdf of ( −1 , ) evaluated at ( −1 ( 1 ) , ( 2 ) ) is then given by 1 , 2 = 1
(2) Find the matrix formed by the eigenvectors of ̂.
A nominally 100(1 − )% prediction region for ( −1 , ) may then be expressed as where 2, 2 is the (1 − )-quantile of a chi square distribution with two degrees of freedom and is given by in Equation (1) with , + and − changed respectively to ̃, + and − . The conditional joint pdf 1 , 2 of ( −1 , ) also provides an alternative method for finding a prediction interval for the variable * = −1 × given by the product of −1 and . The alternative method is as follows: (a) Find where * = [ ( * )] 1 2 ⁄ and ̃ * has a power-normal distribution of which the parameters * + and * − are chosen such that the first four moments of * are equal to those in (a). A nominally 100(1 − )% prediction interval for * is then given by where 2 ⁄ is the (1 − 2 ⁄ )-quantile of the standard normal distribution and * is given by in Equation (1) with , + and − changed respectively to ̃ * , * + and * − .
Under the category specified by ( , , , ), the estimated probability that an insured will make a claim during the policy year may be approximated by = ⁄ while the average claim size may be estimated by = ⁄ . As the probability is defined over an interval of one year, we may also refer to this probability as the average claim rate.
By following the method in Section 3, the observed values of the vector = ( , 1 , 2 , … , 6 , , 1 , 2 , … , 8 , , ) may be used to construct a prediction region for the average claim rate and average claim severity.
Based on the 2183 observed values of , an 18-dimensional power-normal distribution for is formed. When the value of ( ) = ( , 1 , 2 , … , 6 , , 1 , 2 , … , 8 ) is given by the -th row of the observed values of , a nominally 95% in-sample prediction region is found by using the method in Section 3 for ( , ). We may refer to the prediction region which corresponds on the -th row of the observed value of as the -th prediction region. The probability that the prediction region will cover the observed value of ( , ) is called the coverage probability of the prediction region. Among the first 100 prediction regions, it is found that 90 of them cover the observed ( , ). Thus, an estimated coverage probability of the prediction region is 0.9 which is not too far from the targeted value of 0.95. The following figures show two examples of prediction region for ( , ). The prediction region in Fig. 1 shows that the average claim rate is likely to be within 0.09 and 0.43. Furthermore, when the average claim intensity is large, the claim is likely to be around 3,000 with a relatively small range of variation. But when the average claim rate becomes smaller, the range of variation of the claim becomes larger. The prediction region in Fig. 2 shows that when the no claims bonus is changed from 1 to 4 while the values of , and remain unchanged, the average claim rate tends to be smaller while the average claim amount appears to be about the same as before.
Thus, when the values of , , and are given the prediction region gives an idea of the possible ranges of average claim rate and average claim amount.
By using the method given in Section 3, a nominally 95% prediction interval for the claim amount per insured can be found when the values of , , and are given. The lower and upper limits of the first 100 in-sample prediction intervals are shown in Fig. 3. The estimated coverage probability and average length found by using these 100 in-sample prediction intervals are 0.98 and 1,318.17 respectively. Fig. 4 shows the corresponding 100 in-sample prediction intervals for claim amount per insured given in Pan et al. [10]. The estimated coverage probability and average length based on the prediction intervals in Fig. 4 are 0.99 and 1,476.08 respectively.
Thus, compared to the 100 prediction intervals presented in this paper, those given in Pan et al. [10] have comparable estimated coverage probability, but longer average length.
To investigate the performance of out-of-sample prediction regions and intervals, we initially form a table of 2,183 rows with the values of its -th row denoted by ( ) . Next we choose a particular value * of and consider that the initial 16 components 1 * , 2 * , … , 16 * of ( * ) are the given values and we wish to predict ( 17 * , 18 * ). We choose = 200 rows from the remaining 2,180 rows in the table such that the -th is the -th smallest value among the distances computed using the remaining 2,180 rows.
The chosen = 200 rows may be used as the data for getting an 18-dimensional powernormal distribution. The methods in Sections 2 and 3 may next be used to construct the outof-sample prediction region for ( 17 * , 18 * ) (or interval for 17 * × 18 * ). By choosing * = 1, 2, … , 100, we can obtain 100 out-of-sample prediction regions and another 100 out-of-sample prediction intervals. The estimated coverage probabilities of the prediction regions and intervals are found to be 0.85 and 0.96 respectively, while the average length of the prediction intervals is 1,083.61. Fig. 5 shows that the 100 out-of-sample prediction intervals tend to have shorter lengths than those given in Figs

Concluding remarks
Given the profile of a customer, we are interested in the probability that the customer will make a claim ( ), and the size of the claim ( ). As the prediction region for ( , ) provides a set of likely values of ( , ), it may be used to perform the customer risk assessment. The method in his paper may also be used when the profile of a customer includes the data collected via smart sensors on the driving behaviour of the customer.