Chapter 4 Model Selection and Estimation
Chapter Description
Chapters 2 and 3 have described how to fit parametric models to frequency and severity data, respectively. This chapter begins with the selection of models. To compare alternative parametric models, it is helpful to summarize data without reference to a specific parametric distribution. Section 4.1 describes nonparametric estimation, how we can use it for model comparisons and how it can be used to provide starting values for parametric procedures. The process of model selection is then summarized in Sections 4.2 and 4.3. Although our focus is on data from continuous distributions, the same process can be used for discrete versions or data that come from a hybrid combination of discrete and continuous distributions.
Model selection and estimation are fundamental aspects of statistical modeling. To provide a flavor as to how they can be adapted to alternative sampling schemes, Sections 4.4 and 4.5 describes estimation for grouped, censored and truncated data. To see how they can be adapted to alternative models, the chapter closes with Section 4.6 on Bayesian inference, an alternative procedure where the (typically unknown) parameters are treated as random variables.
- Although not needed to go through the tutorials, some users may wish to download the overheads that the videos are based on.
- By watching the videos and working through the tutorial exercises, you will get an appreciation for model selection and estimation. For a deeper dive, see the corresponding chapter in the textbook, Chapter Four of Loss Data Analytics.
4.1 Nonparametric Inference
In this section, you learn how to:
- Estimate moments, quantiles, and distributions without reference to a parametric distribution.
Video: Nonparametric Estimation Tools
Overheads: Nonparametric Estimation Tools (Click Tab to View)
4.1.1 Exercise. Empirical Distribution Function and Quartiles, Percentiles and Quantiles
Assignment Text
In this tutorial, you will find empirical (cumulative) distribution and quantiles for the loss. The Wisconsin Property Fund data has already been read into a data frame called Insample
. These data consist of claim experience for fund members over the years 2006-2010, inclusive. It includes the claim year Year
and claim loss y
.
Instructions
- Use the function ecdf() to create the empirical (cumulative) distribution for loss amount
y
and plot it. - Use the function quantile() to produce sample quantiles corresponding to the given percentiles for each year. Which year has the largest 90th percentile?
4.1.2 Exercise. Density Estimators
Assignment Text
Nonparametric density estimators, such as the kernel estimator, are regularly used in practice. In this tutorial, We will use the Wisconsin Property Fund data, which already has been read into a data frame called Insample
, to plot a histogram and overlay density curves with various density estimators.
Instructions
- Plot a histogram of logarithmic loss amount
y
. - Use the function density() to compute kernel density estimates.
- Overlay density curves on top of the histogram:
- using a blue thick curve to represent a Gaussian kernel density where the bandwidth was selected automatically using an ad hoc rule based on the sample size and volatility of these data,
- using a green thick curve to represent a Gaussian kernel density with a bandwidth equal to 1,
- using a red thick curve to represent a triangular kernel density with a bandwidth of 0.1.
4.1.3 Exercise. Comparing Poisson and Negative Binomial Distributions
In this tutorial, you will model claim count data and decide whether a Poisson\((\lambda)\) or a negative binomial\((r=3,p)\) better fits the data. Because of the number of steps involved, we split this tutorial into several distinct stages.
Assignment Text
The raw data are the number of claims filed by \(n=100\) policyholders. Recall that claim count data are discrete and take values \(0, 1, 2, \dots\). So, we consider two discrete probability models. You will apply tools learned in Section 4.1 to make a decision on which model fits these data better.
We start with exploratory data analysis, making graphical and numeric summaries of data before fitting a model.
Instructions
Observe that the distribution is heavily right-skewed, and the sample variance is much larger than the sample mean. You may recall that if \(X \sim\) Poisson\((\lambda)\), \[ \mathrm{E}(X) = \mathrm{Var}(X) = \lambda. \] We would expect the sample mean and sample variance to be roughly equal if the claims data came from the Poisson… our first clue this distribution may not be best!
Instructions
- Fit the Poisson distribution by estimating \(\lambda\) using maximum likelihood estimation.
- Make a \(qq\) plot comparing the observed data to the fitted Poisson.
Fitting the distribution involves estimating the parameter \(\lambda\). Do you recall the maximum likelihood estimator for a Poisson? If not, this is shown in the companion video for this tutorial.
Do you notice how the empirical quantiles do not match the model for some of the larger values in the right tail? This plot says the empirical quantiles are too large, that’s why they are above the diagonal line \(y=x\). Our claims data are too right-skewed, and that’s why the \(qq\) plot seems off for larger quantiles. Might the negative binomial might be more promising?
Instructions
- Fit the negative binomial by using the method of moments estimator to estimate \(p\).
- Make a \(qq\) plot comparing the observed data to the fitted negative binomial.
To fit the Negative Binomial(\(r=3, \beta\)), we need to estimate the parameter \(\beta\) using the method of moments. Recall that the probability mass function for a negative binomial (Sec 2.2.3.3 of Loss Data Analytics) is \[ \Pr(X = k) = \frac{(k+r-1)!}{k!(r-1)!}\left(\frac{1}{1+\beta}\right)^r \left(\frac{\beta}{1+\beta} \right)^k, \hspace{5mm} k=0, 1, ... \] with parameters \(r\) and \(\beta\). Also recall that in the method of moments we equate the \(t^{th}\) sample moment and \(t^{th}\) model moment, \[ \mathrm{E}(X^t) = \frac{1}{n}\sum_{i=1}^{n}x_{i}^t, \] where \(t=1, 2, ...\). If we use \(t=1\) to equate the first moments, we get the expression \[ \mathrm{E}(X) = r\beta = 3\beta = \overline{x} = 3 \] Solving the above for \(\beta\), we see \(\hat{\beta} = 1\). Therefore the Negative Binomial(\(r=3, \beta=1\)) is the fitted model.
Next, we use R
to fit this and make a \(qq\) plot. It is important to understand that distributions are not uniquely parameterized, and there can be different conventions across fields, textbooks, and software. Indeed, in R
the function defines the negative binomial as
\[
\Pr(X = k) = \frac{(k+n-1)!}{(n-1)!k!}p^n(1-p)^k, \hspace{5mm} k=0, 1, ...
\]
This matches the definition in the textbook by equating probability \(p = \frac{1}{1+\beta}\), and size \(r=n\). It is very important to carefully consider the parameterization of any distribution you use in R
.
4.2 Tools for Model Selection
In this section, you learn how to:
- Summarize the data graphically without reference to a parametric distribution.
- Determine measures that summarize deviations of a parametric from a nonparametric fit.
- Use nonparametric estimators to approximate parameters that can be used to start a parametric estimation procedure.
Video: Tools for Model Selection
Overheads: Tools for Model Selection (Click Tab to View)
4.2.1 Exercise. Probability-Probability (\(pp\)) plot
Assignment Text
Probability-probability (\(pp\)) plots can be used to corroborate the selection of parametric models. The Wisconsin Property Fund data has already been read into a data frame called Insample
and we created subset data which contains the positive loss data called positive_loss
.
In this tutorial, you will generate \(pp\) plots with Gaussian and gamma distribution. Based on the \(pp\) plots, which parametric fitted model is closer to the empirical distribution?
Instructions
- Use the function glm(response~1,data=,family=Gamma(link=log)) to fit generalized linear model (glm) with gamma family.
- Find the shape and rate parameters from the fitted gamma glm. (No action required on your part.)
- Use the function glm(response~1,data=,family=gaussian(link=log)) to fit generalized linear model (glm) with Gaussian distribution.
- Find the mean and standard deviation from the fitted Gaussian glm. (No action required on your part.)
- Use the function ecdf() to create the empirical distribution for positive loss.
- Use the function pgamma() to create distribution function for the gamma distribution. Generate a \(pp\) plot for the positive loss amount
y
using the gamma distribution. - Use the function pnorm() to create distribution function for the Gaussian distribution. Generate a \(pp\) plot for the positive loss amount
y
using Gaussian distribution.
Editorial Note. This exercise uses the glm() function to produce maximum likelihood estimates of certain classes of distributions including the gamma and Gaussian (normal). Section 3.5 of Loss Data Analytics employs a comparable function, vlgm(). These functions are even more useful when employed in regression modeling contexts. Although regression modeling is outside the scope of this course, the exercise provides an initial exposure to these functions that are very useful in actuarial practice.
4.2.2 Exercise. Quantile-Quantile (\(qq\)) plot
Assignment Text
The \(pp\) plot shows cumulative distribution functions, discrepancies can sometimes be difficult to detect where a fitted parametric distribution is deficient. As an alternative, it is common to use a quantile-quantile (\(qq\)) plot. The Wisconsin Property Fund data has already been read into a data frame called Insample
and we also created a subset data which includes only positive loss data called positive_loss
. In this exercise, parameter estimates are determined by the method of moments, in contrast to the maximum likelihood approach (with the glm
function) taken in Exercise 4.2.1. This is approach is simpler and demonstrates an alternative method. Recall that we used the method of moments for the gamma distribution in Exercise 3.2.2.
In this tutorial, you will generate \(qq\) plots with Gaussian and gamma distribution. Based on the \(qq\) plot, which parametric fitted model is closer to the empirical distribution?
Instructions
- Find sample size, mean, variance, and standard deviation for the positive loss.
- Determine the method of moments estimators for the gamma distribution.
- Calculate
p_value
, a list of probabilities used for plotting quantiles.
- Use the function qnorm() to create the quantiles for the Gaussian distribution. Generate a \(qq\) plot for the positive loss amount
y
using the Gaussian distribution. - Use the function qgamma() to create the quantiles for the gamma distribution. Generate a \(qq\) plot for the positive loss amount
y
using the gamma distribution.
4.2.3 Exercise. Kolmogorov-Smirnov statistic
Assignment Text
For reporting results, it can be effective to supplement the graphical displays with selected statistics that summarize model goodness of fit. For background, you can learn more about the Kolmogorov-Smirnov statistic and other goodness of fit statistics in Section 4.1.2 of Loss Data Analytics.
The Wisconsin Property Fund data has already been read into a data frame called Insample
. Previously, we also created a subset, positive_loss
, which included only positive losses. Perform Kolmogorov-Smirnov tests.
Instructions
- Compare the empirical positive loss amount
y
distribution to gamma distribution by performing the Kolmogorov-Smirnov test. - Use the function ks.test to perform Kolmogorov-Smirnov test.
4.3 Model Selection: Likelihood Ratio Tests and Goodness of Fit
Video: Likelihood Ratio Tests and Goodness of Fit
Overheads: Likelihood Ratio Tests and Goodness of Fit (Click Tab to View)
4.4 Estimation using Modified Data: Nonparametric Approach
In this section, you learn how to:
- Describe grouped and censored truncated data.
- Estimate distributions nonparametrically based on grouped and censored data.
Video: Nonparametric Estimation using Modified Data
Overheads: Nonparametric Estimation using Modified Data (Click Tab to View)
4.4.1 Exercise. Kaplan-Meier Estimation based on Censored Data
Assignment Text
This exercise is based on the Wisconsin Property Fund data. These data consist of claim experience for fund members over the years 2006-2010, inclusive. A subset which includes only positive loss data has been read into a data frame called positive_loss
. In part because the units of observations are governmental units, there are no upper limits on these policies. So, simply to illustrate coding for censored data, in this exercise we impose an upper limit of 50,000 on the first 100 claims as well as 100,000 on the remainder. (As an analyst, this part of the exercise could be useful to you as you test the sensitivity of policy limits.)
After adding censoring information to the data, we use Kaplan-Meier technique to estimate the claims survival distribution (recall that the survival function is one minus the distribution function).
Instructions
- Introduce policy limits as part of the
positive_loss
data. - Create a new variable,
AmountPaid
, that is the smaller of the policy limit and the claim amount. In the following sample code, we show how to use the dplyr data wrangling package. - Also create a binary variable,
UnCensored
, that is one if the claim is less than or equal to the policy limit, and zero otherwise. - To determine the Kaplan-Meier product limit estimator, use survfit() to estimate the survival function for the right-censored empirical distribution for the positive loss amount
y
. Plot the result.
4.4.2 Exercise. Nonparametric Estimation using Truncated and Censored Information
As described in Section 4.3.2.2 of Loss Data Analytics, it is very common to have data subject to both deductibles and upper limits. As emphasized in the previous exercise, upper limits typically means that the data are censored from the right. Typically, in the presence of deductibles, the data are truncated from the left (that is, we usually don’t see a claim less than the deductible). We make that assumption for this exercise, using the data from Exercise 4.4.1 where the deductible variable is called Deduct
.
With these modified data, we again estimate the survival distribution. The following code guides you through the estimation using both the Kaplan-Meier Product Limit and the Nelson-Äalen estimators. These methods are described in more detail in Section 4.3.2.2 of Loss Data Analytics.
Instructions
- Calculate and plot the KM estimate of the survival function without deductible information, that is, as in the prior exercise.
- Calculate and plot the KM estimate with deductible information.
- Calculate and plot the Nelson-Äalen estimate of the survival function without deductible information.
- Calculate and plot the Nelson-Äalen estimate with deductible information.
4.5 Estimation using Modified Data: Parametric Approach
In this section, you learn how to:
- Describe grouped and censored truncated data.
- Estimate parametric distributions based on grouped, censored,and truncated data.
Video: Parametric Estimation using Modified Data
Overheads: Parametric Estimation using Modified Data (Click Tab to View)
4.6 Bayesian Inference
In this section, you learn how to:
- Describe the Bayesian model as an alternative to the frequentist approach and summarize the five components of this modeling approach.
- Summarize posterior distributions of parameters and use these posterior distributions to predict new outcomes.
- Use conjugate distributions to determine posterior distributions of parameters.
Video: Bayesian Inference
Overheads: Bayesian Inference (Click Tab to View)
4.6.1 Visualizing Bayesian Methods
Assignment Text
In this tutorial, we will visually explore the prior and posterior distributions from Example 4.4.1 of Loss Data Analytics, and explore what would happen if we observed one more observation as well. Recall that for this problem, the prior distribution was \[ \pi(q) = q^3/0.07, \hspace{5mm} 0.6 \leq q \leq 0.8. \]
two observations from a Bernoulli(q) were \(X_1 = 1, X_2 = 0\), and the posterior was found (Example 4.4.1) to be \[ \pi(q \mid 1,0) = \frac{q^4 - q^5}{0.014069}. \]
Instructions
Notice that the prior distribution sets the support for \(q\), which is bounded between 0.6 and 0.8. The posterior will necessarily have this same support. Notice also that the prior distribution places more mass for larger values of \(q\). This is saying that prior to observing data, we think larger parameters values near 0.8 are more likely than smaller values near 0.6
The observed data has a claim in year 1 (\(X_1 = 1\)) and no claim in year 2 (\(X_2 = 0\)). So, we have two independent samples drawn from a Bernoulli(q), and \(\overline{X} = 0.5\). If we were only using the likelihood and a method such as maximum likelihood estimation, we would estimate \(\widehat{q} = \overline{X} = 0.5\). So these data are telling us that values near \(q=0.5\) are consistent. But, Bayesian methods blend information from the data with information from the prior to produce a posterior. So, expect that our posterior distribution will place more mass towards smaller values of \(q\), and less towards larger values. Let’s see if that is true.
Indeed, the posterior has shifted the mass towards lower values, balancing what the data tell us and what the prior tells us. Since this is only two data points, the shift is not especially dramatic.
Assignment Extension
Let’s continue with the prior example, but now imagine we next observe the third year, and there was again no claim, \(X_3 = 0\). We will compute the new posterior distribution, and compare it to the posterior in the previous problem.
Instructions
- Make a plot showing the original prior, posterior from example 4.1.1, and posterior for the new case.
The posterior distribution after observing three data points is:
\[ \pi(q \mid 1,0,0) \propto f(1\mid q)f(0\mid q)f(0\mid q)\pi(q) = q(1-q)(1-q)q^3 = q^4 - 2q^5 + q^6 \]
Recall the posterior is proportional to \(q^4 - 2q^5 + q^6\), so that means \[ \pi(q \mid 1,0,0) = c \left(q^4 - 2q^5 + q^6 \right). \]
In order to find the normalizing constant \(c\), we would need to integrate this and make sure the integral is 1:
\[ \begin{array}{ll} 1 &= \int_{q=0.6}^{0.8} c \left(q^4 - 2q^5 + q^6\right)dq\\ &= c \left(\frac{1}{5}q^5 - \frac{2}{6} q^6 +\frac{1}{7} q^7 \right) \rvert_{q=0.6}^{q=0.8}\\ &= c \left( 0.0081 - 0.00400 \right)\\ &= c \left(0.0041 \right)\\ \end{array} \]
Thus, we see the normalizing constant is \(c = \frac{1}{.0041}\). We can now revisit the code and add the new posterior line:
As expected, the new posterior has shifted even more mass towards lower parameter values in the posterior, since \(\overline{X} = \frac{1}{3}\) and the data are more consistent with low parameter values for \(q\).
A Practicing Actuary’s Perspective
Here is a perspective on Chapter Four from Roosevelt Mosley, a principal with Pinnacle Actuarial Resources.
Chapter Contributors
- Authors. Rob Erhardt, Wake Forest University, Brian Hartman, Brigham Young University, and Zhiyu (Frank) Quan, University of Illinois at Urbana-Champaign, are the principal authors of the initial version of this chapter.
- Chapter Maintainers. Please contact Frank at zquan@illinois.edu and/or Jed at jfrees@bus.wisc.edu for chapter comments and suggested improvements.