Chapter 2 Appendix. Data Resources

This appendix section describes the datasets used in this book and others that you may wish to explore.

For each set of data, we provide download buttons so that you can easily access the data in standard .csv (comma separated value) format. This allows you replicate and experiment with the methods developed in the book as well as sharpen your understanding through exercises.

We provide the source of each dataset. We also recommend, for deeper understanding, that you occasionally refer to these original sources to further develop your appreciation of the data underpinning the analytics developed in this book.

2.1 Wisconsin Property Fund

Description: The Wisconsin Local Government Property Insurance Fund (LGPIF) is an insurance pool administered by the Wisconsin Office of the Insurance Commissioner. The LGPIF was established to provide property insurance for local government entities that include counties, cities, towns, villages, school districts, and library boards. The fund insures local government property such as government buildings, schools, libraries, and motor vehicles. It covers all property losses except those resulting from flood, earthquake, wear and tear, extremes in temperature, mold, war, nuclear reactions, and embezzlement or theft by an employee.

The data are available using this download button:

Table 2.1: Variables in the Wisconsin Property Fund Dataset
Variable Description
PolicyNum Policy number
Year Contract year
Premium Premium
Deduct Deductible
BCcov Coverage for building and contents
Freq Number of claims during the year (frequency)
Fire5 Binary variable to indicate the fire class is below 5
NoClaimCredit Binary variable to indicate no claims in the past two years
EntityType Categorical variable that is one of six types: 1=Village, 2=City,3=County, 4=Misc, 5=School, or Town)
AlarmCredit Categorical variable that is one of four types: (0, 5, 10, or 15) for automatic smoke alarms in main rooms
BCClaim Builing and contents claims
Table 2.2: Wisconsin Property Fund First Five Rows
PolicyNum Year Premium Deduct BCcov Freq Fire5 NoClaimCredit EntityType AlarmCredit BCClaim
120002 2006 9313 1000 22714456 0 1 0 3 1 0.00
120002 2007 8767 1000 25046646 0 1 0 3 1 0.00
120002 2008 7090 1000 20851525 0 1 1 3 1 0.00
120002 2009 8522 1000 21852696 0 1 1 3 1 0.00
120002 2010 7994 1000 23511493 1 1 1 3 1 6838.87
Table 2.2: Wisconsin Property Fund Last Five Rows
PolicyNum Year Premium Deduct BCcov Freq Fire5 NoClaimCredit EntityType AlarmCredit BCClaim
180787 2010 199 5e+02 285000 0 1 1 4 1 0.00
180788 2010 58344 1e+05 416739800 1 1 0 4 1 168304.05
180789 2010 295 5e+02 500988 1 1 0 4 1 1034.33
180790 2010 2077 1e+03 3580665 0 1 0 4 4 0.00
180791 2010 81 5e+02 118800 0 1 0 4 1 0.00

2.2 ANU Corporate Travel Data

Universities purchase corporate travel policies to cover employees and students traveling on official university business for a wide variety of accidents and incidents while away from the campus or primary workplace. This broad coverage includes medical care and evacuation, loss of personal property, extraction for political and weather related reasons, and more. See Frees and Butt (2022) for more information about this coverage.

There are 2107 observations in this dataset. The variable names are described in Table 2.3 and the first and last five observations are in Table 2.4.

Data are available using this button: .

Table 2.3: Variables in the Corporate Travel Dataset
Variable Description
UW Year Underwriting Year
Loss Date Date that the loss occurred
Reported Date Date that the loss was reported
Last Trans Date Last date in which there was a transaction regarding the loss
Paid Loss Cumulative amount paid on the loss
Outstanding Reserve Estimate of the loss amount yet to be paid
Incurred Loss Sum of the amount paid and the estimate of future payments
Status An indicator as to whether the claim has been deemed settled (closed) or not settled (open)
Table 2.4: Corporate Travel Data First Five Rows
UW.Year Loss.Date Reported.Date Last.Trans.Date Paid.Loss Outstanding.Reserve Incurred.Loss Status
2021 19/12/2021 20/12/2021 24/12/2021 10000.00 0 10000.00 Closed
2021 9/4/2022 29/04/2022 30/05/2022 423.08 0 423.08 Closed
2021 2/5/2022 4/5/2022 0.00 500 500.00 Open
2021 5/5/2022 17/05/2022 0.00 562 562.00 Open
2021 30/04/2022 27/05/2022 10/6/2022 1500.00 0 1500.00 Closed
Table 2.4: Corporate Travel Data Last Five Rows
UW.Year Loss.Date Reported.Date Last.Trans.Date Paid.Loss Outstanding.Reserve Incurred.Loss Status
2006 1/11/2006 19/06/2007 0.00 0 0.00 Closed
2006 24/06/2007 26/06/2007 8/1/2008 6278.10 0 6278.10 Closed
2006 4/7/2007 6/7/2007 11/9/2007 114.50 0 114.50 Closed
2006 20/05/2007 26/06/2007 14/07/2007 135.65 0 135.65 Closed
2006 15/02/2007 27/06/2007 14/07/2007 1207.75 0 1207.75 Closed


Source: Frees, Edward and Butt, Adam (2022). “ANU Corporate Travel Insurance Claims 2022”. Australian National University Data Commons. DOI https://doi.org/10.25911/vrdw-9f32.

2.3 ANU Group Personal Accident Data

Group personal accident insurance offers financial protection in case of injury or death resulting from an incident that occurs on the job. Like workers’ compensation, group personal accident offers insurance coverage and liability insurance protection against accidental death or injury. Unlike workers’ compensation, group personal accident covers students and ANU’s voluntary workers. See Frees and Butt (2022) for more information about this coverage.

There are 148 observations in this dataset. The variable names are described in Table 2.5 and the first and last five observations are in Table 2.6.

Data are available using this button: .

Table 2.5: Variables in the Group Personal Accident Dataset
Variable Description
UW Year Underwriting Year
Loss Date Date that the loss occurred
Last Trans Date Last date in which there was a transaction regarding the loss.
Paid Loss Cumulative amount paid on the loss
Outstanding Reserve Estimate of the loss amount yet to be paid
Incurred Loss Sum of the amount paid and the estimate of future payments
Status An indicator as to whether the claim has been deemed settled (closed) or not settled (open)
Table 2.6: Group Personal Accident Data First Five Rows
UW.Year Loss.Date Last.Trans.Date Paid.Loss Outstanding.Reserve Incurred.Loss Status
2021 6/12/2021 3/6/2022 805.0 0.0 805 Closed
2021 15/11/2021 0.0 0.0 0 Closed
2021 15/11/2021 0.0 0.0 0 Closed
2021 22/03/2022 4/5/2022 396.0 0.0 396 Closed
2021 11/4/2022 2/8/2022 740.1 359.9 1100 Open
Table 2.6: Group Personal Accident Data Last Five Rows
UW.Year Loss.Date Last.Trans.Date Paid.Loss Outstanding.Reserve Incurred.Loss Status
2010 6/3/2011 26/07/2011 776.00 0 776.00 Closed
2010 22/07/2011 23/01/2012 4624.54 0 4624.54 Closed
2010 5/6/2011 30/01/2012 1503.65 0 1503.65 Closed
2007 11/1/2008 23/02/2008 0.00 0 0.00 Closed
2007 29/08/2008 0.00 0 0.00 Closed


Source: Frees, Edward and Butt, Adam (2022). “ANU Group Personal Accident Claims 2022”. Australian National University Data Commons. https://doi.org/10.25911/jcfx-zj56.

2.4 ANU Motor Vehicle Data

This policy covers ANU’s vehicles including cars, vans, utilities, and motorcycles. See Frees and Butt (2022) for more information about this coverage.

There are 318 observations in this dataset. The variable names are described in Table 2.7 and the first and last five observations are in Table 2.8.

Data are available using this button: .

Table 2.7: Variables in the Motor Vehicle Dataset
Variable Description
Policy Term Start Date Start date of the contract year in which the loss occurred
Loss Date Date that the loss occurred
Reported Date Date that the loss was reported
Motor Fault Party responsible for the loss
Driver Age Age of the driver
Vehicle Description Type of vehicle
Loss Postcode Postal code where the loss occurred
Excess The deductible applied to the loss
Motor Net Paid Amount paid to the insured (ANU)
Outstanding Estimate Estimate of the loss amount yet to be paid
Motor Net Incurred Sum of the amount paid and the estimate of future payments
Third Party Identified Indicates whether a responsible third party could be identified
Third Party Insured Indicates whether a responsible third party was insured
Table 2.8: Motor Vehicle Data First Five Rows
Policy.Term.Start.Date Loss.Date Reported.Date Motor.Fault Driver.Age Vehicle.Description Loss.Postcode
1/11/2011 6/6/2012 4/10/2012 THIRD PARTY RESPONSIBLE NA FORD TRANSIT VAN 2600
1/11/2011 16/08/2012 14/11/2013 INSURED RESPONSIBLE 39 TOYOTA HIACE 2612
1/11/2011 4/9/2012 17/01/2013 INSURED RESPONSIBLE 52 HYUNDAI IX35 2600
1/11/2011 21/09/2012 28/09/2012 THIRD PARTY RESPONSIBLE 59 HOLDEN COMMODORE 2518
1/11/2011 22/09/2012 12/10/2012 INSURED RESPONSIBLE NA SUBARU FORESTER 2612
Excess Motor.Net.Paid Outstanding.Estimate Motor.Net.Incurred Third.Party.Identified Third.Party.Insured
1000 384.88 0 384.88 IDENTIFIED
1000 901.21 0 901.21
1000 1225.71 0 1225.71
NA 1671.76 0 1671.76 IDENTIFIED NOT INSURED
1000 3418.86 0 3418.86 INSURED
Table 2.8: Motor Vehicle Data Last Five Rows
Policy.Term.Start.Date Loss.Date Reported.Date Motor.Fault Driver.Age Vehicle.Description Loss.Postcode
1/11/2021 4/4/2022 5/4/2022 INSURED RESPONSIBLE 66 VOLKSWAGEN TIGUAN 2604
11/1/2021 11/4/2022 9/5/2022 INSURED RESPONSIBLE 27 TOYOTA HILUX 2540
1/11/2021 11/4/2022 9/5/2022 INSURED RESPONSIBLE 27 TOYOTA HILUX 2540
11/1/2021 15/04/2022 11/7/2022 INSURED RESPONSIBLE 21 TOYOTA HILVX 2601
1/11/2021 18/07/2022 18/07/2022 NO-ONE RESPONSIBLE NA TOYOTA HILUX 2601
Excess Motor.Net.Paid Outstanding.Estimate Motor.Net.Incurred Third.Party.Identified Third.Party.Insured
0 2373.49 1056.00 3429.49
0 210.00 25000.00 25210.00
0 0.00 31927.27 31927.27
0 0.00 2750.00 2750.00
0 0.00 299.00 299.00


Source: Frees, Edward and Butt, Adam (2022). “ANU Motor Vehicle Claims 2022”. Australian National University Data Commons. DOI https://doi.org/10.25911/g7e4-9e46.

2.5 Spanish Personal Insurance Data

This dataset consists of 10,000 insurance private customers of a real portfolio of insurance policy holders in Spain with a motor insurance and a homeowners insurance contract for policy year 2014. The data contain information on each customer, policies and yearly claims by type of contract.

The data are available using this download button:

The description of the data is the following:

Table 2.9: Variable and Description of Spanish Personal Insurance Data
Variable Description
gender 1 for male and 0 for female
Age_client the age of the customer in years
year Policy year. Equals 5 corresponding to 2014.
age_of_car_M the number of years since the vehicle was bought by the customer
Car_power_M the power of the vehicle
Car_2ndDriver_M 1 if the customer has informed the insurance company that a second occasional driver uses the vehicle, and 0 otherwise
num_policiesC the total number of policies held by the same customer in the insurance company
metro_code 1 for urban or metropolitan and 0 for rural
Policy_PaymentMethodA 1 for annual payment and 0 for monthly payment in the motor policy
Policy_PaymentMethodH 1 for annual payment and 0 for monthly payment in the homeowners policy
Insuredcapital_content_re the value of content in homeowners insurance
Insuredcapital_continent_re the value of building in homeowners insurance
appartment 1 if the homeowners insurance correspond to an apartment and 0 otherwise
Client_Seniority the number of years that the customer has been in the company
Retention 1 if the policy is renewed and 0 otherwise
NClaims1 the number of claims in the motor insurance policy for the corresponding year
NClaims2 the number of claims in the homeowners insurance policy for the corresponding year
Claims1 the sum of claims cost in the motor insurance policy for the corresponding year
Claims2 the sum of claims cost in the homeowners insurance policy for the corresponding year
Types 1 when neither an auto nor a home claim, it is equal to 2 when the customer has an auto but not a home claim, it is equal to 3 when the customer does not have not an auto but a home claim and it is equal to 4 when both an auto and a home claim.
PolID Policy Identification Number

All monetary units are expressed in Euros. In motor insurance, only claims at fault are considered.

Table 2.10: Spanish Personal Insurance Data First Five Rows
gender Age_client year age_of_car_M Car_power_M Car_2ndDriver_M num_policiesC metro_code Policy_PaymentMethodA Policy_PaymentMethodH
1 47 5 12 163 0 0 0 1 1
1 52 5 13 80 0 1 0 1 1
0 66 5 7 97 0 1 1 1 1
1 70 5 17 95 0 1 0 1 1
1 67 5 13 110 0 1 0 1 1
Insuredcapital_content_re Insuredcapital_continent_re appartment Client_Seniority Retention NClaims1 NClaims2 Claims1 Claims2 Types PolID
10.189202 12.07322 1 6.581793 1 0 0 0 0.00 1 12476
9.571442 11.44319 0 18.480493 1 0 0 0 0.00 1 29232
9.330216 11.27605 1 15.085558 1 0 0 0 0.00 1 23770
10.484897 11.13662 1 15.523614 1 0 1 0 57.97 3 8228
10.961310 12.34759 0 6.108145 1 0 0 0 0.00 1 37088
Table 2.10: Spanish Personal Insurance Data Last Five Rows
gender Age_client year age_of_car_M Car_power_M Car_2ndDriver_M num_policiesC metro_code Policy_PaymentMethodA Policy_PaymentMethodH
1 66 5 8 143 0 1 0 1 1
1 55 5 18 125 1 1 0 1 1
0 41 5 10 190 0 1 0 1 1
1 50 5 5 140 0 1 0 1 1
1 55 5 12 90 0 1 1 1 1
Insuredcapital_content_re Insuredcapital_continent_re appartment Client_Seniority Retention NClaims1 NClaims2 Claims1 Claims2 Types PolID
10.305182 11.40377 1 19.731691 1 0 0 0 0 1 2967
10.888420 11.07072 1 15.334702 1 0 0 0 0 1 9387
9.224866 11.63272 1 6.006845 1 0 0 0 0 1 36519
9.969163 12.27171 0 8.391513 1 0 0 0 0 1 33276
11.127278 12.73670 0 6.422998 1 0 0 0 0 1 25370

These data were drawn from a larger database of 40,284 insurance private customers. These customers are tracked from 2010 to 2014. Some customers do not renew their policies, so that they do not stay in the sample for five years. For the smaller data, only the 2014 policy year was used and from this, a random sample of 10,000 customers was drawn.

See Frees et al. (2021) for more information about this dataset. The larger database contains 122935 rows and is freely available at:

Source:

Guillen, Montserrat; Bolancé, Catalina; Frees, Edward W.; Valdez, Emiliano A. (2021), “Insurance data for homeowners and motor insurance customers monitored over five years”, Mendeley Data, V1, DOI https://doi.org/10.17632/vfchtm5y7j.1

2.6 ‘R’ Package CASdatasets

The R package CASdatasets provides a convenient way to access many well-known insurance datasets. This package was originally created to support the book Computational Actuarial Science with R, edited by Arthur Charpentier, Charpentier (2014).

To install the package, here is a bit of R code:

install.packages("CASdatasets", repos = "http://cas.uqam.ca/pub/", type="source")
library(CASdatasets)
?CASdatasets
?sgautonb # See the documentation of the Singapore Auto Data
?lossalae # See the documentation of the Loss and Expense Data

Note that this package assumes that you have already installed a few other packages, including xts, sp, and zoo.

To illustrate,

  • in Chapter 4 we use the Singapore data (referred to as sgautonb in the package) and
  • in Chapter 15 we use the loss and expense data (referred to as lossalae in the package).

2.7 Other Data Sources

There exists man other (non-actarial) data sources. First, data can be obtained from university-based researchers who collect primary data. Second, data can be obtained from organizations that are set up for the purpose of releasing secondary data for the general research community. Third, data can be obtained from national and regional statistical institutes that collect data. Finally, companies have corporate data that can be obtained for research purposes.

While it might be difficult to obtain data to address a specific research problem or answer a business question, it is relatively easy to obtain data to test a model or an algorithm for data analysis. In the modern era, readers can obtain datasets from the Internet. The following is a list of some websites to obtain real-world data:

  • UCI Machine Learning Repository. This website (url: http://archive.ics.uci.edu/ml/index.php) maintains more than 400 datasets that can be used to test machine learning algorithms.
  • Kaggle. The Kaggle website (url: https://www.kaggle.com/) include real-world datasets used for data science competitions. Readers can download data from Kaggle by registering an account.
  • DrivenData. DrivenData aims at bringing cutting-edge practices in data science to solve some of the world’s biggest social challenges. In its website (url: https://www.drivendata.org/), readers can participate in data science competitions and download datasets.
  • Analytics Vidhya. This website (url: https://datahack.analyticsvidhya.com/contest/all/) allows you to participate and download datasets from practice problems and hackathon problems.
  • KDD Cup. KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by the ACM Special Interest Group on Knowledge Discovery and Data Mining. This website (url: http://www.kdd.org/kdd-cup) contains the datasets used in past KDD Cup competitions since 1997.
  • U.S. Government’s open data. This website (url: https://www.data.gov/) contains about 200,000 datasets covering a wide range of areas including climate, education, energy, and finance.
  • AWS Public Datasets. In this website (url: https://aws.amazon.com/datasets/), Amazon provides a centralized repository of public datasets, including some huge datasets.

Bibliography

Charpentier, Arthur. 2014. Computational Actuarial Science with R. CRC press.

Frees, Edward W, Catalina Bolancé, Montserrat Guillen, and Emiliano A Valdez. 2021. “Dependence Modeling of Multivariate Longitudinal Hybrid Insurance Data with Dropout.” Expert Systems with Applications 185: 115552.

Frees, Edward W, and Adam Butt. 2022. “ANU Insurable Risks.” https://doi.org/10.25911/0SE7-N746.