Chapter 22 Appendix. Data Resources
This appendix section describes the datasets used in this book and others that you may wish to explore.
For each set of data, we provide download buttons so that you can easily access the data in standard .csv (comma separated value) format. This allows you replicate and experiment with the methods developed in the book as well as sharpen your understanding through exercises.
We provide the source of each dataset. We also recommend, for deeper understanding, that you occasionally refer to these original sources to further develop your appreciation of the data underpinning the analytics developed in this book.
22.1 Wisconsin Property Fund
Description: The Wisconsin Local Government Property Insurance Fund (LGPIF) is an insurance pool administered by the Wisconsin Office of the Insurance Commissioner. The LGPIF was established to provide property insurance for local government entities that include counties, cities, towns, villages, school districts, and library boards. The fund insures local government property such as government buildings, schools, libraries, and motor vehicles. It covers all property losses except those resulting from flood, earthquake, wear and tear, extremes in temperature, mold, war, nuclear reactions, and embezzlement or theft by an employee.
The data are available using this download button:
Variable | Description |
---|---|
PolicyNum | Policy number |
Year | Contract year |
Premium | Premium |
Deduct | Deductible |
BCcov | Coverage for building and contents |
Freq | Number of claims during the year (frequency) |
Fire5 | Binary variable to indicate the fire class is below 5 |
NoClaimCredit | Binary variable to indicate no claims in the past two years |
EntityType | Categorical variable that is one of six types: 1=Village, 2=City,3=County, 4=Misc, 5=School, or Town) |
AlarmCredit | Categorical variable that is one of four types: (0, 5, 10, or 15) for automatic smoke alarms in main rooms |
BCClaim | Builing and contents claims |
PolicyNum | Year | Premium | Deduct | BCcov | Freq | Fire5 | NoClaimCredit | EntityType | AlarmCredit | BCClaim |
---|---|---|---|---|---|---|---|---|---|---|
120002 | 2006 | 9313 | 1000 | 22714456 | 0 | 1 | 0 | 3 | 1 | 0 |
120002 | 2007 | 8767 | 1000 | 25046646 | 0 | 1 | 0 | 3 | 1 | 0 |
120002 | 2008 | 7090 | 1000 | 20851525 | 0 | 1 | 1 | 3 | 1 | 0 |
120002 | 2009 | 8522 | 1000 | 21852696 | 0 | 1 | 1 | 3 | 1 | 0 |
120002 | 2010 | 7994 | 1000 | 23511493 | 1 | 1 | 1 | 3 | 1 | 6839 |
PolicyNum | Year | Premium | Deduct | BCcov | Freq | Fire5 | NoClaimCredit | EntityType | AlarmCredit | BCClaim |
---|---|---|---|---|---|---|---|---|---|---|
180787 | 2010 | 199 | 500 | 285000 | 0 | 1 | 1 | 4 | 1 | 0 |
180788 | 2010 | 58344 | 100000 | 416739800 | 1 | 1 | 0 | 4 | 1 | 168304 |
180789 | 2010 | 295 | 500 | 500988 | 1 | 1 | 0 | 4 | 1 | 1034 |
180790 | 2010 | 2077 | 1000 | 3580665 | 0 | 1 | 0 | 4 | 4 | 0 |
180791 | 2010 | 81 | 500 | 118800 | 0 | 1 | 0 | 4 | 1 | 0 |
22.2 ANU Corporate Travel Data
Universities purchase corporate travel policies to cover employees and students traveling on official university business for a wide variety of accidents and incidents while away from the campus or primary workplace. This broad coverage includes medical care and evacuation, loss of personal property, extraction for political and weather related reasons, and more. See Frees and Butt (2022) for more information about this coverage.
There are 2107 observations in this dataset. The variable names are described in Table 22.3 and the first and last five observations are in Table 22.4.
Data are available using this button: .
Variable | Description |
---|---|
UW Year | Underwriting Year |
Loss Date | Date that the loss occurred |
Reported Date | Date that the loss was reported |
Last Trans Date | Last date in which there was a transaction regarding the loss |
Paid Loss | Cumulative amount paid on the loss |
Outstanding Reserve | Estimate of the loss amount yet to be paid |
Incurred Loss | Sum of the amount paid and the estimate of future payments |
Status | An indicator as to whether the claim has been deemed settled (closed) or not settled (open) |
UW.Year | Loss.Date | Reported.Date | Last.Trans.Date | Paid.Loss | Outstanding.Reserve | Incurred.Loss | Status |
---|---|---|---|---|---|---|---|
2021 | 19/12/2021 | 20/12/2021 | 24/12/2021 | 10000 | 0 | 10000 | Closed |
2021 | 9/4/2022 | 29/04/2022 | 30/05/2022 | 423 | 0 | 423 | Closed |
2021 | 2/5/2022 | 4/5/2022 | 0 | 500 | 500 | Open | |
2021 | 5/5/2022 | 17/05/2022 | 0 | 562 | 562 | Open | |
2021 | 30/04/2022 | 27/05/2022 | 10/6/2022 | 1500 | 0 | 1500 | Closed |
UW.Year | Loss.Date | Reported.Date | Last.Trans.Date | Paid.Loss | Outstanding.Reserve | Incurred.Loss | Status |
---|---|---|---|---|---|---|---|
2006 | 1/11/2006 | 19/06/2007 | 0 | 0 | 0 | Closed | |
2006 | 24/06/2007 | 26/06/2007 | 8/1/2008 | 6278 | 0 | 6278 | Closed |
2006 | 4/7/2007 | 6/7/2007 | 11/9/2007 | 114 | 0 | 114 | Closed |
2006 | 20/05/2007 | 26/06/2007 | 14/07/2007 | 136 | 0 | 136 | Closed |
2006 | 15/02/2007 | 27/06/2007 | 14/07/2007 | 1208 | 0 | 1208 | Closed |
Source: Frees, Edward and Butt, Adam (2022). “ANU Corporate Travel Insurance Claims 2022”. Australian National University Data Commons. DOI https://doi.org/10.25911/vrdw-9f32.
22.3 ANU Group Personal Accident Data
Group personal accident insurance offers financial protection in case of injury or death resulting from an incident that occurs on the job. Like workers’ compensation, group personal accident offers insurance coverage and liability insurance protection against accidental death or injury. Unlike workers’ compensation, group personal accident covers students and ANU’s voluntary workers. See Frees and Butt (2022) for more information about this coverage.
There are 148 observations in this dataset. The variable names are described in Table 22.5 and the first and last five observations are in Table 22.6.
Data are available using this button: .
Variable | Description |
---|---|
UW Year | Underwriting Year |
Loss Date | Date that the loss occurred |
Last Trans Date | Last date in which there was a transaction regarding the loss. |
Paid Loss | Cumulative amount paid on the loss |
Outstanding Reserve | Estimate of the loss amount yet to be paid |
Incurred Loss | Sum of the amount paid and the estimate of future payments |
Status | An indicator as to whether the claim has been deemed settled (closed) or not settled (open) |
UW.Year | Loss.Date | Last.Trans.Date | Paid.Loss | Outstanding.Reserve | Incurred.Loss | Status |
---|---|---|---|---|---|---|
2021 | 6/12/2021 | 3/6/2022 | 805 | 0 | 805 | Closed |
2021 | 15/11/2021 | 0 | 0 | 0 | Closed | |
2021 | 15/11/2021 | 0 | 0 | 0 | Closed | |
2021 | 22/03/2022 | 4/5/2022 | 396 | 0 | 396 | Closed |
2021 | 11/4/2022 | 2/8/2022 | 740 | 360 | 1100 | Open |
UW.Year | Loss.Date | Last.Trans.Date | Paid.Loss | Outstanding.Reserve | Incurred.Loss | Status |
---|---|---|---|---|---|---|
2010 | 6/3/2011 | 26/07/2011 | 776 | 0 | 776 | Closed |
2010 | 22/07/2011 | 23/01/2012 | 4625 | 0 | 4625 | Closed |
2010 | 5/6/2011 | 30/01/2012 | 1504 | 0 | 1504 | Closed |
2007 | 11/1/2008 | 23/02/2008 | 0 | 0 | 0 | Closed |
2007 | 29/08/2008 | 0 | 0 | 0 | Closed |
Source: Frees, Edward and Butt, Adam (2022). “ANU Group Personal Accident Claims 2022”. Australian National University Data Commons. https://doi.org/10.25911/jcfx-zj56.
22.4 ANU Motor Vehicle Data
This policy covers ANU’s vehicles including cars, vans, utilities, and motorcycles. See Frees and Butt (2022) for more information about this coverage.
There are 318 observations in this dataset. The variable names are described in Table 22.7 and the first and last five observations are in Table 22.8.
Data are available using this button: .
Variable | Description |
---|---|
Policy Term Start Date | Start date of the contract year in which the loss occurred |
Loss Date | Date that the loss occurred |
Reported Date | Date that the loss was reported |
Motor Fault | Party responsible for the loss |
Driver Age | Age of the driver |
Vehicle Description | Type of vehicle |
Loss Postcode | Postal code where the loss occurred |
Excess | The deductible applied to the loss |
Motor Net Paid | Amount paid to the insured (ANU) |
Outstanding Estimate | Estimate of the loss amount yet to be paid |
Motor Net Incurred | Sum of the amount paid and the estimate of future payments |
Third Party Identified | Indicates whether a responsible third party could be identified |
Third Party Insured | Indicates whether a responsible third party was insured |
Policy.Term.Start.Date | Loss.Date | Reported.Date | Motor.Fault | Driver.Age | Vehicle.Description | Loss.Postcode |
---|---|---|---|---|---|---|
1/11/2011 | 6/6/2012 | 4/10/2012 | THIRD PARTY RESPONSIBLE | NA | FORD TRANSIT VAN | 2600 |
1/11/2011 | 16/08/2012 | 14/11/2013 | INSURED RESPONSIBLE | 39 | TOYOTA HIACE | 2612 |
1/11/2011 | 4/9/2012 | 17/01/2013 | INSURED RESPONSIBLE | 52 | HYUNDAI IX35 | 2600 |
1/11/2011 | 21/09/2012 | 28/09/2012 | THIRD PARTY RESPONSIBLE | 59 | HOLDEN COMMODORE | 2518 |
1/11/2011 | 22/09/2012 | 12/10/2012 | INSURED RESPONSIBLE | NA | SUBARU FORESTER | 2612 |
Excess | Motor.Net.Paid | Outstanding.Estimate | Motor.Net.Incurred | Third.Party.Identified | Third.Party.Insured |
---|---|---|---|---|---|
1000 | 385 | 0 | 385 | IDENTIFIED | |
1000 | 901 | 0 | 901 | ||
1000 | 1226 | 0 | 1226 | ||
NA | 1672 | 0 | 1672 | IDENTIFIED | NOT INSURED |
1000 | 3419 | 0 | 3419 | INSURED |
Policy.Term.Start.Date | Loss.Date | Reported.Date | Motor.Fault | Driver.Age | Vehicle.Description | Loss.Postcode |
---|---|---|---|---|---|---|
1/11/2021 | 4/4/2022 | 5/4/2022 | INSURED RESPONSIBLE | 66 | VOLKSWAGEN TIGUAN | 2604 |
11/1/2021 | 11/4/2022 | 9/5/2022 | INSURED RESPONSIBLE | 27 | TOYOTA HILUX | 2540 |
1/11/2021 | 11/4/2022 | 9/5/2022 | INSURED RESPONSIBLE | 27 | TOYOTA HILUX | 2540 |
11/1/2021 | 15/04/2022 | 11/7/2022 | INSURED RESPONSIBLE | 21 | TOYOTA HILVX | 2601 |
1/11/2021 | 18/07/2022 | 18/07/2022 | NO-ONE RESPONSIBLE | NA | TOYOTA HILUX | 2601 |
Excess | Motor.Net.Paid | Outstanding.Estimate | Motor.Net.Incurred | Third.Party.Identified | Third.Party.Insured |
---|---|---|---|---|---|
0 | 2373 | 1056 | 3429 | ||
0 | 210 | 25000 | 25210 | ||
0 | 0 | 31927 | 31927 | ||
0 | 0 | 2750 | 2750 | ||
0 | 0 | 299 | 299 |
Source: Frees, Edward and Butt, Adam (2022). “ANU Motor Vehicle Claims 2022”. Australian National University Data Commons. DOI https://doi.org/10.25911/g7e4-9e46.
22.5 Spanish Personal Insurance Data
This dataset consists of 10,000 insurance private customers of a real portfolio of insurance policy holders in Spain with a motor insurance and a homeowners insurance contract for policy year 2014. The data contain information on each customer, policies and yearly claims by type of contract.
The data are available using this download button:
The description of the data appears in Table 22.9.
Variable | Description |
---|---|
gender | 1 for male and 0 for female |
Age_client | the age of the customer in years |
year | Policy year. Equals 5 corresponding to 2014. |
age_of_car_M | the number of years since the vehicle was bought by the customer |
Car_power_M | the power of the vehicle |
Car_2ndDriver_M | 1 if the customer has informed the insurance company that a second occasional driver uses the vehicle, and 0 otherwise |
num_policiesC | the total number of policies held by the same customer in the insurance company |
metro_code | 1 for urban or metropolitan and 0 for rural |
Policy_PaymentMethodA | 1 for annual payment and 0 for monthly payment in the motor policy |
Policy_PaymentMethodH | 1 for annual payment and 0 for monthly payment in the homeowners policy |
Insuredcapital_content_re | the value of content in homeowners insurance |
Insuredcapital_continent_re | the value of building in homeowners insurance |
appartment | 1 if the homeowners insurance correspond to an apartment and 0 otherwise |
Client_Seniority | the number of years that the customer has been in the company |
Retention | 1 if the policy is renewed and 0 otherwise |
NClaims1 | the number of claims in the motor insurance policy for the corresponding year |
NClaims2 | the number of claims in the homeowners insurance policy for the corresponding year |
Claims1 | the sum of claims cost in the motor insurance policy for the corresponding year |
Claims2 | the sum of claims cost in the homeowners insurance policy for the corresponding year |
Types | 1 when neither an auto nor a home claim, it is equal to 2 when the customer has an auto but not a home claim, it is equal to 3 when the customer does not have not an auto but a home claim and it is equal to 4 when both an auto and a home claim. |
PolID | Policy Identification Number |
All monetary units are expressed in Euros. In motor insurance, only claims at fault are considered.
gender | Age_client | year | age_of_car_M | Car_power_M | Car_2ndDriver_M | num_policiesC |
---|---|---|---|---|---|---|
1 | 47 | 5 | 12 | 163 | 0 | 0 |
1 | 52 | 5 | 13 | 80 | 0 | 1 |
0 | 66 | 5 | 7 | 97 | 0 | 1 |
1 | 70 | 5 | 17 | 95 | 0 | 1 |
1 | 67 | 5 | 13 | 110 | 0 | 1 |
metro_code | Policy_PaymentMethodA | Policy_PaymentMethodH | Insuredcapital_content_re | Insuredcapital_continent_re | appartment |
---|---|---|---|---|---|
0 | 1 | 1 | 10 | 12 | 1 |
0 | 1 | 1 | 10 | 11 | 0 |
1 | 1 | 1 | 9 | 11 | 1 |
0 | 1 | 1 | 10 | 11 | 1 |
0 | 1 | 1 | 11 | 12 | 0 |
Client_Seniority | Retention | NClaims1 | NClaims2 | Claims1 | Claims2 | Types | PolID |
---|---|---|---|---|---|---|---|
7 | 1 | 0 | 0 | 0 | 0 | 1 | 12476 |
18 | 1 | 0 | 0 | 0 | 0 | 1 | 29232 |
15 | 1 | 0 | 0 | 0 | 0 | 1 | 23770 |
16 | 1 | 0 | 1 | 0 | 58 | 3 | 8228 |
6 | 1 | 0 | 0 | 0 | 0 | 1 | 37088 |
gender | Age_client | year | age_of_car_M | Car_power_M | Car_2ndDriver_M | num_policiesC |
---|---|---|---|---|---|---|
1 | 66 | 5 | 8 | 143 | 0 | 1 |
1 | 55 | 5 | 18 | 125 | 1 | 1 |
0 | 41 | 5 | 10 | 190 | 0 | 1 |
1 | 50 | 5 | 5 | 140 | 0 | 1 |
1 | 55 | 5 | 12 | 90 | 0 | 1 |
metro_code | Policy_PaymentMethodA | Policy_PaymentMethodH | Insuredcapital_content_re | Insuredcapital_continent_re | appartment |
---|---|---|---|---|---|
0 | 1 | 1 | 10 | 11 | 1 |
0 | 1 | 1 | 11 | 11 | 1 |
0 | 1 | 1 | 9 | 12 | 1 |
0 | 1 | 1 | 10 | 12 | 0 |
1 | 1 | 1 | 11 | 13 | 0 |
Client_Seniority | Retention | NClaims1 | NClaims2 | Claims1 | Claims2 | Types | PolID |
---|---|---|---|---|---|---|---|
20 | 1 | 0 | 0 | 0 | 0 | 1 | 2967 |
15 | 1 | 0 | 0 | 0 | 0 | 1 | 9387 |
6 | 1 | 0 | 0 | 0 | 0 | 1 | 36519 |
8 | 1 | 0 | 0 | 0 | 0 | 1 | 33276 |
6 | 1 | 0 | 0 | 0 | 0 | 1 | 25370 |
These data were drawn from a larger database of 40,284 insurance private customers. These customers are tracked from 2010 to 2014. Some customers do not renew their policies, so that they do not stay in the sample for five years. For the smaller data, only the 2014 policy year was used and from this, a random sample of 10,000 customers was drawn.
See Frees et al. (2021) for more information about this dataset. The larger database contains 122935 rows and is freely available at:
Source: Guillen, Montserrat; Bolancé, Catalina; Frees, Edward W.; Valdez, Emiliano A. (2021), “Insurance data for homeowners and motor insurance customers monitored over five years”, Mendeley Data, V1, DOI https://doi.org/10.17632/vfchtm5y7j.1
22.6 ‘R’ Package CASdatasets
The R
package CASdatasets
provides a convenient way to access many well-known insurance datasets. This package was originally created to support the book Computational Actuarial Science with R, edited by Arthur Charpentier, Charpentier (2014).
To install the package, here is a bit of R
code:
install.packages("CASdatasets", repos = "http://cas.uqam.ca/pub/", type = "source")
library(CASdatasets)
`?`(CASdatasets)
`?`(sgautonb # See the documentation of the Singapore Auto Data
)
`?`(lossalae # See the documentation of the Loss and Expense Data
)
Note that this package assumes that you have already installed a few other packages, including xts, sp, and zoo.
To illustrate,
22.7 Other Data Sources
There exists man other (non-actarial) data sources. First, data can be obtained from university-based researchers who collect primary data. Second, data can be obtained from organizations that are set up for the purpose of releasing secondary data for the general research community. Third, data can be obtained from national and regional statistical institutes that collect data. Finally, companies have corporate data that can be obtained for research purposes.
While it might be difficult to obtain data to address a specific research problem or answer a business question, it is relatively easy to obtain data to test a model or an algorithm for data analysis. In the modern era, readers can obtain datasets from the Internet. The following is a list of some websites to obtain real-world data:
- UCI Machine Learning Repository. This website (url: http://archive.ics.uci.edu/ml/index.php) maintains more than 400 datasets that can be used to test machine learning algorithms.
- Kaggle. The Kaggle website (url: https://www.kaggle.com/) include real-world datasets used for data science competitions. Readers can download data from Kaggle by registering an account.
- DrivenData. DrivenData aims at bringing cutting-edge practices in data science to solve some of the world’s biggest social challenges. In its website (url: https://www.drivendata.org/), readers can participate in data science competitions and download datasets.
- Analytics Vidhya. This website (url: https://datahack.analyticsvidhya.com/contest/all/) allows you to participate and download datasets from practice problems and hackathon problems.
- KDD Cup. KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by the ACM Special Interest Group on Knowledge Discovery and Data Mining. This website (url: http://www.kdd.org/kdd-cup) contains the datasets used in past KDD Cup competitions since 1997.
- U.S. Government’s open data. This website (url: https://www.data.gov/) contains about 200,000 datasets covering a wide range of areas including climate, education, energy, and finance.
- AWS Public Datasets. In this website (url: https://aws.amazon.com/datasets/), Amazon provides a centralized repository of public datasets, including some huge datasets.