Kaplan-Meier usage for forecasting

So I have somehow got involved in a forecasting project at work and am currently making things up as I go along (because no one else has any clue what they're doing either), but would greatly value your input as to the best method of approach.


Insurance policies are sold with an unknown end date. For all live policies, we want to predict a likely end date (or upper and lower bands) to forecast future earnings, based on data from previous policies that have now ended.


Covering 15 years worth of data, once split into the separate product categories, looking at anywhere from 1.5k to 20k values per category

Current method:

(Using KNIME's loop feature to iterate through live policies individually)

1. Take live policy active days (start date vs today's date)
2. Remove policies from overall dataset (filtered for same category) with active days less than live policy
3. Calculate Median and Median Absolute Deviation of remaining policies
4. Remove policies from overall dataset with active days greater than Median + Median Abolute Deviation
5. Calculate Median, Mean, and Count from remaining policies and attach data to live policy
6. Loop to next live policy


As mentioned in the intro, I made up this process based on what I thought seemed not totally unreasonable, but with no true knowledge of whether it is statistically robust.

With the end result, I'm not even sure what to do. I can obviously use the Median or the Mean value to then predict the likely end date of the policy, and use the sample size as a sort of weighting to that prediction (i.e. smaller end sample size vs overall category sample size = low accuracy prediction score), but would really appreciate someone's input.

Many thanks
Just to add, this current model would only ever predict that policies are going to continue for a good amount of time, therefore I guess I shouldn't cut the data off with active days less that the live policy, so that a prediction could be that the policy will not last any longer, just not sure how to go about setting that lower band.



TS Contributor
I recommend using a different approach. You are essentially trying to predict the probability of an event occurring at a point in the future. This is the basis of reliability analysis. I would use a survival analysis for arbitrary censored data. With the number of policies involved, you could use a nonparametric Kaplan-Meier, Turnbull or Actuarial approach depending on your data structure. You would end up with probabilities for a policy ending for any future time period with upper and lower confidence intervals.


No cake for spunky
Cox proportional hazard is probably the best way to do this, but if you don't know time series or it very well its not worth trying unless you are a genius. That is similar to what miner is talking about.

Another much simpler approach would be to use something like exponential smoothing or ARIMA to project future closure rates based on past closure rates.
Many thanks both, really appreciate your quick repsonses.

So I have managed to find a Kaplan-Meier node in KNIME (how handy) and run my data through that.

The results look like this:


So as I understand (per https://towardsdatascience.com/kaplan-meier-curves-c5768e349479), if I take out the censored data, I can also create a sort of worst case scenario plot to go along with the overal plot, although I can't do a best case because none of my censored data comes from observations where we don't know the outcome (i.e. we know they are still live)

If I now take a live policy, I can see for any given future time period, what the probability of it still being live is What I'm struggling with, is how I can use this to predict the likey time period of it ceasing (or at least a band), which I can then use to feed financial data into for my forecasting.

Many thanks


Less is more. Stay pure. Stay poor.
Are there any covariates that you need to control for? As I follow, currently you have previous policies length and if they were terminated. What other information is relevant?
There aren't any covariants that are accurately captured as far as I am aware so I have not included any. There are multiple different products across multiple currencies but I am treating them separately as they have no influence on each other.

So really, I just have policy start date and end date (if applicable).



Active Member
There's historical data in the form of lengths of time, thousands of measurements of completed policy lifespans.
Are these policy lifespans distributed across their range of values in a normal bell curve?

If yes, then standard deviation and z-scores can be used on lifespan to model the trade-off between pinning down a narrow expected range with it's attendant risk of error, and accepting a wider band of possible outcomes in exchange for higher confidence.

I've brought the subject of confidence/risk up on several forums, but have been frequently met with hostility and confusion, so I suspect it's a poorly taught/understood principle.
Last edited: