So I have somehow got involved in a forecasting project at work and am currently making things up as I go along (because no one else has any clue what they're doing either), but would greatly value your input as to the best method of approach.
Scenario:
Insurance policies are sold with an unknown end date. For all live policies, we want to predict a likely end date (or upper and lower bands) to forecast future earnings, based on data from previous policies that have now ended.
Dataset:
Covering 15 years worth of data, once split into the separate product categories, looking at anywhere from 1.5k to 20k values per category
Current method:
(Using KNIME's loop feature to iterate through live policies individually)
1. Take live policy active days (start date vs today's date)
2. Remove policies from overall dataset (filtered for same category) with active days less than live policy
3. Calculate Median and Median Absolute Deviation of remaining policies
4. Remove policies from overall dataset with active days greater than Median + Median Abolute Deviation
5. Calculate Median, Mean, and Count from remaining policies and attach data to live policy
6. Loop to next live policy
Notes:
As mentioned in the intro, I made up this process based on what I thought seemed not totally unreasonable, but with no true knowledge of whether it is statistically robust.
With the end result, I'm not even sure what to do. I can obviously use the Median or the Mean value to then predict the likely end date of the policy, and use the sample size as a sort of weighting to that prediction (i.e. smaller end sample size vs overall category sample size = low accuracy prediction score), but would really appreciate someone's input.
Many thanks
Scenario:
Insurance policies are sold with an unknown end date. For all live policies, we want to predict a likely end date (or upper and lower bands) to forecast future earnings, based on data from previous policies that have now ended.
Dataset:
Covering 15 years worth of data, once split into the separate product categories, looking at anywhere from 1.5k to 20k values per category
Current method:
(Using KNIME's loop feature to iterate through live policies individually)
1. Take live policy active days (start date vs today's date)
2. Remove policies from overall dataset (filtered for same category) with active days less than live policy
3. Calculate Median and Median Absolute Deviation of remaining policies
4. Remove policies from overall dataset with active days greater than Median + Median Abolute Deviation
5. Calculate Median, Mean, and Count from remaining policies and attach data to live policy
6. Loop to next live policy
Notes:
As mentioned in the intro, I made up this process based on what I thought seemed not totally unreasonable, but with no true knowledge of whether it is statistically robust.
With the end result, I'm not even sure what to do. I can obviously use the Median or the Mean value to then predict the likely end date of the policy, and use the sample size as a sort of weighting to that prediction (i.e. smaller end sample size vs overall category sample size = low accuracy prediction score), but would really appreciate someone's input.
Many thanks