Validate my algorithm with probability?

#1
Before I start I wanna say sorry. This post might a bit vague, I'm not really sure where I'm going and it's been a while since I took a stat class.

I'm hoping someone can nudge me in the right direction here, kinda lost at the moment.

We have an algorithm for correcting people's height - it's not really height but it's easier to explain. We already have a guesstimate of a person's height, and our algorithm makes an additional prediction of the height. After running our algorithm we have two numbers, the new height prediction and the guesstimated height we had before we started - we keep this as a backup e.g. if the new prediction is too different from the guesstimate.

When comparing the prediction with the person's actual height, our algorithm does fairly well. Most of the time we are pretty spot on and we can scrap the guesstimate.
However, sometimes the prediction is just plain wrong. It would be nice to feel more comfortable with the predictions and validate them using statistics. The idea was to use probability with a threshold to decide whether to keep the prediction or not - this has to be better than just tossing the prediction if its too different from guesstimate - right?


If we do small corrections the new predictions are most likely correct. When doing bigger corrections the likelihood decreases, but not always by a lot.

... and that's where I'm at. How can I tackle this problem? How can I "describe" this with statistics? How do I start?

Edit: What I really would like to know is if my new prediction is correct or not, and with what probability.

I started off by making a frequency distribution for differences between the guesstimate and the actual height.

As you see, I'm wandering in the dark. Any suggestions or nudges are appreciated!

Merry christmas everybody! :)
 
Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
#2
You never really discuss if you know the truth. So eventually or early on you know the truth? Also it seems assumed you are working with a continuous variable, is that true?
 
#3
Thank you for replying!

If I understand you correctly, I never know the truth unless I manually check the predictions one by one afterwards. This algorithm will just run by itself for all eternity. I made a small dataset with 23 peoples guesstimate height and their actual height (truth) so I can validate my algorithm but also see how big a correction was (difference between guesstimate and new prediction, and difference between new prediction and truth).

Yes it outputs a continuous variable, height, but I started working with the offset (also continuous of course) between the guesstimate and the new prediction since peoples height come in all different sizes. It felt more natural to say "we usually do corrections within this distribution" when starting on this statistical problem.

Here is a histogram of valid (truth) corrections (offset or distance between guesstimate and the truth). X-axis is the size in millimeters (mm) of offset or correction from the guesstimate to the truth. Y-axis is count of samples within the bucket. As you see, in this dataset truth is usually higher than the guesstimate (not always).

So normally we will have to do smaller corrections, but from time to time we have to do larger corrections. Right now we don't dare do corrections bigger than 100 mm, but as you see below, sometimes we do need to correct height bigger than 100 mm.
1577175712194.png