Filling in missing values

#1
I subscribed to TalkStats less than an hour ago. This isn't a homework question, so I hope that's OK. I embarrassed to say I passed a college-level stats course in 1976 and still can't figure this out. Here's my question.

Consider a list of 100 data points (they happen to be my finishing times for playing crossword puzzles) sorted from earliest to latest. Assume the general trend is upward.

Now consider that a contiguous block of 20 of the data points is in fact missing. For example, I have the values for 1 - 35 and I have the values for 56 through 100, but I do not have the values for 36 through 55.

What is the most mathematically sensible way to fill in those 20 missing values?

It occurs to me that there are two possibilities. In the simpler one, a single value is calculated and used for all 20 of the missing data points. If that's the better possibility, how do I calculate that value?

The other possibility, unlike the one above, would somehow account for trends by filling in the missing 20 values with a series of different values based on the known values before and after the gap. If this is the better possibility, how do I calculate those missing values?

(The real-world example of this problem, in case it helps, can be seen in Excel 2000 format at http://barelybad.com/xwd_times_194.xls. The gap starts at Row 173 and continues to Row 204.)

So, to repeat, what is the most defensible way of somehow filling in those 20 missing values, and what is the rationale?

I've tried to noodle it out several times over the last couple years, and now I'm turning to experts for help.

Thanks for any ideas you can offer.

--Johnny
johnegee aattt gmail ddoottt ccommm
 
#2
Crossword trend

Johnny,
You could graph time against crosswords completed.

On the assumption that your ability changes with the number you completed you could graph them.

For the first point use x=1 and y = the time for crossword 1
For the second point use x=2 and y the time for crossword 2
For the nth point use x=n and y= the time for crossword n where n is the crossword that you have completed, you would then have a gap for the ones that are missing the data.

Then try and plot a trend line to the points. Excell has a regression fit for this. If the line looks straight then a linear fit might be appropiate. Excel also has polynomila fits and exponential fits.

If a good fit trend emerges you could interpolate for the missing data. If you can not get a good fit then you need to modify your law, maybe the square root of the number of crosswords completed or the square of the time taken. In the first case point 1: x=1 , y=the time for crossword 1, point 2: x=root 2, y=the time for crossword 2, point n: x=root n, y =the time for crossword n.

Be careful if you do get a trend with your intrepation, relationship does not imply causation. For example you could plot fire damage against number of firemen in attendance and get a relationship but that does not mean that the firemen cause the damage, it might mean that bigger fires need more firemen and bigger fires cause more damage.

If you look up acceleration due to gravity experiment where they plot distance vs time squared to get a value for g, and look up Boyles law you will see real world examples that work out.

If you want real data to practice on send me a private message and I will send you some numbers from my old lab books in the attic.

PeterVincent
 
#3
JOhny,
You may be able to do use a type of imputation to account for that missing data. usually your not suppose to do that if its above 15% missing I belive.

However, if its a relationship you want to see just for your own curiosity, then I dont see the harm in using imputation on a data set of more than 15% missing values.

As peter mentioned, just treat one variable as time (x) and the variable as the scores (y) and then run a Regresion or Expected Maximization Imputation on SPSS.

Hope it helps,
Gerardo