# Using kknn regression function with NAs

#### bloynoys

##### New Member
I have been lurking here for awhile and now have a question I hope to get answered!

Alright not sure if someone has experience with the kknn package but I am using it in a regression framework to predict future baseball statistics. Not sure if there is a way to go about doing this in a way that tricks the function into doing what I want it to do.

Note: Lag is a stat the previous year. So lag1 would be the stat in question for the previous year, lag2 is two years etc.

The problem I am running into is how do I deal with NA in the train and test datasets? Lets say a guy has played for two years and that is it. If I use 3 lags he will have NAs and not be able to be predicted. So I want something where this guy can be predicted but a guy with 4 seasons all four are used (with the 3 and 4 seasons being a lot less weight if possible).

Auxiliary question: Is there anyway to make inputs for the equation less important than others?

Some things I have tried: Change all NAs to 0s. This solves the problem of everyone not being predicted but the problem is, if he guy with two years is really similar to a guy's two previous years (but that guy has played 4 years) that persons 3 and 4th lags being non-zero will produce large gap in separation.

Like: Player 1: (10,20,0,0) Player 2: (10,20,10,10)

These won't be nearest neighbors.

Second thought. Split into seasons then combine together. Basically I would take an individual person's seasons played. Let's say four. Compare him to all four season people, get closest 20 neighbors, then do the same with 3, then 2. So I will get a dataset of 60. Then remove duplicates and sort by closeness and only use the top 20. Couple reasons this doesn't work. I would have to do weighted average outside of the function. Also, because there are more things being compared, the four season people will probably be farther away in how it looks at it so comparing closeness using this probably doesn't work

Here is a reproducible example of why the NAs don't work:

names=c("Helton","Bonds","Bagwell","Kent","Trout","Pedroia")
cy=c(.190,.180,.170,.190,.20,.13)
lag1=c(.105,.205,.155,.134,.190,.180)
lag2=c(.20,.22,.150,.170,.178,.160)
lag3=c(NA,NA,.150,.21,.18,NA)
a=cbind(names,cy,lag1,lag2,lag3)
a=data.frame(a)
new=c("Parker","Morneau")
lag1n=c(.2,.18)
lag2n=c(.17,.15)
lag3n=c(.19,.16)
b=cbind(new,lag1n,lag2n,lag3n)
b=data.frame(b)
check=kknn(cy~lag1+lag2+lag3,a,b,k=2,distance=1,kernel='optimal')

If there are any other functions or packages that may accomplish this idea I would love to hear about them. I am not tied to this function or package. Thanks!