Logistic regression problem in R - Horse racing

#1
Hi All,

My question probably spans a couple of thread topics with regards to the title but I decided to post my question in the R forum as its the program I am using.

I am currently in the process of fitting a logistic regression model to horse racing outcomes (1 = Winner, 0 = Loser) given a range of fundamental variables associated with the horse.

The closing market odds is the best predicted of a horse race outcome. I am firstly going to run a one-step logistic regression model which will pool all of my fundamental variables and also include the market closing odds in the regression model. I will then run a two-step logistic regression model (one for the fundamental variables, then another for the market probability - Using different populations for both to prevent overfitting). The process closely resembles the following article www.bjll.org/index.php/jpm/article/download/419/450 (all credits to the authors).

My question is in respect to how we treat the market closing odds of a horse (i.e the inverse of the probability) in R2 before we fit a model to it. I am finding that depending on what I do with the market odds, greatly impacts the R^2 of the model.

For instance, if a choose the market closing odds of the horse I get an R2 of ~ 0.076672 and log likelihood of -1718.3 which appears to be very low to me compared to other . However, if I simply just convert the market closing odds to a probability ratio (i.e convert $10 horse to 0.1 winning probability) and run the logit model again the R2 returns as 0.171 which is what I'd expect from the market prediction. How can a simple 1/Winodds calculation greatly affect the R2 and log likelihood?

Same thing keeps occurring when I take out the bookie margin as well and run natural logarithms to these numbers. See results below of all options

Winodds (Bookmaker) - R2 = 0.076672, log likelihood = -1718.3
Probability conversion of odds (Bookmaker) - R2 = 0.171, log likelihood =-1542.6
Natural log (Winodds) = R2 = 0.066927, log likelihood = -1736.4
Natural log (Adjusted probability - take out book margin so probability sums to 1) - R^2 - 0.0066727

What is going on here? The results just don't seem to make logical sense to me? Shouldn't the log models perform better? Plus the order of magnitudes of improvement doesn’t make sense.

The database is in excel and I am copying and pasting it into a notepad when I want to analyse the data.

Any help greatly appreciated

Thanks.
 
Last edited: