Overcoming small dataset anomalies in genetic algorithm


New Member
Hello Talk Stats, this is my first post on this forum, interestingly typing in modeling forum in google took me to a series of websites that was not my desired search result.

So I am currently making my 6th version of a model designed to predict the likelihood of of a particular medical condition based on a multifactorial genetic markers and I really really would appreciate some advice with regards to a method to overcome this problem. The model I have used thus far has proved very successful, but I know there is still one thing I am not accounting for statistically.

The data I have for this medical condition is particularly limited. For some parameters I have multiple candidate genetic markers lets call this one GM1, the likely hood of developing this particular medical condition is now being shown to be related to the number of repeats of this genetic marker. However when modeling them, I run into issues with limited data for some number of repeats.

Parameter 1
For instance for 12 repeats of this genetic marker I could have 1000 people with 100 of them developing the condition. I input this into my model by checking what the % of people displaying 12 repeats was against the actual number that showed in a new test dataset so in this case 10% into my polynomial regression yields 8.7% chance of developing the condition. However in the same polynomial regression 35 repeats of the the genetic marker is also input as a 10% being 1 confirmed case from 10 people showing the same 8.7% chance of developing the condition.

The way in which my model works is it assigns different weights to over 600 different parameters in order to best predict the actual multifactorial causation within the genetic profile. For instance parameter 2 would be a different genetic marker and a polynomial regression would be performed on the % of model data patients against the number of real observed and the same as for parameter 1, a 10% chance could input into the polynomial regression as a result of multiple different numbers of repeats of this genetic marker.

What I need to know is how to adjust for different size datasets. Depending on what size dataset made up the original % chance of developing the condition. I am open to suggestion with regards to a method to achieve this.

Many thanks JIB