Comparing groups with subcategories

#1
Hello,

I am doing a research project and am having trouble finding the correct statistical analysis for my data.

I would like to test whether gene length is different between plant species that grow in two different habitats. I chose 12 species from each habitat and sequenced almost all (17000) genes from each species, and measured the length of each gene. I was planning to do a linear model like this: (length ~ habitat + species) but since each species is found in only 1 habitat, I am worried that it violates assumptions since certain combinations of variables cannot exist (species A is always in habitat 1, never habitat 2). On top of that, I am concerned that they are not exactly random samples- I measured all the possible genes from each species (not a random subset) while there are hundreds more species in each habitat that I did not sample at all (logistically impossible). Finally, gene length is not normally distributed (neither within a species nor among the whole dataset) (it is skewed with many short genes and a tail of fewer very long genes).

Are there any statistical tests that could handle this type of data? I was thinking of testing each of the 17000 genes one at a time to see if their average length is higher in habitat 2 vs habitat 1, and then see if the proportion of genes that are longer in habitat 2 would be greater than 50%. Would that be the best approach?

Thank you very much for reading.