Logit regression with categorical variables over represented in the sample

#1
Hello,

I do not know if the post head label my problem appropriately since I am quiet a beginner in this field but here it is:

My analysis is about the determinants that influence someone's chances to live in a low income area (dichotromial variable). Data are repeated cross sections of different borrowers over 12 years.

I have a categorical variable indicating the race of each borrower. After creating dummies with the reference category white and having included the 10 other variables and run the model, I get negative coefficients for the race dummies. For exemple, I have a negative coefficient for the dummy for African American (1=African American, 0=White). Normalized to their population in the sample, there are way more African American in low income areas than there are white so I would have expected a positive coefficient. All the other coefficients make sense

Then I thought that this might be due to the higher number of white borrower in the sample (is this even a problem when doing a logit analysis?). So I decided to re run the regression but I randomly deleted observations of each race category so there would be the same number of white, black, asians. I re run the regression and the coefficients were again negative.

Now I am just really confused since I do not know which regression is more correct since both show this negative relationship with second having a sample of 37,000 observations whereas the first had 385,000.

Hope this is clear.
Thank you