I want to know if different doses yield a statistically significant difference in mean drug exposure across the two groups. I am observing that even if I calibrate "artificially" the doses to generate very similar mean exposures in both groups, all the statistical tests will always return very low p-values despite the very low difference in the groups' means. I guess this is due to the very large sample size (n = 1000 per group).

However, if I reduce the sample size (to 50 virtual drug exposures, let's say) the exposure is very sensitive to the sampling procedure because the samples are taken from a distribution with high standard deviation compared to the mean, and repeating the same analysis on different datasets can give very different means in exposure.

Is this a case where I should focus more on the "biological relevance" of the difference rather than the significance of such difference? Can you suggest a different approach to judging the relevance of the difference based on robust criteria?