(I wasn't going to say any more, now that the heavyweights have got involved and partying, but I just can't resist. I know I will probably regret it.)
We don't have a sample of n = 1. We have a sample of n = 100. The data is 90 h and 10 t. In principle it is no different from a sample of 100 children's heights from a mixed distribution (boys and girls). We can use the 100 heights and maximum likelihood to estimate the different means and SDs of the individual distributions.
In our case, the likelihood of a 90h/10t split is a very complicated function of the numbers of each type of coin in our sample of 100. The likelihood of the actual combination of numbers in our sample is, in turn, a complicated function of the proportions in the population. So, the likelihood of a 90h/10t split is a very, very complicated function of the proportions of each type of coin in the population. There are many well understood methods of maximizing a function of several variables, so in principle, the likelihood function and the maximum likelihood solution could be found.
Of course. we have no realistic chance of deriving the function and doing the actual maximization, but it is not hard to accept that a 90%hh 10%tt 0%ht split in the population is the most likely scenario to give a 90h/10t split in our sample.
(It is almost certainly not the actual situation, but it is the most likely. Or can you think of a more likely one?)
Actually it is interesting that you bring up mixture distributions, as it definitely has that flavor. I'm vaguely aware that mixtures of binomials are not always identifiable in the sense of page 448 top Identifiability of mixtures of power-series distributions and related characterizations (ism.ac.jp). That is essentially the argument i was making about the population model, that even if i told you the actual chance of obtaining a head, more than one set of mixing parameters (ie the frequency with which double head/fair are drawn from the super population) may produce it. so if i said there is a 90% chance of obtaining a head in this experiment, you would have 0.9 = chance of a true head + 0.5 * chance of a fair coin. That's what the paper says anyway...
What im not so sure about is: does considering the finite sample model as you do alter the identifiability problem? I think that is ultimately the kernel of what im trying to get at here.
Its fine that it is maximum liklihood, but lilihood is only important because it produces good estimates. If it fails to do so, what is the differene between very likely non-sense, and just guessing at random. Its pretty clear that asymtotically the mle you discuss gives an estimated number of double heads = true double heads + 0.5 * true fair coins? If that's right then it is asymptotically biased, and, worse yet, it gets further away from the 'right' answer as the number of coins increases!
It's the 4th and I've had a few drinks but ... Is anybody really that concerned that when n=1 we might not be able to do great?