Comparing Models with Different Scope

Hi All,

I'm looking for any research papers in the Statistics literature or any information, in general, about comparing the performance of two different models when one of them imposes more severe restrictions on the data. Here is my specific case:

I am trying to predict the occurrence of financial fraud using two, distinct sets of regressors estimated via logistic regressions. 1s in my sample are cases of fraud, 0s are case of non-fraud. One regression requires 3 variables which are common across the entire universe of observations, while the other requires 7 different variables, the confluence of which exists for only 2/3 of the universe. I know that listwise/casewise deletion of observations with missing values is archaic and frowned upon by statisticians but, alas, it is the standard in the (backwards) finance and accounting literature that my paper aims for. So in sum, Model A can be computed for, say, 300,000 observations but model 2 can be computed for 200,000.

After estimating both models, I compare their performance by taking the ratio of how many 1s each correctly predicts over how many 1s could be estimated by that model. I do the same for 0s which leaves me with 4 ratios: 1A, 1B, 0A, and 0B. One minus these ratios gives me Type II and I error rates, respectively. (I know I should be comparing out of sample predictions but that's a separate story). Unsurprisingly, the limited-scope model outperforms the full-scope model. However, this methodology fails to account for the two models' differing scopes.

There's a old expression, "You miss 100% of the shots you don't take," and I wonder if there's any statistical analog of this saying. In other words, is there any precedent for classifying omitted observations as 'miscategorized' and, instead, reporting the following ratio: how many 1s (or 0s) each model correctly predicts over the entire universe of 1s (or 0s). Of course, for the full-scope model, this ratio will be identical to the one in the preceding paragraph.

More generally, how do I appropriately penalize a model for its imposed data restrictions?

Any help would be much appreciated!