VERY RELEVANT MORTGAGE-RESEARCH in this crisis

#1
Within the firm I work for the question exists whether there are common characteristics within (between) more or less 900 cases/dossiers which we have prevented from actually obtaining a mortgage, because they showed that they would not be able to their monthly installments. If we find common characterists, we will use this as 'red flags' in the future to find these types of (new) dossiers in an easier way than we do now.

Secondly, I was asked to do the same analysis of the 500 dossiers that have been found to have commited fraud in their mortgage contract

Thirdly, I was asked to find similarities between these two samples (are they samples or populations?)

I have many variables at my disposal and many of them have more than 20 possible values. Which types of analyses shoudl I apply in each of these researches?

Kind Regards,

Dennis
from The Netherlands
 
#2
Some more relevant information that could be useful:

It now may seem that I no nothing of statistics, but I do. It is just that in all these analyses that I describe I think I cannot use ANOVA or regression. The reason is that the dependent variable in all cases is numeric and has 'yes' (in SPSS '1') as a reply. For instance: all dossiers have been found to have frauded. So this would mean that I do not even have a dependent variable, do I?

The same holds for the second research to be conducted: Because all cases are fraud, the dependent variable will be 'yes' in all cases.

Or am I thinking in a wrong manner..?
 

CB

Super Moderator
#3
Way I see it, you're quite right - you don't seem to have DV variability, so no immediately obvious grounds to make an analysis. My first thought is that you need a control group - go find 500 files for people who haven't committed fraud on their contract!

As for the 900 declined people - this doesn't strike me as a wonderfully useful data source. These aren't people who've failed to repay mortgages - they're ones where credit officers have decided (on whatever basis) that they probably can't. At best you're looking at a numerical redefinition of existing decision rules. Looking at the files of people who've actually defaulted on their mortgages would make a heck of a lot more sense.

PS. If you're wanting to use the analysis to make decisions about new clients, the file groups are samples.
 
#4
Thank you very much for your reply. This has helped me a lot, because my suspicion has been confirmed.

Three more short questions though:

1: You said to compare the 500 fraud cases with 500 none fraud cases. However, the problem with no-fraud cases is that you can never say 100% whether somebody has not commited fraud. I could take 500 cases out of the database (31,500 mortgages!), but who says that later on in the future some of these people will not be found to also have commited fraud? One never knows...

2: What is better? Compare the 500 fraud cases with 500 no-fraud cases (if I can successfully find those, given the problem depicted above) Or should I compare with all 31,500 mortgages (minus the 500 fraud cases)?

3: Many of my variables are in fact nominal and can have many values. For instance 'type of mortgage' already has some 6 or 7 types... Should I make a dummy variable for each type..?

Many thanks and Kind Regards,

Dennis
from The Netherlands
 

CB

Super Moderator
#5
No worries at all! Hmm, let's see...

1: You said to compare the 500 fraud cases with 500 none fraud cases. However, the problem with no-fraud cases is that you can never say 100% whether somebody has not commited fraud. I could take 500 cases out of the database (31,500 mortgages!), but who says that later on in the future some of these people will not be found to also have commited fraud? One never knows...
Good point! This is a bit of a sticky one. However, you would expect that only a small percentage of the other cases would have undetected fraud on their accounts (hopefully!) You could get an estimate of the base rate for fraud from the literature, I'm sure. You could also reduce the possibility of this somewhat by, say, excluding cases from the control group where there is a reasonable suspicion of fraud - e.g. ones where the fraud office at your firm has made any sort of investigation (even if fraud wasn't 'proven').

2: What is better? Compare the 500 fraud cases with 500 no-fraud cases (if I can successfully find those, given the problem depicted above) Or should I compare with all 31,500 mortgages (minus the 500 fraud cases)?
Technically, I think it would be better to use all the 31500 (excluding maybe those with a suspicion of fraud) - however, because the other group is only 500 in size, there would be little in the way of gains in statistical power from doing so - and if you randomly chose 500 control cases (again, perhaps excluding suspicion-of-fraud cases) you'd hope that they would be reasonably representative of all the non-fraud cases. It depends to some degree on the nature of the data you have - if you're going to be sitting and manually coding every case from a paper file, 31500 cases is a crazy idea! But, if the data is all there and it's a matter of just clicking import in SPSS, using the whole bunch might be worth it! What I would NOT suggest is choosing 500 control cases based on 'convenience' factors or any other non-random criteria - you'd run the risk that any 'differences' you observe between the control and fraud cases would be due to the way they've been selected, not due to actual differences between fraudulent and honest customers.

3: Many of my variables are in fact nominal and can have many values. For instance 'type of mortgage' already has some 6 or 7 types... Should I make a dummy variable for each type..?
It depends on what analysis program you're going to use - in SPSS you can specify a variable as nominal and it will do the dummy-coding for you in any analysis you choose, which saves a ton of work - but other programs will require dummy-coding, unfortunately!

Good luck with it all :)