Compare different methods for classifying two datasets

I'm not sure if this question is Applied Statistics, my apologies if it isn't.

I have different implementations of different classification models (Discriminant Analsis, SVM, Neural Networks, Decision trees, etc) a total of 40 implementations and I need to compare them in two different sets of variables.

I run each implementation on both sets of variables: set A involving 20 variables and set B the same as A plus other 5 variables. For each run the performance measure is the accuracy.

Besides I run 3 resampling methods for each implementation: 10-fold cross validation, bootstrap and leave one out.

And now I'm stuck in trying to get any conclusion because for some implementations I get better results in group A or B depending on the resampling method and also for the same model for example Decision Trees with an implementation I get better results in group A and for others I get better results in group B and also depending on the resampling.

Is there any statistical test I could apply maybe studing A and B for each model? for example with the mean value of all the implementations for one model (e.g. for Decision Trees the mean accuracy in group A and B and statistically compare them with mean accuracy of SVM)


Less is more. Stay pure. Stay poor.
Well, it doesn't sound like you had a hold out set. I get that you used CV, etc. in the model building. You are going to get lots of estimates, now if you could apply them to a test dataset you could see which ended up having the best accuracy on a new set. I usually don't see a use for a third set, but given the number of approaches, you may then want to have a second holdout set which you get your final accuracy estimate just for the final selected model.

What is the purpose of this project, find best model or see if A or B is better. It would seem like you could run B and the models would drop unimportant variables during the building process.
Thanks for your reply. I have 2 purposes:

- run several classifiers to find out if sets A and B are useful sets of variables for building predictive models, compared to previous research in the state of the art. For my problem I know that in the state of the art accuracy is at most 80% with different sets of variables so I wanted to try with sets A and B and see which accuracy I get using different models. In previous research some use 10 fold cross validation, others use leave one out and others bootstrap so I used the three of them.

- I want to compare if I get higher accuracy using set A or using set B because set B is somehow different to those that I found in previous research in the state of the art, so if I get high accuracy using B that would be useful.

But for some classifiers I get higher accuracy using B and for other classifiers I get higher accuracy using A so perhaps I could use t-Student or other statistic test to support my results?


Less is more. Stay pure. Stay poor.
How big is your sample and what percentage of the sample are in each of the outcome groups? How does your sample size compare to those published in the literature and is there suspicion that your sample is inherently different than previously published samples?

Different approaches are going to be better/worse fit given your approaches. Do you used CV, BS, and LOO for each technique? If you have a small sample, CV-10 may be producing too small and sparse samples, etc.
My sample is very similar to those in state of the art, it is larger than those in many studies. it is different because we have our tools to produce data that are not exactly the same than tools in previous studies, I mean that there are a lot of common tools but different people uses different tools, in set A all the data I have is very similar to data involved in other studies, though not exactly the same.
I used CV, BS and LOO in each technique for set A and for set B, so now I have only one table with all those accuracies for each technique and I don't know which statistical analysis do with it, maybe t-Student to compare group A and group B?


Less is more. Stay pure. Stay poor.
hmm, I was curious what you were thinking with the ttest! So you are thinking of comparing the AUCs between the two model groups. I still think having a hold out set to apply predictions to, score, would be very beneficial here. I think I may have seen the ttest before but I can recall a specific call. In the back of my mind I keep wondering if there is an issue with it, since all AUCs are not actually independent. I wonder if there is a engineering approach that would address this, since it is kind of a reliability problem.

Did you use the same seed values in analytics performed with either group to ensure comparability of samples?

Just curious what program and package you used?
Yes, I'm using the same seed, I'm working with the caret R package, Yes, I'm using two sets: a training test and a test set as shown here

> inTrain <- createDataPartition(y = Sonar$Class,
+ ## the outcome data are needed
+ p = .75,
+ ## The percentage of data in the
+ ## training set
+ list = FALSE)
> ## The format of the results

I'm not sure about using t-test, I can't figure out which test to use to show differences in the two groups for each of the sampling methods.

To say it briefly I have 3 different sampling methods (cross validation, hold one out and bootstrap) I run 40 classifiers getting for each one a number from 0 to 100(the accuracy) and I run each classifier for each sampling method using two different sets of variables A and B where A is included in B, I mean that B is the same as A but B has all the variables in A and also other 5 new variables. I need to try to find differences between both sets of variables.


Less is more. Stay pure. Stay poor.
Your last post cleared many things up.

Side question, I would imagine that some of the approaches the B set of variables may collapse down to the A set, if the 5 variables are of little importance (e.g., random forest).


Less is more. Stay pure. Stay poor.
Well, I popped over to the WWW and found the following. It doesn't give you an overall winner though.

Another thing to remember when using AUC is that some models may have drastically different false positive and false negative rates but comparable accuracy values. This won't be discernible when reviewing AUC values. I came across this when playing around with entropy and gini index once, where a model had identical average AUC but predictive components varied drastically.

P.S., You also have CV-fold repeated, where you repeat the CV-10 say 3 times. Not sure much is gained there given sample size, which you didn't report the prevalence your outcome.