I am currently working with a data set that contains about 26 IVs of almost all sorts of scale of measurement (binary, nominal, ordinal and interval scale variables). There are strong reasons to suspect that some variables are probably highly correlated, while some may not be related to any other IVs to a great extent.
I came across with great suggestions to resolve this problem in this site (which was an useful advice to use optimally scaled variables in the FA procedure and to use derived factor score as the IVs). But due to my inexperience in this field I am in need of some expert advice on the following issues:
How should I check if Multicollinearity really exists?
I am not sure how to check Multicollinearity with such a heterogeneous data. I may calculate the Heterogeneous Correlation Matrix (or Spearman's Rank Correlation) by somehow forcing me to consider the nominal variables as ordinal but even if I do it what should be the value of the correlation coefficient at which Multicollinearity can be ignored? I am also not sure if it is going to give any insight at all, as I am missing something like a VIF measure!
Should I take only those variables for a FA which are highly correlated?
Say, if I can find two sets of variables (one set containing 8 IVs and another containing 4 IVs) quite highly correlated to each other within each set, then should I use only those 12 variables for FA and derive FA scores for those two factors to use them as IVs? Clearly my intention is to use the other 14 variables separately as IVs along with the two derived scores. I am confused if I should actually use not the 12, but all 26 variables in the FA in this scenario. Remember in that case my FA scores are weighted by the other 14 unrelated variables too!
Is there any problem to categorize a proportion type DV for an ordinal logistic regression?
I've actually found people using logistic regression instead. But I want to mention here that unfortunately I don't know the number of cases (or trials) out of which each proportion was calculated. So I cannot use the number of trials as the weights in the logistic regression. In that case a logistic regression may not be accurate enough. So, as I only know the proportions, won't it be good to categorize the proportions by median split or by quartile split? So that I can use it as a DV in a logistic or in an ordinal logistic regression?
I am thankful for reading this thread patiently and hoping some expert advice.
Regards.
I came across with great suggestions to resolve this problem in this site (which was an useful advice to use optimally scaled variables in the FA procedure and to use derived factor score as the IVs). But due to my inexperience in this field I am in need of some expert advice on the following issues:
How should I check if Multicollinearity really exists?
I am not sure how to check Multicollinearity with such a heterogeneous data. I may calculate the Heterogeneous Correlation Matrix (or Spearman's Rank Correlation) by somehow forcing me to consider the nominal variables as ordinal but even if I do it what should be the value of the correlation coefficient at which Multicollinearity can be ignored? I am also not sure if it is going to give any insight at all, as I am missing something like a VIF measure!
Should I take only those variables for a FA which are highly correlated?
Say, if I can find two sets of variables (one set containing 8 IVs and another containing 4 IVs) quite highly correlated to each other within each set, then should I use only those 12 variables for FA and derive FA scores for those two factors to use them as IVs? Clearly my intention is to use the other 14 variables separately as IVs along with the two derived scores. I am confused if I should actually use not the 12, but all 26 variables in the FA in this scenario. Remember in that case my FA scores are weighted by the other 14 unrelated variables too!
Is there any problem to categorize a proportion type DV for an ordinal logistic regression?
I've actually found people using logistic regression instead. But I want to mention here that unfortunately I don't know the number of cases (or trials) out of which each proportion was calculated. So I cannot use the number of trials as the weights in the logistic regression. In that case a logistic regression may not be accurate enough. So, as I only know the proportions, won't it be good to categorize the proportions by median split or by quartile split? So that I can use it as a DV in a logistic or in an ordinal logistic regression?
I am thankful for reading this thread patiently and hoping some expert advice.
Regards.