# Factor analysis with categorical variables containing missing values

#### Blain Waan

##### New Member
Hello,

I am trying to make a factor analysis with 32 categorical variables. They can be ordered and found from a questionnaire that answer options. Different questions had different options. The options were coded as 1,2,3,4,5 etc. So, each number representing a category for a question is different in meaning from the other question. I mean this is not likert type where , for example, 1 means good for all questions, 2 means moderate for all questions, 3 means bad for all questions etc. It is like for one question:

"How often do senior management visit the wards to talk to staff?"

rarely or never ..................... 1
around once a year................... 2
around once a month.................. 3
around once a week................... 4

For another question:

"What is the average amount of training (per person) received by a management staff?"

Less than a day ..................... 1
Less than a week .................... 2
One to two weeks .................... 3

Etc.

Moreover I have missing values of two kinds. Some of them are simply because of non-response. The respondent did not filled up any answer option for that question. Some of them are due to questions of the following type:

5) Do you create formal work teams in your institution?

1="NO" 2="YES"

(Please skip question number 6 and 7 whose answer to this question is 1="NO")

6) How many members form the work team? (for example)

7) What is the criterion of selecting team members? (for example)

Now those who answered "NO" for question number 5 will not answer 6 and 7. He will again start from 8. This is another source of missing information or gap in the data set. Because of specially this type of missing values if I omit missings listwise a lots of information is missed.

My actual number of observations is 212, but it reduces to only 42 when I use na.omit(data).

So, I want to ask two things-

1) What kind of correlation I should put as an input for factor analysis? Some have suggested me polychoric correlation. But can I really make assumption of underlying normality for these categorical variables?

2) How do I adjust the missing values for categorical variables?

Best,

Blain Waan

#### spunky

##### Can't make spagetti
1) What kind of correlation I should put as an input for factor analysis? Some have suggested me polychoric correlation. But can I really make assumption of underlying normality for these categorical variables?
and why not? the polychoric correlation is actually (and surprisingly, at least to me) somewhat robust to some moderate skewness in the latent, continuous distribution that underlies the categorical manifest variables. since you included the na.(omit) data i'm assuming you're using R here which, if my memory serves me right, produces a chi-square statistic to evaluate the plausibility of the hypothetical normal distribution. for the case of two variables you can do (assuming you've already installed the 'polycor' package):
Code:
polychor(YOUR DATA VECTOR 1 HERE, YOUR DATA VECTOR 2 HERE, ML=TRUE, std.err=TRUE)
and get that chi-square test of fit.now, if assuming normality seems like too much of a strech, you can always use a distribution-free estimation method like weighted least squares.

How do I adjust the missing values for categorical variables?
this is actually an interesting one. in my case, i'd both try full-infromation maximum likelihood and multiple imputation methods and see if they hopefully provide you with a similar answer.

#### Blain Waan

##### New Member
Thanks spunky. This was helpful. But Can I ask you a related question that I came to notice later?

When the responses were collected, for dichotomous variables having answer option "yes" and "no", they coded and tabulated the data with 1="yes" and 2="no". Again for the question whether the turnover was increasing, answer options were coded and tabulated like- 1="rapidly", 2="stable", 3="shrinking" and 4="don't know".

Actually there is no reason for such coding. But while I am calculating polychoric correlation with such tabulation I guess I may not get the actual correlation for such coding. I think as for most dichotomous representation 0="no" and 1="yes" is appropriate. Moreover, for the second question I mentioned about turnover, 0="don't know", 1="shrinking", 2="stable" and 3="rapidly" should be used. Because I actually want to find factors with these variables that affect profitability of a company, so I think scoring of the answers should be made in such a way that the answer option that commensurate with profitability get the highest value and "don't know" type gets the lowest value during coding.

So, what do you think, should I recode accordingly the original tabulation?

#### noetsi

##### No cake for spunky
I think the real question here is not if you should use polychoric correlations (you should) but why you are getting missing values for some questions. Unless you can assume (and you really can't) that people are not answering these questions at random ignoring the missing values will distort your results. Because it is likely then that if they had responded the set of responses would be different than the partial set you actually got. For example ( made up response to show my point) people who are not satisfied might not want to say so, so they just would not respond to that question. This gives the sense that the population is more satisfied than they are. And it could change the factors I assume.

There are methods like multiple imputations to address this.

#### spunky

##### Can't make spagetti
Unless you can assume (and you really can't) that people are not answering these questions at random ignoring the missing values will distort your results.
they most certainly aren't responding at random. if you look at the statement of the problem of the OP you get a clear indication that this looks a lot like some sort of MAR mechanism because their response to one quesiton (in the OP's example, question #5) generates missing data on questions #6 and #7. this is actually an interesting problem and will certainly add it on my to-do list of Monte Carlo simulation problems. part of me thinks that this could be categorised as a 'missing by design' problem but the other part tells me that it is more a 'missing at random' issue because it is certainly very probable that quite a few people would end up answering questions #6 and #7. there's actually an interesting approach to this kind of missingness of data through multi-group factor analysis but i'm not sure whether this is what the OP is aiming for here.

#### spunky

##### Can't make spagetti
Actually there is no reason for such coding. But while I am calculating polychoric correlation with such tabulation I guess I may not get the actual correlation for such coding. I think as for most dichotomous representation 0="no" and 1="yes" is appropriate. Moreover, for the second question I mentioned about turnover, 0="don't know", 1="shrinking", 2="stable" and 3="rapidly" should be used. Because I actually want to find factors with these variables that affect profitability of a company, so I think scoring of the answers should be made in such a way that the answer option that commensurate with profitability get the highest value and "don't know" type gets the lowest value during coding.

So, what do you think, should I recode accordingly the original tabulation?
you're certainly right. it is important to re-code those questions so that the interpretation of the correlations is meaningless. if not, what gets estimated as your threshold $$\tau_{i}$$ will not correspond to what the question intends to do and you'd end up with a meaningless correlation amtrix.

to be very, very honest with you i'm struggling a little bit to work around the 'dont know' issue... how much data would you lose if you were to throw away the 'dont knows'?

#### Blain Waan

##### New Member
they most certainly aren't responding at random. if you look at the statement of the problem of the OP you get a clear indication that this looks a lot like some sort of MAR mechanism because their response to one quesiton (in the OP's example, question #5) generates missing data on questions #6 and #7. this is actually an interesting problem and will certainly add it on my to-do list of Monte Carlo simulation problems. part of me thinks that this could be categorised as a 'missing by design' problem but the other part tells me that it is more a 'missing at random' issue because it is certainly very probable that quite a few people would end up answering questions #6 and #7. there's actually an interesting approach to this kind of missingness of data through multi-group factor analysis but i'm not sure whether this is what the OP is aiming for here.
Hello spunky, yes I think you have got my question perfectly. Question #5 generates missing data on questions #6 and #7. This can be categorized as a 'missing by design' problem. People who was supposed to answer question 6 and 7 due to their response for question number 5, actually have answered 6 and 7 (I am lucky enough!). But what should I do with this 'missing by design' problem? Thing to be noticed here, question number 7 is categorical. So, can I create a category 0="no work team" and fill up the missings by 0? Is it logical to do for a statistician? For 6 I may take 0 as the number of members because those who answered they have no formal work team, this is equivalent to 0 members. (I'm sorry for my desperate thoughts about the problem!)

And you are right, I can cancel the don't knows. I can take them as missing. But they had collected it as answer. So, what is your suggestion about it?

#### spunky

##### Can't make spagetti
. So, can I create a category 0="no work team" and fill up the missings by 0? Is it logical to do for a statistician? For 6 I may take 0 as the number of members because those who answered they have no formal work team, this is equivalent to 0 members
uhmm... i dont think creating a 0 category would help out. please keep in mind that these are categorical variables. they do not mean 'abscence of' or anything like that. it just means you create another category. honestly, i think the best option here is to go with what i mentioned previously: either you attack this through full information maximum likelihood/multiple imputation or you follow the procedure of operating on this through a multi-group factor analytic design, which i know people have done research on. as you can imagine, you'd have one group as the ones who answered YES to question 5 and the ones who answered NO.

And you are right, I can cancel the don't knows. I can take them as missing. But they had collected it as answer. So, what is your suggestion about it?
to be honest, the more i think about this the less sure i am. 'dont know' is a completely different category from what you would have in your regular scale. this answer makes absolutely no sense within the options that you have. and you cannot give it a 0 because 'dont know' doesnt imply the lowest point of turnover allowed by your scale. honestly, i'd either hit the literature on latent variable modelling to see what people have done on this cases or just throw them out completely. everytime i think i've got a way around this i end up fidning problems with my attempt at a solution. i'd either consider them as missing data or throw them out completely. but, once again, this seems like a common design. surely someone has written something about it somewhere.

#### Blain Waan

##### New Member
this seems like a common design. surely someone has written something about it somewhere.
I am searching over net and if I find I'll surely post here, but actually so far I haven't found any clue. If you find it somewhere please give me a link. Yes, this seems like a common problem. If you can find it written somewhere, please mention.