R2 in SEM decreases when adding more variables

#1
In an SEM analysis using SPSS AMOS with 9 latent constructs, I am trying to establish incremental validity between a basic model of 4 constructs and the fully integrated model with 9 constructs.

Without going too much into the details here, I discovered that while most fit indices (e.g. CFI, RMSEA) are better when more variables are added, the R2 (squared multiple correlations) of the outcome variable actually sometimes gets lower when variables are added.

Now, I am aware that R2 is not a good criterion to establish incremental validity in SEM, but it strikes me as odd that it can actually drop when more predictors are included. My understanding has always been that any additional predictor can only increase R2, as is the case in linear regression.

Basically, my question is: Can R2 decrease with more variables in an SEM, and if so, why?
 

spunky

Can't make spagetti
#3
I discovered that while most fit indices (e.g. CFI, RMSEA) are better when more variables are added
this is not particularly surprising. as more and more parameters are estimated you start losing more and more degrees of freedom, which means your models are more and more saturated and a saturated modely will always perfectly reproduce the covariance matrix. so yes, this is to be expected. a model with 0 degrees of freedom will result in perfect fit (a CFI of 1, an RMSEA of 0, a chi-square valye of 0, etc.)



the R2 (squared multiple correlations) of the outcome variable actually sometimes gets lower when variables are added.

Now, I am aware that R2 is not a good criterion to establish incremental validity in SEM, but it strikes me as odd that it can actually drop when more predictors are included. My understanding has always been that any additional predictor can only increase R2, as is the case in linear regression.

Basically, my question is: Can R2 decrease with more variables in an SEM, and if so, why?



now, before i go on a long rant i just want us to clarify some terms (i hate when i rant about A just so the OP comes back and says "well... i actually meant B. i'm not interested in A). when you talk about "variables", are you referring to your factors/latent variables/constructs or are you talking about the observed variables/indicators/items? i am having trouble following you because you talk about "outcome variable" in singular when (as i guess you know) the indicators/items/observed variables are the outcome variableS. and when you talk about R-squared, are you referring to the R-squared obtained from each individual regression equation relating each inidicator to each factor or are you talking about the overall R-squared of the model? (you know, the one you obtain from taking the ratio of determinants of the residual covariance matrix divided by the determinant of the model-implied covariance matrix (a substracting that to 1).
 
#4
Thanks for your replies!

Now to your questions:

1. There are no missing values and all models are computed from the exact same dataset

2. I am only talking about latent constructs, not about the indicators

3. The R2 I am referring to is the amount of variance in an endogenous latent construct that is explained by its predictors. Amos prints this into its output as "squared multiple correlations" for each endogenous variable (and also for each indicator of a reflective factor measurement, but that is not what I am referring to). In my case, a bunch of latent constructs have a few direct effects specified among them, and one of them is my main dependent variable and therefore I am interested in its R2, i.e. I want to know how much of its variance is explained by the constructs that have one-headed arrows pointing at it.
 

spunky

Can't make spagetti
#5
3. The R2 I am referring to is the amount of variance in an endogenous latent construct that is explained by its predictors. Amos prints this into its output as "squared multiple correlations" for each endogenous variable (and also for each indicator of a reflective factor measurement, but that is not what I am referring to). In my case, a bunch of latent constructs have a few direct effects specified among them, and one of them is my main dependent variable and therefore I am interested in its R2, i.e. I want to know how much of its variance is explained by the constructs that have one-headed arrows pointing at it.
thank you, this does clear a few things up. now, when you say it goes down... does it go down like quite a bit? or is it just a few decimail points? it could very well be that every time you run a different model (with more or less latent variables predicting that one latent variable you're talking about) the likelihood optimizer settles is slightly different solutions, which is nothing too crazy to worry about. it's just simply a byproduct of not having an exact solution and relying on numerical methods to obtain an answer.
 
#6
The exact values for R2 are .354 (model with 7 variables) dropping to .331 (model with 9 variables).

I reproduced these results with 1000 bootstrap samples and a 90% confidence interval, and the confidence intervals are mostly overlapping (.270 - .381 for the first model vs. .291 - .404 for the second model).

What do you think, can I use your point regarding the ML-instability in combination with the bootstrapped confidence intervals as a justification for the lower value in my thesis?
 

spunky

Can't make spagetti
#7
What do you think, can I use your point regarding the ML-instability in combination with the bootstrapped confidence intervals as a justification for the lower value in my thesis?


uhmmm... i think one of two things could be going on here (among myriads of other reasons, of course). the differences in R2s are pretty minor (.023) for a model that's somewhat larger so it could very well be that the ML solution is just bouncing around the optimal solution. the other potential reason is that your model is somewhat misspecified so you're bringing in additional unreliability that's driving the R2s down.


lemme show you with an example. i know you use AMOS so i'm assuming you're more familiar with specifying models as path diagrams as opposed to structural equations. i'm also more of an R/Mplus user than an AMOS user so a lot of my code may not make sense to you, but what matters are the results.

anyway, so what i am doing is using the R package lavaan that fits SEM models to generate data that follows the particular covariance structure i'm attaching on the picture. as you can see, this is a 3-factor model, each factor is measured by 3 indicators with population loadings of 0.7, covariances among the factors of 0.5 *AND*, the most important part, there is a direct latent regression of Factor 1 being predicted by Factor 3 (this will come back to haunt us later).

i will use lavaan (in R) to specify the population model and generate data that matches it. my sample size will be N=1000 and my factor variances are fixed to 1.

Code:
three.factor <- 

	    '#model
	    
	    f1=~ 0.7*y1 + 0.7*y2 + 0.7*y3
	    f2=~ 0.7*y4 + 0.7*y5 + 0.7*y6
	    f3=~ 0.7*y7 + 0.7*y8 + 0.7*y9	

	    
	    #variances
	    
	    f1~~1*f1
	    f2~~1*f2
            f3~~1*f3

          #covariances
 
            f1 ~~ 0.5*f2
	    f1 ~~ 0.5*f3
            f2 ~~ 0.5*f3
	    
          #latent regression
	    f1~ 0.3*f3
          
	'
datum <- simulateData(three.factor, sample.nobs=1000)
now that i have some data, notice how i will fit the same model to the population model, so i expect things and results to look pretty good:

Code:
three.factor1 <- 

	    '#model
	    
	    f1=~ NA*y1 + y2 + y3
	    f2=~ NA*y4 + y5 + y6
	    f3=~ NA*y7 + y8 + y9	

	    
	    #variances
	    
	    f1~~1*f1
	    f2~~1*f2
            f3~~1*f3
	    

	   
          #latent regressions
            f1~f3

	'
summary(sem(three.factor1, datum), rsquare=TRUE)
indeed, i get very good measures of fit and whatnot, but i want to draw your att'n to the R2 vale for the latent regression of f1 predicted by f3

Code:
R-Square:
    [COLOR="red"]f1                0.545[/COLOR]
    y1                0.418
    y2                0.434
    y3                0.353
    y4                0.305
    y5                0.369
    y6                0.339
    y7                0.314
    y8                0.290
    y9                0.342
so the R2 for this latent regression is 0.545. now, allow me to change a part of my syntax slightly by adding a latent regression path that is *not* in the population and matches the description of your situation. i am going to predict f1 both using f2 and f3 (so adding another predictor variable). i just need to change one line of code here:


Code:
three.factor1 <- 

	    '#model
	    
	    f1=~ NA*y1 + y2 + y3
	    f2=~ NA*y4 + y5 + y6
	    f3=~ NA*y7 + y8 + y9	

	    
	    #variances
	    
	    f1~~1*f1
	    f2~~1*f2
            f3~~1*f3
	    

	   
          #latent regressions
            [COLOR="red"]f1~f3+f2[/COLOR]

	'
summary(sem(three.factor1, datum), rsquare=TRUE)
aaaand... let's see what happened to the R2 of f1 that is now being predicted by f3 and f2

Code:
R-Square:

    [COLOR="red"]f1                0.533[/COLOR]
    y1                0.418
    y2                0.435
    y3                0.353
    y4                0.317
    y5                0.359
    y6                0.336
    y7                0.328
    y8                0.320
    y9                0.365
you see? it went down from 0.545 to 0.533, when analyzing the same dataset "datum" using two different models, one where f1 = f3 + error and another where f1 = f2 + f3 + error. what's the problem here? well, the regression path of f2 predicting f1 is *not* in the population. it's merely adding to the unreliability of the model, increasing the error term and bringing the R2 *down* instead of *up*.

the reason here is that R2s in SEM are NOT like R2s in regression. it is somewhat misleading to use regression as an analogy for SEM because, although they do share quite a bit, they are *not* the same. the error of multiple regression explicitly assumes endogeneity is not present, which is the main reason for why SEM was developed to tackle this problem. because of this, some of the properties that one would expect to hold in a linear model like SEM do not necessarily hold.

hope this helps clarify a few things.

it was a fun question. i got to learn quite a bit myself :)
 
#8
Thank you for the enlightening response very nicely illustrated and explained. I have a few follow-up questions if you don't mind:

1. When you say "not in the population", I assume the corresponding direct effect would be very weak, i.e. close to zero in the estimated model and likely non-significant, correct?

2. Is there any kind of measure in your average SEM output that would allow me to see how much unreliability is added by an additional predictor?
 

spunky

Can't make spagetti
#9
1. When you say "not in the population", I assume the corresponding direct effect would be very weak, i.e. close to zero in the estimated model and likely non-significant, correct?


not necessarily. please keep in mind that although there is no direct effect of f2 on f1, i am specifying a covariance of 0.5 between f2 and f1. if there is a covariance between two variables you can obtain a regression coefficient from it using the forumla \(\beta_{x.y}=\frac{\sigma_{x,y}}{\sigma_{x}}\).

for this direct effect to be completely absent, f2 and f1 should have no paths (direct or indirect) connecting them so they don't covary at all. that is main gist of mediation, right? no way of getting from one variable to another using Wright's path rules.

remember: you can have covariances without regressions BUT you cannot have regressions without covariances.





2. Is there any kind of measure in your average SEM output that would allow me to see how much unreliability is added by an additional predictor?

not that i'm aware of. and i would be suspicious of any that would come around. reliability under SEM model can be tricky and, to be very honest, i don't think it's very well understood outside from the 1-Factor model. it suffers from the same drawback as the R2 in multiple regression where it accounts for the explained variance that all predictors jointly contribute to, but it is difficult to split it to see how much of the jointly-contributed variances can be attributed to each predictor. said measure would also need to deal with the fact that SEM models both have uniqueness/specific variance and random measurement error. actually, the more i think about your question the harder i find it that anyone could come up with such a measure (unless it is for very, very simple models like unidimensional ones).