#1
I'm new to Multilevel modeling and currently I been working on a business project and its data is related to multilevel modeling. I know a lot of things about how to approach this problem, but I will do my reasoning so you can tell me if it is ok or wrong. So I need a mind that know this type of things. Any suggestions will be very appreciated.
Data:
I can't show the data because it is confidencial, but I will explain you with an example.The data is collected by asking clients to score from 1 to 10 his experience about a certain business. The differents columns are:
  • Client: Client id.
  • Characteristic: A characteristic such "Price" for example. There are 11 characteristics. (All Industries and Businesses have the same characteristics)
  • Business: The business which the client belongs to. There are 49 businesses.
  • Industry: The Industry which the client belongs to. There are 13 industries.
  • Score: Score from 1 to 10 of the experience
  • X1,X2,...,Xn: Different variables at the individual level such as age,gender,etc.
(Total numbers of rows are aprox 10.000)
Screen Shot 2018-07-05 at 5.55.39 PM.png

Understanding data structure:
  • An Industry can have multiple Business.
  • A business has only one Industry.
  • A Business can han multiple Characteristics.
  • Multiple Characteristics can be said by a Client.
So, I think that the data is structured as Fig.1, but because there are 2 multiple-membership links (same weight I guess?), I can assume to make the structure more handleable for lme4, that every Characteristic said is just by one Client, resulting as Fig. 2. Ok, but I still have this multiple-membership between Business and Characteristics, so, ¿Can I use Characteristic as a fixed effect and use Fig.3 data structure? (It may be problematic due to quantity of Characteristics I think, too much parameters) What would be your thought on this?

From now on I will assume for just example purpose, to assume a strict hierarchical structure.
(Strict hierarchical link from Client to Industry. Being Client->Characteristic->Business->Industry)
Questions that I want to respond:
  1. Is there a significant difference of Score by Industry?
  2. Is there a significant difference of Score by Business?
  3. Is there a significant difference of Score by Characteristic?
  4. How much does the characteristics in each of the industries contribute to the score?
  5. Which characteristic are better rated on average?
  6. How much does the characteristics in each of the business contribute to the note?
  7. Which industries can be considered equal to the country's average?
  8. What business can be considered equal to the Industry's average?
  9. How much does the gender of the person contribute by grade to industry?
Approach:
Because of the questions above, I want to do a regression using Score as the response variable.

Analysing the behaviour of the response variable, it is counted data. I will use a Poisson response to fit the data better. So, because of this, I will use the lme4 package, for the glmer function.
So, to get answers for the questions above, I think I should use this code (I will treat Industry, Business and Characteristic as random because of the quantity of parameters):

fit <- glmer(score ~ (1|Industry/Business/Characteristic), family=poisson, data=mydata)

Which is the same as (I think):

fit <- glmer(score ~ (1|Industry) + (1| Industry:Business) + (1|Industry:Business:Characteristic), family=poisson, data=mydata)

Assuming correlation between Industry and Business and Industry, Business and Characteristic. But I know, that first I should check if there is a significant group factor involve in the data (by Industry, Business and Characteristic), by first using a simple linear regression and the compare that to the nullmodel (after this analysis I will use glmer and Poisson). In this case I used lm and glmer:

fit <- lm(score ~ 1, data = mydata)

Checking significant group factors:
nullmodel <- lmer(score ~ (1 | Industry), data = mydata)
anova(nullmodel,fit)
###refitting model(s) with ML (instead of REML)###

Data: mydata
Models:
fit: score ~ 1
nullmodel: score ~ (1 | Industry)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
fit 2 293601 293619 -146798 293597
nullmodel 3 289522 289549 -144758 289516 4080.9 1 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

nullmodel <- lmer(score ~ (1 | Characteristic), data = mydata)
anova(nullmodel,fit)
###refitting model(s) with ML (instead of REML)###

Data: mydata
Models:
fit: score ~ 1
nullmodel: score ~ (1 | Characteristic)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
fit 2 293601 293619 -146798 293597
nullmodel 3 291810 291837 -145902 291804 1793 1 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

nullmodel <- lmer(score ~ (1 | Business), data = mydata)
anova(nullmodel,fit)
###refitting model(s) with ML (instead of REML)##

Data: mydata
Models:
fit: score ~ 1
nullmodel: score ~ (1 | Business)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
fit 2 293601 293619 -146798 293597
nullmodel 3 286396 286423 -143195 286390 7207 1 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Ok, with this responses I can conclude that there is a significant Industry, Business and Characteristic group factor, thus, I think I respond to my first 3 questions.
But here it gets tricky for me. I have just used linear analysis to start, but now I will consider the distribution of my response variable (Score) as Poisson. Then I compare the lineal model with the Poisson model with anova.
Results:

null.lineal <- lmer(score ~ (1 | industria), data = mydata)
null.poisson <- glmer(score ~ (1 | industria), family=poisson, data = mydata)
anova(null.lineal,null.poisson)
###refitting model(s) with ML (instead of REML)##

Data: mydata
Models:
null.poisson: score ~ (1 | industria)
null.lineal: score ~ (1 | industria)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
null.poisson 2 302317 302335 -151157 302313
null.lineal 3 289522 289549 -144758 289516 12797 1 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

By this, I can conclude that a linear fit is better that assuming a Poisson distribution response. ¿So, I just stay with linear fit? The majority of this comparisons (between linear and Poisson) is better the Linear models.
So:
¿It is ok my reasoning? Let me know if I am doing something wrong or if it is right too please. One of my troubles is that I don't know if there is something I should consider.
¿What would it be your approach to answer the above questions? ¿Any suggestions?