I thought the following snippet from a somewhat edited email I recently wrote in reply to a question about random slopes and what it means that an effect becomes insignificant might be helpful to some. I also took it as an opportunity to updated the procedure I described at http://hlplab.wordpress.com/2009/05/14/random-effect-structure/. As always, comments are welcome. What I am writing below are just suggestions.
[...] an insignificant effect in an (1 + factor|subj) model means that, after controlling for random by-subject variation in the slope/effect of factor, you find no (by-convention-significant) evidence for the effect. Like you suggest, this is due to the fact that there is between-subject variability in the slope that is sufficiently large to let us call into question the hypothesis that the ‘overall’ slope is significantly different from zero.
[...] So, what’s the rule of thumb here? If you run any of the standard simple designs (2×2, 2×3, 2x2x2,etc.) and you have the psychologist’s luxury of plenty of data (24+item, 24+ subject [...]), the full random effect structure is something you should entertain as your starting point. That’s in Clark’s spirit. That’s what F1 and F2 were meant for. [...] All of these approaches do not just capture random intercept differences by subject and item. They also aim to capture random slope differences.
[...] here’s what I’d recommend during tutorials now because it often saves time for psycholinguistic data. I am only writing down the random effects but, of course, I am assuming there are fixed effects, too, and that your design factors will remain in the model. Let’s look at a 2×2 design:
1) find the largest model that still converges. for normal psycholinguistic data sets, you can actually often fit the full model:
- (1 + factorA * factorB | subject) + (1 + factorA * factorB | item)
but you might have to back-off, if this doesn’t converge. If so, try both:
- (1 + factorA + factorB | subject) + (1 + factorA * factorB | item)
- (1 + factorA * factorB | subject) + (1 + factorA + factorB | item)
If neither of those works, try:
- (1 + factorA + factorB | subject) + (1 + factorA + factorB | item)
etc. This will give you what I started to call “the maximal random effect structure justified by your sample”. NB: this does not mean that you can go around and say that higher random slope terms don’t matter and that your results would hold if you included those. You’re sample does not have enough data to afford that conclusion within the mixed model implementations available to you. That’s a normal caveat, I find.
At this point, you can say: I have enough data, the random effects are theoretically motivated, so I will leave it at this. Or, e.g., b/c you have reason to suspect that there are power issues, you might want to check whether you can reduce the random effect structure further. If so, continue to 2)
2) Compare the maximal model against:
- the intercept-only model: (1 | subject) + (1 | item)
compare the deviance between the two models (e.g. the chisq of anova(model1, model2)). if it’s less than 3, there is no room for any of the slopes to matter (deviance differences are cumulative). you’re done with slope tests. if not continue at 3)
3) if the comparison of the full and the intercept-only model is significant, we need to find out which slopes matter. The size of the deviance difference between the full and the intercept-only model is very instructive as it gives us an idea about how much of a deviance difference is there to be accounted for by additional slopes.
In my experience, the homogeneous nature of psycholinguistic stimuli usually means that there is not much item variance and that most of your variance will be due to subjects. This is often also visible in the size of the variance estimates of the by-subject and by-item intercepts. So, if you want to save some time, I’d recommend first looking which of the random by-subject slopes matters the most. This is done by further model comparison (e.g. using the anova(model1, model2, ..) command; although there are more complicated tests that have been argued to be better).
Usually this will result in a clear winner model. Be aware that it’s theoretically possible that two models with different, non-nested, random effect structures are equally good in terms of their deviance. In that case, write to ling-R-lang.
What else? I would always follows R default to include random covariances between different random terms by the same group (e.g. random by-subject intercepts and by-subject slopes for factorA). You can test this assumption, too (again using model comparison), but I find that it’s usually not worth removing the random covariances.
4) Of course, you can also assess whether you need a subject or item effect at all. Simple compare the intercept-only model against, e.g.:
- (1 | subject)
- (1 | item)
For example, if anova(intercept-only.model, subject-intercept-only.model) is not significant, your sample doesn’t provide evidence that you need item effects.
5) Note that, to the best of my knowledge, it’s *not* legit to test whether you might not need any random effect by comparing e.g. (1 | subject) against an ordinary linear model. See, for example, the link provided on http://hlplab.wordpress.com/2011/05/31/two-interesting-papers-on-mixed-models/.
This whole procedure may seem cumbersome, but this is a matter of implementation. To the best of my knowledge, several folks are working on implementations that make these comparisons easier [...]
HTH,
Florian
Had a thought while reading step 1 about a hypothetical (and I’d imagine relatively unlikely) situation, and thought it would be interesting to get your opinion:
The fullest model (slopes for both subjects and items) fails to converge. The next two models (slopes for subjects in one, slopes for items in the other) do converge. How do you decide which to go forward with?
I’m maybe leaning toward AIC(c) comparisons. Although, with subjects tending to have more variance — as you point out — perhaps allowing them the fuller structure makes more sense.
A more general point about whether or not to include covariances between random terms: I’ve found from my own corpus work that excluding the covariances appear to dramatically shorten the time it takes to build each model. So for anybody who has a large dataset and lots of parameters, it may well be worth taking the — comparatively short — extra time to do the model comparisons.
Finally, has the “the maximal random effect structure justified by your sample” line appeared in print anywhere? Would be nice to be able to cite it in a draft I’m writing.
Hi Ian,
yeah, what you put your finger on is what I mention a little further down as part of step 3) “Be aware that it’s theoretically possible that two models with different, non-nested, random effect structures are equally good in terms of their deviance. In that case, write to ling-R-lang.”, which was meant a bit tongue-and-cheek
. AIC itself is not suited for comparison of non-nested mixed models. One way to distinguish between non-nested models is the Deviance Information Criterion (DIC, Spiegelhalter, D. J., N. G. Best, et al. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society. Series B 64(4): 583-639), which is a generalization of AIC and BIC, though that is not for weak stomachs (I think it’s implemented in for example, the MCMCglmm package by Jarrod Hadfield 2010. MCMC Methods for Multi-response Generalized Linear Mixed Models: The MCMCglmm R Package). See also the closely related conditional AIC (cAIC, e.g. Florin Vaida and Suzette Blanchard. 2005. Conditional Akaike Information for Mixed-Effects Models. Biometrika 92 (2), 351-370).
As for your second comment, let me add something. I agree that removing the covariance term can save time. But, especially for corpus work, you often end up making minor changes to data sets or including additional controls. So you would always have to repeat that comparison. But, for what it’s worth, for most corpus-based projects there isn’t enough per-speaker data to fit many if any random slopes anyway (but then again, that very problem reduces the severity of the violation of independence for that assessment …).
Finally, thanks for asking what to cite. I’ve actually seen the term catch on (I’ve seen it in CUNY abstracts). I am supposed to write an article for PBR soon that provides some guidelines like the ones suggested here. For now, any link to the blog is appreciated (who knows maybe that means somebody will find some useful information).
Thanks for the references, Florian. I’ve been meaning to do a little research on DIC after seeing Andrew Gelman post about it a few times recently on his blog.
AICc had been sold to me on the idea that it allowed for comparisons between non-nested models, including mixed-models, so very interested to hear that this is not in fact the case! Fortunately, I very rarely have to compare non-nested models, so have just been playing it safe with LLRT.
Looking forward to the PBR paper — any attention given to diagnosing, and avoiding, overfitting would be particularly welcome!
Well, I hadn’t heard of AICc (I looked it up before I replied, albeit in a hurry). Is it by any chance the same as the conditional AIC (it didn’t seem to be). By LLRT, I assume you mean, you’re creating a superset model that you then compare against the two nested models, hoping that exaclty one comparison will be significant? Unfortunately, the situation with regard to the random effects we’re talking about is (in a somewhat indirect way) a situation where this won’t work.
Thanks for the suggestions for the paper. The primary purpose of the paper will be to come up with something that might be suitable to provide an acceptable default for reporting mixed models over simple factorial designs, plus some covariates. That’s currently the biggest challenge for editors and reviewers because most papers in Cognitive Psychology use that type of data. I think for the type of (more exciting
data you’re working on, it will always come down to user expertise and the never ending journey of learning, hehe.
Hi Florian,
just wondering where your assertion that AIC can’t be used to compare non-nested models comes from? As the “salesman” who pointed Ian in that direction, I was under the impression that it was contentious but generally accepted to be OK (see for example Anderson & Burham, http://bit.ly/ioOPFu). I’m hoping I didn’t mislead him, but if you have a good source (other than Ripley) to show that I did I’d be interested to see it…
Hi Martin, I’ll get to this later (I hope) – it’s grant writing season. The fast answer though is that my statement is too strong. Of course, the very purpose of AIC is to compare non-nested models (if they were nested we could just do model comparison). But I’ve read several papers that talk about the inappropriateness of ordinary AIC for the comparison of mixed models (see e.g. the reference to Vaida and Blanchard I gave above).
AICc is a correction made to the AIC where the adjustment takes into account the number of observations and the number of parameters, which penalizes large numbers of parameters when the data is small. I’ve also heard it suggested that because AICc compensates for the number of parameters, it allows you to still use the >2 difference criterion for model comparison (I’m still not 100% sure I believe that’s how AIC should be used) when the difference in number of parameters in the model is greater than one — which is most obviously useful for (1|group) vs (predictor|group) comparisons.
I haven’t had a chance to read the cAIC paper, but I’d suspect you’re right in thinking they’re quite different. I’d assume that AICc is also unsuitable for non-nested mixed-models.
re: LLRT, I just meant that as non-nested models rarely feature in my work, I use LLRT to compare nested models. I’m yet to really find a strong advantage to using AICc over LLRT when all the models are nested.
Any consistency in reporting of mixed-models that your paper could bring would be equally appreciated!
One issue I’ve run up against in doing this process is that the full model often can lead to higher correlations of fixed effect factors compared to models without the by-participant & by-item random slope adjustments for those fixed effect factors.
Florian, do you agree that the two options then are (1) live with the correlations if the critical effects remain significant or (2) remove the random effect correlation terms if you think the correlations are obscuring an effect?
Philip
I’d be curious to hear other opinions, but I am inclined *not* to agree with that description. Let me make a couple of distinctions first that I find somewhat crucial.
1) Fixed effect collinearity is a problem for the modeler who wants to interpret the model, not necessarily for the model quality itself. As such, increased fixed effect correlations in models with random slopes are the researchers problem to deal with. If the slopes are justified, simply removing them does not constitute a good solution to the problem. That’d be like removing an interaction that is significant, just because it makes it harder to understand the main effects. So, regardless of whether the fixed effects remain significant, the (justified) slope must go on.
2) Of course, it’s debatable when a slope is ‘justified’. In addition to assessing the answer to this question in terms of model comparison, I think it can be helpful to distinguish between random effects that we have a priori reasons to expect to matter and those that we’re just entertaining as additional controls. For example, ‘participants’ is a rather clear case of the former. If random slopes for this grouping structure are significant, you need to include them.
3) The case won’t always be as clear. Ironically, the status of the other random effects that psycholinguistic work usually controls for (‘items’) has been debated over and over again. But to make my point clearer, what keeps us from using many random effects to capture different properties of our items, e.g. the determiner type of the subject in the sentence, the semantic class of the verb in the sentence, etc. In many experiments items do not form a completely homogeneous class (i.e. between items, they vary along more dimensions that those intended by the manipulation). This becomes even clearer in corpus studies. Say I am studying effects on the duration of a word. I might include random effects for the preceding, following, and current word’s phonological representation. At some point, this approach can become rather conservative and I risk overfitting the data just as I would by including many arbitrary fixed effects.
That being said, when there are no signs of overfitting and you have enough data, I wouldn’t even see a reason to exclude those random effects.
4) In GLMMs, random effects are fit under the assumption that the differences between levels are normally distributed. This assumption may be blatantly violated, requiring either exclusion of some levels of the grouping structure or a different way to account for these differences.
5) It can be the case that you do not have enough power to find an effect, once random slopes are included. So take the case, where random slopes seem to be justified, but once you include them you don’t have an effect anymore. You can conduct a power simulation whether you would have been able to find the effect given your data size, the estimated effect size in the model without the random slope, the estimated random slope variance (covariance, etc.) from the mode with the slope. This may tell you that you had not chance detecting the effect. Now what? You still can’t just say that the effect is real, but you can tell you readers that it is there in the simpler model and that you don’t have enough data to judge whether it’s there after additional, apparently justified, controls are included. Specifically, you even will have an estimate of the proportion of samples of the size and grouping structure of your sample that would not have yielded a significant effect even if the effect is true.
Peter Graff, Bill Croft and I submitted a commentary on a recent paper by Atkinson to Linguistic Typology that contains such a power analysis. I probably shouldn’t say this publicly, but I am not yet totally happy with the paper, but as soon as we’ve done the revisions, I will upload it to http://rochester.academia.edu/tiflo/Papers.