More on random slopes and what it means if your effect is not longer significant after the inclusion of random slopes
I thought the following snippet from a somewhat edited email I recently wrote in reply to a question about random slopes and what it means that an effect becomes insignificant might be helpful to some. I also took it as an opportunity to updated the procedure I described at https://hlplab.wordpress.com/2009/05/14/random-effect-structure/. As always, comments are welcome. What I am writing below are just suggestions.
[…] an insignificant effect in an (1 + factor|subj) model means that, after controlling for random by-subject variation in the slope/effect of factor, you find no (by-convention-significant) evidence for the effect. Like you suggest, this is due to the fact that there is between-subject variability in the slope that is sufficiently large to let us call into question the hypothesis that the ‘overall’ slope is significantly different from zero.
[…] So, what’s the rule of thumb here? If you run any of the standard simple designs (2×2, 2×3, 2x2x2,etc.) and you have the psychologist’s luxury of plenty of data (24+item, 24+ subject […]), the full random effect structure is something you should entertain as your starting point. That’s in Clark’s spirit. That’s what F1 and F2 were meant for. […] All of these approaches do not just capture random intercept differences by subject and item. They also aim to capture random slope differences.
[…] here’s what I’d recommend during tutorials now because it often saves time for psycholinguistic data. I am only writing down the random effects but, of course, I am assuming there are fixed effects, too, and that your design factors will remain in the model. Let’s look at a 2×2 design:
1) find the largest model that still converges. for normal psycholinguistic data sets, you can actually often fit the full model:
- (1 + factorA * factorB | subject) + (1 + factorA * factorB | item)
but you might have to back-off, if this doesn’t converge. If so, try both:
- (1 + factorA + factorB | subject) + (1 + factorA * factorB | item)
- (1 + factorA * factorB | subject) + (1 + factorA + factorB | item)
If neither of those works, try:
- (1 + factorA + factorB | subject) + (1 + factorA + factorB | item)
etc. This will give you what I started to call “the maximal random effect structure justified by your sample”. NB: this does not mean that you can go around and say that higher random slope terms don’t matter and that your results would hold if you included those. You’re sample does not have enough data to afford that conclusion within the mixed model implementations available to you. That’s a normal caveat, I find.
At this point, you can say: I have enough data, the random effects are theoretically motivated, so I will leave it at this. Or, e.g., b/c you have reason to suspect that there are power issues, you might want to check whether you can reduce the random effect structure further. If so, continue to 2)
2) Compare the maximal model against:
- the intercept-only model: (1 | subject) + (1 | item)
compare the deviance between the two models (e.g. the chisq of anova(model1, model2)). if it’s less than 3, there is no room for any of the slopes to matter (deviance differences are cumulative). you’re done with slope tests. if not continue at 3)
3) if the comparison of the full and the intercept-only model is significant, we need to find out which slopes matter. The size of the deviance difference between the full and the intercept-only model is very instructive as it gives us an idea about how much of a deviance difference is there to be accounted for by additional slopes.
In my experience, the homogeneous nature of psycholinguistic stimuli usually means that there is not much item variance and that most of your variance will be due to subjects. This is often also visible in the size of the variance estimates of the by-subject and by-item intercepts. So, if you want to save some time, I’d recommend first looking which of the random by-subject slopes matters the most. This is done by further model comparison (e.g. using the anova(model1, model2, ..) command; although there are more complicated tests that have been argued to be better).
Usually this will result in a clear winner model. Be aware that it’s theoretically possible that two models with different, non-nested, random effect structures are equally good in terms of their deviance. In that case, write to ling-R-lang.
What else? I would always follows R default to include random covariances between different random terms by the same group (e.g. random by-subject intercepts and by-subject slopes for factorA). You can test this assumption, too (again using model comparison), but I find that it’s usually not worth removing the random covariances.
4) Of course, you can also assess whether you need a subject or item effect at all. Simple compare the intercept-only model against, e.g.:
- (1 | subject)
- (1 | item)
For example, if anova(intercept-only.model, subject-intercept-only.model) is not significant, your sample doesn’t provide evidence that you need item effects.
5) Note that, to the best of my knowledge, it’s *not* legit to test whether you might not need any random effect by comparing e.g. (1 | subject) against an ordinary linear model. See, for example, the link provided on https://hlplab.wordpress.com/2011/05/31/two-interesting-papers-on-mixed-models/.
This whole procedure may seem cumbersome, but this is a matter of implementation. To the best of my knowledge, several folks are working on implementations that make these comparisons easier […]