More on random slopes and what it means if your effect is not longer significant after the inclusion of random slopes

Posted on Updated on


I thought the following snippet from a somewhat edited email I recently wrote in reply to a question about random slopes and what it means that an effect becomes insignificant might be helpful to some. I also took it as an opportunity to updated the procedure I described at https://hlplab.wordpress.com/2009/05/14/random-effect-structure/. As always, comments are welcome. What I am writing below are just suggestions.

[…] an insignificant effect in an (1 + factor|subj) model means that, after controlling for random by-subject variation in the slope/effect of factor, you find no (by-convention-significant) evidence for the effect. Like you suggest, this is due to the fact that there is between-subject variability in the slope that is sufficiently large to let us call into question the hypothesis that the ‘overall’ slope is significantly different from zero.

[…] So, what’s the rule of thumb here? If you run any of the standard simple designs (2×2, 2×3, 2x2x2,etc.) and you have the psychologist’s luxury of plenty of data (24+item, 24+ subject […]), the full random effect structure is something you should entertain as your starting point. That’s in Clark’s spirit. That’s what F1 and F2 were meant for. […] All of these approaches do not just capture random intercept differences by subject and item. They also aim to capture random slope differences.

[…] here’s what I’d recommend during tutorials now because it often saves time for psycholinguistic data. I am only writing down the random effects but, of course, I am assuming there are fixed effects, too, and that your design factors will remain in the model. Let’s look at a 2×2 design:

1) find the largest model that still converges. for normal psycholinguistic data sets, you can actually often fit the full model:

  • (1 + factorA * factorB | subject) + (1 + factorA * factorB | item)

but you might have to back-off, if this doesn’t converge. If so, try both:

  • (1 + factorA + factorB | subject) + (1 + factorA * factorB | item)
  • (1 + factorA * factorB | subject) + (1 + factorA + factorB | item)

If neither of those works, try:

  • (1 + factorA + factorB | subject) + (1 + factorA + factorB | item)

etc. This will give you what I started to call “the maximal random effect structure justified by your sample”. NB: this does not mean that you can go around and say that higher random slope terms don’t matter and that your results would hold if you included those. You’re sample does not have enough data to afford that conclusion within the mixed model implementations available to you. That’s a normal caveat, I find.

At this point, you can say: I have enough data, the random effects are theoretically motivated, so I will leave it at this. Or, e.g., b/c you have reason to suspect that there are power issues, you might want to check whether you can reduce the random effect structure further. If so, continue to 2)

2) Compare the maximal model against:

  • the intercept-only model: (1 | subject) + (1 | item)

compare the deviance between the two models (e.g. the chisq of anova(model1, model2)). if it’s less than 3, there is no room for any of the slopes to matter (deviance differences are cumulative). you’re done with slope tests. if not continue at 3)

3) if the comparison of the full and the intercept-only model is significant, we need to find out which slopes matter. The size of the deviance difference between the full and the intercept-only model is very instructive as it gives us an idea about how much of a deviance difference is there to be accounted for by additional slopes.

In my experience, the homogeneous nature of psycholinguistic stimuli usually means that there is not much item variance and that most of your variance will be due to subjects. This is often also visible in the size of the variance estimates of the by-subject and by-item intercepts. So, if you want to save some time, I’d recommend first looking which of the random by-subject slopes matters the most. This is done by further model comparison (e.g. using the anova(model1, model2, ..) command; although there are more complicated tests that have been argued to be better).

Usually this will result in a clear winner model. Be aware that it’s theoretically possible that two models with different, non-nested, random effect structures are equally good in terms of their deviance. In that case, write to ling-R-lang.

What else? I would always follows R default to include random covariances between different random terms by the same group (e.g. random by-subject intercepts and by-subject slopes for factorA). You can test this assumption, too (again using model comparison), but I find that it’s usually not worth removing the random covariances.

4) Of course, you can also assess whether you need a subject or item effect at all. Simple compare the intercept-only model against, e.g.:

  • (1 | subject)
  • (1 | item)

For example, if anova(intercept-only.model, subject-intercept-only.model) is not significant, your sample doesn’t provide evidence that you need item effects.

5) Note that, to the best of my knowledge, it’s *not* legit to test whether you might not need any random effect by comparing e.g. (1 | subject) against an ordinary linear model. See, for example, the link provided on https://hlplab.wordpress.com/2011/05/31/two-interesting-papers-on-mixed-models/.

This whole procedure may seem cumbersome, but this is a matter of implementation. To the best of my knowledge, several folks are working on implementations that make these comparisons easier […]

HTH,
Florian

18 thoughts on “More on random slopes and what it means if your effect is not longer significant after the inclusion of random slopes

    Ian Finlayson said:
    June 25, 2011 at 8:49 pm

    Had a thought while reading step 1 about a hypothetical (and I’d imagine relatively unlikely) situation, and thought it would be interesting to get your opinion:

    The fullest model (slopes for both subjects and items) fails to converge. The next two models (slopes for subjects in one, slopes for items in the other) do converge. How do you decide which to go forward with?

    I’m maybe leaning toward AIC(c) comparisons. Although, with subjects tending to have more variance — as you point out — perhaps allowing them the fuller structure makes more sense.

    A more general point about whether or not to include covariances between random terms: I’ve found from my own corpus work that excluding the covariances appear to dramatically shorten the time it takes to build each model. So for anybody who has a large dataset and lots of parameters, it may well be worth taking the — comparatively short — extra time to do the model comparisons.

    Finally, has the “the maximal random effect structure justified by your sample” line appeared in print anywhere? Would be nice to be able to cite it in a draft I’m writing.

    Like

      tiflo said:
      June 25, 2011 at 9:29 pm

      Hi Ian,

      yeah, what you put your finger on is what I mention a little further down as part of step 3) “Be aware that it’s theoretically possible that two models with different, non-nested, random effect structures are equally good in terms of their deviance. In that case, write to ling-R-lang.”, which was meant a bit tongue-and-cheek ;). AIC itself is not suited for comparison of non-nested mixed models. One way to distinguish between non-nested models is the Deviance Information Criterion (DIC, Spiegelhalter, D. J., N. G. Best, et al. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society. Series B 64(4): 583-639), which is a generalization of AIC and BIC, though that is not for weak stomachs (I think it’s implemented in for example, the MCMCglmm package by Jarrod Hadfield 2010. MCMC Methods for Multi-response Generalized Linear Mixed Models: The MCMCglmm R Package). See also the closely related conditional AIC (cAIC, e.g. Florin Vaida and Suzette Blanchard. 2005. Conditional Akaike Information for Mixed-Effects Models. Biometrika 92 (2), 351-370).

      As for your second comment, let me add something. I agree that removing the covariance term can save time. But, especially for corpus work, you often end up making minor changes to data sets or including additional controls. So you would always have to repeat that comparison. But, for what it’s worth, for most corpus-based projects there isn’t enough per-speaker data to fit many if any random slopes anyway (but then again, that very problem reduces the severity of the violation of independence for that assessment …).

      Finally, thanks for asking what to cite. I’ve actually seen the term catch on (I’ve seen it in CUNY abstracts). I am supposed to write an article for PBR soon that provides some guidelines like the ones suggested here. For now, any link to the blog is appreciated (who knows maybe that means somebody will find some useful information).

      Like

        Ian Finlayson said:
        June 25, 2011 at 10:11 pm

        Thanks for the references, Florian. I’ve been meaning to do a little research on DIC after seeing Andrew Gelman post about it a few times recently on his blog.

        AICc had been sold to me on the idea that it allowed for comparisons between non-nested models, including mixed-models, so very interested to hear that this is not in fact the case! Fortunately, I very rarely have to compare non-nested models, so have just been playing it safe with LLRT.

        Looking forward to the PBR paper — any attention given to diagnosing, and avoiding, overfitting would be particularly welcome! 🙂

        Like

          tiflo said:
          June 25, 2011 at 10:48 pm

          Well, I hadn’t heard of AICc (I looked it up before I replied, albeit in a hurry). Is it by any chance the same as the conditional AIC (it didn’t seem to be). By LLRT, I assume you mean, you’re creating a superset model that you then compare against the two nested models, hoping that exaclty one comparison will be significant? Unfortunately, the situation with regard to the random effects we’re talking about is (in a somewhat indirect way) a situation where this won’t work.

          Thanks for the suggestions for the paper. The primary purpose of the paper will be to come up with something that might be suitable to provide an acceptable default for reporting mixed models over simple factorial designs, plus some covariates. That’s currently the biggest challenge for editors and reviewers because most papers in Cognitive Psychology use that type of data. I think for the type of (more exciting 😉 data you’re working on, it will always come down to user expertise and the never ending journey of learning, hehe.

          Like

        Martin Corley said:
        June 28, 2011 at 10:11 am

        Hi Florian,

        just wondering where your assertion that AIC can’t be used to compare non-nested models comes from? As the “salesman” who pointed Ian in that direction, I was under the impression that it was contentious but generally accepted to be OK (see for example Anderson & Burham, http://bit.ly/ioOPFu). I’m hoping I didn’t mislead him, but if you have a good source (other than Ripley) to show that I did I’d be interested to see it…

        Like

          tiflo said:
          June 28, 2011 at 2:24 pm

          Hi Martin, I’ll get to this later (I hope) – it’s grant writing season. The fast answer though is that my statement is too strong. Of course, the very purpose of AIC is to compare non-nested models (if they were nested we could just do model comparison). But I’ve read several papers that talk about the inappropriateness of ordinary AIC for the comparison of mixed models (see e.g. the reference to Vaida and Blanchard I gave above).

          Like

        Elise said:
        December 23, 2013 at 12:34 pm

        dear prof. Jaeger,

        after reading this comment I was wondering: did you already write the PBR article with guidelines? If so, where do I find it?

        I’m a master’s student working on my first data analysis (categorical production data, I’m pretty sure I need to do a mixed effects logit regression analysis and I’m also pretty sure about what fixed and random effects (and slopes) I should include). I’m having a hard time figuring out, after modeling my data in R, how to interpret it / what to do with it… I *think* parametric bootstrap confidence intervals and model comparisons might be the way to go, but I base this on reading posts to mailing lists and stack overflow. My main problem is that the published literature on this (e.g. the Baayen 2008 book) seems to be a bit outdated, since in those examples the pvals function of the languageR package is used.

        best,
        Elise Hopman
        master’s student at Radboud University Nijmegen

        Like

          tiflo responded:
          December 23, 2013 at 10:29 pm

          Hi Elise,

          first the good news. For mixed logit models, the current standard is that everybody just uses the p-values provided by the model output. Pvals.fnc never worked for those models anyway. It was meant for linear mixed models, for which lmer() didn’t provide p-values. I think for a standard data analysis project, I would simply take the p-values provided by the mixed logit model output (i.e., those based on the z-statistic). For linear mixed models, you can find a comparison of different methods to obtain p-values in Barr et al. 2013, but it seems you won’t need this, since you say you’re using mixed logit models.

          Of course, you should still make sure that your model isn’t overfitting (again, rather unlikely if you have a decent amount of data — in particular, if your variables are balanced and if the outcome probability is not all too close to 0 or 1). If you want some specific numbers that give you an idea when you should be careful about overfitting your model, have a look at Jaeger (2011) and references therein:

          Jaeger, T. F. (2011). Corpus-based Research on Language Production: Information Density and Reducible Subject Relatives. In E. M. Bender & J. E. Arnold (Eds.), Language from a Cognitive Perspective: Grammar, Usage, and Processing. Studies in honor of Tom Wasow (pp. 161–197). Stanford: CSLI Publications.

          If your model contains interactions or higher order polynomial terms of, I’d make sure to center (and, optionally, scale) your continuous variables. Centering will help reduce collinearity with higher order terms of the same variable. Assuming approximate balance, I’d use sum-coding of any factors in the model. This makes the interpretation ANOVA-like (the effect given for that factor will indeed be its “main effect”) and it will reduce collinearity with interactions.

          If you have inherently correlated variables (e.g., word frequency and word length) there are some additional things to consider. A good article to read about that is Wall and Friedman (2005). But I’d say let’s cross that bridge when you get there (and make sure to email ling-R-lang if you get to that stage).

          As you might guess from this, I unfortunately never got around to writing the article. Right around that time I got caught up in a rather time-consuming review of a paper that covered related topics and I instead decided to put my energy into contributing to improving that paper (with mixed success, I am afraid ;)). I hope these answers and the above references help a little.

          Florian

          Like

        Elise said:
        December 24, 2013 at 6:52 am

        hi Florian,

        thanks a lot for your fast reply. I had been struggling before to get p-values for my fixed effects out of my model, but reading from your comment that it should be possible made me look at my dataframe again. I figured out what I was doing wrong, and I’m happy to say that the analysis now works!

        merry christmas,
        Elise

        Like

          tiflo responded:
          December 24, 2013 at 9:39 am

          I’m glad it worked. A good break to you, too.

          Florian

          Like

    Ian Finlayson said:
    June 26, 2011 at 8:50 am

    AICc is a correction made to the AIC where the adjustment takes into account the number of observations and the number of parameters, which penalizes large numbers of parameters when the data is small. I’ve also heard it suggested that because AICc compensates for the number of parameters, it allows you to still use the >2 difference criterion for model comparison (I’m still not 100% sure I believe that’s how AIC should be used) when the difference in number of parameters in the model is greater than one — which is most obviously useful for (1|group) vs (predictor|group) comparisons.

    I haven’t had a chance to read the cAIC paper, but I’d suspect you’re right in thinking they’re quite different. I’d assume that AICc is also unsuitable for non-nested mixed-models.

    re: LLRT, I just meant that as non-nested models rarely feature in my work, I use LLRT to compare nested models. I’m yet to really find a strong advantage to using AICc over LLRT when all the models are nested.

    Any consistency in reporting of mixed-models that your paper could bring would be equally appreciated! 🙂

    Like

    Philip said:
    July 2, 2011 at 4:32 am

    One issue I’ve run up against in doing this process is that the full model often can lead to higher correlations of fixed effect factors compared to models without the by-participant & by-item random slope adjustments for those fixed effect factors.

    Florian, do you agree that the two options then are (1) live with the correlations if the critical effects remain significant or (2) remove the random effect correlation terms if you think the correlations are obscuring an effect?

    Philip

    Like

      tiflo said:
      July 2, 2011 at 1:19 pm

      I’d be curious to hear other opinions, but I am inclined *not* to agree with that description. Let me make a couple of distinctions first that I find somewhat crucial.

      1) Fixed effect collinearity is a problem for the modeler who wants to interpret the model, not necessarily for the model quality itself. As such, increased fixed effect correlations in models with random slopes are the researchers problem to deal with. If the slopes are justified, simply removing them does not constitute a good solution to the problem. That’d be like removing an interaction that is significant, just because it makes it harder to understand the main effects. So, regardless of whether the fixed effects remain significant, the (justified) slope must go on.

      2) Of course, it’s debatable when a slope is ‘justified’. In addition to assessing the answer to this question in terms of model comparison, I think it can be helpful to distinguish between random effects that we have a priori reasons to expect to matter and those that we’re just entertaining as additional controls. For example, ‘participants’ is a rather clear case of the former. If random slopes for this grouping structure are significant, you need to include them.

      3) The case won’t always be as clear. Ironically, the status of the other random effects that psycholinguistic work usually controls for (‘items’) has been debated over and over again. But to make my point clearer, what keeps us from using many random effects to capture different properties of our items, e.g. the determiner type of the subject in the sentence, the semantic class of the verb in the sentence, etc. In many experiments items do not form a completely homogeneous class (i.e. between items, they vary along more dimensions that those intended by the manipulation). This becomes even clearer in corpus studies. Say I am studying effects on the duration of a word. I might include random effects for the preceding, following, and current word’s phonological representation. At some point, this approach can become rather conservative and I risk overfitting the data just as I would by including many arbitrary fixed effects.

      That being said, when there are no signs of overfitting and you have enough data, I wouldn’t even see a reason to exclude those random effects.

      4) In GLMMs, random effects are fit under the assumption that the differences between levels are normally distributed. This assumption may be blatantly violated, requiring either exclusion of some levels of the grouping structure or a different way to account for these differences.

      5) It can be the case that you do not have enough power to find an effect, once random slopes are included. So take the case, where random slopes seem to be justified, but once you include them you don’t have an effect anymore. You can conduct a power simulation whether you would have been able to find the effect given your data size, the estimated effect size in the model without the random slope, the estimated random slope variance (covariance, etc.) from the mode with the slope. This may tell you that you had not chance detecting the effect. Now what? You still can’t just say that the effect is real, but you can tell you readers that it is there in the simpler model and that you don’t have enough data to judge whether it’s there after additional, apparently justified, controls are included. Specifically, you even will have an estimate of the proportion of samples of the size and grouping structure of your sample that would not have yielded a significant effect even if the effect is true.

      Peter Graff, Bill Croft and I submitted a commentary on a recent paper by Atkinson to Linguistic Typology that contains such a power analysis. I probably shouldn’t say this publicly, but I am not yet totally happy with the paper, but as soon as we’ve done the revisions, I will upload it to http://rochester.academia.edu/tiflo/Papers.

      Like

    torvon said:
    September 21, 2012 at 12:40 pm

    Hey there.

    I’m not interested in running mixed models, but am still interested in something like a measure of intra-individual variation over time. Basically I am looking at depressive symptoms over time. What I am really interested in is comparing week 1 vs week 12. But I am not sure whether week 1 and 12 are more or less random weeks because people vary a lot within themselves over time, or whether week 1 and 12 are pretty reliable because the symptoms are pretty stable over time.

    I thought about my mixed model background and read up again on ways to determine whether random slopes are needed. Maybe you have a suggestion for my problem, outside of mixed models.

    I plotted some individuals to see whether the trajectories look kind of stable or kind of all over the place, however with N=4000 and an ordered outcome it’s really hard to make sense of individual trajecotires with more than 10 people 😉

    I really appreciate it, thank you!
    T.

    Like

      tiflo said:
      September 25, 2012 at 9:45 pm

      Dear T. =),

      I think you’re saying that your dependent variable is an order multinomial outcome? That is indeed not the easiest to analyze. You could follow the mixed model approach described, for example, in Gelman and Hill 2007, which I find very accessible. An alternative would an ordinary proportional odds model and then use bootstrap with replacements of (subject) clusters to assess the trend over time.

      Or am I misunderstanding you, and you’re not concerned about repeated measures, but rather about the fact whether the trend is reliable over week 1 – 12? In a (ordinary or mixed) proportional odds) model this could be tested by compared a model with only a linear term for time of measurement (week 1 to 12) or also non-linear terms (e.g. quadratic and cubic).

      HTH,
      Florian

      Like

        torvon said:
        September 26, 2012 at 5:10 pm

        Florian,

        Thank you. My main question is (sorry that this was confusing) whether I can use individual times points in my dataset to predict later outcomes (e.g. in a simple regression), or whether time points are not reliable enough because there is lots of intra-individual variation over time.

        Let’s say I want to predict treatment response to an anxiety drug at week10 by baseline anxiety factors. Let’s assume there are 3 different anxiety factors, and I have 10 weeks of data (at week2 I start giving drugs). Now I want to use the factors at week 0 to predict treatment response at week10, because I speculate that loading high on one factor (having these specific problems vs. other problems) makes the drug work less.

        What I want to know now is whether there is a lot of variation between timepoints so that it doesn’t make sense to use week0 as a predictor for week10 (because it’s unreliable, because at week 1 and week2 and week3 a single person would be all over these factors) – in this case I should probably average a couple of time points (weeks 0 – 2) and use these as predictor.

        Now, I can look at plots of individual variation of time, and I can look at correlations over time, but that doesn’t really help me (it only gives clues), and I was hoping you had an idea on this (seeing that it’s kind of similar to the “random slopes vs. no random slopes” debate, although not really)

        Hope that is less confusing now
        Eiko

        (To the other points: My outcome is indeed ordered, and my experience with R thus far has been a bit disappointing because NLME and LME4 cannot handle ordered variables, and ORDINAL (clmm) doesn’t do random slopes and covariance structures (like AR1), which are both crucial for the models I am running. I haven’t checked out the approach described in Gelman and Hill 2007, how does it differ from the ORDINAL package? I’ll also have to read up on your idea of an “ordinary proportional odds model and then use bootstrap with replacements of (subject) clusters to assess the trend over time”, do you have a literature recommendation?)

        Like

    Mixed Effect Models | Social by Selection said:
    July 14, 2015 at 10:53 am

    […] finally, here are some pages that go over some of the basic questions I had during implementation: On what happens when a significant main effect goes nonsignificant when adding random effects Examples of mixed logistic regression: One, Two Wiki on GLMM How to graph logit model when you have […]

    Like

Questions? Thoughts?