### Some thoughts on the sensitivity of mixed models to (two types of) outliers

Outliers are a nasty problem for both ANOVA or regression analyses, though at least some folks consider them more of a problem for the latter type of analysis. So, I thought I post some of my thoughts on a recent discussion about outliers that took place at CUNY 2009. Hopefully, some of you will react and enlighten me/us (maybe there are some data, some simulations out there that may speak to the issues I mention below?). I first summarize a case where one outlier out of 400 apparently drove the result of a regression analysis, but wouldn’t have done so if the researchers had used ANO(C)OVA. After that I’ll have some simulation data for you on another type of “outlier” (I am not even sure whether outlier is the right word): the case where a few levels of a group-level predictor may be driving the result. That is, how do we make sure that our results aren’t just due to item-specific or subject-specific properties.

## One outlier that tops them all?

Some of you may know the nice 2007 CogSci paper by Very Demberg and Frank Keller entitled “Eye-tracking Evidence for Integration Cost Effects in Corpus Data“, in which they present evidence from a corpus of eye-tracking in reading that subject-extracted relative clauses (RCs) are processed with shorter reading times (on the relative clause verb) than object-extracted RCs. That’s an interesting finding supporting memory-based accounts of sentence processing (e.g. Gibson, 2000; Lewis et al., 2006). It’s a particular important finding since some alternative accounts have attributed the difference previously found in experiments to badly chosen experimental stimuli, hypothesizing that, in context, subject- and object-extracted RCs are equally easy to process. The Demberg and Keller results seem to argue against this hypothesis.

Now, at CUNY 2009, Doug Roland presented a nice poster entitled “*Relative clauses remodeled: The problem with Mixed-Effect Models“, *in which he suggests that the apparent effect found by Demberg and Keller (2007) is actually due to *one* (!) outlier labelled as object-extraction that has very high reading times. (the particular case turns out to not even be an object-extracted RC at all, but rather extraction out of an oblique PP-argument, see Roland 2009). This is out of about 400 RCs in the database.

Here, I want to focus on what some people conclude on the basis of such cases, under *the assumption that Roland’s result holds (i.e. that one single outlier created the spurious result).* As the title of Doug Roland’s poster suggests, his focus is actually not on the fact that the Demberg and Keller result may not hold (I don’t know whether this mistake was caught before the journal publication of Demberg and Keller (2009); in any case, it’s a cool paper –regardless of whether that particular effect holds or not). Rather, Doug argues that the observed sensitivity to outliers is a problem of mixed-effect models. I’ve heard this opinion elsewhere, too, and I am not sure where it comes from.

As Doug noticed on the poster, mixed effect models returned a non-significant effects (just like ANCOVA over the same data) if the necessary random slopes were included (in this case, a random slope for subject- vs object-extracted RCs). So, in this case, the problem seems to be a matter of knowing enough about mixed models to know what random effect structure would correspond to an ANCOVA approach, and -more importantly- what effect structures seem appropriate given the question one is asking. As more researchers in our field become trained in mixed models, this situation will hopefully improve. After all, it isn’t the case that everybody is born with knowledge about how to conduct AN(C)OVA either ;).

There is a bigger point though. In my experience, especially in corpus-based work, it is uncommon that researchers run models with random slopes for all fixed effects (which would correspond to the AN(C)OVA approach). I take Doug’s poster to provide a nice piece of warning for us: Maybe we *should*. Of course, such models are unlikely to yield interpretable results, because they will overfit (at least if you’re working with highly unbalanced data, e.g. because you’re one of those corpus fools ;). But, we can apply step-wise model comparison to find the *maximal justified random effect structure* (i.e. the maximal set of random effects justified by the data based on model comparison). For all those of you who shout in pain knowing how much time that would take with a model of 10-20 predictors: fear not as help is on the way. Hal Tily is working on a nice package that will allow drop1()-like behavior for mixed effect models in R. So, essentially, I am not saying anything new (as always), but I thought it would be nice to point to this ongoing discussion and to see whether anybody reacts.

## Group-level “outliers”

The second thing with regard to outliers that I’ve recently thought about –triggered by an interesting question Chuck Clifton raised. Imagine the typical psycholinguistics data here we have two levels of grouping: participants and items (or some other between-case property for that matter, e.g. verbs). Now, say we have a hypothesis about a continuous property that differs *between* one of these grouping variables. This could be a between-subject property (e.g. age if you’re sociolinguists; working memory if you’re a cognitive psychologist studying individual differences) or a between-item property (e.g. verb bias, word frequency). For the sake of discussion, let’s assume our hypothesis is about how verb bias relates to *that*-omission (yeah, it’s totally because I thought about that).

If we want to test our hypothesis, we should probably include a random intercept for the group level variable that the effect of interest varies between (verb). This is meant to ensure that potential effects we might observe for the between-group fixed effect predictor (verb bias) are not due to idiosyncratic properties associated with different levels of the group variable (verb). That, by the way, is the very reason why some researchers in sociolinguistics have recently begun to argue that sociolinguists should used mixed effect models with random speaker effects rather than ordinary regression models (see a previous blog post on this topic). But does this really help? That is, does the inclusion of random effects rule out that just a few group members (verbs) drive a spurious effect of the between-group predictor (verb bias) because they happen by chance to be correlated with the between-group predictor?

Since I didn’t know of any simulations on this questions, I started some on my own and it would be cool to hear your ideas to see whether what I did made sense and what you think of it. Also, if anybody is interested to adapt the simulation to other types of data or models, please do so and let us know what you find.

Below I assume a balanced sample, where each subject (*g1*) sees each item (*g2*) exactly once. The outcome *y *is binomially distributed. Subject and items are normally distributed, the latter with much higher variance (which corresponds to what I have seen in many corpus studies). There also is additional unexplained random variance (an attempt to create the common scenario of overdispersion; I am sure this can be done more elegantly). As by hypothesis, the continuous between-*g2* predictor *x* has no effect and the intercept is 0, too. A proportion of the levels of *g2* (verbs) is assigned to have property* g2p* (e.g. a classification in terms of verb sense), which by hypothesis has a huge effect on the outcome. I vary the number of *g1*-levels (subjects) and number of *g2*-levels (verbs), as well as the proportion of *g2*-levels (verbs) that have property *g2p*.

The hypothesis we want to test is whether *x* has a negative effect on the outcome *y* (positive, just to pick on direction).

The questions then are: (A) How often will a model that only contains the continuous between-*g2* predictor *x* return significance despite the fact that *x* does not affect *y*? (B) How often will the significant effect be in the direction (wrongly) predicted by the researcher further misleading the researcher? Put, differently, how often would a model that fails to account for the true underlying between-group difference randomly attribute the effect to *x*? And, (C) as a control, how well does a model with only the fixed effect predictor *g2p* return significance for that effect? The results are posted below the code I used for the first set of simulations (cautious, the code runs for a long time).

```
library(lme4)
myCenter <- function(x) { return( x - mean(x, na.rm=T) ) }
# assume a binomially distributed outcome y
# two groups with random effect
# group1 (e.g. subjects) has relatively low variance
# group2 (e.g. items, verbs) has relatively large variance
# additional unexplained variance (overdispersion)
#
# there is a property of group2 levels, g2p that drives the
# outcome y
#
# there is a continuous between group2 predictor x (within group1)
# that --by hypothesis-- does have NO effect.
#
# assume a balanced sample
# number of group levels
ng1= 300
ng2= 25
# proportion of group levels of g2 that have relevant property
ppg2= 0.2
model <- function(ng1, ng2, ppg2) {
# number of group levels of g2 that have relevant property
pg2= round(ng2 * ppg2, 0)
# group level predictors
g1= sort(rep(1:ng1, ng2))
g2= rep(1:ng2, ng1)
# between group2 continuous property that is NOT correlated
# with outcome y
# here probability (of something given group level)
# e.g. probability of structure given verb
x= rep(plogis(rnorm(ng2, mean=0, sd= 2.5)), ng1)
# randomly assign group levels of group2 to the group2
# group-level property that is driving outcome y
# (e.g. verb sense)
g2p= rep(sample(c(rep(1, pg2), rep(0, ng2-pg2)), ng2), ng1)
# noise (including random noise
noise= rnorm(ng1*ng2, sd=0.8)
g1noise= rnorm(ng1, sd=0.2)
g2noise= rnorm(ng2, sd=1)
# we set:
int= 0
x_coef= 0
g2p_coef= 4
p= plogis(int + x_coef * x + g2p_coef * g2p + noise + g1noise[g1] + g2noise[g2])
y= sapply(p, function(x) { rbinom(n=1,size=1,x) } )
# analysis
x <- myCenter(x)
g2p <- myCenter(g2p)
l <- lmer(y ~ 1 + (1|g1) + (1|g2), family="binomial")
l.x <- lmer(y ~ x + (1|g1) + (1|g2), family="binomial")
l.g2p <- lmer(y ~ g2p + (1|g1) + (1|g2), family="binomial")
# return coefficient of x and p-value (according to anova) of x
# return coefficient of pg2 and p-value (according to anova) of pg2
return( cbind( fixef(l.x)[2], anova(l, l.x)[[7]][2],
fixef(l.g2p)[2], anova(l, l.g2p)[[7]][2]))
}
# variables to store
d= data.frame(ng1=c(),ng2=c(),ppg2=c(),x.coef=c(),pvalue.x=c(),g2p.coef=c(),pvalue.g2p=c())
for (ng1 in c(50,100,200,400)) {
for (ng2 in seq(20,50,10)) {
for (ppg2 in seq(0.05,0.3,0.05)) {
for (r in 1:200) {
d<- rbind(d, t(c(ng1, ng2, ppg2, model(ng1,ng2,ppg2))))
}
}
}
}
save(d, file="simulation-200.RData")
```

First the answer to question (C). The real effect (i.e. the effect of *g2p*) is found almost always (99.4% of all times in one run of the above simulation) and the effect direction is always recognized correctly and within a very tight bound around the true effect. The results for question (A) and (B) are also very encouraging: The model with only the continuous between-*g2 *predictor *x* find a spurious effect in only 4.4% of all cases (close to what would be expected by chance given a significance level of p<0.05). It turns out that the coefficients are biased in the direction of the true underlying (*g2p*) effect. About 60% of the time that the model finds a spurious effect of *x*, the effect has the same direction as the effect of *g2p*. That is, depending on the whether our hypothesized effect direction happens to coincide with the unknown effect direction of *g2p*, we would be mislead into concluding the *x* had the predicted effect in about 1.7%-2.6% of all times. Figure 1 plots group sizes against the observed average p-values of *x *(each data point corresponds to 20 simulations; the plane is a linear fit; larger simulations are still running).

In sum, the simulations I have run so far suggest that mixed models with appropriate random effects are robust against a couple of group-level outliers. But more work is definitely needed here. For starters, I have only looked at balanced data, although it’s possible, if not likely, that this potential problem surfaces with unbalanced data. I just finished running some follow-up simulations to see how this compares to a model that fails to include the relevant random group effect of *g2*. Actually, **the inclusion of the random effect for the relevant group **(here *g2*) **turns out to be crucial.** If only a random intercept for *g1* is included, the spurious effect of *x * is returned in 74% of all cases (rather than 4.4%). Interestingly, the effect direction now has a 50/50 distribution. Half of the time that the effect of *x* is significant the coefficient is positive, half of the time it is negative. Thus, we would risk being mislead into concluding wrongly that *x* affects *y* in the expected direction in about 38% of all cases (compared to 1.7%-2.6% with the appropriate random effect structure). **This, by the way, should also serve as further support that sociolinguistics being interested in between speaker differences would do well in including random speaker effects in their analyses **(see Johnson, 2008, discussed in the a previous blog post)**. ** Figure 2 plots group sizes against the observed average p-values of *x *(each data point corresponds to 20 simulations; the plane is a linear fit; larger simulations are still running).

Finally, it would also be a great idea to compare these simulations to comparable data using an F2 approach, since F2-analyses have been criticized to not catch cases where a between-item effect is only driven by a few items.

Florian