Some R code to understand the difference between treatment and sum (ANOVA-style) coding

Posted on Updated on

Below, I’ve posted some code that

  1. generates an artificial data set
  2. creates both treatment (a.k.a. dummy) and sum (a.k.a. contrast or ANOVA-style) coding for the data set
  3. compares the lmer output for the two coding systems
  4. suggests a way to test simple effects in a linear mixed model

Mostly though the code is just meant as a starting point for people who want to play with a balanced (non-Latin square design) data set to understand the consequences of coding your predictor variables in different ways.

# number of items and subjects
ni= 32
ns= 40
# predictors
# s for subject, i for item
# x for predictor 1, z for predictor 2, xz for their interaction
s= sort(rep(1:ns, ni))
l= rep(c(rep(1,ni), rep(2,ni), rep(3,ni), rep(4,ni)), ns/4)
i= rep(1:ni, ns)
# noise
noise= rnorm(ns*ni, sd=1)
snoise= rnorm(ns, sd=2)
inoise= rnorm(ni, sd=0.4)
# underlying group means of dependent variable y
# let's imagine the following situation
# a main effect of z of 3.5,
# no effect of x
# a mild interaction of -0.5:
#       z=0  z=1
# x=0   10   14
# x=1   10.25 13.75
# using the traditional way of describing the formula,
# where the above group means are described as follows:
#       z=0        z=1
# x=0   int        int+zcoef
# x=1   int+xcoef  int+xcoef+zcoef+xzcoef
# we set:
int= 10
xcoef= .25
zcoef= 4
xzcoef= -.5
y= int + xcoef * x + zcoef * z + xzcoef * xz + noise + snoise[s] + inoise[i]
# test: look at the group means (keep in mind that there is noise,
# so they won't match perfectly to the table given above).
lapply(split(y, paste("x=",x,",z=",z)), mean)
# dataframe
d <-,i,x,z,y))
d$s= as.factor(d$s)
d$i= as.factor(d$i)
d$x= as.factor(d$x)
d$z= as.factor(d$z)
# analysis
# treatment coding (which would also be the default in R)
d$x.treat <- ifelse(d$x == "1", 1, 0)
d$z.treat <- ifelse(d$z == "1", 1, 0)
l.treat <- lmer(y ~ x.treat * z.treat + (1|s) + (1|i), d)
# sum coding (which gives us ANOVA-style main effects)
d$x.sum <- ifelse(d$x == "1", .5, -.5)
d$z.sum <- ifelse(d$z == "1", .5, -.5)
l.sum <- lmer(y ~ x.sum * z.sum + (1|s) + (1|i), d)

For treatment coding, the coefficients correspond to simple effects. For example, the coefficient for x.treat is the estimated effect of x at z=0. For sum coding, on the other hand, the coefficients correspond to main effects. For example, the coefficient for x.sum is the estimated effect of x marginalizing over z.

So how would one test simple effects in such a framework? It would be nice, if one could just assess the coefficients in the treatment coded model, but most of the time that won’t work since treatment coding inevitably leads to collinearity between the simple predictors and their interactions (see for yourself by running the above code). I think the simplest way is to do model comparison, comparing models with and without the simple effects.

l.treat.woX <- lmer(y ~ z.treat + x.treat : z.treat + (1|s) + (1|i), d)
l.treat.woZ <- lmer(y ~ x.treat + x.treat : z.treat + (1|s) + (1|i), d)
l.treat.woXZ <- lmer(y ~ x.treat + z.treat + (1|s) + (1|i), d) anova(l.treat, l.treat.woX)
anova(l.treat, l.treat.woZ)
anova(l.treat, l.treat.woXZ)

Comments? What do you think? I have tested this a bit since I wasn’t/am not entirely sure and it seems to work.


8 thoughts on “Some R code to understand the difference between treatment and sum (ANOVA-style) coding

    Chris said:
    December 24, 2009 at 11:24 pm

    Just wondering, I’m thinking of investing in Stefan Gries’s “Statistics for Linguistics With R: A Practical Introduction.” Are you familiar with this book? Can you recommend it, or a better book? I’m looking to improve my understanding of statistical analysis of corpora research as well as my ability to perform the analysis myself.


      tiflo said:
      December 25, 2009 at 3:29 am

      Hi Chris, I am actually not too familiar with Stefan’s book. I can though definitely recommend Gelman and Hill (2007) as a great introduction to some of the models from a Bayesian-inspired perspective. Even if you’re not a Bayesian, it’s a wonderful book. I’ve been using this for some workshops. It comes with it’s own packages. It’s clearly written. It’s sophisticated, but accessible – though it’s not directed at language researchers.

      I also think that Harald Baayen‘s book “Analyzing Linguistic Data in R” is great in terms of providing a ton of tricks and detail along with R-code particularly relevant to language researchers. He also discusses linear and logit mixed models, which are particularly useful to corpus-based research. Austin Frank wrote a really nice review of Baayen’s book for Lingua. It should be available via our lab webpage right after the holidays (Frank & Jaeger, submitted).

      For a couple of introductory articles to mixed models (though not directly intended for corpus-based work), I recommend the special issue in JML, where Baayen, Davidson, and Bates (2008), Barr (2008), Dixon (2008), and I (Jaeger, 2008) all talk about mixed models (see also the tutorial slides from WOMM and the LSA regression workshop, or some of the HLP mini classes). The Baayen et al. article is really great. Another recent article that discusses these models for sociolinguists in a wonderfully accessible way is Johnson (2009). If you want to see a mixed logit model applied to corpus-data and don’t mind reading a lot ;), then I could send you two submitted papers where I apply multilevel logit models to investigate syntactic reduction (that-omission and whiz-deletion) in spontaneous speech (5-25 variables controlling for a variety of processing effects, plus random effects; unbalanced sample). A similar approach can be found in Bresnan et al. (2007), who investigate the dative alternation.

      I hope some of this is useful and: I enjoyed reading your blog =).


    alan said:
    March 10, 2010 at 2:05 pm

    Related to this, how would one graph the results of an analysis that have categorical factors that are sum coded? For example, if I want to visualize graphically the effect of gender on response rates, and if gender is sum-coded, wouldn’t I not be able to get a two-column graph because the sum-coded version of gender is a scalar? I’m concerned about this because I am writing this paper that has a model with many categorical variables. I had to sum-coded these variables to avoid collinearity, but that raises questions on how to report the results either in prose or graphically. Would you recommend graphing the interaction using a non-centered/treatment-coded model, but present the coefficients using the centered/sum-coded one? Any recommendations would be much appreciated!


      tiflo said:
      March 10, 2010 at 3:36 pm

      Hi Alan, I am not sure I understand what you mean. Let me start with some basics that I think you didn’t ask about (but just to be sure that we’re on the same page).

      Regardless of whether you’re factor is sum coded or not, the interpretation of the coefficient depends on the distance in units between the values assigned for level1 vs. level2 of the factor. For example, if x=1 for male and x=-1 for female, then the coefficient corresponds to half of the difference between the two levels (male vs. female). If you have coded x=0.5 for male and x=-0.5 for female, then the coefficient corresponds to the difference between the two levels. If you treatment coded, but used x=5 for male, and x=0 for female, then the predicted difference between the group corresponds to 5-times the coefficient value of gender (all of this, for now ignores interactions, collinearity, etc.).

      Now you talk about graphing the results and two-column graphs. I assume you’re thinking about how to visualize an interaction between two binary (sum-coded) effects? Well, the predicted cell means (corresponding to e.g. bar heights in a plot) should not depend on the coding. It’s only the standard error estimates that are statistically biased by collinearity. The individual coefficients of highly correlated predictors can also be hard to interpret, but there sum is not affected: For each data point the prediction of the model is the same under all exhaustive orthogonal coding schemes. That is, if you look at where case 723 in your data falls, which happens to come from a male participant, the predicted response rate is independent of whether you used sum-coding or treatment-coding.

      How you will get to the cell means (of e.g. MALE & YOUNG vs. FEMALE & YOUNG vs. MALE & OLD vs. FEMALE & OLD) does depend on the coding-scheme. I.e. what you have to add/subtract from the intercept to get the predicted values for each of these cells depends on the coding-scheme. Is that what you mean? Sorry, to not be more helpful.


    lriss said:
    May 16, 2010 at 5:06 pm

    I’m currently working on writing up a mixed logit model analysis where we have 2 fixed effects and get high collinearity between the effects and their interactions when we use treatment coding. This goes away when we do the sum coding, but then we want to interpret the simple effects. What do you think of this solution: let’s say for fixed effect A we have levels 1,2,3, where level 1 is a conceptual “baseline” level. The sum coding tells us there is a main effect of A, and in order to test the simple effects we split the data in half and fit two separate models: the first model only includes the data for levels 1 and 2, and the second model contains the data for levels 1 and 3. Now you can evaluate the effect of each level of A against the baseline.


      tiflo said:
      May 16, 2010 at 9:36 pm


      unfortunately, I am about to leave to Mexico for fieldwork in a few hours, so I have to keep this very short (you can email to R-lang for more feedback).

      I am not sure I fully get your point. I would need to know more about the coding. Let’s see whether I get this right. Let’s assume you have condition A by B in a 3 x 2 design. So condition A has levels 1,2,3 or say “a1”, “a2”, “a3”. If you sum-code that using contr.sum(3), that’ll give you the following contrasts:

      [,1] [,2]
      a1 1 0
      a2 0 1
      a3 -1 -1

      So, you should see two coefficients associated with the condition A input variable — just like for treatment coding. Assuming that the collinearity indeed is negligible (or non-existent in a balanced data set), you can safely interpret a significant coefficient for contrast [,1] as meaning that condition a1 and a3 differ. A significant coefficient of for contrast [,2] means that condition a2 and a3 differ. So, there is no post-hoc test necessary (assuming you code things like that from the beginning, which is what I would recommend).

      You might find Maureen Gillespie’s tutorial on coding useful that she gave as part of our workshop at McGill recently. Does that help?


    lriss said:
    May 18, 2010 at 10:55 am

    Thanks — the Gillespie tutorial was useful. Yes, I am using the contrasts you suggested, repeated here:
    [,1] [,2]
    a1 1 0
    a2 0 1
    a3 -1 -1
    My interpretation of the effects coding is that a significant coefficient for contrast [,1] would tell you that condition a1 is different from the grand mean, not that a1 is different from a3. Can you clarify this? I’ll email R-lang as well.


      tiflo said:
      May 21, 2010 at 7:00 pm

      Hey, did you email to R-lang already? I am in Mexico now (fieldwork) and thus a little distracted.


Questions? Thoughts?

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s