R code

Some thoughts on the sensitivity of mixed models to (two types of) outliers

Posted on Updated on

Outliers are a nasty problem for both ANOVA or regression analyses, though at least some folks  consider them more of a problem for the latter type of analysis. So, I thought I post some of my thoughts on a recent discussion about outliers that took place at CUNY 2009. Hopefully, some of you will react and enlighten me/us (maybe there are some data, some simulations out there that may speak to the issues I mention below?). I first summarize a case where one outlier out of 400 apparently drove the result of a regression analysis, but wouldn’t have done so if the researchers had used ANO(C)OVA. After that I’ll have some simulation data for you on  another type of “outlier” (I am not even sure whether outlier is the right word): the case where a few levels of a group-level predictor may be driving the result. That is, how do we make sure that our results aren’t just due to item-specific or subject-specific properties. Read the rest of this entry »

Some R code to understand the difference between treatment and sum (ANOVA-style) coding

Posted on Updated on

Below, I’ve posted some code that

  1. generates an artificial data set
  2. creates both treatment (a.k.a. dummy) and sum (a.k.a. contrast or ANOVA-style) coding for the data set
  3. compares the lmer output for the two coding systems
  4. suggests a way to test simple effects in a linear mixed model

Mostly though the code is just meant as a starting point for people who want to play with a balanced (non-Latin square design) data set to understand the consequences of coding your predictor variables in different ways.

Read the rest of this entry »

Nagelkerke and CoxSnell Pseudo R2 for Mixed Logit Models

Posted on Updated on

What to do when you need an intuitive measure of model quality for your logit (logistic) model? The problem is that logit models don’t have a nice measure such as R-square for linear models, which has a super intuitive interpretation. However, several pseudo R-square measures have been suggested are some are more commonly used (e.g. Nagelkerke R2). In R, some model-fitting procedures for ordinary logistic regression provide the Nagelkerke R-square as part of the standard output (e.g. lrm in Harrell’s Design package). However, no such measure is provided for the most widely used mixed logit model-fitting procedure (lmer in Bates’ lme4 library). Below I provide some code that provides Nagelkerke and CoxSnell pseudo R-squares for mixed logit models. Read the rest of this entry »

little function to clean up factor variables after subset-ing

Posted on Updated on

I am always annoyed that one has to remind R to reduce the number of levels of a factor after a subset (of the original data set) has been created. In addition to screwing up tables (b/c they will contain all the zero rows/columns, too), this also can affect comparison of factor values (“Factors do not have the same number of levels”), and it makes RData files much bigger than they need to be. In our lab, we often work with large data files (up to 800,000 rows and 100-350 variables are relatively common), so that an RData file containing just that data.frame can easily be 100MB+. Say, you select 5,000 rows out of 800,000, that may still leave you at an RData file size of 50MB+ because R remembers all original levels for all factors still in the data.frame. The little script I attach below, takes either a factor or a data.frame as input and returns the factor or the data.frame in such a way that only levels still in the data are considered. In the files, Ir recently worked with, that reduced the file size by 90%+, which in turns leads to a considerable speed-up in analyzing the data (I mean, on a small laptop, you will definitely feel the difference …). Anyway, nothing big, but I maybe some of you may find it useful: Read the rest of this entry »

Random effect: Should I stay or should I go?

Posted on Updated on

One of the more common questions I get about mixed models is whether there are any standards regarding the removal of random effects from the model. When should a random effect be included in the model? This was also one of the questions we had hope to answer for our field (psycholinguistics) in the pre-CUNY Workshop on Ordinary and Multilevel Models (WOMM), but I don’t think we got anywhere close to a “standard” (see Harald Baayen’s presentation on understanding random effect correlations though for a very insightful discussion).

That being said, I find most of us would probably agree on a set of rules of thumb, at least for factorial analyses of balanced data: Read the rest of this entry »

Multinomial random effects models in R

Posted on Updated on

This post is partly a response to this message. The author of that question is working on ordered categorical data. For that specific case, there are several packages in R that might work, none of which I’ve tried. The most promising is the function DPolmm() from DPpackage. It’s worth noting, though, that in that package you are committed to a Dirichlet Process prior for the random effects (instead of the more standard Gaussian). A different package, mprobit allows one clustering factor. This could be suitable, depending on the data set. MNP, mlogit, multinomRob, vbmp, nnet, and msm all offer some capability of modeling ordered categorical data, and it’s possible that one of them allows for random effects (though I haven’t discovered any yet). MCMCpack may also be useful, as it provides MCMC implementations for a large class of regression models. lrm() from the Design package handles ordered categorical data, and clustered bootstrap sampling can be used for a single cluster effect.

I’ve recently had some success using MCMCglmm for the analysis of unordered multinomial data, and want to post a quick annotated example here. It should be noted that the tutorial on the CRAN page is extremely useful, and I encourage anyone using the package to work through it.

I’m going to cheat a bit in my choice of data sets, in that I won’t be using data from a real experiment with a multinomial (or polychotomous) outcome. Instead, I want to use a publicly available data set with some relevance to language research. I also need a categorical dependent variable with more than two levels for this demo to be interesting. Looking through the data sets provided in the languageR package, I noticed that the dative data set has a column SemanticClass which has five levels. We’ll use this as our dependent variable for this example. We’ll investigate whether the semantic class of a ditransitive event is influenced by the modality in which it is produced (spoken or written).

data("dative", package = "languageR")

k <- length(levels(dative$SemanticClass))
I <- diag(k-1)
J <- matrix(rep(1, (k-1)^2), c(k-1, k-1))

m <- MCMCglmm(SemanticClass ~ -1 + trait + Modality,
              random = ~ us(trait):Verb + us(Modality):Verb,
              rcov = ~ us(trait):units,
              prior = list(
                R = list(fix=1, V=0.5 * (I + J), n = 4),
                G = list(
                  G1 = list(V = diag(4), n = 4),
                  G2 = list(V = diag(2), n = 2))),
              burnin = 15000,
              nitt = 40000,
              family = "categorical",
              data = dative)

Read on for an explanation of this model specification, along with some functions for evaluating the model fit.

Read the rest of this entry »

Centering several variables

Posted on Updated on

One of the most common issues in regression analyses of even balanced experimental data is collinearity between main effects and interactions. To avoid this problem, a simple first step is to center all predictors. In my experience folks often fail to do that simply because it’s a bit more work and we’re all lazy. So here’s an attempt at a simple R function that takes single variables as well as entire dataframes. Read the rest of this entry »

Jaeger (2008), J Memory Language, 59, 434-446 (ANOVA)

Posted on Updated on

Since I get asked for the R code I promised in my 2008 JML paper on mixed logit models every now and then, I have posted it here. If you find this code useful, please consider citing the Jaeger (2008) paper:

Please note, however, that the data analyzed in that paper is not mine and you need to acquire it from the Inbal Arnon, who conducted the study. With Inbal’s permission, here’s the data file I used:

  • Data from the comprehension component of Study 2 from Arnon, Inbal. “Rethinking child difficulty: The effect of NP type on children’s processing of relative clauses in Hebrew.” Journal of Child Language 37.01 (2010): 27-57.

If you try to work your way through my paper, you may also find the following wiki pages from our lab with readings and more code helpful:


As a quick intro you may find the talks from a recent workshop on the conceptual background, common issues and solutions for ordinary and multilevel regression models that some colleagues (Dale Barr, Roger Levy, Harald Baayen, Victor Kuperman, Austin Frank) and I gave at the CUNY sentence processing conference 2009 useful. The talk slides are all linked to the schedule on that page. You’ll find detailed walk-throughs, R code, and a conceptual overviews.

I appreciate if you leave a comment here in case this was useful. It helps to see what we should be posting. cheers.

Visualizing the quality of an glmer(family=”binomial”) model

Posted on Updated on

Ah, while I am at, I may as well put this plot up, too. The code needs to be updated, but let me know if you think this could be useful. It’s very similar to the calibrate() plots from Harell’s Design library, just that it works for lmer() models from Doug Bates’ lme4 library.

The plot below is from a model of complementizer that-mentioning (a type of syntactic reduction as in I believe (that) it is time to go to bed). The model uses 26 parameters to predict speakers’ choice between complement clauses with and without that. This includes predictors modeling the accessibility, fluency, etc.  at the complement clause onset,  overall domain complexity, the potential for ambiguity avoidance, predictability of the complement clause, syntactic persistence effects, social effects, individual speaker differences, etc.

Visualization of mode fit: predicted probability vs. observed proportions of complementizer "that"

Mean predicted probabilities vs. observed proportions of that. The data is divided into 20 bins based on 0.05 intervals of predicted values from 0 to 1. The amount of observed data points in each bin is expressed as multiples of the minimum bin size. The data rug at the top of the plot visualizes the distribution of the predicted values. See Jaeger (almost-submitted, Figure 2).

Modeling self-paced reading data: Effects of word length, word position, spill-over, etc.

Posted on Updated on

I’ve been using a two-step approach, where in the first step I use all data (including fillers, but not practice items) of an experiment to fit a model of log-transformed raw reading times with:

R-code for reading time data preparation

Posted on Updated on

Some time ago I posted some R-code on how to create spill-over data from a linger reading time file (for spill-over analysis of self-paced reading time data). Here are the steps that need to be done prior to that, importing from a linger file, data preparation, outlier check, etc.

 Read the rest of this entry »

Categorical Data Analysis

Posted on Updated on

If you’re interested in getting an intro to ordinary logistic regression and mixed logit models (logit models with random subject and item modeling) and why not to blindly trust ANOVA over proportions, even when they are based on (arcsine) transformed proportions, check out my paper on categorical data analysis (accepted for publication in JML; version prior to proofs). Feedback is welcome, but at this point I can’t really change that much.

Cheers to all you folks who helped me with this!

R-code for visual model summaries: linear mixed models

Posted on Updated on

Here is some code to summarize the coefficients of a linear mixed model that produces nice graphs like the following one (well, the curved arrows were added in powerpoint): [click to see a larger version]

An example slide of a linear mixed model summary

But first some background about the example model: Read the rest of this entry »

Spill-over effects in self-paced reading

Posted on Updated on

I’ve been working on some R-code for spill-over analysis for self-paced reading experiments. I’ll be posting the actual analysis a later. Here’s some code that adds the spill-over from previous words to each word: Read the rest of this entry »