Outliers are a nasty problem for both ANOVA or regression analyses, though at least some folks consider them more of a problem for the latter type of analysis. So, I thought I post some of my thoughts on a recent discussion about outliers that took place at CUNY 2009. Hopefully, some of you will react and enlighten me/us (maybe there are some data, some simulations out there that may speak to the issues I mention below?). I first summarize a case where one outlier out of 400 apparently drove the result of a regression analysis, but wouldn’t have done so if the researchers had used ANO(C)OVA. After that I’ll have some simulation data for you on another type of “outlier” (I am not even sure whether outlier is the right word): the case where a few levels of a group-level predictor may be driving the result. That is, how do we make sure that our results aren’t just due to item-specific or subject-specific properties. Read the rest of this entry »
Below, I’ve posted some code that
- generates an artificial data set
- creates both treatment (a.k.a. dummy) and sum (a.k.a. contrast or ANOVA-style) coding for the data set
- compares the lmer output for the two coding systems
- suggests a way to test simple effects in a linear mixed model
Mostly though the code is just meant as a starting point for people who want to play with a balanced (non-Latin square design) data set to understand the consequences of coding your predictor variables in different ways.
What to do when you need an intuitive measure of model quality for your logit (logistic) model? The problem is that logit models don’t have a nice measure such as R-square for linear models, which has a super intuitive interpretation. However, several pseudo R-square measures have been suggested are some are more commonly used (e.g. Nagelkerke R2). In R, some model-fitting procedures for ordinary logistic regression provide the Nagelkerke R-square as part of the standard output (e.g. lrm in Harrell’s Design package). However, no such measure is provided for the most widely used mixed logit model-fitting procedure (lmer in Bates’ lme4 library). Below I provide some code that provides Nagelkerke and CoxSnell pseudo R-squares for mixed logit models. Read the rest of this entry »
I am always annoyed that one has to remind R to reduce the number of levels of a factor after a subset (of the original data set) has been created. In addition to screwing up tables (b/c they will contain all the zero rows/columns, too), this also can affect comparison of factor values (“Factors do not have the same number of levels”), and it makes RData files much bigger than they need to be. In our lab, we often work with large data files (up to 800,000 rows and 100-350 variables are relatively common), so that an RData file containing just that data.frame can easily be 100MB+. Say, you select 5,000 rows out of 800,000, that may still leave you at an RData file size of 50MB+ because R remembers all original levels for all factors still in the data.frame. The little script I attach below, takes either a factor or a data.frame as input and returns the factor or the data.frame in such a way that only levels still in the data are considered. In the files, Ir recently worked with, that reduced the file size by 90%+, which in turns leads to a considerable speed-up in analyzing the data (I mean, on a small laptop, you will definitely feel the difference …). Anyway, nothing big, but I maybe some of you may find it useful: Read the rest of this entry »
One of the more common questions I get about mixed models is whether there are any standards regarding the removal of random effects from the model. When should a random effect be included in the model? This was also one of the questions we had hope to answer for our field (psycholinguistics) in the pre-CUNY Workshop on Ordinary and Multilevel Models (WOMM), but I don’t think we got anywhere close to a “standard” (see Harald Baayen’s presentation on understanding random effect correlations though for a very insightful discussion).
That being said, I find most of us would probably agree on a set of rules of thumb, at least for factorial analyses of balanced data: Read the rest of this entry »
This post is partly a response to this message. The author of that question is working on ordered categorical data. For that specific case, there are several packages in R that might work, none of which I’ve tried. The most promising is the function
DPolmm() from DPpackage. It’s worth noting, though, that in that package you are committed to a Dirichlet Process prior for the random effects (instead of the more standard Gaussian). A different package, mprobit allows one clustering factor. This could be suitable, depending on the data set. MNP, mlogit, multinomRob, vbmp,
nnet, and msm all offer some capability of modeling ordered categorical data, and it’s possible that one of them allows for random effects (though I haven’t discovered any yet). MCMCpack may also be useful, as it provides MCMC implementations for a large class of regression models.
lrm() from the
Design package handles ordered categorical data, and clustered bootstrap sampling can be used for a single cluster effect.
I’ve recently had some success using MCMCglmm for the analysis of unordered multinomial data, and want to post a quick annotated example here. It should be noted that the tutorial on the CRAN page is extremely useful, and I encourage anyone using the package to work through it.
I’m going to cheat a bit in my choice of data sets, in that I won’t be using data from a real experiment with a multinomial (or polychotomous) outcome. Instead, I want to use a publicly available data set with some relevance to language research. I also need a categorical dependent variable with more than two levels for this demo to be interesting. Looking through the data sets provided in the
languageR package, I noticed that the
dative data set has a column
SemanticClass which has five levels. We’ll use this as our dependent variable for this example. We’ll investigate whether the semantic class of a ditransitive event is influenced by the modality in which it is produced (spoken or written).
library(MCMCglmm) data("dative", package = "languageR") k <- length(levels(dative$SemanticClass)) I <- diag(k-1) J <- matrix(rep(1, (k-1)^2), c(k-1, k-1)) m <- MCMCglmm(SemanticClass ~ -1 + trait + Modality, random = ~ us(trait):Verb + us(Modality):Verb, rcov = ~ us(trait):units, prior = list( R = list(fix=1, V=0.5 * (I + J), n = 4), G = list( G1 = list(V = diag(4), n = 4), G2 = list(V = diag(2), n = 2))), burnin = 15000, nitt = 40000, family = "categorical", data = dative)
Read on for an explanation of this model specification, along with some functions for evaluating the model fit.
One of the most common issues in regression analyses of even balanced experimental data is collinearity between main effects and interactions. To avoid this problem, a simple first step is to center all predictors. In my experience folks often fail to do that simply because it’s a bit more work and we’re all lazy. So here’s an attempt at a simple R function that takes single variables as well as entire dataframes. Read the rest of this entry »
Since I get asked for the R code I promised in my 2008 JML paper on mixed logit models every now and then, I have posted it here. If you find this code useful, please consider citing the Jaeger (2008) paper:
- Jaeger, T. Florian (2008). Categorical Data Analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. Journal of Memory and Language 59, 434-446.
Please note, however, that the data analyzed in that paper is not mine and you need to acquire it from the Inbal Arnon, who conducted the study. With Inbal’s permission, here’s the data file I used:
- Data from the comprehension component of Study 2 from Arnon, Inbal. “Rethinking child difficulty: The effect of NP type on children’s processing of relative clauses in Hebrew.” Journal of Child Language 37.01 (2010): 27-57.
If you try to work your way through my paper, you may also find the following wiki pages from our lab with readings and more code helpful:
As a quick intro you may find the talks from a recent workshop on the conceptual background, common issues and solutions for ordinary and multilevel regression models that some colleagues (Dale Barr, Roger Levy, Harald Baayen, Victor Kuperman, Austin Frank) and I gave at the CUNY sentence processing conference 2009 useful. The talk slides are all linked to the schedule on that page. You’ll find detailed walk-throughs, R code, and a conceptual overviews.
I appreciate if you leave a comment here in case this was useful. It helps to see what we should be posting. cheers.
Ah, while I am at, I may as well put this plot up, too. The code needs to be updated, but let me know if you think this could be useful. It’s very similar to the calibrate() plots from Harell’s Design library, just that it works for lmer() models from Doug Bates’ lme4 library.
The plot below is from a model of complementizer that-mentioning (a type of syntactic reduction as in I believe (that) it is time to go to bed). The model uses 26 parameters to predict speakers’ choice between complement clauses with and without that. This includes predictors modeling the accessibility, fluency, etc. at the complement clause onset, overall domain complexity, the potential for ambiguity avoidance, predictability of the complement clause, syntactic persistence effects, social effects, individual speaker differences, etc.
Mean predicted probabilities vs. observed proportions of that. The data is divided into 20 bins based on 0.05 intervals of predicted values from 0 to 1. The amount of observed data points in each bin is expressed as multiples of the minimum bin size. The data rug at the top of the plot visualizes the distribution of the predicted values. See Jaeger (almost-submitted, Figure 2).
I’ve been using a two-step approach, where in the first step I use all data (including fillers, but not practice items) of an experiment to fit a model of log-transformed raw reading times with:
- word length (Wlen)
- position of word in stimulus (Wpos)
- position of stimulus in list (Lpos) Read the rest of this entry »
Some time ago I posted some R-code on how to create spill-over data from a linger reading time file (for spill-over analysis of self-paced reading time data). Here are the steps that need to be done prior to that, importing from a linger file, data preparation, outlier check, etc.
If you’re interested in getting an intro to ordinary logistic regression and mixed logit models (logit models with random subject and item modeling) and why not to blindly trust ANOVA over proportions, even when they are based on (arcsine) transformed proportions, check out my paper on categorical data analysis (accepted for publication in JML; version prior to proofs). Feedback is welcome, but at this point I can’t really change that much.
Cheers to all you folks who helped me with this!
Here is some code to summarize the coefficients of a linear mixed model that produces nice graphs like the following one (well, the curved arrows were added in powerpoint): [click to see a larger version]
But first some background about the example model: Read the rest of this entry »
I’ve been working on some R-code for spill-over analysis for self-paced reading experiments. I’ll be posting the actual analysis a later. Here’s some code that adds the spill-over from previous words to each word: Read the rest of this entry »