….. during the LSA Summer Institute at Berkeley. Come and join us. More information and registration can be found here.
I am always annoyed that one has to remind R to reduce the number of levels of a factor after a subset (of the original data set) has been created. In addition to screwing up tables (b/c they will contain all the zero rows/columns, too), this also can affect comparison of factor values (”Factors do not have the same number of levels”), and it makes RData files much bigger than they need to be. In our lab, we often work with large data files (up to 800,000 rows and 100-350 variables are relatively common), so that an RData file containing just that data.frame can easily be 100MB+. Say, you select 5,000 rows out of 800,000, that may still leave you at an RData file size of 50MB+ because R remembers all original levels for all factors still in the data.frame. The little script I attach below, takes either a factor or a data.frame as input and returns the factor or the data.frame in such a way that only levels still in the data are considered. In the files, Ir recently worked with, that reduced the file size by 90%+, which in turns leads to a considerable speed-up in analyzing the data (I mean, on a small laptop, you will definitely feel the difference …). Anyway, nothing big, but I maybe some of you may find it useful: Continue reading ‘little function to clean up factor variables after subset-ing’
hlplab adirondacks retreat
Here are some pictures from the HLPlab’s Adirondacks retreat 2009.
We spent a fun & productive paper-writing week at the beautiful Upper Saranac lake…

The view from work.

Peter and Paper.
The CCC, NSF, and CRA have announced a joint funding program for Computing Innovation Fellows. It provides 1-to-2 years of post-doc funding and you can take that funding to any lab registers on their webpage. You can also join several labs during that post-doc period. Check out our CI fellows page.
One of the more common questions I get about mixed models is whether there are any standards regarding the removal of random effects from the model. When should a random effect be included in the model? This was also one of the questions we had hope to answer for our field (psycholinguistics) in the pre-CUNY Workshop on Ordinary and Multilevel Models (WOMM), but I don’t think we got anywhere close to a “standard” (see Harald Baayen’s presentation on understanding random effect correlations though for a very insightful discussion).
That being said, I find most of us would probably agree on a set of rules of thumb, at least for factorial analyses of balanced data: Continue reading ‘Random effect: Should I stay or should I go?’
This post is partly a response to this message. The author of that question is working on ordered categorical data. For that specific case, there are several packages in R that might work, none of which I’ve tried. The most promising is the function DPolmm() from DPpackage. It’s worth noting, though, that in that package you are committed to a Dirichlet Process prior for the random effects (instead of the more standard Gaussian). A different package, mprobit allows one clustering factor. This could be suitable, depending on the data set. MNP, mlogit, multinomRob, vbmp, nnet, and msm all offer some capability of modeling ordered categorical data, and it’s possible that one of them allows for random effects (though I haven’t discovered any yet). MCMCpack may also be useful, as it provides MCMC implementations for a large class of regression models. lrm() from the Design package handles ordered categorical data, and clustered bootstrap sampling can be used for a single cluster effect.
I’ve recently had some success using MCMCglmm for the analysis of unordered multinomial data, and want to post a quick annotated example here. It should be noted that the tutorial on the CRAN page is extremely useful, and I encourage anyone using the package to work through it.
I’m going to cheat a bit in my choice of data sets, in that I won’t be using data from a real experiment with a multinomial (or polychotomous) outcome. Instead, I want to use a publicly available data set with some relevance to language research. I also need a categorical dependent variable with more than two levels for this demo to be interesting. Looking through the data sets provided in the languageR package, I noticed that the dative data set has a column SemanticClass which has five levels. We’ll use this as our dependent variable for this example. We’ll investigate whether the semantic class of a ditransitive event is influenced by the modality in which it is produced (spoken or written).
library(MCMCglmm)
data("dative", package = "languageR")
k <- length(levels(dative$SemanticClass))
I <- diag(k-1)
J <- matrix(rep(1, (k-1)^2), c(k-1, k-1))
m <- MCMCglmm(SemanticClass ~ -1 + trait + Modality,
random = ~ us(trait):Verb + us(Modality):Verb,
rcov = ~ us(trait):units,
prior = list(
R = list(fix=1, V=0.5 * (I + J), n = 4),
G = list(
G1 = list(V = diag(4), n = 4),
G2 = list(V = diag(2), n = 2))),
burnin = 15000,
nitt = 40000,
family = "categorical",
data = dative)
Read on for an explanation of this model specification, along with some functions for evaluating the model fit.
Austin Frank and I just gave a 2×3 hours workshop on multilevel models at Haskins Lab (thanks to Tine Mooshammer for organizing!). We had a great audience with a pretty diverse background (ranging from longitudinal studies on nutrition, over speech researchers, clinical studies, and psycholinguists, to fMRI researchers), which made for lots of interesting conversations on topics I don’t usually get to think about. Thanks to everyone attending =). We had a great time.
We may post the recordings once we receive them, if it turns out they may be useful. But for now, here are many of the slides we used, a substantial subset of which were created by Roger Levy (UC San Diego) and/or in collaboration with Victor Kuperman (Stanford University) for WOMM’09 at the CUNY Sentence Processing Conference, as indicated on the slides. No guarantees for the R-code and please do not distribute (rather: refer to this page) and ask before citing.
- Conceptual intro to Generalized Linear Models and Generalized Linear Mixed (a.k.a multilevel) models (based on WOMM’09 slides by Roger Levy with minimal changes by Florian Jaeger)
- Common issues in ordinary and multilevel regression modeling (based on Florian Jaeger & Victor Kuperman’s WOMM’09 slides plus some additional materials).
- Additional issues: multiple post-hoc comparisons and random effect ex/in-clusion (Austin Frank)
Questions and comments welcome, preferably using the comment box at the bottom of this page. R related questions should be send to the very friendly email support list for language researchers using R (see R-lang link in the navigation bar to the right).
Centering several variables
One of the most common issues in regression analyses of even balanced experimental data is collinearity between main effects and interactions. To avoid this problem, a simple first step is to center all predictors. In my experience folks often fail to do that simply because it’s a bit more work and we’re all lazy. So here’s an attempt at a simple R function that takes single variables as well as entire dataframes. Continue reading ‘Centering several variables’
Alex Fine pointed out this summary of an article on a common statistical flaw in fMRI research. Absolutely worth reading, especially for all those that consider statistics an unnecessary evil
.
I recently parsed the British National Corpus (BNC) using the latest version of the parser by Charniak’s group @ Brown. In running the results through ‘tgrep2 -p’ (i.e., building a corpus file), I ran into some troubles that I thought I’d put up here in case they save someone a bit of grief.