This post is partly a response to this message. The author of that question is working on ordered categorical data. For that specific case, there are several packages in R that might work, none of which I’ve tried. The most promising is the function
DPolmm() from DPpackage. It’s worth noting, though, that in that package you are committed to a Dirichlet Process prior for the random effects (instead of the more standard Gaussian). A different package, mprobit allows one clustering factor. This could be suitable, depending on the data set. MNP, mlogit, multinomRob, vbmp,
nnet, and msm all offer some capability of modeling ordered categorical data, and it’s possible that one of them allows for random effects (though I haven’t discovered any yet). MCMCpack may also be useful, as it provides MCMC implementations for a large class of regression models.
lrm() from the
Design package handles ordered categorical data, and clustered bootstrap sampling can be used for a single cluster effect.
I’ve recently had some success using MCMCglmm for the analysis of unordered multinomial data, and want to post a quick annotated example here. It should be noted that the tutorial on the CRAN page is extremely useful, and I encourage anyone using the package to work through it.
I’m going to cheat a bit in my choice of data sets, in that I won’t be using data from a real experiment with a multinomial (or polychotomous) outcome. Instead, I want to use a publicly available data set with some relevance to language research. I also need a categorical dependent variable with more than two levels for this demo to be interesting. Looking through the data sets provided in the
languageR package, I noticed that the
dative data set has a column
SemanticClass which has five levels. We’ll use this as our dependent variable for this example. We’ll investigate whether the semantic class of a ditransitive event is influenced by the modality in which it is produced (spoken or written).
library(MCMCglmm) data("dative", package = "languageR") k <- length(levels(dative$SemanticClass)) I <- diag(k-1) J <- matrix(rep(1, (k-1)^2), c(k-1, k-1)) m <- MCMCglmm(SemanticClass ~ -1 + trait + Modality, random = ~ us(trait):Verb + us(Modality):Verb, rcov = ~ us(trait):units, prior = list( R = list(fix=1, V=0.5 * (I + J), n = 4), G = list( G1 = list(V = diag(4), n = 4), G2 = list(V = diag(2), n = 2))), burnin = 15000, nitt = 40000, family = "categorical", data = dative)
Read on for an explanation of this model specification, along with some functions for evaluating the model fit.
The first two lines load the
MCMCglmm package and give us access to the
dative data set without loading all of the
The next three lines set up some useful constants for us.
k is the number of categories in our outcome variable.
J are matrices that we’ll use to set up constraints on the covariance structure of the residuals.
The remainder of the code specifies the multinomial model, and we’ll break it down into smaller chunks.
SemanticClass ~ -1 + trait + Modality,
This first line specifies the dependent variable,
SemanticClass, and the fixed effects,
-1 + trait + Modality. The term
trait is not part of the original data set– rather, it’s a reserved term in the
trait comes from phylogenetics, where the dependent variable is some observed trait in an organism. For our purposes, including the
trait term tells the model to estimate an intercept for each of the observed outcomes. We follow the advice from the tutorial and enter this as
-1 + trait. This guarantees an intercept for each non-baseline level of the outcome, rather than an overall intercept term with an offset for
random = ~ us(trait):Verb + us(Modality):Verb,
The next line specifies our random effects. To the right of the colon is the variable in our data that will be used as a grouping variable, in this case
Verb. The content on the left side of the colon specifies which fixed effects will also have random offsets. These fixed effects are wrapped in a function call which defines the structure of the variance-covariance matrix for that random effect. In this case, we’re allowing an unstructured variance-covariance matrix for the levels of the outcome and the Modality of the observation. The other relevant possibility would be to use
idh(trait):Verb, which constrains the random effects to be uncorrelated. For more information on these specifications, see page 12 of the tutorial.
rcov = ~ us(trait):units,
This line specifies the structure of the residuals.
units is another reserved word in
MCMCglmm, in this case referring to each individual observation. This specification of the variance-covariance matrix of the residuals allows arbitrary correlations in the errors. Again, see page 12 of the tutorial for the precise terms being fit.
prior = list(
R = list(fix=1, V=0.5 * (I + J), n = 4),
Now we define the priors for the model. This is done using some structures defined by
MCMCglmm. The R-structure is the variance-covariance matrix for the residuals. The G-structure is the variance-covariance matrix for the random effects (which is itself a list of sub-specifications, one per random effect). These structures are defined on pages 16-19 of the tutorial.
The R-structure in this case is set to have a fixed form (
fix = 1). The reasoning behind this decision is given on page 9 of the tutorial. Basically, for data such as we have here where each observation is a single sample from a distribution over
k categorical outcomes, we can’t actually estimate the residual variance because it depends entirely on the mean (think of the binomial distribution in the simplest case). We follow the prescriptions from the package authors for working with data from this type of distribution, fixing the variance to be 1 for all of the diagonal terms (variances) and .5 for all of the off-diagonal terms (covariances). This advice can be found in the table on page 23 of the tutorial, with further information on page 21.
G = list(
G1 = list(V = diag(4), n = 4),
G2 = list(V = diag(2), n = 2))),
Here are the priors on the random effects. When priors are not fixed, they are inverse-Wishart distributed.
V is a hyperprior on the variance-covariance matrix (the inverse scale matrix), and
n can be thought of as a degree-of-belief parameter. The tutorial suggests that
n >= k to aid in convergence. Here we choose the lowest
n that results in a proper prior, or the least-informative proper prior (but not an uninformative prior).
burnin = 15000,
nitt = 40000,
Here we specify how many samples to generate. The burnin argument sets the number of samples generated in the burn in period. The goal is to converge on a set of stable estimates for the model parameters. We know that this won’t happen instantaneously, and so we allow the model some time to try to settle on a good set of parameters before we start collecting samples. These initial throwaway samples comprise the burn in period, and the number of such samples to be generated is set by the
burnin argument. After the burn in period, we sample each parameter from the model a certain number of times, controlled by the
nitt argument. There’s no guarantee that the model will have settled on stable estimates by the end of burn in, though, so it will be important to test whether the model has converged in the samples we end up using. It should also be noted that it is possible to discard samples as they’re collected using the
thin parameter. This is usually done to try to minimize dependence between samples.
family = "categorical",
data = dative)
We use the
"categorical" family, appropriate for data that consists of one line per observation. If we had multinomial count data, we would use the
"multinomialK" family, where
K is the number of levels in the dependent variable. In this case, if the data were restructured as multinomial count data, we would use
family = "multinomial5". To restructure the data, we could use something like
dative.melt <- melt(dative,
id.vars = c("Modality", "Verb"),
measure.vars = c("SemanticClass"))
dative.wide <- cast(dative.melt, Verb ~ value)
But in this case, it works just as well to use the original data and the
family = "categorical" argument to
When we run the model, we’ll get output like
MCMC iteration = 0 E[MH acceptance ratio] = 0.000168 MCMC iteration = 1000 E[MH acceptance ratio] = 0.263537 MCMC iteration = 2000 E[MH acceptance ratio] = 0.261401
This tells us that the model is being fit using the Metropolis-Hastings algorithm (MH), not Gibbs sampling. It also tells us the proportion of samples in each set of one thousand that is accepted. An acceptance rate that’s too high means that the space will explored very slowly. Anecdotally, an acceptance rate between .25 and .5 seems to indicate good performance.
After fitting the model, we can start by examining the result object,
> str(m) List of 8 $ Sol : mcmc [1:2500, 1:5] -0.833 -2.191 -1.027 -0.978 -2.456 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL .. ..$ : chr [1:5] "traitSemanticClass.c" "traitSemanticClass.f" "traitSemanticClass.p" "traitSemanticClass.t" ... ..- attr(*, "mcpar")= num [1:3] 1 2500 1 $ VCV : mcmc [1:2500, 1:36] 9.45 11.18 7.02 7.78 7.33 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL .. ..$ : chr [1:36] "Verb.trait.SemanticClass.c.SemanticClass.c" "Verb.trait.SemanticClass.f.SemanticClass.c" "Verb.trait.SemanticClass.p.SemanticClass.c" "Verb.trait.SemanticClass.t.SemanticClass.c" ... ..- attr(*, "mcpar")= num [1:3] 1 2500 1 $ Liab : NULL $ Fixed :Class 'formula' length 3 SemanticClass ~ -1 + trait + Modality .. ..- attr(*, ".Environment")= $ Random :Class 'formula' length 2 ~us(trait):Verb + us(Modality):Verb .. ..- attr(*, ".Environment")= $ Residual:Class 'formula' length 2 ~us(trait):units .. ..- attr(*, ".Environment")= $ Deviance:Class 'mcmc' atomic [1:2500] 3079 3063 3101 3062 3127 ... .. ..- attr(*, "mcpar")= num [1:3] 1 2500 1 $ DIC : num 3572
Of immediate interest are the
m$VCV members of the list. The former are for the fixed effects parameters, and the latter are the variance-covariance matrices associated with the residuals and the random effects. These each have the class
mcmc, meaning they represent samples from an MCMC fit. In this case, we have all of the samples generated by the model after the burn-in period. We can start our examination of the fixed effects (stored in m$Sol) using some visualizations provided by the
Next we can look at the numeric estimates of the fixed effects.
> summary(m$Sol) Iterations = 1:2500 Thinning interval = 1 Number of chains = 1 Sample size per chain = 2500 1. Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE traitSemanticClass.c -3.4160 1.0983 0.02197 0.11372 traitSemanticClass.f -2.5098 0.7304 0.01461 0.06545 traitSemanticClass.p -5.3633 1.2252 0.02450 0.12210 traitSemanticClass.t 0.7708 0.8786 0.01757 0.07661 Modalitywritten -2.2994 0.7708 0.01542 0.07250 2. Quantiles for each variable: 2.5% 25% 50% 75% 97.5% traitSemanticClass.c -5.6219 -4.1561 -3.4069 -2.694 -1.2674 traitSemanticClass.f -3.8757 -3.0037 -2.5280 -2.033 -1.0036 traitSemanticClass.p -7.7261 -6.2194 -5.2755 -4.504 -3.1063 traitSemanticClass.t -0.9578 0.1640 0.7919 1.363 2.4537 Modalitywritten -3.8277 -2.7963 -2.2668 -1.778 -0.8377
Finally, we can check out the Highest Posterior Density intervals for the fixed effects.
> HPDinterval(m$Sol) lower upper traitSemanticClass.c -5.4157540 -1.1619380 traitSemanticClass.f -3.9443586 -1.0921360 traitSemanticClass.p -7.7055507 -3.0929750 traitSemanticClass.t -0.9460623 2.4557778 Modalitywritten -3.7766821 -0.7944914 attr(,"Probability")  0.95
If I were dealing with a real data set, I would also run the exact same model a few more times, saving each set of results separately. This allows us to test for model convergence by comparing results across runs of the model (see
gelman.diag). Other convergence metrics that would be important to check for any real data would be
raftery.diag. I find the Raftery test to be relatively intuitive to interpret, but it can require generating a large number of samples.
Questions can be left here in the comments, or can be sent to the R-lang list.