### Multinomial random effects models in R

This post is partly a response to this message. The author of that question is working on ordered categorical data. For that specific case, there are several packages in R that might work, none of which I’ve tried. The most promising is the function `DPolmm()`

from DPpackage. It’s worth noting, though, that in that package you are committed to a Dirichlet Process prior for the random effects (instead of the more standard Gaussian). A different package, mprobit allows one clustering factor. This could be suitable, depending on the data set. MNP, mlogit, multinomRob, vbmp, `nnet`

, and msm all offer some capability of modeling ordered categorical data, and it’s possible that one of them allows for random effects (though I haven’t discovered any yet). MCMCpack may also be useful, as it provides MCMC implementations for a large class of regression models. `lrm()`

from the `Design`

package handles ordered categorical data, and clustered bootstrap sampling can be used for a single cluster effect.

I’ve recently had some success using MCMCglmm for the analysis of unordered multinomial data, and want to post a quick annotated example here. It should be noted that the tutorial on the CRAN page is extremely useful, and I encourage anyone using the package to work through it.

I’m going to cheat a bit in my choice of data sets, in that I won’t be using data from a real experiment with a multinomial (or polychotomous) outcome. Instead, I want to use a publicly available data set with some relevance to language research. I also need a categorical dependent variable with more than two levels for this demo to be interesting. Looking through the data sets provided in the `languageR`

package, I noticed that the `dative`

data set has a column `SemanticClass`

which has five levels. We’ll use this as our dependent variable for this example. We’ll investigate whether the semantic class of a ditransitive event is influenced by the modality in which it is produced (spoken or written).

library(MCMCglmm) data("dative", package = "languageR") k <- length(levels(dative$SemanticClass)) I <- diag(k-1) J <- matrix(rep(1, (k-1)^2), c(k-1, k-1)) m <- MCMCglmm(SemanticClass ~ -1 + trait + Modality, random = ~ us(trait):Verb + us(Modality):Verb, rcov = ~ us(trait):units, prior = list( R = list(fix=1, V=0.5 * (I + J), n = 4), G = list( G1 = list(V = diag(4), n = 4), G2 = list(V = diag(2), n = 2))), burnin = 15000, nitt = 40000, family = "categorical", data = dative)

Read on for an explanation of this model specification, along with some functions for evaluating the model fit.

The first two lines load the `MCMCglmm`

package and give us access to the `dative`

data set without loading all of the `languageR`

package.

The next three lines set up some useful constants for us. `k`

is the number of categories in our outcome variable. `I`

and `J`

are matrices that we’ll use to set up constraints on the covariance structure of the residuals.

The remainder of the code specifies the multinomial model, and we’ll break it down into smaller chunks.

`SemanticClass ~ -1 + trait + Modality, `

This first line specifies the dependent variable, `SemanticClass`

, and the fixed effects, `-1 + trait + Modality`

. The term `trait`

is not part of the original data set– rather, it’s a reserved term in the `MCMCglmm`

package. `trait`

comes from phylogenetics, where the dependent variable is some observed trait in an organism. For our purposes, including the `trait`

term tells the model to estimate an intercept for each of the observed outcomes. We follow the advice from the tutorial and enter this as `-1 + trait`

. This guarantees an intercept for each non-baseline level of the outcome, rather than an overall intercept term with an offset for `k-2`

levels.

random = ~ us(trait):Verb + us(Modality):Verb,

The next line specifies our random effects. To the right of the colon is the variable in our data that will be used as a grouping variable, in this case `Verb`

. The content on the left side of the colon specifies which fixed effects will also have random offsets. These fixed effects are wrapped in a function call which defines the structure of the variance-covariance matrix for that random effect. In this case, we’re allowing an unstructured variance-covariance matrix for the levels of the outcome and the Modality of the observation. The other relevant possibility would be to use `idh(trait):Verb`

, which constrains the random effects to be uncorrelated. For more information on these specifications, see page 12 of the tutorial.

rcov = ~ us(trait):units,

This line specifies the structure of the residuals. `units`

is another reserved word in `MCMCglmm`

, in this case referring to each individual observation. This specification of the variance-covariance matrix of the residuals allows arbitrary correlations in the errors. Again, see page 12 of the tutorial for the precise terms being fit.

prior = list(

R = list(fix=1, V=0.5 * (I + J), n = 4),

Now we define the priors for the model. This is done using some structures defined by `MCMCglmm`

. The R-structure is the variance-covariance matrix for the residuals. The G-structure is the variance-covariance matrix for the random effects (which is itself a list of sub-specifications, one per random effect). These structures are defined on pages 16-19 of the tutorial.

The R-structure in this case is set to have a fixed form (`fix = 1`

). The reasoning behind this decision is given on page 9 of the tutorial. Basically, for data such as we have here where each observation is a single sample from a distribution over `k`

categorical outcomes, we can’t actually estimate the residual variance because it depends entirely on the mean (think of the binomial distribution in the simplest case). We follow the prescriptions from the package authors for working with data from this type of distribution, fixing the variance to be 1 for all of the diagonal terms (variances) and .5 for all of the off-diagonal terms (covariances). This advice can be found in the table on page 23 of the tutorial, with further information on page 21.

G = list(

G1 = list(V = diag(4), n = 4),

G2 = list(V = diag(2), n = 2))),

Here are the priors on the random effects. When priors are not fixed, they are inverse-Wishart distributed. `V`

is a hyperprior on the variance-covariance matrix (the inverse scale matrix), and `n`

can be thought of as a degree-of-belief parameter. The tutorial suggests that `n >= k`

to aid in convergence. Here we choose the lowest `n`

that results in a proper prior, or the least-informative proper prior (but not an uninformative prior).

burnin = 15000,

nitt = 40000,

Here we specify how many samples to generate. The burnin argument sets the number of samples generated in the burn in period. The goal is to converge on a set of stable estimates for the model parameters. We know that this won’t happen instantaneously, and so we allow the model some time to try to settle on a good set of parameters before we start collecting samples. These initial throwaway samples comprise the burn in period, and the number of such samples to be generated is set by the `burnin`

argument. After the burn in period, we sample each parameter from the model a certain number of times, controlled by the `nitt`

argument. There’s no guarantee that the model will have settled on stable estimates by the end of burn in, though, so it will be important to test whether the model has converged in the samples we end up using. It should also be noted that it is possible to discard samples as they’re collected using the `thin`

parameter. This is usually done to try to minimize dependence between samples.

family = "categorical",

data = dative)

We use the `"categorical"`

family, appropriate for data that consists of one line per observation. If we had multinomial count data, we would use the `"multinomialK"`

family, where `K`

is the number of levels in the dependent variable. In this case, if the data were restructured as multinomial count data, we would use `family = "multinomial5"`

. To restructure the data, we could use something like

library(reshape)

dative.melt <- melt(dative,

id.vars = c("Modality", "Verb"),

measure.vars = c("SemanticClass"))

dative.wide <- cast(dative.melt, Verb ~ value)

But in this case, it works just as well to use the original data and the `family = "categorical"`

argument to `MCMCglmm`

.

When we run the model, we’ll get output like

MCMC iteration = 0 E[MH acceptance ratio] = 0.000168 MCMC iteration = 1000 E[MH acceptance ratio] = 0.263537 MCMC iteration = 2000 E[MH acceptance ratio] = 0.261401

This tells us that the model is being fit using the Metropolis-Hastings algorithm (MH), not Gibbs sampling. It also tells us the proportion of samples in each set of one thousand that is accepted. An acceptance rate that’s too high means that the space will explored very slowly. Anecdotally, an acceptance rate between .25 and .5 seems to indicate good performance.

After fitting the model, we can start by examining the result object, `m`

.

> str(m) List of 8 $ Sol : mcmc [1:2500, 1:5] -0.833 -2.191 -1.027 -0.978 -2.456 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL .. ..$ : chr [1:5] "traitSemanticClass.c" "traitSemanticClass.f" "traitSemanticClass.p" "traitSemanticClass.t" ... ..- attr(*, "mcpar")= num [1:3] 1 2500 1 $ VCV : mcmc [1:2500, 1:36] 9.45 11.18 7.02 7.78 7.33 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL .. ..$ : chr [1:36] "Verb.trait.SemanticClass.c.SemanticClass.c" "Verb.trait.SemanticClass.f.SemanticClass.c" "Verb.trait.SemanticClass.p.SemanticClass.c" "Verb.trait.SemanticClass.t.SemanticClass.c" ... ..- attr(*, "mcpar")= num [1:3] 1 2500 1 $ Liab : NULL $ Fixed :Class 'formula' length 3 SemanticClass ~ -1 + trait + Modality .. ..- attr(*, ".Environment")= $ Random :Class 'formula' length 2 ~us(trait):Verb + us(Modality):Verb .. ..- attr(*, ".Environment")= $ Residual:Class 'formula' length 2 ~us(trait):units .. ..- attr(*, ".Environment")= $ Deviance:Class 'mcmc' atomic [1:2500] 3079 3063 3101 3062 3127 ... .. ..- attr(*, "mcpar")= num [1:3] 1 2500 1 $ DIC : num 3572

Of immediate interest are the `m$Sol`

and `m$VCV`

members of the list. The former are for the fixed effects parameters, and the latter are the variance-covariance matrices associated with the residuals and the random effects. These each have the class `mcmc`

, meaning they represent samples from an MCMC fit. In this case, we have all of the samples generated by the model after the burn-in period. We can start our examination of the fixed effects (stored in m$Sol) using some visualizations provided by the `coda`

and `lattice`

packages.

Next we can look at the numeric estimates of the fixed effects.

> summary(m$Sol) Iterations = 1:2500 Thinning interval = 1 Number of chains = 1 Sample size per chain = 2500 1. Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE traitSemanticClass.c -3.4160 1.0983 0.02197 0.11372 traitSemanticClass.f -2.5098 0.7304 0.01461 0.06545 traitSemanticClass.p -5.3633 1.2252 0.02450 0.12210 traitSemanticClass.t 0.7708 0.8786 0.01757 0.07661 Modalitywritten -2.2994 0.7708 0.01542 0.07250 2. Quantiles for each variable: 2.5% 25% 50% 75% 97.5% traitSemanticClass.c -5.6219 -4.1561 -3.4069 -2.694 -1.2674 traitSemanticClass.f -3.8757 -3.0037 -2.5280 -2.033 -1.0036 traitSemanticClass.p -7.7261 -6.2194 -5.2755 -4.504 -3.1063 traitSemanticClass.t -0.9578 0.1640 0.7919 1.363 2.4537 Modalitywritten -3.8277 -2.7963 -2.2668 -1.778 -0.8377

Finally, we can check out the Highest Posterior Density intervals for the fixed effects.

> HPDinterval(m$Sol) lower upper traitSemanticClass.c -5.4157540 -1.1619380 traitSemanticClass.f -3.9443586 -1.0921360 traitSemanticClass.p -7.7055507 -3.0929750 traitSemanticClass.t -0.9460623 2.4557778 Modalitywritten -3.7766821 -0.7944914 attr(,"Probability") [1] 0.95

If I were dealing with a real data set, I would also run the exact same model a few more times, saving each set of results separately. This allows us to test for model convergence by comparing results across runs of the model (see `gelman.diag`

). Other convergence metrics that would be important to check for any real data would be `heidel.diag`

, `geweke.diag`

, and `raftery.diag`

. I find the Raftery test to be relatively intuitive to interpret, but it can require generating a large number of samples.

Questions can be left here in the comments, or can be sent to the R-lang list.

May 14, 2009 at 2:58 pm

Cool, very nice introduction. One question I have, though, is how to incorporate alternative-specific variables in this framework. The example you have here has just one item-specific variable, Modality, but some of the data I’m working with seems like it would best be modeled by a combination of item-specific and alternative-specific predictors. I’ve been playing around recently with the mlogit package, which allows you to specify such models very nicely (“dependent ~ altSpec1 + … + altSpecN | itemSpec1 + … + itemSpecM”), but it doesn’t support random effects.

LikeLike this

November 22, 2009 at 11:38 am

An interesting introduction. Has this advanced any further to hierarchical random effect models for multinomial responses?

LikeLike this

January 30, 2010 at 4:25 pm

Cool introduction, thanks.

LikeLike this

January 3, 2011 at 9:48 am

[...] Multinomial random effects models in R May 2009 3 comments 4 [...]

LikeLike this

September 21, 2011 at 1:15 am

hi

in first place, thank very much for you contribution.

i write you for make you a little question. I have how response variable a discrete variable with 4 states, (my independent variable is continuous) , so I use the variance structure of the fixed effects and random-as you propose in your mini-tutorial. the problems to be admitted to the program gives me an error: object ‘J’ not found. any idea why this happens?

I hope your answer please.

regards

LikeLike this

September 21, 2011 at 2:13 am

Dear Oscar,

J is part of the specification of the prior. It’s best you read the course notes and the vignette for the MCMCglmm package by Hadfield. They contain a lot of useful information (I started working with this package myself recently)

LikeLike this

June 7, 2012 at 10:57 am

I’ve started learning how to use MCMCglmm to fit linear mixed-effects models recently. I wonder how one can conduct multiple comparisons after one obtains a significant interaction. With lmer(), we can use glht() in the multcomp package, but I haven’t been able to find a comparable method for MCMCglmm. Any suggestion would be much appreciated!

Best,

Xiao

LikeLike this

June 7, 2012 at 11:09 am

Dear Xiao,

I almost always code interactions a priori based on theoretical considerations, so I haven’t been using glht() and, unfortunately, know of no equivalent for MCMCglmm, but perhaps someone else reading this has an answer?

Florian

LikeLike this

June 7, 2012 at 11:18 am

Thanks for the reply! I wonder if you can suggest a couple of articles/textbooks/tutorials on coding interactions a priori. I would be very interested in learning it. Thank you again

LikeLike this

June 7, 2012 at 11:55 am

Hi Xiao,

Maureen Gillespie (Illinois) put together a coding tutorial, which is available here: https://hlplab.wordpress.com/2010/05/10/mini-womm/. It also contains references.

HTH,

Florian

LikeLike this

May 17, 2013 at 8:17 am

Thank you so much for this summary! I spent the whole day trying to get the hang of the MCMCglmms, as I am totally new to it, but this summary finally helped me to get a model. However, now I am not entirely sure whether I used the right prior settings (despite reading through the descriptions over and over again the mathematics behind it seem to be beyond me).

k <- length(levels(part$Level))

I <- diag(k-1)

J <- matrix(rep(1, (k-1)^2), c(k-1, k-1))

m <- MCMCglmm(Level ~ -1+trait +Fixed1 +Fixed2+Fixed3 + Fixed4 + Fixed5 +

Fixed6 +Fixed7 +Fixed8,

random = ~ us(trait):IndividualID +us(trait):GroupDyad,

rcov = ~ us(trait):units,

prior = list(

R = list(fix=1, V=0.5 * (I + J), n = 5),

G = list(

G1 = list(V = diag(4), n = 5),

G2 = list(V = diag(4), n = 5))),

burnin = 15000,

nitt = 40000,

family = "categorical",

data = part)

So this is basically what I did, I have my levels in one collumn, which I hope works out? And the Level is a categorical variable with the classes (0,1,2,3,4), from what I got so far with the model I have it compares it all to 0, is that right? the prior settings for R I basically just took from what you wrote, and the model seems to run, but I am not sure whether its right, cause my fixed variables have different distributions (some are poisson, some categorical and one even binomial). Do you need to take that into account when formulating the R?

Thanks a lot for your help!!

Anna

LikeLike this

June 26, 2013 at 12:18 am

It is a very nice tutorial. However I have a question about how to explain the result?

In multinomial logistic regression model, there should be 4 coefficients for Modalitywritten:

Modalitywritten class c (compare to the baseline)

Modalitywritten class f

Modalitywritten class p

Modalitywritten class t

but here in this example, we only got one. How should we explain the influence of modality (written) to each class?

Thanks very much!

Renee

LikeLike this

June 26, 2013 at 6:29 pm

Dear Renee,

The coefficient for modality here describes the effect of modality on

alllevels of SemanticClass (c,f,p,t). That is, the model described here assumes that the effect of modality is identical at all levels of SemanticClass (i.e., additive to the class intercepts).I haven’t used MCMCglmm much, but I assume to relax that assumption, you would add the interaction of modality and trait to the fixed effect structure of the model, changing

`SemanticClass ~ -1 + trait + Modality`

to

`SemanticClass ~ -1 + trait * Modality`

I tried that and it seems to yield the desired result. I used a much shorter burnin and collected fewer samples, so the estimates given below are not to be trusted, but they illustrate my point:

Iterations = 1001:3991

Thinning interval = 10

Number of chains = 1

Sample size per chain = 300

1. Empirical mean and standard deviation for each variable,

plus standard error of the mean:

Mean SD Naive SE Time-series SE

traitSemanticClass.c -2.12514 0.7814 0.04511 0.12300

traitSemanticClass.f -1.39007 0.6487 0.03745 0.08351

traitSemanticClass.p -2.66922 0.8521 0.04919 0.13989

traitSemanticClass.t 0.49969 0.7906 0.04564 0.10948

Modalitywritten -1.68676 0.7375 0.04258 0.12506

traitSemanticClass.f:Modalitywritten -0.73914 0.2808 0.01621 0.12319

traitSemanticClass.p:Modalitywritten -0.02762 0.2256 0.01303 0.10308

traitSemanticClass.t:Modalitywritten -0.75859 0.2458 0.01419 0.08884

2. Quantiles for each variable:

`2.5% 25% 50% 75% 97.5%`

traitSemanticClass.c -3.7766 -2.6081 -2.08930 -1.5495 -0.7722

traitSemanticClass.f -2.7074 -1.8566 -1.39134 -0.9053 -0.2231

traitSemanticClass.p -4.3854 -3.2386 -2.70065 -2.1575 -0.9850

traitSemanticClass.t -0.9317 -0.1557 0.52490 1.0610 2.0400

Modalitywritten -2.9807 -2.2319 -1.72275 -1.2190 -0.1443

traitSemanticClass.f:Modalitywritten -1.1657 -0.9783 -0.77251 -0.5448 -0.1313

traitSemanticClass.p:Modalitywritten -0.4315 -0.2071 -0.02458 0.1360 0.3805

traitSemanticClass.t:Modalitywritten -1.2067 -0.9253 -0.74746 -0.6360 -0.1994

HTH,

Florian

LikeLike this

August 27, 2013 at 7:09 pm

Thanks so much for this. Seems to be working, I got a running model but I have no idea how to interpret the output. The tutorial you reference is somewhat helpful, but the questions are so different than mine.

I am attempting an analyses as follows:

Background: We are attempting to describe the potential bias that arises from the use of different resolutions of imagery when trying to determine ‘on-the-ground’ vegetation at a given point

Question: does spatial resolution influence the probability that a technician will describe the vegetation at a point as a tree, shrub, or herbaceous vegetation, etc…

Objective: model the effect of imagery resolution (i.e. pixel size) in determining the ‘call’ made by a technician

Data: We have selected 109 points clustered within 33 ‘plots’, at each point there is a ‘call’ of tree, shrub, herb, bare, or other. This was done 4 times for each point, at image resolutions of 1m, 50cm, 10cm, and using aerial imagery from the National Agricultural Inventory Program (NAIP). Plots are spaced at least 6km apart along a grid

So the data looks like this:

Plot Point Call Resolution

41-69-6644 1 Tree 1m

41-69-6644 1 Tree 10cm

41-69-6644 1 Tree 50cm

41-69-6644 1 Tree naip

41-69-6644 2 Herbaceous 1m

41-69-6644 2 Shrub 10cm

41-69-6644 2 Shrub 50cm

41-69-6644 2 Tree naip

and my code looked like this:

model = MCMCglmm(Call~ -1 + trait + Res,

random = ~us(trait):Plot,

rcov = ~ us(trait):units,

prior = list(

R = list(fix=1, V=0.5 * (I + J), n = 4),

G = list(

G1 = list(V = diag(4), n = 4))),

burnin = 1000,

nitt = 6000,

family = "categorical",

data = res.data)

I only did a few iterations just for exploratory purposes.

Here is some output:

Location effects: Call ~ -1 + trait + Res

`post.mean l-95% CI u-95% CI eff.samp pMCMC`

traitCall.Herbaceous 3.8063 3.0089 4.5007 6.954 <0.002 **

traitCall.Other -0.7597 -1.2783 -0.2033 3.520 <0.002 **

traitCall.Shrub 3.8189 3.0808 4.5369 5.561 <0.002 **

traitCall.Tree 2.1910 1.2333 3.2048 24.191 <0.002 **

Res1m 1.0131 0.3714 1.3977 2.434 <0.002 **

Res50cm 0.0239 -0.1985 0.2294 6.899 0.752

Resnaip 0.7726 0.2696 1.0761 3.292 <0.002 **

---

Am I right in thinking that the last three lines are what I am interested in? Are these saying that relative to the 10cm resolution, the calls made at 50cm were not significantly different, but the other two were different? Further, is it possible to further examine what might be causing the differences of the 1m and NAIP scale (i.e. when the call at 10cm was 'shrub' more misclassification occurred).

Cheers my friend, you were of great help.

Jesse Poulos

LikeLike this

August 29, 2013 at 4:43 am

Dear Jesse,

let me see whether I can help. (btw, I took the liberty to remove what I took to be a typo from your comment — I think you meant 1m, 50cm, amd 10cm rather than 10cm, 50cm, and 10cm resolution?)

In the output you present, the interpretation of the coefficients would depend on the factor coding you employed. I will assume that you didn’t specify any coding explicitly, in which case R will employ treatment coding. Since treatment coding can lead to collinearity between factor levels (especially, if you don’t have data that is balanced with respect to the factor levels), you might want to consider sum coding (aka ANOVA coding). I recommend Maureen Gillespie’s tutorial on factor coding in R, linked to this blog. E.g., you could do:

`res.data$Res = factor(res.data$Res, levels = c("10cm", "50cm", "1m", "naip"))`

contrasts(res.data$Res) = contr.sum(4)

contrasts(res.data$Res)

The first four lines describe the predicted mean log-odds of the four responses when the resolution is “10cm”. “10cm” was chosen as base line by default as its alphanumerically the first of the four levels of Res, but it also make sense (since it’s presumably the highest resolution?).

The next three lines describe the treatment effects of Resolution. Minus any problems with collinearity (which, for example, could cause insignificance), your interpretation of these effects is correct.

In order to investigate whether classification of different categories (shrub, tree, etc.) was affected

differentlyby the resolution, you would need to add the interaction of trait and resolution to your model (i.e.,`trait * Res`

, rather than`trait + Res`

).Note also that your model currently only contains random intercepts. You did not specify random slopes for the resolution effect (or the interaction of trait and the resolution). The absence of random slopes of an effect (e.g., the absence of random slopes for the resolution effect) makes the evaluation of that effect anti-conservative. Perhaps it’s the best you can do with the data you have, but I wanted to mention it.

Florian

LikeLike this

August 29, 2013 at 11:56 am

Tiflo,

Thanks for the hasty response, and I hope I can tug your eyes and brain a little longer.

Ahh, yes, you are right in your correction. it should for 10cm, 50cm, and 1m (and NAIP).

Also, last night I did include the treat*Res term for interactions since this is precisely the information we are seeking. This morning when I checked, I found that it only performed a test for 3/5 of my response classes (Tree, shrub and other), leaving out bare ground, and herbaceous. Do you know if there are any good reasons this might be happening? I assume that it has to do with having a baseline reference class, but I would expect to see at least 4/5 comparisons if that were the case… That said, when I restructure my data as a column for each class and use class multinomialk, the issue disappears, but the results differ; might this multinomialk approach still be valid or perhaps better?

I did suspect that I would need to do some factor recoding. It changed the output dramatically when I did so. I wonder though, when running contrasts(res.data$Res) why is the last row -1, -1, -1, -1, instead of 0, 0, 0, 0. Is this more or less arbitrary dummy coding?

How exactly do I specify the random slopes anyway? Instead of random = ~us(trait):Plot, use random = ~us(1+trait):Plot?

Lastly, what exactly are the estimates in the summary telling me. Say the estimate was 3 for the when comparing tree calls at the 10cm scale vs the 1m scale. Does this mean the at the 1m scale I am 3 times more likely to pick trees?

Thanks again, you are making me look like a pro at work. I’m surprised they are consulting an ecologist as a statistician; I have no one to consult besides the interwebs, so much thanks.

JP

LikeLike this

August 29, 2013 at 3:29 pm

Hi JP,

since I had a couple of classes of wine and beer, let me focus on the basics. (otherwise, we have drunk linguist advising an economist advising ….).

As for only getting only three coefficients, that makes sense. First, consider the case without an interaction. You specified

`-1 + trait`

, which should give you four parameters for the five outcomes. The probability of the fifth outcome is 1 – the sum of the probabilities of all four other outcomes. Hence, no fifth parameter is required. That’s what you saw in the output.Now consider the case with an interaction. For simplicity’s sake, let’s assume that

`res`

is still treatment coded. The effect of`res`

has three parameters and so does its interaction with`trait`

. In this model, the ‘main’ effect of`res`

predict the treatment effect of`res`

for whatever level of`trait`

is not given an interaction (i.e., herbaceous). The treatment effects of`res`

for tree, shrub, and other given by the interaction terms. Finally, the predictions for the fifth level of the outcome (i.e., bare ground) are derived just the same way as in the first model: the predicted probability of the fifth outcome is 1 – the sum of the probabilities of all four other outcomes.Essentially, the ‘main’ effects of

`trait`

are all intercepts. The effect of`res`

and its interaction with`trait`

togetherdescribe the effect of resolution at the different levels of`trait`

.As for the coding advice I gave, I recommended

sum-coding, not dummy/treatment-coding. The code I provided did that. The contrasts are the columns and by default the last column will be the one that everything is compared against (hence the -1). But for more detail I recommend Maureen’s tutorial. It has exercises that are good to work through to get a more solid understanding of the different coding options.Finally, you asked how to read the estimates. The estimates are the coefficients in

log-odds. Let’s go through an example once we have ascertained that the above code works? (let me know).Florian

LikeLike this

August 29, 2013 at 4:31 pm

Flo,

For some reason I can’t place my reply below your latest post, so I appologize if this post is out of place.

Just to be clear, my dependent response (Call) has 5 classes (Tree, shrub, herb, bare, other) while my resolution predictor (Res) has 4 classes (10cm, 50cm, 100cm, and NAIP).

I am somewhat confused now, as I thought your examples response also had 5 levels, yet it appears as if you used k=4 in your prior…

Here is what my newest code-block is looking like, note, I’ve reorganized your recommended sum coding so 10cm is the reference level:

#convert treatment coding to sum coding

res.data.s = res.data

res.data.s$Res = factor(res.data.s$Res, levels = c(“50cm”, “1m”, “naip”, “10cm”))

contrasts(res.data.s$Res) = contr.sum(4)

contrasts(res.data.s$Res)

k <- length(levels(res.data.s$Call)) #to double check this ended up being 5 as it should

I <- diag(k-1)

J <- matrix(rep(1, (k-1)^2), c(k-1, k-1))

model = MCMCglmm(Call~ -1 + trait * Res,

random = ~us(trait):Plot,

rcov = ~ us(trait):units,

prior = list(

R = list(fix=1, V=0.5 * (I + J), n = 5),

G = list(

G1 = list(V = diag(5), n = 5))),

burnin = 100,

nitt = 500,

family = "categorical",

data = res.data.s)

This produces the following error:

Error in priorformat(if (NOpriorG) { :

V is the wrong dimension for some prior$G/prior$R elements

Perhaps this is arising from some confusion about my response having 5 levels and my predictor having 4 levels? Am I missing something here? It only works if my prior is:

R = list(fix=1, V=0.5 * (I + J), n = 4), #this can also run if it is set as 5, but only if G1 has 5's as 4's

G = list(

G1 = list(V = diag(4), n = 4))),

Thanks again, and I do hope you are enjoying your beer and wine

JP

LikeLike this

August 29, 2013 at 6:06 pm

Oh, I really shouldn’t drink and write. You’re absolutely right. The example was for a 5-level response variable and you had done everything alright. In order not to confuse others, I revised my original answer to your question above. I hope it helps. Can you post the output of the interaction model as you

originallyran it?LikeLike this

August 29, 2013 at 7:15 pm

Flo,

Absolutely. This model is obviously not accurate due to the short burn in etc… but I just wanted to get a quick example output.

model5frs = MCMCglmm(Call~ -1 + trait * Res,

random = ~us(trait):Plot,

rcov = ~ us(trait):units,

prior = list(

R = list(fix=1, V=0.5 * (I + J), n = 4),

G = list(

G1 = list(V = diag(4), n = 4))),

burnin = 100,

nitt = 500,

#thin = 100,

family = “categorical”,

data = res.data.s)

OUTPUT BELOW:

Iterations = 101:491

Thinning interval = 10

Sample size = 40

DIC: 26325.25

G-structure: ~us(trait):Plot

post.mean l-95% CI u-95% CI eff.samp

Call.Herbaceous:Call.Herbaceous.Plot 1.3104 0.77742 1.95209 40.00

Call.Other:Call.Herbaceous.Plot 0.1647 -0.10811 0.36195 80.87

Call.Shrub:Call.Herbaceous.Plot 0.2836 -0.15647 0.69475 36.40

Call.Tree:Call.Herbaceous.Plot -1.2054 -2.61764 -0.47034 40.00

Call.Herbaceous:Call.Other.Plot 0.1647 -0.10811 0.36195 80.87

Call.Other:Call.Other.Plot 0.2514 0.14707 0.41632 40.00

Call.Shrub:Call.Other.Plot 0.1941 -0.02443 0.47858 40.00

Call.Tree:Call.Other.Plot -0.1844 -0.65000 0.06387 40.00

Call.Herbaceous:Call.Shrub.Plot 0.2836 -0.15647 0.69475 36.40

Call.Other:Call.Shrub.Plot 0.1941 -0.02443 0.47858 40.00

Call.Shrub:Call.Shrub.Plot 1.0291 0.55959 1.61985 59.94

Call.Tree:Call.Shrub.Plot -0.8194 -1.60903 -0.28891 29.60

Call.Herbaceous:Call.Tree.Plot -1.2054 -2.61764 -0.47034 40.00

Call.Other:Call.Tree.Plot -0.1844 -0.65000 0.06387 40.00

Call.Shrub:Call.Tree.Plot -0.8194 -1.60903 -0.28891 29.60

Call.Tree:Call.Tree.Plot 2.2644 1.09706 4.38641 22.29

R-structure: ~us(trait):units

post.mean l-95% CI u-95% CI eff.samp

Call.Herbaceous:Call.Herbaceous.units 1.0 1.0 1.0 0

Call.Other:Call.Herbaceous.units 0.5 0.5 0.5 0

Call.Shrub:Call.Herbaceous.units 0.5 0.5 0.5 0

Call.Tree:Call.Herbaceous.units 0.5 0.5 0.5 0

Call.Herbaceous:Call.Other.units 0.5 0.5 0.5 0

Call.Other:Call.Other.units 1.0 1.0 1.0 0

Call.Shrub:Call.Other.units 0.5 0.5 0.5 0

Call.Tree:Call.Other.units 0.5 0.5 0.5 0

Call.Herbaceous:Call.Shrub.units 0.5 0.5 0.5 0

Call.Other:Call.Shrub.units 0.5 0.5 0.5 0

Call.Shrub:Call.Shrub.units 1.0 1.0 1.0 0

Call.Tree:Call.Shrub.units 0.5 0.5 0.5 0

Call.Herbaceous:Call.Tree.units 0.5 0.5 0.5 0

Call.Other:Call.Tree.units 0.5 0.5 0.5 0

Call.Shrub:Call.Tree.units 0.5 0.5 0.5 0

Call.Tree:Call.Tree.units 1.0 1.0 1.0 0

Location effects: Call ~ -1 + trait * Res

post.mean l-95% CI u-95% CI eff.samp pMCMC

traitCall.Herbaceous 2.017779 1.319766 2.631103 3.728 <0.03 *

traitCall.Other -0.059957 -0.277582 0.167994 40.000 0.55

traitCall.Shrub 1.985699 1.219209 2.716203 2.025 <0.03 *

traitCall.Tree 0.863295 0.381788 1.431208 4.532 <0.03 *

Res1m 0.251940 0.141353 0.350335 5.852 <0.03 *

Resnaip -0.124915 -0.286551 0.043194 5.987 0.10 .

Res10cm 0.007258 -0.138939 0.135456 3.257 1.00

traitCall.Other:Res1m -0.160500 -0.257337 -0.062635 4.673 <0.03 *

traitCall.Shrub:Res1m -0.382345 -0.469192 -0.284058 11.096 <0.03 *

traitCall.Tree:Res1m -0.048957 -0.165548 0.064457 10.073 0.35

traitCall.Other:Resnaip 0.068419 -0.112842 0.193998 5.869 0.40

traitCall.Shrub:Resnaip 0.290083 0.213862 0.402369 11.026 <0.03 *

traitCall.Tree:Resnaip 0.492460 0.355131 0.631961 7.151 <0.03 *

traitCall.Other:Res10cm -0.066982 -0.166105 -0.001008 12.644 0.05 *

traitCall.Shrub:Res10cm 0.209294 0.114260 0.332900 9.099 <0.03 *

traitCall.Tree:Res10cm -0.018507 -0.134081 0.101332 5.141 0.75

Note that right now 50cm is the reference resolution. I will have to fix that later. But anyway, to remind you of my questions. I don't quite see why there are only three within Call (ie. tree) across resolution comparisons for the interaction and I also don't know quite how these would be interpreted in a useful context (ie, we know they are significant, but how do I translate the magnitude to something the lay-person can understand).

Also,

How exactly do I specify the random slopes anyway? Instead of random = ~us(trait):Plot, use random = ~us(1+trait):Plot?

Cheer mate!

JP

LikeLike this

August 30, 2013 at 2:59 am

Hey JP,

thanks for posting the model. I tried to explain why you only get three `interaction’ terms in the earlier post that I revised based on your comments. Part of the trick in understanding this is to see that this isn’t really an ordinary interaction (since

`trait`

is not like an ordinary factor). First consider the following model with only an ordinary intercept:`Model 1: Call~ +1`

It’s hard to see what Model 1 would mean ( the intercept would describe the log-odds of any of the four non-baseline outcomes over the baseline). Since we’re now trying to predict the probability of all 5 outcomes with

oneconstant. Really, the model that corresponds to an intercept only model for our five-way multinomial model is the following:`Model 2: Call~ -1 + trait`

In Model 2, there will be four predictors. The fixed effect output of the model gives you the predicted log-odds (=mean log-odds expected by MCMC fitted posterior distribution of those parameters) for four of the five outcomes,

relative to the fifth outcome. The predicted log-odds of the fifth outcome are implicit since they are redundant: they can be derived from knowing the log-odds of the other four outcomes since the probability of all outcomes has to sum to 1 (that’s an assumption made by the model, sometimes called exhaustivity). The output of Model 2 will look something like this (the numbers are made up/copied from the output of your larger model):`traitCall.Herbaceous 2.017779 1.319766 2.631103 3.728 <0.03 *`

traitCall.Other -0.059957 -0.277582 0.167994 40.000 0.55

traitCall.Shrub 1.985699 1.219209 2.716203 2.025 <0.03 *

traitCall.Tree 0.863295 0.381788 1.431208 4.532 <0.03 *

So the predicted log-odds of a

Herbaceouscall over aBare groundcall are 2.017779 or about 88.3%`=plogis(2.017779)`

. The predicted log-odds of aTreecall over aBare groundcall are .863295 or about 70.3%. etc. The course notes by Jarrod Hadfield, chapter 5, contain code to translate these predicted log-odds, which are expressed against the baseline, into probabilities for all possible outcomes.Now, let’s turn to the model you first ran, without an `interaction’.

`Model 3: Call~ -1 + trait + Res`

This model will have only three additional parameters for the four levels of

`res`

. Something like:traitCall.Herbaceous 2.017779 1.319766 2.631103 3.728 <0.03 *

traitCall.Other -0.059957 -0.277582 0.167994 40.000 0.55

traitCall.Shrub 1.985699 1.219209 2.716203 2.025 <0.03 *

traitCall.Tree 0.863295 0.381788 1.431208 4.532 <0.03 *

Res1m 0.251940 0.141353 0.350335 5.852 <0.03 *

Resnaip -0.124915 -0.286551 0.043194 5.987 0.10 .

Res10cm 0.007258 -0.138939 0.135456 3.257 1.00

The interpretation of the

`Res`

predictors is, however, not what you might first think. Those parameters describe the effect that`Res`

has on the log-odds ofanyof the four non-baseline outcomes over the baseline outcome. I.e., we’re predicting the effect of resolution onHerbeceous,Other,Shrub, andTreecompared toBare groundcalls.The model with the `interaction’ adds only three parameters because the fourth the `main’ effects of resolution already capture one of the four contrasts that we need to describe the overall effect of resolution. Actually, this becomes easier to see, if instead of:

`Model 4a: Call~ -1 + trait * Res`

we write:

`Model 4b: Call~ -1 - Res + trait * Res`

The output should look structurally something like this (again, ignore the numbers):

post.mean l-95% CI u-95% CI eff.samp pMCMC

traitCall.Herbaceous 2.017779 1.319766 2.631103 3.728 <0.03 *

traitCall.Other -0.059957 -0.277582 0.167994 40.000 0.55

traitCall.Shrub 1.985699 1.219209 2.716203 2.025 <0.03 *

traitCall.Tree 0.863295 0.381788 1.431208 4.532 <0.03 *

traitCall.Herbaceous:Res1m 0.251940 0.141353 0.350335 5.852 <0.03 *

traitCall.Other:Res1m -0.160500 -0.257337 -0.062635 4.673 <0.03 *

traitCall.Shrub:Res1m -0.382345 -0.469192 -0.284058 11.096 <0.03 *

traitCall.Tree:Res1m -0.048957 -0.165548 0.064457 10.073 0.35

traitCall.Herbaceous:Resnaip -0.124915 -0.286551 0.043194 5.987 0.10 .

traitCall.Other:Resnaip 0.068419 -0.112842 0.193998 5.869 0.40

traitCall.Shrub:Resnaip 0.290083 0.213862 0.402369 11.026 <0.03 *

traitCall.Tree:Resnaip 0.492460 0.355131 0.631961 7.151 <0.03 *

traitCall.Herbaceous:Res10cm 0.007258 -0.138939 0.135456 3.257 1.00

traitCall.Other:Res10cm -0.066982 -0.166105 -0.001008 12.644 0.05 *

traitCall.Shrub:Res10cm 0.209294 0.114260 0.332900 9.099 <0.03 *

traitCall.Tree:Res10cm -0.018507 -0.134081 0.101332 5.141 0.75

So, now the predicted log-odds of, for example, a

Shrubcall over aBare groundcall are predicted to decrease if the resolution is 1m rather than 50cm (the coefficient is -0.382345). For a resolution of 50cm, the predicted log-odds of aShrubcall over aBare groundcall are 1.985699. For a resolution of 1m, the predicted log-odds of aShrubcall over aBare groundcall are 1.985699 + -0.382345 = 1.603354. So the predicted probability of aShrubcall over aBare groundcall goes from about 87.9% for a resolution of 50cm down to 83.2% for a resolution of 1m. Since we often don’t care about the specific probabilities, it might be sufficient to say that going from a resolution of 50cm to a resolution of 1m decreases the probability ofShrubrelative to aBare groundcall.Btw, the bottom of the following page contains a nice and simple way to visualize your results (once you have extracted the fitted probabilities from the model).

As for your question about random slopes, unfortunately, I haven’t tried that myself. I couldn’t find direct pointers to this question either (and I have to switch to preparing for a conference). It seems though that what you had in mind is the right thing. Here’s an example for a linear mixed effect model in MCMCglmm with random slopes, which to me suggests that the syntax you suggested should work. Also, have you tried to see what MCMCglmm defaults to when you do not specify the

`random`

argument for your model?LikeLike this

April 7, 2014 at 4:42 pm

You provided a link to http://www.ats.ucla.edu/stat/r/dae/mlogit.htm as a simple way to visualize your results. However, the predict function currently does not work with MCMCglmm… any suggestions on how to visualize data. Can you alter the suggested code so that you are not predicting new data?

LikeLike this

April 7, 2014 at 6:12 pm

Hi Kyle, we’re working on a related project that might result in the type of code you’re looking for but I don’t think we’re quite there yet. You could contact Klinton Bicknell, now at Northwestern, to ask him about this (he’s taking the lead in that project). Sorry, to not be more helpful.

Florian

LikeLike this

April 8, 2014 at 2:12 pm

Thanks. I did find a wrapper for the MCMCglmm package called mcmcglmm that enables prediction of new data using the PredictNew command, but I haven’t really tried using it yet.

LikeLike this

August 30, 2013 at 3:07 pm

Ok, so I noticed that your recommendation of using the formula Call~ -1 -Res + trait*Res didn’t work, instead, it has to be Call~ -1 + trait*Res – Res in order for the 4th interaction to show up. So I got the following output when doing this:

post.mean l-95% CI u-95% CI eff.samp pMCMC

traitCall.Herb 1.95153 1.11930 2.63090 2.241 <0.03 *

traitCall.Other -0.18640 -0.42445 -0.02673 62.460 0.05 *

traitCall.Shrub 2.04203 1.14134 2.69823 2.473 <0.03 *

traitCall.Tree 0.81595 0.33647 1.24168 24.597 <0.03 *

traitCall.Herb:Res1m 0.30195 0.17190 0.43011 2.139 <0.03 *

traitCall.Other:Res1m 0.20085 0.09327 0.29599 21.610 <0.03 *

traitCall.Shrub:Res1m -0.27070 -0.43171 -0.14375 4.281 <0.03 *

traitCall.Tree:Res1m 0.37537 0.26161 0.48194 23.174 <0.03 *

traitCall.Herb:Resnaip -0.06315 -0.15464 0.05195 9.795 0.25

traitCall.Other:Resnaip 0.08218 0.01528 0.20684 13.112 0.05 .

traitCall.Shrub:Resnaip 0.04392 -0.04036 0.14581 12.443 0.45

traitCall.Tree:Resnaip 0.51446 0.39610 0.62859 7.038 <0.03 *

traitCall.Herb:Res50cm 0.13053 0.03043 0.19995 18.936 <0.03 *

traitCall.Other:Res50cm 0.15048 0.07094 0.21995 18.512 <0.03 *

traitCall.Shrub:Res50cm -0.04734 -0.20706 0.07930 9.224 0.50

traitCall.Tree:Res50cm 0.21715 0.11581 0.31486 10.364 <0.03 *

Suppose now that rather than wanting to know the probability that a technician will classify a point as a tree relative to bare ground at the 50cm scale vs the 10cm scale, I just want to know the absolute probability that a technician will classify a pixel as a tree at the 50cm scale vs 10cm scale (basically, I don't care about the relative probabilities between classes, but instead just between resolutions), or say, probability of classifying as a shrub at 50cm vs 10cm. This is pretty tricky to interpret, isn't it.

I can't seem to wrap my heard around it…

LikeLike this

August 30, 2013 at 3:43 pm

Hey JP,

great — looks like we’re making progress =)! And thanks for posting the correction (I was surprised that the order matters, but I hadn’t test run the code and it’s perfectly plausible that it matters for the way MCMCglmm parses the formula object). For the interpretation that you’re looking for, I recommend you read the course notes I referred to (http://cran.r-project.org/web/packages/MCMCglmm/vignettes/CourseNotes.pdf, Chapter 5). It actually provides a few lines of code to translate the output of your model into probabilities for each cell of your design and for each level of the response variable. If you get stuck let me know?

Florian

LikeLike this

April 7, 2014 at 9:08 pm

I ended up contacting Jarrod Hadfield to resolve this question of testing one group against the next. I framed it using his example from Pg. 101 of the course notes:

ME:

“I was wondering if you could provide some insight on how to find p-values for the example on p101 of your course notes. I am specifically interested in how I might go about testing whether there is a significant difference between the probability of observing polled males vs polled females. I understand how to use the HPDinterval function to obtain credible intervals, but is there a similar function that can obtain p-values for those credible intervals?”

J. Hadfield:

“The p-values in summary.MCMCglmm are obtained using:

2 * pmax(0.5/length(x), pmin(sum(x > 0)/length(x), 1 – sum(x > 0)/length(x)))

where in your case x would be the difference in probabilities. Its not ideal, but probably gives a fairly reliable answer.”

So it’s been a while, and I don’t quite remember what’s going on here, but essentially, ‘x’ is a vector containing the differences in probabilities for each iteration. Beyond that, it’s not clear to me what is going on. It seemed to work out though and gave me the test I was looking for.

LikeLike this

April 8, 2014 at 2:02 pm

Thanks for posting this. Much appreciated. I’m swamped right now and won’t get to play around with this for a while, but I’m sure others will be happy to see this solution posted.

LikeLike this

August 29, 2013 at 12:41 pm

Also, there is more documentation and information in the MCMCglmm authors course:

http://cran.r-project.org/web/packages/MCMCglmm/vignettes/CourseNotes.pdf

LikeLike this

January 25, 2014 at 4:40 am

I can’t figure out how to specify heteroskedasticity on level 1 in MCMCglmm. Basically, I’ve got individuals within countries, and my research question is how does variance for individuals (after control for the country-level) changes with my predictor, which is time. I have estimated that model in MLwiN, and I can reproduce it almost completely in MCMCglmm, except for specifying the predictor for variance on the inividual level. Can it be done? Thank you so much in advance.

LikeLike this

January 29, 2014 at 10:58 am

Hi Maxim,

sorry, I’m swamped at the moment and don’t know from the top of my head the answer to your question. I would recommend that you write to the developer of the package (Jarrod Hadfield) or to the r-sig-mixed-models email list (he’s one of the people replying). If you have the time, it’d be great if you could post a summary of the reply here? In the meantime, I approved this post, so that readers of this blog can jump in and throw in their answer.

Sorry, to not be of more help!

LikeLike this