Multinomial random effects models in R

Posted on Updated on


This post is partly a response to this message. The author of that question is working on ordered categorical data. For that specific case, there are several packages in R that might work, none of which I’ve tried. The most promising is the function DPolmm() from DPpackage. It’s worth noting, though, that in that package you are committed to a Dirichlet Process prior for the random effects (instead of the more standard Gaussian). A different package, mprobit allows one clustering factor. This could be suitable, depending on the data set. MNP, mlogit, multinomRob, vbmp, nnet, and msm all offer some capability of modeling ordered categorical data, and it’s possible that one of them allows for random effects (though I haven’t discovered any yet). MCMCpack may also be useful, as it provides MCMC implementations for a large class of regression models. lrm() from the Design package handles ordered categorical data, and clustered bootstrap sampling can be used for a single cluster effect.

I’ve recently had some success using MCMCglmm for the analysis of unordered multinomial data, and want to post a quick annotated example here. It should be noted that the tutorial on the CRAN page is extremely useful, and I encourage anyone using the package to work through it.

I’m going to cheat a bit in my choice of data sets, in that I won’t be using data from a real experiment with a multinomial (or polychotomous) outcome. Instead, I want to use a publicly available data set with some relevance to language research. I also need a categorical dependent variable with more than two levels for this demo to be interesting. Looking through the data sets provided in the languageR package, I noticed that the dative data set has a column SemanticClass which has five levels. We’ll use this as our dependent variable for this example. We’ll investigate whether the semantic class of a ditransitive event is influenced by the modality in which it is produced (spoken or written).

library(MCMCglmm)
data("dative", package = "languageR")

k <- length(levels(dative$SemanticClass))
I <- diag(k-1)
J <- matrix(rep(1, (k-1)^2), c(k-1, k-1))

m <- MCMCglmm(SemanticClass ~ -1 + trait + Modality,
              random = ~ us(trait):Verb + us(Modality):Verb,
              rcov = ~ us(trait):units,
              prior = list(
                R = list(fix=1, V=0.5 * (I + J), n = 4),
                G = list(
                  G1 = list(V = diag(4), n = 4),
                  G2 = list(V = diag(2), n = 2))),
              burnin = 15000,
              nitt = 40000,
              family = "categorical",
              data = dative)

Read on for an explanation of this model specification, along with some functions for evaluating the model fit.

The first two lines load the MCMCglmm package and give us access to the dative data set without loading all of the languageR package.

The next three lines set up some useful constants for us. k is the number of categories in our outcome variable. I and J are matrices that we’ll use to set up constraints on the covariance structure of the residuals.

The remainder of the code specifies the multinomial model, and we’ll break it down into smaller chunks.

SemanticClass ~ -1 + trait + Modality,

This first line specifies the dependent variable, SemanticClass, and the fixed effects, -1 + trait + Modality. The term trait is not part of the original data set– rather, it’s a reserved term in the MCMCglmm package. trait comes from phylogenetics, where the dependent variable is some observed trait in an organism. For our purposes, including the trait term tells the model to estimate an intercept for each of the observed outcomes. We follow the advice from the tutorial and enter this as -1 + trait. This guarantees an intercept for each non-baseline level of the outcome, rather than an overall intercept term with an offset for k-2 levels.


random = ~ us(trait):Verb + us(Modality):Verb,

The next line specifies our random effects. To the right of the colon is the variable in our data that will be used as a grouping variable, in this case Verb. The content on the left side of the colon specifies which fixed effects will also have random offsets. These fixed effects are wrapped in a function call which defines the structure of the variance-covariance matrix for that random effect. In this case, we’re allowing an unstructured variance-covariance matrix for the levels of the outcome and the Modality of the observation. The other relevant possibility would be to use idh(trait):Verb, which constrains the random effects to be uncorrelated. For more information on these specifications, see page 12 of the tutorial.


rcov = ~ us(trait):units,

This line specifies the structure of the residuals. units is another reserved word in MCMCglmm, in this case referring to each individual observation. This specification of the variance-covariance matrix of the residuals allows arbitrary correlations in the errors. Again, see page 12 of the tutorial for the precise terms being fit.


prior = list(
R = list(fix=1, V=0.5 * (I + J), n = 4),

Now we define the priors for the model. This is done using some structures defined by MCMCglmm. The R-structure is the variance-covariance matrix for the residuals. The G-structure is the variance-covariance matrix for the random effects (which is itself a list of sub-specifications, one per random effect). These structures are defined on pages 16-19 of the tutorial.

The R-structure in this case is set to have a fixed form (fix = 1). The reasoning behind this decision is given on page 9 of the tutorial. Basically, for data such as we have here where each observation is a single sample from a distribution over k categorical outcomes, we can’t actually estimate the residual variance because it depends entirely on the mean (think of the binomial distribution in the simplest case). We follow the prescriptions from the package authors for working with data from this type of distribution, fixing the variance to be 1 for all of the diagonal terms (variances) and .5 for all of the off-diagonal terms (covariances). This advice can be found in the table on page 23 of the tutorial, with further information on page 21.


G = list(
G1 = list(V = diag(4), n = 4),
G2 = list(V = diag(2), n = 2))),

Here are the priors on the random effects. When priors are not fixed, they are inverse-Wishart distributed. V is a hyperprior on the variance-covariance matrix (the inverse scale matrix), and n can be thought of as a degree-of-belief parameter. The tutorial suggests that n >= k to aid in convergence. Here we choose the lowest n that results in a proper prior, or the least-informative proper prior (but not an uninformative prior).


burnin = 15000,
nitt = 40000,

Here we specify how many samples to generate. The burnin argument sets the number of samples generated in the burn in period. The goal is to converge on a set of stable estimates for the model parameters. We know that this won’t happen instantaneously, and so we allow the model some time to try to settle on a good set of parameters before we start collecting samples. These initial throwaway samples comprise the burn in period, and the number of such samples to be generated is set by the burnin argument. After the burn in period, we sample each parameter from the model a certain number of times, controlled by the nitt argument. There’s no guarantee that the model will have settled on stable estimates by the end of burn in, though, so it will be important to test whether the model has converged in the samples we end up using. It should also be noted that it is possible to discard samples as they’re collected using the thin parameter. This is usually done to try to minimize dependence between samples.


family = "categorical",
data = dative)

We use the "categorical" family, appropriate for data that consists of one line per observation. If we had multinomial count data, we would use the "multinomialK" family, where K is the number of levels in the dependent variable. In this case, if the data were restructured as multinomial count data, we would use family = "multinomial5". To restructure the data, we could use something like


library(reshape)
dative.melt <- melt(dative,
id.vars = c("Modality", "Verb"),
measure.vars = c("SemanticClass"))
dative.wide <- cast(dative.melt, Verb ~ value)

But in this case, it works just as well to use the original data and the family = "categorical" argument to MCMCglmm.

When we run the model, we’ll get output like

MCMC iteration = 0

E[MH acceptance ratio] = 0.000168

MCMC iteration = 1000

E[MH acceptance ratio] = 0.263537

MCMC iteration = 2000

E[MH acceptance ratio] = 0.261401

This tells us that the model is being fit using the Metropolis-Hastings algorithm (MH), not Gibbs sampling. It also tells us the proportion of samples in each set of one thousand that is accepted. An acceptance rate that’s too high means that the space will explored very slowly. Anecdotally, an acceptance rate between .25 and .5 seems to indicate good performance.

After fitting the model, we can start by examining the result object, m.

> str(m)
List of 8
 $ Sol     : mcmc [1:2500, 1:5] -0.833 -2.191 -1.027 -0.978 -2.456 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:5] "traitSemanticClass.c" "traitSemanticClass.f" "traitSemanticClass.p" "traitSemanticClass.t" ...
  ..- attr(*, "mcpar")= num [1:3] 1 2500 1
 $ VCV     : mcmc [1:2500, 1:36] 9.45 11.18 7.02 7.78 7.33 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:36] "Verb.trait.SemanticClass.c.SemanticClass.c" "Verb.trait.SemanticClass.f.SemanticClass.c" "Verb.trait.SemanticClass.p.SemanticClass.c" "Verb.trait.SemanticClass.t.SemanticClass.c" ...
  ..- attr(*, "mcpar")= num [1:3] 1 2500 1
 $ Liab    : NULL
 $ Fixed   :Class 'formula' length 3 SemanticClass ~ -1 + trait + Modality
  .. ..- attr(*, ".Environment")=
 $ Random  :Class 'formula' length 2 ~us(trait):Verb + us(Modality):Verb
  .. ..- attr(*, ".Environment")=
 $ Residual:Class 'formula' length 2 ~us(trait):units
  .. ..- attr(*, ".Environment")=
 $ Deviance:Class 'mcmc'  atomic [1:2500] 3079 3063 3101 3062 3127 ...
  .. ..- attr(*, "mcpar")= num [1:3] 1 2500 1
 $ DIC     : num 3572

Of immediate interest are the m$Sol and m$VCV members of the list. The former are for the fixed effects parameters, and the latter are the variance-covariance matrices associated with the residuals and the random effects. These each have the class mcmc, meaning they represent samples from an MCMC fit. In this case, we have all of the samples generated by the model after the burn-in period. We can start our examination of the fixed effects (stored in m$Sol) using some visualizations provided by the coda and lattice packages.

Next we can look at the numeric estimates of the fixed effects.

> summary(m$Sol)

Iterations = 1:2500
Thinning interval = 1
Number of chains = 1
Sample size per chain = 2500 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

                        Mean     SD Naive SE Time-series SE
traitSemanticClass.c -3.4160 1.0983  0.02197        0.11372
traitSemanticClass.f -2.5098 0.7304  0.01461        0.06545
traitSemanticClass.p -5.3633 1.2252  0.02450        0.12210
traitSemanticClass.t  0.7708 0.8786  0.01757        0.07661
Modalitywritten      -2.2994 0.7708  0.01542        0.07250

2. Quantiles for each variable:

                        2.5%     25%     50%    75%   97.5%
traitSemanticClass.c -5.6219 -4.1561 -3.4069 -2.694 -1.2674
traitSemanticClass.f -3.8757 -3.0037 -2.5280 -2.033 -1.0036
traitSemanticClass.p -7.7261 -6.2194 -5.2755 -4.504 -3.1063
traitSemanticClass.t -0.9578  0.1640  0.7919  1.363  2.4537
Modalitywritten      -3.8277 -2.7963 -2.2668 -1.778 -0.8377

Finally, we can check out the Highest Posterior Density intervals for the fixed effects.

> HPDinterval(m$Sol)
                          lower      upper
traitSemanticClass.c -5.4157540 -1.1619380
traitSemanticClass.f -3.9443586 -1.0921360
traitSemanticClass.p -7.7055507 -3.0929750
traitSemanticClass.t -0.9460623  2.4557778
Modalitywritten      -3.7766821 -0.7944914
attr(,"Probability")
[1] 0.95

If I were dealing with a real data set, I would also run the exact same model a few more times, saving each set of results separately. This allows us to test for model convergence by comparing results across runs of the model (see gelman.diag). Other convergence metrics that would be important to check for any real data would be heidel.diag, geweke.diag, and raftery.diag. I find the Raftery test to be relatively intuitive to interpret, but it can require generating a large number of samples.

Questions can be left here in the comments, or can be sent to the R-lang list.

About these ads

31 thoughts on “Multinomial random effects models in R

    Max Bane said:
    May 14, 2009 at 2:58 pm

    Cool, very nice introduction. One question I have, though, is how to incorporate alternative-specific variables in this framework. The example you have here has just one item-specific variable, Modality, but some of the data I’m working with seems like it would best be modeled by a combination of item-specific and alternative-specific predictors. I’ve been playing around recently with the mlogit package, which allows you to specify such models very nicely (“dependent ~ altSpec1 + … + altSpecN | itemSpec1 + … + itemSpecM”), but it doesn’t support random effects.

    Like this

    Graham Leask said:
    November 22, 2009 at 11:38 am

    An interesting introduction. Has this advanced any further to hierarchical random effect models for multinomial responses?

    Like this

    Shige said:
    January 30, 2010 at 4:25 pm

    Cool introduction, thanks.

    Like this

    2010 in review « HLP/Jaeger lab blog said:
    January 3, 2011 at 9:48 am

    [...] Multinomial random effects models in R May 2009 3 comments 4 [...]

    Like this

    Oscar Inostroza said:
    September 21, 2011 at 1:15 am

    hi

    in first place, thank very much for you contribution.
    i write you for make you a little question. I have how response variable a discrete variable with 4 states, (my independent variable is continuous) , so I use the variance structure of the fixed effects and random-as you propose in your mini-tutorial. the problems to be admitted to the program gives me an error: object ‘J’ not found. any idea why this happens?

    I hope your answer please.

    regards

    Like this

      tiflo said:
      September 21, 2011 at 2:13 am

      Dear Oscar,

      J is part of the specification of the prior. It’s best you read the course notes and the vignette for the MCMCglmm package by Hadfield. They contain a lot of useful information (I started working with this package myself recently)

      Like this

    Xiao He said:
    June 7, 2012 at 10:57 am

    I’ve started learning how to use MCMCglmm to fit linear mixed-effects models recently. I wonder how one can conduct multiple comparisons after one obtains a significant interaction. With lmer(), we can use glht() in the multcomp package, but I haven’t been able to find a comparable method for MCMCglmm. Any suggestion would be much appreciated!

    Best,
    Xiao

    Like this

      tiflo said:
      June 7, 2012 at 11:09 am

      Dear Xiao,

      I almost always code interactions a priori based on theoretical considerations, so I haven’t been using glht() and, unfortunately, know of no equivalent for MCMCglmm, but perhaps someone else reading this has an answer?

      Florian

      Like this

        Xiao He said:
        June 7, 2012 at 11:18 am

        Thanks for the reply! I wonder if you can suggest a couple of articles/textbooks/tutorials on coding interactions a priori. I would be very interested in learning it. Thank you again

        Like this

    Anna said:
    May 17, 2013 at 8:17 am

    Thank you so much for this summary! I spent the whole day trying to get the hang of the MCMCglmms, as I am totally new to it, but this summary finally helped me to get a model. However, now I am not entirely sure whether I used the right prior settings (despite reading through the descriptions over and over again the mathematics behind it seem to be beyond me).

    k <- length(levels(part$Level))
    I <- diag(k-1)
    J <- matrix(rep(1, (k-1)^2), c(k-1, k-1))

    m <- MCMCglmm(Level ~ -1+trait +Fixed1 +Fixed2+Fixed3 + Fixed4 + Fixed5 +
    Fixed6 +Fixed7 +Fixed8,
    random = ~ us(trait):IndividualID +us(trait):GroupDyad,
    rcov = ~ us(trait):units,
    prior = list(
    R = list(fix=1, V=0.5 * (I + J), n = 5),
    G = list(
    G1 = list(V = diag(4), n = 5),
    G2 = list(V = diag(4), n = 5))),
    burnin = 15000,
    nitt = 40000,
    family = "categorical",
    data = part)

    So this is basically what I did, I have my levels in one collumn, which I hope works out? And the Level is a categorical variable with the classes (0,1,2,3,4), from what I got so far with the model I have it compares it all to 0, is that right? the prior settings for R I basically just took from what you wrote, and the model seems to run, but I am not sure whether its right, cause my fixed variables have different distributions (some are poisson, some categorical and one even binomial). Do you need to take that into account when formulating the R?
    Thanks a lot for your help!!
    Anna

    Like this

    Renee said:
    June 26, 2013 at 12:18 am

    It is a very nice tutorial. However I have a question about how to explain the result?
    In multinomial logistic regression model, there should be 4 coefficients for Modalitywritten:
    Modalitywritten class c (compare to the baseline)
    Modalitywritten class f
    Modalitywritten class p
    Modalitywritten class t

    but here in this example, we only got one. How should we explain the influence of modality (written) to each class?
    Thanks very much!

    Renee

    Like this

      tiflo said:
      June 26, 2013 at 6:29 pm

      Dear Renee,

      The coefficient for modality here describes the effect of modality on all levels of SemanticClass (c,f,p,t). That is, the model described here assumes that the effect of modality is identical at all levels of SemanticClass (i.e., additive to the class intercepts).

      I haven’t used MCMCglmm much, but I assume to relax that assumption, you would add the interaction of modality and trait to the fixed effect structure of the model, changing

      SemanticClass ~ -1 + trait + Modality

      to

      SemanticClass ~ -1 + trait * Modality

      I tried that and it seems to yield the desired result. I used a much shorter burnin and collected fewer samples, so the estimates given below are not to be trusted, but they illustrate my point:


      Iterations = 1001:3991
      Thinning interval = 10
      Number of chains = 1
      Sample size per chain = 300

      1. Empirical mean and standard deviation for each variable,
      plus standard error of the mean:

      Mean SD Naive SE Time-series SE
      traitSemanticClass.c -2.12514 0.7814 0.04511 0.12300
      traitSemanticClass.f -1.39007 0.6487 0.03745 0.08351
      traitSemanticClass.p -2.66922 0.8521 0.04919 0.13989
      traitSemanticClass.t 0.49969 0.7906 0.04564 0.10948
      Modalitywritten -1.68676 0.7375 0.04258 0.12506
      traitSemanticClass.f:Modalitywritten -0.73914 0.2808 0.01621 0.12319
      traitSemanticClass.p:Modalitywritten -0.02762 0.2256 0.01303 0.10308
      traitSemanticClass.t:Modalitywritten -0.75859 0.2458 0.01419 0.08884

      2. Quantiles for each variable:

      2.5% 25% 50% 75% 97.5%
      traitSemanticClass.c -3.7766 -2.6081 -2.08930 -1.5495 -0.7722
      traitSemanticClass.f -2.7074 -1.8566 -1.39134 -0.9053 -0.2231
      traitSemanticClass.p -4.3854 -3.2386 -2.70065 -2.1575 -0.9850
      traitSemanticClass.t -0.9317 -0.1557 0.52490 1.0610 2.0400
      Modalitywritten -2.9807 -2.2319 -1.72275 -1.2190 -0.1443
      traitSemanticClass.f:Modalitywritten -1.1657 -0.9783 -0.77251 -0.5448 -0.1313
      traitSemanticClass.p:Modalitywritten -0.4315 -0.2071 -0.02458 0.1360 0.3805
      traitSemanticClass.t:Modalitywritten -1.2067 -0.9253 -0.74746 -0.6360 -0.1994

      HTH,
      Florian

      Like this

    Jesse Poulos said:
    August 27, 2013 at 7:09 pm

    Thanks so much for this. Seems to be working, I got a running model but I have no idea how to interpret the output. The tutorial you reference is somewhat helpful, but the questions are so different than mine.

    I am attempting an analyses as follows:

    Background: We are attempting to describe the potential bias that arises from the use of different resolutions of imagery when trying to determine ‘on-the-ground’ vegetation at a given point

    Question: does spatial resolution influence the probability that a technician will describe the vegetation at a point as a tree, shrub, or herbaceous vegetation, etc…

    Objective: model the effect of imagery resolution (i.e. pixel size) in determining the ‘call’ made by a technician

    Data: We have selected 109 points clustered within 33 ‘plots’, at each point there is a ‘call’ of tree, shrub, herb, bare, or other. This was done 4 times for each point, at image resolutions of 1m, 50cm, 10cm, and using aerial imagery from the National Agricultural Inventory Program (NAIP). Plots are spaced at least 6km apart along a grid

    So the data looks like this:


    Plot Point Call Resolution
    41-69-6644 1 Tree 1m
    41-69-6644 1 Tree 10cm
    41-69-6644 1 Tree 50cm
    41-69-6644 1 Tree naip
    41-69-6644 2 Herbaceous 1m
    41-69-6644 2 Shrub 10cm
    41-69-6644 2 Shrub 50cm
    41-69-6644 2 Tree naip

    and my code looked like this:


    model = MCMCglmm(Call~ -1 + trait + Res,
    random = ~us(trait):Plot,
    rcov = ~ us(trait):units,
    prior = list(
    R = list(fix=1, V=0.5 * (I + J), n = 4),
    G = list(
    G1 = list(V = diag(4), n = 4))),
    burnin = 1000,
    nitt = 6000,
    family = "categorical",
    data = res.data)

    I only did a few iterations just for exploratory purposes.

    Here is some output:

    Location effects: Call ~ -1 + trait + Res

    post.mean l-95% CI u-95% CI eff.samp pMCMC
    traitCall.Herbaceous 3.8063 3.0089 4.5007 6.954 <0.002 **
    traitCall.Other -0.7597 -1.2783 -0.2033 3.520 <0.002 **
    traitCall.Shrub 3.8189 3.0808 4.5369 5.561 <0.002 **
    traitCall.Tree 2.1910 1.2333 3.2048 24.191 <0.002 **
    Res1m 1.0131 0.3714 1.3977 2.434 <0.002 **
    Res50cm 0.0239 -0.1985 0.2294 6.899 0.752
    Resnaip 0.7726 0.2696 1.0761 3.292 <0.002 **
    ---

    Am I right in thinking that the last three lines are what I am interested in? Are these saying that relative to the 10cm resolution, the calls made at 50cm were not significantly different, but the other two were different? Further, is it possible to further examine what might be causing the differences of the 1m and NAIP scale (i.e. when the call at 10cm was 'shrub' more misclassification occurred).

    Cheers my friend, you were of great help.

    Jesse Poulos

    Like this

      tiflo said:
      August 29, 2013 at 4:43 am

      Dear Jesse,

      let me see whether I can help. (btw, I took the liberty to remove what I took to be a typo from your comment — I think you meant 1m, 50cm, amd 10cm rather than 10cm, 50cm, and 10cm resolution?)

      In the output you present, the interpretation of the coefficients would depend on the factor coding you employed. I will assume that you didn’t specify any coding explicitly, in which case R will employ treatment coding. Since treatment coding can lead to collinearity between factor levels (especially, if you don’t have data that is balanced with respect to the factor levels), you might want to consider sum coding (aka ANOVA coding). I recommend Maureen Gillespie’s tutorial on factor coding in R, linked to this blog. E.g., you could do:

      res.data$Res = factor(res.data$Res, levels = c("10cm", "50cm", "1m", "naip"))
      contrasts(res.data$Res) = contr.sum(4)
      contrasts(res.data$Res)

      The first four lines describe the predicted mean log-odds of the four responses when the resolution is “10cm”. “10cm” was chosen as base line by default as its alphanumerically the first of the four levels of Res, but it also make sense (since it’s presumably the highest resolution?).

      The next three lines describe the treatment effects of Resolution. Minus any problems with collinearity (which, for example, could cause insignificance), your interpretation of these effects is correct.

      In order to investigate whether classification of different categories (shrub, tree, etc.) was affected differently by the resolution, you would need to add the interaction of trait and resolution to your model (i.e., trait * Res, rather than trait + Res).

      Note also that your model currently only contains random intercepts. You did not specify random slopes for the resolution effect (or the interaction of trait and the resolution). The absence of random slopes of an effect (e.g., the absence of random slopes for the resolution effect) makes the evaluation of that effect anti-conservative. Perhaps it’s the best you can do with the data you have, but I wanted to mention it.

      Florian

      Like this

        Jesse Poulos said:
        August 29, 2013 at 11:56 am

        Tiflo,

        Thanks for the hasty response, and I hope I can tug your eyes and brain a little longer.

        Ahh, yes, you are right in your correction. it should for 10cm, 50cm, and 1m (and NAIP).

        Also, last night I did include the treat*Res term for interactions since this is precisely the information we are seeking. This morning when I checked, I found that it only performed a test for 3/5 of my response classes (Tree, shrub and other), leaving out bare ground, and herbaceous. Do you know if there are any good reasons this might be happening? I assume that it has to do with having a baseline reference class, but I would expect to see at least 4/5 comparisons if that were the case… That said, when I restructure my data as a column for each class and use class multinomialk, the issue disappears, but the results differ; might this multinomialk approach still be valid or perhaps better?

        I did suspect that I would need to do some factor recoding. It changed the output dramatically when I did so. I wonder though, when running contrasts(res.data$Res) why is the last row -1, -1, -1, -1, instead of 0, 0, 0, 0. Is this more or less arbitrary dummy coding?

        How exactly do I specify the random slopes anyway? Instead of random = ~us(trait):Plot, use random = ~us(1+trait):Plot?

        Lastly, what exactly are the estimates in the summary telling me. Say the estimate was 3 for the when comparing tree calls at the 10cm scale vs the 1m scale. Does this mean the at the 1m scale I am 3 times more likely to pick trees?

        Thanks again, you are making me look like a pro at work. I’m surprised they are consulting an ecologist as a statistician; I have no one to consult besides the interwebs, so much thanks.

        JP

        Like this

          tiflo said:
          August 29, 2013 at 3:29 pm

          Hi JP,

          since I had a couple of classes of wine and beer, let me focus on the basics. (otherwise, we have drunk linguist advising an economist advising ….).

          As for only getting only three coefficients, that makes sense. First, consider the case without an interaction. You specified -1 + trait, which should give you four parameters for the five outcomes. The probability of the fifth outcome is 1 – the sum of the probabilities of all four other outcomes. Hence, no fifth parameter is required. That’s what you saw in the output.

          Now consider the case with an interaction. For simplicity’s sake, let’s assume that res is still treatment coded. The effect of res has three parameters and so does its interaction with trait. In this model, the ‘main’ effect of res predict the treatment effect of res for whatever level of trait is not given an interaction (i.e., herbaceous). The treatment effects of res for tree, shrub, and other given by the interaction terms. Finally, the predictions for the fifth level of the outcome (i.e., bare ground) are derived just the same way as in the first model: the predicted probability of the fifth outcome is 1 – the sum of the probabilities of all four other outcomes.

          Essentially, the ‘main’ effects of trait are all intercepts. The effect of res and its interaction with trait together describe the effect of resolution at the different levels of trait.

          As for the coding advice I gave, I recommended sum-coding, not dummy/treatment-coding. The code I provided did that. The contrasts are the columns and by default the last column will be the one that everything is compared against (hence the -1). But for more detail I recommend Maureen’s tutorial. It has exercises that are good to work through to get a more solid understanding of the different coding options.

          Finally, you asked how to read the estimates. The estimates are the coefficients in log-odds. Let’s go through an example once we have ascertained that the above code works? (let me know).

          Florian

          Like this

        Jesse Poulos said:
        August 29, 2013 at 4:31 pm

        Flo,

        For some reason I can’t place my reply below your latest post, so I appologize if this post is out of place.

        Just to be clear, my dependent response (Call) has 5 classes (Tree, shrub, herb, bare, other) while my resolution predictor (Res) has 4 classes (10cm, 50cm, 100cm, and NAIP).

        I am somewhat confused now, as I thought your examples response also had 5 levels, yet it appears as if you used k=4 in your prior…

        Here is what my newest code-block is looking like, note, I’ve reorganized your recommended sum coding so 10cm is the reference level:

        #convert treatment coding to sum coding
        res.data.s = res.data
        res.data.s$Res = factor(res.data.s$Res, levels = c(“50cm”, “1m”, “naip”, “10cm”))
        contrasts(res.data.s$Res) = contr.sum(4)
        contrasts(res.data.s$Res)

        k <- length(levels(res.data.s$Call)) #to double check this ended up being 5 as it should
        I <- diag(k-1)
        J <- matrix(rep(1, (k-1)^2), c(k-1, k-1))

        model = MCMCglmm(Call~ -1 + trait * Res,
        random = ~us(trait):Plot,
        rcov = ~ us(trait):units,
        prior = list(
        R = list(fix=1, V=0.5 * (I + J), n = 5),
        G = list(
        G1 = list(V = diag(5), n = 5))),
        burnin = 100,
        nitt = 500,
        family = "categorical",
        data = res.data.s)

        This produces the following error:

        Error in priorformat(if (NOpriorG) { :
        V is the wrong dimension for some prior$G/prior$R elements

        Perhaps this is arising from some confusion about my response having 5 levels and my predictor having 4 levels? Am I missing something here? It only works if my prior is:

        R = list(fix=1, V=0.5 * (I + J), n = 4), #this can also run if it is set as 5, but only if G1 has 5's as 4's
        G = list(
        G1 = list(V = diag(4), n = 4))),

        Thanks again, and I do hope you are enjoying your beer and wine

        JP

        Like this

          tiflo said:
          August 29, 2013 at 6:06 pm

          Oh, I really shouldn’t drink and write. You’re absolutely right. The example was for a 5-level response variable and you had done everything alright. In order not to confuse others, I revised my original answer to your question above. I hope it helps. Can you post the output of the interaction model as you originally ran it?

          Like this

      Jesse Poulos said:
      August 29, 2013 at 7:15 pm

      Flo,

      Absolutely. This model is obviously not accurate due to the short burn in etc… but I just wanted to get a quick example output.

      model5frs = MCMCglmm(Call~ -1 + trait * Res,
      random = ~us(trait):Plot,
      rcov = ~ us(trait):units,
      prior = list(
      R = list(fix=1, V=0.5 * (I + J), n = 4),
      G = list(
      G1 = list(V = diag(4), n = 4))),
      burnin = 100,
      nitt = 500,
      #thin = 100,
      family = “categorical”,
      data = res.data.s)

      OUTPUT BELOW:

      Iterations = 101:491
      Thinning interval = 10
      Sample size = 40

      DIC: 26325.25

      G-structure: ~us(trait):Plot

      post.mean l-95% CI u-95% CI eff.samp
      Call.Herbaceous:Call.Herbaceous.Plot 1.3104 0.77742 1.95209 40.00
      Call.Other:Call.Herbaceous.Plot 0.1647 -0.10811 0.36195 80.87
      Call.Shrub:Call.Herbaceous.Plot 0.2836 -0.15647 0.69475 36.40
      Call.Tree:Call.Herbaceous.Plot -1.2054 -2.61764 -0.47034 40.00
      Call.Herbaceous:Call.Other.Plot 0.1647 -0.10811 0.36195 80.87
      Call.Other:Call.Other.Plot 0.2514 0.14707 0.41632 40.00
      Call.Shrub:Call.Other.Plot 0.1941 -0.02443 0.47858 40.00
      Call.Tree:Call.Other.Plot -0.1844 -0.65000 0.06387 40.00
      Call.Herbaceous:Call.Shrub.Plot 0.2836 -0.15647 0.69475 36.40
      Call.Other:Call.Shrub.Plot 0.1941 -0.02443 0.47858 40.00
      Call.Shrub:Call.Shrub.Plot 1.0291 0.55959 1.61985 59.94
      Call.Tree:Call.Shrub.Plot -0.8194 -1.60903 -0.28891 29.60
      Call.Herbaceous:Call.Tree.Plot -1.2054 -2.61764 -0.47034 40.00
      Call.Other:Call.Tree.Plot -0.1844 -0.65000 0.06387 40.00
      Call.Shrub:Call.Tree.Plot -0.8194 -1.60903 -0.28891 29.60
      Call.Tree:Call.Tree.Plot 2.2644 1.09706 4.38641 22.29

      R-structure: ~us(trait):units

      post.mean l-95% CI u-95% CI eff.samp
      Call.Herbaceous:Call.Herbaceous.units 1.0 1.0 1.0 0
      Call.Other:Call.Herbaceous.units 0.5 0.5 0.5 0
      Call.Shrub:Call.Herbaceous.units 0.5 0.5 0.5 0
      Call.Tree:Call.Herbaceous.units 0.5 0.5 0.5 0
      Call.Herbaceous:Call.Other.units 0.5 0.5 0.5 0
      Call.Other:Call.Other.units 1.0 1.0 1.0 0
      Call.Shrub:Call.Other.units 0.5 0.5 0.5 0
      Call.Tree:Call.Other.units 0.5 0.5 0.5 0
      Call.Herbaceous:Call.Shrub.units 0.5 0.5 0.5 0
      Call.Other:Call.Shrub.units 0.5 0.5 0.5 0
      Call.Shrub:Call.Shrub.units 1.0 1.0 1.0 0
      Call.Tree:Call.Shrub.units 0.5 0.5 0.5 0
      Call.Herbaceous:Call.Tree.units 0.5 0.5 0.5 0
      Call.Other:Call.Tree.units 0.5 0.5 0.5 0
      Call.Shrub:Call.Tree.units 0.5 0.5 0.5 0
      Call.Tree:Call.Tree.units 1.0 1.0 1.0 0

      Location effects: Call ~ -1 + trait * Res

      post.mean l-95% CI u-95% CI eff.samp pMCMC
      traitCall.Herbaceous 2.017779 1.319766 2.631103 3.728 <0.03 *
      traitCall.Other -0.059957 -0.277582 0.167994 40.000 0.55
      traitCall.Shrub 1.985699 1.219209 2.716203 2.025 <0.03 *
      traitCall.Tree 0.863295 0.381788 1.431208 4.532 <0.03 *
      Res1m 0.251940 0.141353 0.350335 5.852 <0.03 *
      Resnaip -0.124915 -0.286551 0.043194 5.987 0.10 .
      Res10cm 0.007258 -0.138939 0.135456 3.257 1.00
      traitCall.Other:Res1m -0.160500 -0.257337 -0.062635 4.673 <0.03 *
      traitCall.Shrub:Res1m -0.382345 -0.469192 -0.284058 11.096 <0.03 *
      traitCall.Tree:Res1m -0.048957 -0.165548 0.064457 10.073 0.35
      traitCall.Other:Resnaip 0.068419 -0.112842 0.193998 5.869 0.40
      traitCall.Shrub:Resnaip 0.290083 0.213862 0.402369 11.026 <0.03 *
      traitCall.Tree:Resnaip 0.492460 0.355131 0.631961 7.151 <0.03 *
      traitCall.Other:Res10cm -0.066982 -0.166105 -0.001008 12.644 0.05 *
      traitCall.Shrub:Res10cm 0.209294 0.114260 0.332900 9.099 <0.03 *
      traitCall.Tree:Res10cm -0.018507 -0.134081 0.101332 5.141 0.75

      Note that right now 50cm is the reference resolution. I will have to fix that later. But anyway, to remind you of my questions. I don't quite see why there are only three within Call (ie. tree) across resolution comparisons for the interaction and I also don't know quite how these would be interpreted in a useful context (ie, we know they are significant, but how do I translate the magnitude to something the lay-person can understand).

      Also,

      How exactly do I specify the random slopes anyway? Instead of random = ~us(trait):Plot, use random = ~us(1+trait):Plot?

      Cheer mate!

      JP

      Like this

        tiflo said:
        August 30, 2013 at 2:59 am

        Hey JP,

        thanks for posting the model. I tried to explain why you only get three `interaction’ terms in the earlier post that I revised based on your comments. Part of the trick in understanding this is to see that this isn’t really an ordinary interaction (since trait is not like an ordinary factor). First consider the following model with only an ordinary intercept:

        Model 1: Call~ +1

        It’s hard to see what Model 1 would mean ( the intercept would describe the log-odds of any of the four non-baseline outcomes over the baseline). Since we’re now trying to predict the probability of all 5 outcomes with one constant. Really, the model that corresponds to an intercept only model for our five-way multinomial model is the following:

        Model 2: Call~ -1 + trait

        In Model 2, there will be four predictors. The fixed effect output of the model gives you the predicted log-odds (=mean log-odds expected by MCMC fitted posterior distribution of those parameters) for four of the five outcomes, relative to the fifth outcome. The predicted log-odds of the fifth outcome are implicit since they are redundant: they can be derived from knowing the log-odds of the other four outcomes since the probability of all outcomes has to sum to 1 (that’s an assumption made by the model, sometimes called exhaustivity). The output of Model 2 will look something like this (the numbers are made up/copied from the output of your larger model):

        traitCall.Herbaceous 2.017779 1.319766 2.631103 3.728 <0.03 *
        traitCall.Other -0.059957 -0.277582 0.167994 40.000 0.55
        traitCall.Shrub 1.985699 1.219209 2.716203 2.025 <0.03 *
        traitCall.Tree 0.863295 0.381788 1.431208 4.532 <0.03 *

        So the predicted log-odds of a Herbaceous call over a Bare ground call are 2.017779 or about 88.3% =plogis(2.017779). The predicted log-odds of a Tree call over a Bare ground call are .863295 or about 70.3%. etc. The course notes by Jarrod Hadfield, chapter 5, contain code to translate these predicted log-odds, which are expressed against the baseline, into probabilities for all possible outcomes.

        Now, let’s turn to the model you first ran, without an `interaction’.

        Model 3: Call~ -1 + trait + Res

        This model will have only three additional parameters for the four levels of res. Something like:


        traitCall.Herbaceous 2.017779 1.319766 2.631103 3.728 <0.03 *
        traitCall.Other -0.059957 -0.277582 0.167994 40.000 0.55
        traitCall.Shrub 1.985699 1.219209 2.716203 2.025 <0.03 *
        traitCall.Tree 0.863295 0.381788 1.431208 4.532 <0.03 *
        Res1m 0.251940 0.141353 0.350335 5.852 <0.03 *
        Resnaip -0.124915 -0.286551 0.043194 5.987 0.10 .
        Res10cm 0.007258 -0.138939 0.135456 3.257 1.00

        The interpretation of the Res predictors is, however, not what you might first think. Those parameters describe the effect that Res has on the log-odds of any of the four non-baseline outcomes over the baseline outcome. I.e., we’re predicting the effect of resolution on Herbeceous, Other, Shrub, and Tree compared to Bare ground calls.

        The model with the `interaction’ adds only three parameters because the fourth the `main’ effects of resolution already capture one of the four contrasts that we need to describe the overall effect of resolution. Actually, this becomes easier to see, if instead of:

        Model 4a: Call~ -1 + trait * Res

        we write:

        Model 4b: Call~ -1 - Res + trait * Res

        The output should look structurally something like this (again, ignore the numbers):


        post.mean l-95% CI u-95% CI eff.samp pMCMC
        traitCall.Herbaceous 2.017779 1.319766 2.631103 3.728 <0.03 *
        traitCall.Other -0.059957 -0.277582 0.167994 40.000 0.55
        traitCall.Shrub 1.985699 1.219209 2.716203 2.025 <0.03 *
        traitCall.Tree 0.863295 0.381788 1.431208 4.532 <0.03 *
        traitCall.Herbaceous:Res1m 0.251940 0.141353 0.350335 5.852 <0.03 *
        traitCall.Other:Res1m -0.160500 -0.257337 -0.062635 4.673 <0.03 *
        traitCall.Shrub:Res1m -0.382345 -0.469192 -0.284058 11.096 <0.03 *
        traitCall.Tree:Res1m -0.048957 -0.165548 0.064457 10.073 0.35
        traitCall.Herbaceous:Resnaip -0.124915 -0.286551 0.043194 5.987 0.10 .
        traitCall.Other:Resnaip 0.068419 -0.112842 0.193998 5.869 0.40
        traitCall.Shrub:Resnaip 0.290083 0.213862 0.402369 11.026 <0.03 *
        traitCall.Tree:Resnaip 0.492460 0.355131 0.631961 7.151 <0.03 *
        traitCall.Herbaceous:Res10cm 0.007258 -0.138939 0.135456 3.257 1.00
        traitCall.Other:Res10cm -0.066982 -0.166105 -0.001008 12.644 0.05 *
        traitCall.Shrub:Res10cm 0.209294 0.114260 0.332900 9.099 <0.03 *
        traitCall.Tree:Res10cm -0.018507 -0.134081 0.101332 5.141 0.75

        So, now the predicted log-odds of, for example, a Shrub call over a Bare ground call are predicted to decrease if the resolution is 1m rather than 50cm (the coefficient is -0.382345). For a resolution of 50cm, the predicted log-odds of a Shrub call over a Bare ground call are 1.985699. For a resolution of 1m, the predicted log-odds of a Shrub call over a Bare ground call are 1.985699 + -0.382345 = 1.603354. So the predicted probability of a Shrub call over a Bare ground call goes from about 87.9% for a resolution of 50cm down to 83.2% for a resolution of 1m. Since we often don’t care about the specific probabilities, it might be sufficient to say that going from a resolution of 50cm to a resolution of 1m decreases the probability of Shrub relative to a Bare ground call.

        Btw, the bottom of the following page contains a nice and simple way to visualize your results (once you have extracted the fitted probabilities from the model).

        As for your question about random slopes, unfortunately, I haven’t tried that myself. I couldn’t find direct pointers to this question either (and I have to switch to preparing for a conference). It seems though that what you had in mind is the right thing. Here’s an example for a linear mixed effect model in MCMCglmm with random slopes, which to me suggests that the syntax you suggested should work. Also, have you tried to see what MCMCglmm defaults to when you do not specify the random argument for your model?

        Like this

          Kyle said:
          April 7, 2014 at 4:42 pm

          You provided a link to http://www.ats.ucla.edu/stat/r/dae/mlogit.htm as a simple way to visualize your results. However, the predict function currently does not work with MCMCglmm… any suggestions on how to visualize data. Can you alter the suggested code so that you are not predicting new data?

          Like this

          tiflo said:
          April 7, 2014 at 6:12 pm

          Hi Kyle, we’re working on a related project that might result in the type of code you’re looking for but I don’t think we’re quite there yet. You could contact Klinton Bicknell, now at Northwestern, to ask him about this (he’s taking the lead in that project). Sorry, to not be more helpful.

          Florian

          Like this

          Kyle said:
          April 8, 2014 at 2:12 pm

          Thanks. I did find a wrapper for the MCMCglmm package called mcmcglmm that enables prediction of new data using the PredictNew command, but I haven’t really tried using it yet.

          Like this

        Jesse Poulos said:
        August 30, 2013 at 3:07 pm

        Ok, so I noticed that your recommendation of using the formula Call~ -1 -Res + trait*Res didn’t work, instead, it has to be Call~ -1 + trait*Res – Res in order for the 4th interaction to show up. So I got the following output when doing this:

        post.mean l-95% CI u-95% CI eff.samp pMCMC
        traitCall.Herb 1.95153 1.11930 2.63090 2.241 <0.03 *
        traitCall.Other -0.18640 -0.42445 -0.02673 62.460 0.05 *
        traitCall.Shrub 2.04203 1.14134 2.69823 2.473 <0.03 *
        traitCall.Tree 0.81595 0.33647 1.24168 24.597 <0.03 *
        traitCall.Herb:Res1m 0.30195 0.17190 0.43011 2.139 <0.03 *
        traitCall.Other:Res1m 0.20085 0.09327 0.29599 21.610 <0.03 *
        traitCall.Shrub:Res1m -0.27070 -0.43171 -0.14375 4.281 <0.03 *
        traitCall.Tree:Res1m 0.37537 0.26161 0.48194 23.174 <0.03 *
        traitCall.Herb:Resnaip -0.06315 -0.15464 0.05195 9.795 0.25
        traitCall.Other:Resnaip 0.08218 0.01528 0.20684 13.112 0.05 .
        traitCall.Shrub:Resnaip 0.04392 -0.04036 0.14581 12.443 0.45
        traitCall.Tree:Resnaip 0.51446 0.39610 0.62859 7.038 <0.03 *
        traitCall.Herb:Res50cm 0.13053 0.03043 0.19995 18.936 <0.03 *
        traitCall.Other:Res50cm 0.15048 0.07094 0.21995 18.512 <0.03 *
        traitCall.Shrub:Res50cm -0.04734 -0.20706 0.07930 9.224 0.50
        traitCall.Tree:Res50cm 0.21715 0.11581 0.31486 10.364 <0.03 *

        Suppose now that rather than wanting to know the probability that a technician will classify a point as a tree relative to bare ground at the 50cm scale vs the 10cm scale, I just want to know the absolute probability that a technician will classify a pixel as a tree at the 50cm scale vs 10cm scale (basically, I don't care about the relative probabilities between classes, but instead just between resolutions), or say, probability of classifying as a shrub at 50cm vs 10cm. This is pretty tricky to interpret, isn't it.

        I can't seem to wrap my heard around it…

        Like this

          tiflo said:
          August 30, 2013 at 3:43 pm

          Hey JP,

          great — looks like we’re making progress =)! And thanks for posting the correction (I was surprised that the order matters, but I hadn’t test run the code and it’s perfectly plausible that it matters for the way MCMCglmm parses the formula object). For the interpretation that you’re looking for, I recommend you read the course notes I referred to (http://cran.r-project.org/web/packages/MCMCglmm/vignettes/CourseNotes.pdf, Chapter 5). It actually provides a few lines of code to translate the output of your model into probabilities for each cell of your design and for each level of the response variable. If you get stuck let me know?

          Florian

          Like this

          kaufnizer said:
          April 7, 2014 at 9:08 pm

          I ended up contacting Jarrod Hadfield to resolve this question of testing one group against the next. I framed it using his example from Pg. 101 of the course notes:

          ME:
          “I was wondering if you could provide some insight on how to find p-values for the example on p101 of your course notes. I am specifically interested in how I might go about testing whether there is a significant difference between the probability of observing polled males vs polled females. I understand how to use the HPDinterval function to obtain credible intervals, but is there a similar function that can obtain p-values for those credible intervals?”

          J. Hadfield:
          “The p-values in summary.MCMCglmm are obtained using:

          2 * pmax(0.5/length(x), pmin(sum(x > 0)/length(x), 1 – sum(x > 0)/length(x)))

          where in your case x would be the difference in probabilities. Its not ideal, but probably gives a fairly reliable answer.”

          So it’s been a while, and I don’t quite remember what’s going on here, but essentially, ‘x’ is a vector containing the differences in probabilities for each iteration. Beyond that, it’s not clear to me what is going on. It seemed to work out though and gave me the test I was looking for.

          Like this

          tiflo said:
          April 8, 2014 at 2:02 pm

          Thanks for posting this. Much appreciated. I’m swamped right now and won’t get to play around with this for a while, but I’m sure others will be happy to see this solution posted.

          Like this

    Jesse Poulos said:
    August 29, 2013 at 12:41 pm

    Also, there is more documentation and information in the MCMCglmm authors course:
    http://cran.r-project.org/web/packages/MCMCglmm/vignettes/CourseNotes.pdf

    Like this

    Maxim K (@kovla123) said:
    January 25, 2014 at 4:40 am

    I can’t figure out how to specify heteroskedasticity on level 1 in MCMCglmm. Basically, I’ve got individuals within countries, and my research question is how does variance for individuals (after control for the country-level) changes with my predictor, which is time. I have estimated that model in MLwiN, and I can reproduce it almost completely in MCMCglmm, except for specifying the predictor for variance on the inividual level. Can it be done? Thank you so much in advance.

    Like this

      tiflo said:
      January 29, 2014 at 10:58 am

      Hi Maxim,

      sorry, I’m swamped at the moment and don’t know from the top of my head the answer to your question. I would recommend that you write to the developer of the package (Jarrod Hadfield) or to the r-sig-mixed-models email list (he’s one of the people replying). If you have the time, it’d be great if you could post a summary of the reply here? In the meantime, I approved this post, so that readers of this blog can jump in and throw in their answer.

      Sorry, to not be of more help!

      Like this

Questions? Thoughts?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s