### Is my analysis problematic? A simulation-based example

This post is in reply to a recent question on in ling-R-lang by Meredith Tamminga. Meredith was wondering whether an analysis she had in mind for her project was circular, causing the pattern of results predicted by the hypothesis that she was interested in testing. I felt her question (described below in more detail) was an interesting example that might best be answered with some simulations. Reasoning through an analysis can, of course, help a lot in understanding (or better, as in Meredith’s case, anticipating) problems with the interpretation of the results. Not all too infrequently, however, I find that intuition fails or isn’t sufficiently conclusive. In those cases, **simulations can be a powerful tool in understanding your analysis**. So, I decided to give it a go and use this as an example of how one might approach this type of question.

**Some background**

Meredith’s question is about morpho-phonological priming, specifically about priming of the choice between the two phonological variants of the English gerund morpheme, /in/ and /ing/. Her data set is a relatively large corpus of conversational speech, from which she extracted all examples of -ing and annotated them for their phonological realization (/in/ vs. /ing/). Her question on in ling-R-lang was about how to test for effects of a prime’s surprisal on the strength of priming in production (cf. Jaeger & Snider, 2013). In Jaeger & Snider (2013) we found that syntactic primes that were more surprising seem to lead in larger shifts in production syntactic preferences on subsequent production (i.e., larger priming effects). The surprisal of a prime is the logarithm of the reciprocal of its probability, i.e. surprisal(prime) = -log p(prime). This, of course, raises the question surprisal *given what?* In Jaeger and Snider we investigated syntactic priming in the ditransitive constructions (DO: *Tom handed Mary the book *vs. PO: *Tom handed the book to Mary*). We estimated surprisal in two ways:

- based on the verb’s subcategorization bias across speakers, p(structure | verb) independent of speaker, and
- based on the distribution of DO and PO structures in the experiment prior to the prime trial, p(structure | distribution of structure recently experienced by the speaker)

**The question**

Meredith’s question was whether the surprisal that speakers experience while processing a prime is actually conditioned on *their own production preferences*. For this one could estimate a speaker’s production preference (e.g., overall or depending on the lexical context) for /in/ and /ing/ and then calculate the surprisal for each prime based on these preferences, i.e., -log p(prime structure | structural bias of speaker), where structure is /in/ or /ing/. Let’s call this the *speaker-specific prime surprisal.*

**The problem**

The potential problem in this approach is that both speaker’s responses in target trials and the speaker-specific prime surprisal depend on the speaker’s structural bias. For example, speakers who produce /in/ and /ing/ with about equal probability will experience higher average surprisal than speakers who produce one of the two structures more often than the other (average surprisal is entropy and the entropy is highest for the uniform distribution). At the same time, it might be easiest to detect priming effects if a speakers’ (overall) structural bias is close to 50/50 (since the effect needs to be estimated from finite data, priming effects will be harder to detect if the structural bias is really close to 0 or 1). It’s this type of dependence that Meredith was worried about.

**Using simulation to address the problem**

One way to address this question is to construct simulated data sets in which the true effect of speaker-specific prime surprisal is 0, i.e. where there is no effect of speaker-specific prime surprisal. I’ll first demonstrate this for a typical balanced experiment and then try to extend it to a scenario that more closely resembles corpus-based analysis. I’ll be using code that I adapted from Dave Kleinschmidt’s plyr tutorial (as part of our class at the 2013 LSA Summer Institute).

Let’s start by writing a function that generates a function that generates simulated data sets. For simplicity’s sake we’ll assume that there’s no data loss and that we only have to care about repeated measures over subjects, not items.

```
library(plyr)
library(mvtnorm)
library(lme4)
## ---------------------------------------------------------
#
# Functions
#
## ---------------------------------------------------------
myC = function(x) { x - mean(x, na.rm=T) }
make.balanced.data.generator <- function(true.effects=c(1,1),
ranef.var=diag(c(1,1)),
n.subj=24,
n.obs=48
)
{
# this data generator assumes prime trials that are balanced and hence
# independent of the speaker's own production bias.
# create design matrix for our made up experiment
data.str <- data.frame(nprime = c(rep(1, n.obs/2), rep(0, n.obs/2)),
prime = factor(c(rep('in', n.obs/2), rep('ing', n.obs/2)), levels = c("in", "ing")))
data.str$cnprime = myC(data.str$nprime)
model.mat <- model.matrix(~ 1 + cnprime, data.str)
generate.balanced.data <- function() {
# sample data set
simulated.data <- rdply(n.subj, {
beta <- t(rmvnorm(n=1, sigma=ranef.var)) + true.effects
logodds <- model.mat %*% beta
data.frame(data.str,
target= sapply(X = plogis(logodds), FUN = function(x) { rbinom(1, 1, x) } ))
})
names(simulated.data)[1] <- 'subject'
simulated.data
}
}
```

Note that this data generator assumes that the structure of prime trials (i.e., whether primes are /in/ or /ing/) is independently controlled by the experimenter (e.g., by forcing speakers to say the word one way or the other or by exposing speakers to those primes). That is, here the speaker’s structural bias does *not *determine the distribution of prime trials. We can use this data generator to create lots of instances of our simulated experiment. Here’s a call to the function to create 16 instances of the experiment:

```
# the first effect is the intercept; the second effect is the (sum-coded)
# effect of priming (i.e., this is half the actual effect). i.e. we're
# assuming a positive priming effect of 1 log-odds. but no effect of
# speaker-specific prime surprisal.
true.effects=c(.25,1)
# this is the variance-covariance matrix of the random effects. Here we're
# only considering random effects by-subject. The first value is the variance
# of the random by-subject intercepts and the second is the variance of the
# by-subject slopes of the priming effect. That is, we're assuming no covariance
# between these two random effects.
ranef.var=diag(c(1,.5))
# the number of subjects and the number of observations per subject
n.subj=24
n.obs=48
## ---------------------------------------------------------
#
# balanced data
#
## ---------------------------------------------------------
gen.dat <- make.balanced.data.generator(true.effects=true.effects,
ranef.var=ranef.var,
n.subj=n.subj,
n.obs=n.obs)
# this creates 16 instances of the same experiment with 24 subjects
# each, who each contribute 48 data points. The data is stored in a
# data.frame
d = rdply(.n=16,
gen.dat(),
.progress='text')
```

We can plot the results of these 16 experiments, showing the estimated by-subject average proportions of /in/ and /ing/ for each of the two prime conditions (points). The result is Figure 1 above. The bars show the averages and error bars show bootstrapped confidence intervals. The point of this graph is that our experiments robustly reveal the underlying prime effect (which was build into the simulation).

Now let’s see what happens we analyze this data, *which was generated with a) a very small intercept effect, b) a decently sized positive priming effect, and c)* *with out an effect of speaker-specific prime surprisal*. One secondary goal is recover the underlying prime effect and, less importantly, the underlying intercept effect. To the extent that we fail to do so, those are Type II errors. Our primary goal, however, is to obtain

**no**significance for speaker-specific prime surprisal or its interaction with the prime structure. As in Jaeger and Snider (2013), we code surprisal as the surprisal of the prime structure, i.e., if the prime is an /in/, its surprisal is going to be -log p(/in/ | speakers’ structural bias), where we estimate p(/in/ | speaker’s structural bias) as (empirical logit-smoother) proportion of /in/ responses by the speaker throughout the experiment. That is our analysis is a logit model with the following formula:

- TargetStructure ~ 1 + PrimeStructure * SpeakerSpecificPrimeSurprisal

While we’re at it, I’m using this opportunity to compare four possible models:

- An ordinary logistic regression
- A mixed logit model with only a random by-speaker intercept
- A mixed logit model with random by-speaker intercept and slope for the effect of prime structure.

The third model has the random effect structure recommended by Barr et al (2013). A fourth with truly maximal random effect structure would probably perform better (it’s more conservative), but I omit it here since it will often not converge and because it will considerably slow down the simulations. **The first two models are definitely not recommended and only included to make that point.**

```
fit.priming_models <- function(simulated.data, surprisalBy = c("prime", "target")) {
# add speaker-specific prime surprisal to the data frame
simulated.data = ddply(.data = simulated.data,
.(subject),
transform,
mSpeakerSurprisal = mean(ifelse(prime == "in", -log2(mean(target)), -log2(1-mean(target))))
)
# center
simulated.data$cmSpeakerSurprisal = myC(simulated.data$mSpeakerSurprisal)
# the models
glm.coefs <- coefficients(summary(
glm(target ~ 1+cnprime*cmSpeakerSurprisal, simulated.data, family=binomial)))[, 1:4]
rand.int.coefs <- coef(summary(
glmer(target ~ 1+cnprime*cmSpeakerSurprisal + (1|subject), simulated.data, family=binomial)))
rand.slope.coefs <- coef(summary(
glmer(target ~ 1+cnprime*cmSpeakerSurprisal + (1+prime|subject), simulated.data, family=binomial)))
# format output all pretty
rbind(data.frame(model='glm', predictor=rownames(glm.coefs), glm.coefs),
data.frame(model='rand.int', predictor=rownames(rand.int.coefs), rand.int.coefs),
data.frame(model='rand.slope', predictor=rownames(rand.slope.coefs), rand.slope.coefs))
}
```

We can call this analysis function on randomly generated data sets (like those above), do that for e.g., 1000 times, and then look at the distribution of z-values. **Note that this will take a long time to run. **Below I show how to run 100 iterations of the simulations, which took about 1-2 minutes to run. The vertical dashed lines indicate abs(z) = 1.96. Significant effects fall outside of those boundaries. Each row shows one of the four parameters of the model (see formula above). I discuss the four findings below the plot.

As we can see, the model recovers the priming effect (third row). Unsurprisingly, the ordinary glm and the mixed logit regression with only random by-speaker intercepts find more this effect to be on average more significant than the mixed logit regression with both random by-speaker intercepts and slopes (as we know from other simulations, that’s because the former two models are anti-conservative with regard to the priming effect).

Second, we see that the mixed models struggle somewhat to detect the fact that the intercept is slightly above zero (the experiment was set up with a very weak overall bias towards the /in/ structure). Note that the glm does better only because it’s anti-conservative with regard to the intercept effect.

Third, the models unfortunately also detect a positive main effect of speaker-specific prime surprisal (worst for the glm, but present for all models). This is presumably due to the circularity of predicting speakers’ responses from a transformation of their average responses. The speaker-specific prime surprisal is neither a linear, not a monotonic function of speakers’ responses, but it is related to speakers’ responses. As soon as our /in/ vs. /ing/ distribution isn’t perfectly balanced in the target responses, these two things will be correlated, causing the increased error rate observed above (it would be misleading to call this a Type I error since speaker-specific prime surprisal *is *correlated with the speaker’s responses; this error is not a shortcoming of mixed models — it’s a confound due the way we estimated the speaker-specific prime surprisal; So I’ll refer to this as ‘spurious significances’ for lack of a better term).

Fourth, we also see a mild tendency for a significant interaction of speaker-specific prime surprisal and prime structure. This interaction was significant in 12% of 100 simulations in the mixed logit model with random by-speaker intercepts and slopes. **This is the spurious significance that Meredith was concerned about.** Note that it’s precise strength will likely depend on the across-speaker mean of speakers’ structural biases.

**Solutions?**

One way forward would be to estimate the *expected *rate of spurious effects to assess how probably the actually observed effect is. We can use the above distribution of z-values to conduct such a test (see Jaeger et al., 2012 for details).

It is also possible that for the specific constellation of data we have, the rate of spurious effects is actually smaller than estimated here (also keep in mind that I only used 100 iterations of the simulation, where something in the order of 1000 would strike as a minimum necessary for a robust simulation of this type). For example, we can also further improve the simulations by estimating the intercepts and prime effects from our own data.

To get anyone who’s interested in this started, here’s a data generator that creates more corpus-like ‘experiments’. This generator draws the prime and target trials both from the same underlying distribution based on the speaker’s structural bias. The simulation is, however, still assuming that ‘prime’ trial isn’t itself primed by some preceding instances. So, in that sense, this simulation is most comparable to a scenario where we randomly have extracted prime-target pairs out of a larger corpus of a speakers’ utterances and where the prime-target pairs are sufficiently far away from each other to assume independence between the primes of different prime-target trials conditional on the speaker.

```
make.corpus.data.generator <- function(true.effects=c(1,1),
ranef.var=diag(c(1,1)),
n.subj=24,
n.obs=48
)
{
# this generator samples the prime trials from the same distribution as the target
# trials (but it still assumes that prime-target trials are independently sampled,
# conditional on the speaker; i.e., the prime trials are assumed to be themselves
# *not* primed by previous trials).
generate.corpus.data <- function() {
# sample data with a random number of subjects drawn from a Poisson distribution
# with approximately the mean specific above (~n.subj)
simulated.data <- rdply(rpois(n.subj, n.obs-1)+1, {
beta <- t(rmvnorm(n=1, sigma=ranef.var)) + true.effects
data.str <- data.frame(nprime = rbinom(n.obs, 1, plogis(beta[1,])))
data.str$prime = factor(ifelse(data.str$nprime==1, "in", "ing"), levels = c("in", "ing"))
data.str$cnprime = myC(data.str$nprime)
model.mat <- model.matrix(~ 1 + cnprime, data.str)
logodds <- model.mat %*% beta
data.frame(data.str,
target= sapply(X = plogis(logodds), FUN = function(x) { rbinom(1, 1, x) } ))
})
names(simulated.data)[1] <- 'subject'
simulated.data
}
}
```

For data like these, we could estimate a speaker’s structural bias (and hence the speaker-specific prime surprisal) from independent parts of the corpus. This might further lower the dependence between the estimate of a speaker’s structural bias and the effect of speaker-specific prime surprisal during priming. For example, we could divide the data into two half, estimating the speaker’s structural bias from one half and fitting the priming effect on the other half. The above data generator can be used to test the consequences of such strategies.

Anyway, I hope some of this code is useful in developing your own simulations. Questions, comments, and corrections are welcome. Follow-up simulations using this or other code are particularly welcome. And, sorry, for the hacky way of posting R-code. I should have used knitr.

**References**

- Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal.
*Journal of Memory and Language*, 68(3), 255–278. doi:10.1016/j.jml.2012.11.001 - Jaeger, T. F., Pontillo, D., & Graff, P. (2012). Comment on “Phonemic diversity supports a serial founder effect model of language expansion from Africa”.
*Science,*335(6072), 1042. doi:10.1126/science.1215107 - Jaeger, T. F., & Snider, N. E. (2013). Alignment as a consequence of expectation adaptation: Syntactic priming is affected by the prime’s prediction error given both prior and recent experience.
*Cognition*, 127(1), 57–83. doi:10.1016/j.cognition.2012.10.013