Mixed model’s and Simpson’s paradox
For a paper I am currently working on, I started to think about Simpson’s paradox, which wikipedia succinctly defines as
“a paradox in which a correlation (trend) present in different groups is reversed when the groups are combined. This result is often encountered in social-science […]”
The wikipedia page also gives a nice visual illustration. Here’s my own version of it. The plot shows 15 groups, each with 20 data points. The groups happen to order along the x-axis (“Pseudo distance from origin”) in a way that suggests a negative trend of the Pseudo distance from origin against the outcome (“Pseudo normalized phonological diversity”). However, this trend does not hold within groups. As a matter of fact, in this particular sample, most groups show the opposite of the global trend (10 out of 15 within-group slopes are clearly positive). If this data set is analyzed by an ordinary linear regression (which does not have access to the grouping structure), the result will be a significant negative slope for the Pseudo distance from origin. So, I got curious: what about linear mixed models?
Here, I don’t want to ask how likely the above case is to be an actual instance of Simpson’s paradox (for that, we would need to know that the order of the groups on the x-axis is indeed co-incidental rather than itself being due to a causal effect of Pseudo distance from origin).
So, as a first step, here’s a function that creates data, in which the within-group slope is either the global slope or its negative (plus, in either case, a bit of normal noise).
init = function(ngroup = 10, nitem = 10) { x = 1:27000 spread_x = (max(x) - min(x)) * .2 group_x_start = runif(ngroup, min(x), max(x)) group_x_end = runif(ngroup, apply(cbind(group_x_start + spread_x/2, max(x)), MARGIN = 1, FUN = min), apply(cbind(group_x_start + spread_x, max(x)), MARGIN = 1, FUN = min)) # variances group_sigma = .1 group_x_sigma = .00001 indiv_sigma = .1 # betas alpha = .6 group_alpha = rnorm(ngroup, 0, group_sigma) beta_x = -.00005 # for normal differences # beta_group_x = rnorm(ngroup, 0, group_x_sigma) beta_group_x = 2 * rbinom(ngroup, 1, 0.5) * -beta_x + rnorm(ngroup, 0, group_x_sigma) ## ----------------- # create data set ## ----------------- d = data.frame(Group = rep(1:ngroup, nitem), Item = sort(rep(1:nitem, ngroup))) d$group_x_start = group_x_start[d$Group] d$group_x_end = group_x_end[d$Group] d$x = runif(ngroup * nitem, d$group_x_start, d$group_x_end) # y under assumption linear mixed model d$y = (alpha + group_alpha[d$Group]) + (beta_x + beta_group_x[d$Group]) * d$x + rnorm(ngroup * nitem, 0, indiv_sigma) # y under simpson's paradox d$y_simpson = (alpha + group_alpha[d$Group]) + (beta_x) * d$x + (beta_group_x[d$Group]) * (d$x - d$group_x_start) + rnorm(ngroup * nitem, 0, indiv_sigma) # convery group and item to factor (only AFTER all the above has happened) d$Group = factor(d$Group) d$Item = factor(d$Item) return(d) }
With this function, (a random version of) the above graph is created as follows:
library(ggplot2) dd = init(15, 20) ggplot(dd, aes(y = y_simpson, x = x, color = Group)) + geom_point(alpha = .4, size = 3) + geom_smooth(method = "lm", formula = y ~ x, se = F, size = 1.2 ) + geom_smooth(aes(y = y_simpson, x = x), method = "lm", formula = y ~ x, size = 2, color = "black", linetype = 2 ) + scale_y_continuous("Pseudo normalized phonological diversity") + scale_x_continuous("Pseudo distance from origin") + scale_color_discrete("Pseudo\nlanguage\nfamily") + coord_cartesian(ylim = c(-1.6, 1.2)) + opts(panel.background = theme_blank())
Next, I ran a 2000 simulations each for calls to init(g, i), with all combinations of g=10,20,40,80,160 groups and i=10,20,40,80,160 cases per group. For each simulation run, three models were fit:
- An ordinary linear regression model with only the predictor Pseudo distance from origin
- A linear mixed effect model with the predictor Pseudo distance from origin and a random by-group intercept
- A linear mixed effect model with the predictor Pseudo distance from origin and both a random by-group intercept and a random by-group slope for the predictor Pseudo distance from origin