Mixed model’s and Simpson’s paradox

Posted on


For a paper I am currently working on, I started to think about Simpson’s paradox, which wikipedia succinctly defines as

“a paradox in which a correlation (trend) present in different groups is reversed when the groups are combined. This result is often encountered in social-science […]”

The wikipedia page also gives a nice visual illustration. Here’s my own version of it. The plot shows 15 groups, each with 20 data points. The groups happen to order along the x-axis (“Pseudo distance from origin”) in a way that suggests a negative trend of the Pseudo distance from origin against the outcome (“Pseudo normalized phonological diversity”). However, this trend does not hold within groups. As a matter of fact, in this particular sample, most groups show the opposite of the global trend (10 out of 15 within-group slopes are clearly positive). If this data set is analyzed by an ordinary linear regression (which does not have access to the grouping structure), the result will be a significant negative slope for the Pseudo distance from origin. So, I got curious: what about linear mixed models?

Here, I don’t want to ask how likely the above case is to be an actual instance of Simpson’s paradox (for that, we would need to know that the order of the groups on the x-axis is indeed co-incidental rather than itself being due to a causal effect of Pseudo distance from origin).

So, as a first step, here’s a function that creates data, in which the within-group slope is either the global slope or its negative (plus, in either case, a bit of normal noise).

init = function(ngroup = 10, nitem = 10) {
	x = 1:27000
	spread_x = (max(x) - min(x)) * .2
	group_x_start = runif(ngroup, min(x), max(x))
	group_x_end = runif(ngroup, 
		apply(cbind(group_x_start + spread_x/2, max(x)), MARGIN = 1, FUN =  min),
		apply(cbind(group_x_start + spread_x, max(x)), MARGIN = 1, FUN =  min))

	# variances
	group_sigma = .1
	group_x_sigma = .00001
	indiv_sigma = .1

	# betas
	alpha = .6
	group_alpha = rnorm(ngroup, 0, group_sigma)
	beta_x = -.00005
	# for normal differences
	# beta_group_x = rnorm(ngroup, 0, group_x_sigma)
	beta_group_x = 2 * rbinom(ngroup, 1, 0.5) * -beta_x + rnorm(ngroup, 0, group_x_sigma)

	## -----------------
	# create data set
	## -----------------
	d = data.frame(Group = rep(1:ngroup, nitem), 
		Item = sort(rep(1:nitem, ngroup)))
	d$group_x_start = group_x_start[d$Group]
	d$group_x_end = group_x_end[d$Group]
	d$x = runif(ngroup * nitem, d$group_x_start, d$group_x_end)
	# y under assumption linear mixed model
	d$y = (alpha + group_alpha[d$Group]) + 
		(beta_x + beta_group_x[d$Group]) * d$x  + 
		rnorm(ngroup * nitem, 0, indiv_sigma) 
	# y under simpson's paradox
	d$y_simpson = (alpha + group_alpha[d$Group]) + 
		(beta_x) * d$x  + 
		(beta_group_x[d$Group]) * (d$x - d$group_x_start) +
		rnorm(ngroup * nitem, 0, indiv_sigma) 

	# convery group and item to factor (only AFTER all the above has happened)
	d$Group = factor(d$Group)
	d$Item = factor(d$Item)

	return(d)
}

With this function, (a random version of) the above graph is created as follows:

library(ggplot2)
dd = init(15, 20)
ggplot(dd, aes(y = y_simpson, x = x, color = Group)) +
	geom_point(alpha = .4, size = 3) +
	geom_smooth(method = "lm", 
		formula = y ~ x, 
		se = F, 
		size = 1.2
	) +
	geom_smooth(aes(y = y_simpson, x = x),
		method = "lm", 
		formula = y ~ x, 
		size = 2, 
		color = "black", 
		linetype = 2
	) +
	scale_y_continuous("Pseudo normalized phonological diversity") +
	scale_x_continuous("Pseudo distance from origin") +	
	scale_color_discrete("Pseudo\nlanguage\nfamily") +
	coord_cartesian(ylim = c(-1.6, 1.2)) +
	opts(panel.background = theme_blank())

Next, I ran a 2000 simulations each for calls to init(gi), with all combinations of g=10,20,40,80,160 groups and i=10,20,40,80,160 cases per group. For each simulation run, three models were fit:

  1. An ordinary linear regression model with only the predictor Pseudo distance from origin
  2. A linear mixed effect model with the predictor Pseudo distance from origin and a random by-group intercept
  3. A linear mixed effect model with the predictor Pseudo distance from origin and both a random by-group intercept and a random by-group slope for the predictor Pseudo distance from origin
The first model indeed always returned significance for the predictor Pseudo distance from origin. The linear mixed model with a random intercept is a lot less likely to considerably less likely to return significance, but still does so in 60-98% of all cases. The last model, which also includes a random slope term shows different results, returning significance in 3-7% of the 2000 runs for many combinations of number of groups and group size. As the number of groups begins to be much larger than the cases per group, this model, too, finds significance 70%+ of all times. The left figure below show the simulation results for model 2, the right figure shows the results for model 3. Colored points represent one run of the simulation and show the t-value for the predictor (the pseudo distance effect in Figure 2) in the respective model. The non-solid shapes indicate the mean t-value for all 2000 mixed models fit For each combination of “number of groups” and “number of individual data points per group”. The dashed line indicates t=1.96, above which a t-test would be significant.
I’d be curious to hear what other people think about this.

Questions? Thoughts?