R code for Jaeger, Graff, Croft and Pontillo (2011): Mixed effect models for genetic and areal dependencies in linguistic typology: Commentary on Atkinson

Posted on Updated on


Below I am sharing the R code for our paper on the serial founder effect:
This paper is a commentary on Atkinson’s 2011 Science article on the serial founder model (see also this interview with ScienceNews, in which parts of our comment in Linguistic Typology and follow-up work are summarized). In the commentary, we provide an introduction to linear mixed effect models for typological research. We discuss how to fit and to evaluate these models, using Atkinson’s data as an example.We illustrate the use of crossed random effects to control for genetic and areal relations between languages. We also introduce a (novel?) way to model areal dependencies based on an exponential decay function over migration distances between languages.
Finally, we discuss limits to the statistical analysis due to data sparseness. In particular, we show that the data available to Atkinson did not contain enough language families with sufficiently many languages to test whether the observed effect holds once random by-family slopes (for the effect) are included in the model. We also present simulations that show that the Type I error rate (false rejections) of the approach taken in Atkinson is many times higher than conventionally accepted (i.e. above .2 when .05 is the conventionally accepted rate of Type errors).
The scripts presented below are not intended to allow full replication of our analyses (they lack annotation and we are not allowed to share the WALS data employed by Atkinson on this site anyway). However, there are many plots and tests in the paper that might be useful for typologists or other users of mixed models. For that reason, I am for now posting the raw code. Please comment below if you have questions and we will try to provide additional annotation for the scripts as needed and as time permits. If you find (parts of the) script(s) useful, please consider citing our article in Linguistic Typology.
# assumes that we are in the right working directory and that there 
# is a subdirectory called /figures.
source("functions.R")
## -------------------------------------------------------------
# load data
## -------------------------------------------------------------
load("EnrichedAtkinsonCorrected.RData")
str(d)
summary(d)

## -------------------------------------------------------------
# Atkinson's ordinary model
## -------------------------------------------------------------
l = lm(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000,
 prepareVars(d))
summary(l)

l = lm(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 + 
 getWeightedArealPhonemeDiversity(d, 40),
 prepareVars(d))
summary(l)

## -------------------------------------------------------------
# Atkinson's lmer model
## -------------------------------------------------------------
# mixed model that Atkinson ran
library(lme4)

l.atkinson = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
l.atkinson
round(cor(fitted(l.atkinson), l.atkinson@y)^2,3)
pvals.fnc(l.atkinson, nsim= 20000)

## -------------------------------------------------------------
# outlier check
## -------------------------------------------------------------
hist(scale(d$lEstimatedSpeakerPopSize))
hist(scale(d$DistanceFromBestFitOrigin1000))
hist(scale(d$TotalNormalizedPhonemeDiversity))

# one potential outlier for population size
d$slEstimatedSpeakerPopSize = as.numeric(scale(d$lEstimatedSpeakerPopSize))
d[abs(scale(d$lEstimatedSpeakerPopSize)) > 2.5,]$LanguageName
d[abs(scale(d$lEstimatedSpeakerPopSize)) > 2.5,]$slEstimatedSpeakerPopSize

# no outliers for distance
d[abs(scale(d$DistanceFromBestFitOrigin1000)) > 2.5,]$LanguageName

## -------------------------------------------------------------
# examinging residuals
## -------------------------------------------------------------
# outliers in terms of residuals
d[abs(scale(d$TotalNormalizedPhonemeDiversity - fitted(l.atkinson))) > 2.5,]$LanguageName
ggplot(d, aes(x = TotalNormalizedPhonemeDiversity - fitted(l.atkinson))) +
 geom_histogram(aes(y=..density..)) +
 geom_density(size = 2, color = "blue") +
 scale_x_continuous("Residuals") +
 opts(panel.background = theme_blank(), legend.position = "none")
ggsave(file = "figures/histogram-residuals.png", width = 3.5, height = 3.5)

library(Design)
ggplot(d, aes(x = lEstimatedSpeakerPopSize, y = as.numeric(scale(TotalNormalizedPhonemeDiversity - fitted(l.atkinson))))) +
 geom_point() +
 geom_abline(intercept = -2.5, slope=0, linetype=2, color="gray25", size=1.2) +
 geom_abline(intercept = 2.5, slope=0, linetype=2, color="gray25", size=1.2) +
 geom_smooth(method = "lm") +
 geom_smooth(method = "lm", formula = y ~ pol(x, 2), color = "red") +
 geom_smooth(method = "lm", formula = y ~ pol(x, 3), color = "green") +
 scale_x_continuous("(log-transformed) population size") +
 scale_y_continuous("Standardized residuals") +
 opts(panel.background = theme_blank(), legend.position = "none")
ggsave(file = "figures/populationSize-residuals.png", width = 3.5, height = 3.5)
ggplot(d, aes(x = DistanceFromBestFitOrigin, y = as.numeric(scale(TotalNormalizedPhonemeDiversity - fitted(l.atkinson))))) +
 geom_point() +
 geom_abline(intercept = -2.5, slope=0, linetype=2, color="gray25", size=1.2) +
 geom_abline(intercept = 2.5, slope=0, linetype=2, color="gray25", size=1.2) +
 geom_smooth(method = "lm") +
 geom_smooth(method = "lm", formula = y ~ pol(x, 2), color = "red") +
 geom_smooth(method = "lm", formula = y ~ pol(x, 3), color = "green") +
 scale_x_continuous("Distance from origin") +
 scale_y_continuous("Standardized residuals") +
 opts(panel.background = theme_blank(), legend.position = "none")
ggsave(file = "figures/distance-residuals.png", width = 3.5, height = 3.5)

## -------------------------------------------------------------
# examining residuals by group (assessing normality and homoscedasticity)
## -------------------------------------------------------------
library(ggplot2)
dd = prepareVars(d[table(d$Family)[as.character(d$Family)] >= 4,])
l.atkinson.r = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 dd)
dd = dd[order(paste(dd$Continent, dd$Family)),]
ggplot(dd, aes(x = paste(Continent, Family, sep = " : "), y = resid(l.atkinson.r), fill = Continent)) +
 geom_boxplot(size = .5) +
# geom_abline(intercept = 0, slope = 0, color = "blue") +
 scale_x_discrete("Family") +
 scale_y_continuous("Standardized residuals") +
 coord_flip() + 
 opts(panel.background = theme_blank())
ggsave(file = "figures/by-family-residuals.png", width = 5.5, height = 8)
ggplot(d, aes(x = Continent, y = resid(l.atkinson), fill = Continent)) +
 geom_boxplot(size = .5) +
 scale_x_discrete("Continent") +
 scale_y_continuous("Standardized residuals") +
 coord_flip() + 
 opts(panel.background = theme_blank())
ggsave(file = "figures/by-continent-residuals.png", width = 5.5, height = 3)

## -------------------------------------------------------------
# examining random effects
## -------------------------------------------------------------
dotplot(ranef(l.atkinson, postVar=T))
qqmath(ranef(l.atkinson, postVar=T)

ranef(l.atkinson)

## -------------------------------------------------------------
# rerunning analysis without suspicious data points
## -------------------------------------------------------------
# results holds
lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(subset(d, d$LanguageName != "Mandarin" &
 abs(scale(TotalNormalizedPhonemeDiversity - fitted(l.atkinson))) < 2.5 ))
)

## -------------------------------------------------------------
# Illustrate shrinkage
## -------------------------------------------------------------
library(ggplot2)
d$AtkinsonPrediction = fitted(l.atkinson)
d$FamilyMembers = table(d$Family)[as.character(d$Family)]
dd = aggregate(d[c('FamilyMembers','AtkinsonPrediction','DistanceFromBestFitOrigin','EstimatedSpeakerPopSize','TotalNormalizedPhonemeDiversity')], by= list(Family = d$Family), mean)
ggplot(dd, 
 aes(x = DistanceFromBestFitOrigin)) +
 geom_point(aes(y= TotalNormalizedPhonemeDiversity, 
 size = I(FamilyMembers)), 
 alpha=.5, 
 color = "black"
 ) +
 geom_point(aes(y= AtkinsonPrediction), color = "blue", shape = 8) +
 geom_segment(aes(
 xend = DistanceFromBestFitOrigin, 
 y= TotalNormalizedPhonemeDiversity, 
 yend= AtkinsonPrediction), 
 color = "blue"
 ) +
 scale_color_discrete("Family") +
 scale_x_continuous("Distance from origin") +
 scale_y_continuous("Normalized phonological diversity") + 
 scale_size_continuous("Languages\nin family") +
 geom_smooth(aes(x = DistanceFromBestFitOrigin, 
 y = TotalNormalizedPhonemeDiversity), 
 color = "black", linetype = 1, size = 1.5, 
 method = "lm", se=F) +
 geom_smooth(aes(x = DistanceFromBestFitOrigin, 
 y = AtkinsonPrediction), 
 color = "blue", linetype = 1, size = 1.5, 
 method = "lm", se=F) +
 coord_cartesian(ylim = c(-1.3,1.3)) +
 opts(panel.background = theme_blank())
ggsave(file = "figures/shrinkage.png", width = 8, height = 4.5)

## -------------------------------------------------------------
# What about random slopes?
## -------------------------------------------------------------
# full mixed effect model does not converge
# backing off to mixed effect model with reduced random effect structure
library(lme4

# limiting ourselves to language families with at least kmin languages
kmin = 10
dd = prepareVars(d[table(d$Family)[as.character(d$Family)] >= kmin,])
nlevels(as.factor(as.character(dd$Family)))
nlevels(as.factor(as.character(dd$LanguageName)))
l.base = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 dd
)
l.distance = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 + cDistanceFromBestFitOrigin | Family),
 dd, 
 control = list(msVerbose=F, maxIter = 1000, maxFN = 1200)
)
l.population = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 + clEstimatedSpeakerPopSize | Family),
 dd, 
 control = list(msVerbose=F, maxIter = 1000, maxFN = 1200)
)
l.mainOnly = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 + clEstimatedSpeakerPopSize + cDistanceFromBestFitOrigin | Family),
 dd, 
 control = list(msVerbose=F, maxIter = 1000, maxFN = 1200)
)
l.full = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 + clEstimatedSpeakerPopSize * cDistanceFromBestFitOrigin | Family),
 dd, 
 control = list(msVerbose=F, maxIter = 1000, maxFN = 1200)
)
anova(l.base, l.distance, l.mainOnly, l.full)
anova(l.base, l.population, l.mainOnly, l.full)

library(languageR)
pvals.fnc(l.base, nsim= 10000)

kmin = 7
dd = prepareVars(d[table(d$Family)[as.character(d$Family)] >= kmin,])
nlevels(as.factor(as.character(dd$Family)))
nlevels(as.factor(as.character(dd$LanguageName)))
ll.base = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 | Family),
 dd
)
l.distance = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 + cDistanceFromBestFitOrigin | Family),
 dd
)
l.population = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 + clEstimatedSpeakerPopSize | Family),
 dd
)
l.mainOnly = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 + clEstimatedSpeakerPopSize + cDistanceFromBestFitOrigin | Family),
 dd
)
l.full = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 + clEstimatedSpeakerPopSize * cDistanceFromBestFitOrigin | Family),
 dd
)
anova(l.base, l.distance, l.mainOnly, l.full)
anova(l.base, l.population, l.mainOnly, l.full)

#################################
# assessing partial contributions
#################################
l = lmer(TotalNormalizedPhonemeDiversity ~ 1 + (1 | Genus) + (1 | Subfamily) + (1 | Family), d)
round(cor(fitted(l), l@y)^2,3)

# assessing partial contributions
l = lmer(TotalNormalizedPhonemeDiversity ~ 1 + (1 | Subfamily) + (1 | Family), d)
round(cor(fitted(l), l@y)^2,3)

l = lmer(TotalNormalizedPhonemeDiversity ~ 1 + (1 | Family), d)
round(cor(fitted(l), l@y)^2,3)

l = lmer(NormalizedVowelDiversity ~ 1 + (1 | Genus) + (1 | Subfamily) + (1 | Family), d)
round(cor(fitted(l), l@y)^2,3)

l = lmer(NormalizedConsonantDiversity ~ 1 + (1 | Genus) + (1 | Subfamily) + (1 | Family), d)
round(cor(fitted(l), l@y)^2,3)

l = lmer(NormalizedToneDiversity ~ 1 + (1 | Genus) + (1 | Subfamily) + (1 | Family), d)
round(cor(fitted(l), l@y)^2,3)

l.atkinson = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
round(cor(fitted(l.atkinson), l.atkinson@y)^2,3)

l.atkinson = lmer(NormalizedConsonantDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
round(cor(fitted(l.atkinson), l.atkinson@y)^2,3)

l.atkinson = lmer(NormalizedVowelDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
round(cor(fitted(l.atkinson), l.atkinson@y)^2,3)

l.atkinson = lmer(NormalizedToneDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
round(cor(fitted(l.atkinson), l.atkinson@y)^2,3)

l.country.continent.noGenus = lmer(NormalizedToneDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Country) +
 (1 | Continent) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))

## -------------------------------------------------------------
# are all genealogical grouping factors justified?
## -------------------------------------------------------------
library(lme4)

l.atkinson = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
l.noGenus = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
anova(l.atkinson, l.noGenus)
l.noSubfamily = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin +
 (1 | Family),
 prepareVars(d))
anova(l.atkinson, l.noGenus, l.noSubfamily)

## -------------------------------------------------------------
# assessing areal effects - random effects
## -------------------------------------------------------------
library(lme4)
library(languageR)

l.country.continent = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Country) +
 (1 | Continent) +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
l.continent = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Continent) +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
l.country = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Country) +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
anova(l.country.continent, l.country)
anova(l.country.continent, l.continent)
anova(l.atkinson, l.continent)
l.noCountry = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
anova(l.country.continent, l.country, l.noCountry)
l.country.noSubfamily = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Country) +
 (1 | Family),
 prepareVars(d))
anova(l.country.continent, l.country, l.country.noSubfamily)

l.country.continent.noGenus = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Country) +
 (1 | Continent) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
l.continent.noGenus = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Continent) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
l.country.noGenus = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Country) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
anova(l.country.continent.noGenus, l.country.noGenus)
anova(l.country.continent.noGenus, l.continent.noGenus)
anova(l.country.continent, l.country, l.country.noGenus)
anova(l.atkinson, l.country)
anova(l.country.noGenus, l.country.noSubfamily)
anova(l.country.noGenus, l.country.continent)

l.noCountry.noGenus = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
anova(l.country.noGenus, l.noCountry.noGenus)

#################################################
# let's look at random slopes for the best models
#################################################
l.slp.country.continent.noGenus = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 + cDistanceFromBestFitOrigin1000 | Country) +
 (1 + cDistanceFromBestFitOrigin1000 | Continent) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
l.slp.continent.noGenus = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 + cDistanceFromBestFitOrigin1000 | Continent) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
l.slp.country.noGenus = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 + cDistanceFromBestFitOrigin1000 | Country) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
anova(l.slp.country.continent.noGenus, l.slp.country.noGenus)
anova(l.slp.country.continent.noGenus, l.slp.continent.noGenus)
anova(l.country.noGenus, l.slp.country.noGenus)
anova(l.continent.noGenus, l.slp.continent.noGenus)

#################################################
# get p-values
#################################################
pvals.fnc(l.country.noGenus, nsim=20000)
pvals.fnc(l.continent.noGenus, nsim=20000)

## -------------------------------------------------------------
# assessing areal effects - spill over
## -------------------------------------------------------------
# without interaction with population size
dev2 = c()
ss2 = c()
f2 = c()
for(s in c(seq(100,600,50),seq(600,660,10),seq(670,720,2.5),seq(730,800,10),seq(800,2500,50),seq(2500,15000,500), seq(15000,20000,1000))) {
#for(s in c(seq(20,100,20),seq(100,180,10),seq(180,250,5),seq(250,400,10),seq(400,2500,50),seq(2500,15000,500), seq(15000,20000,1000))) {
 l = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 getMigrationDistanceWeightedArealPhonemeDiversity(d,s) +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d)
 )
 ss2 = append(ss2, s)
 dev2 = as.numeric(append(dev2, summary(l)@deviance["ML"]))
 f2 = append(f2, fixef(l)[4])
}
plot(ss2[1:20], dev2[1:20])
plot(ss2[1:70], dev2[1:70])
plot(ss2, dev2)
plot(ss2, dev2*(f2/abs(f2)))

dd = data.frame(x = ss2[1:85], y = dev2[1:85])
dd[dd$y == min(dd$y),]$x
ggplot(dd,
 aes(x=x, y=y)) +
 geom_line() +
 geom_point() +
 geom_point(aes(x = dd[dd$y == min(dd$y),]$x, 
 y = dd[dd$y == min(dd$y),]$y
 ),
 size = 6,
 color = "red", 
 shape = 1,
 alpha = 1,
 solid = T
 ) +
 scale_x_continuous("Standard deviation of weight function", trans = "log10") +
 scale_y_continuous("Deviance of model") +
 opts(panel.background = theme_blank(), legend.position = "none")
ggsave(file = "figures/areal-fits-overall-normalization.png", width = 4.5, height = 4.5)

d$ArealPhonemeDiversity = getMigrationDistanceWeightedArealPhonemeDiversity(d,240)
d$ContinentWisePhonemeDiversity = unlist(lapply(split(d$TotalNormalizedPhonemeDiversity, d$Continent), mean))[as.character(d$Continent)]
d$CountryWisePhonemeDiversity = unlist(lapply(split(d$TotalNormalizedPhonemeDiversity, d$Country), mean))[as.character(d$Country)]
d$FamilyWisePhonemeDiversity = unlist(lapply(split(d$TotalNormalizedPhonemeDiversity, d$Family), mean))[as.character(d$Family)]
d$SubfamilyWisePhonemeDiversity = unlist(lapply(split(d$TotalNormalizedPhonemeDiversity, d$Subfamily), mean))[as.character(d$Subfamily)]
d$GenusWisePhonemeDiversity = unlist(lapply(split(d$TotalNormalizedPhonemeDiversity, d$Genus), mean))[as.character(d$Genus)]
my.pairs(d[,grep("FamilyWisePhonemeDiversity|SubfamilyWisePhonemeDiversity|GenusWisePhonemeDiversity|ContinentWisePhonemeDiversity|CountryWisePhonemeDiversity|ArealPhonemeDiversity", names(d))],
 labels = c("Family","Subfamily","Genus","Continent","Country",
 "Areal\n(s=685)")
)

# with interaction with population size
dev = c()
ss = c()
for(s in c(seq(10,100,5),seq(100,200,10),seq(200,2500,50),seq(2500,15000,500), seq(15000,20000,1000))) {
l = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 (cDistanceFromBestFitOrigin1000 +
 getMigrationDistanceWeightedArealPhonemeDiversity(d,s)) +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
ss = append(ss, s)
dev = append(dev, deviance(l))
}
plot(ss[1:20], dev[1:20])
plot(ss[1:30], dev[1:30])
plot(ss, dev)

## best fitting model
dd = prepareVars(d)
dd$ArealPhonemeDiversityBest = getMigrationDistanceWeightedArealPhonemeDiversity(dd,685)
l.areal = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 dd$ArealPhonemeDiversityBest +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 dd
)
pvals.fnc(l.areal, nsim = 20000)

l.areal = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 dd$ArealPhonemeDiversityBest +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 dd
)
pvals.fnc(l.areal, nsim = 20000)

#############################################
# some additional explorations and plots
#############################################
d$ArealPhonemeDiversity50 = getMigrationDistanceWeightedArealPhonemeDiversity(d,50)
d$ArealPhonemeDiversity100 = getMigrationDistanceWeightedArealPhonemeDiversity(d,100)
d$ArealPhonemeDiversity250 = getMigrationDistanceWeightedArealPhonemeDiversity(d,250)
d$ArealPhonemeDiversity500 = getMigrationDistanceWeightedArealPhonemeDiversity(d,500)
d$ArealPhonemeDiversity1000 = getMigrationDistanceWeightedArealPhonemeDiversity(d,1000)
d$ArealPhonemeDiversity2500 = getMigrationDistanceWeightedArealPhonemeDiversity(d,2500)
d$ArealPhonemeDiversity5000 = getMigrationDistanceWeightedArealPhonemeDiversity(d,5000)
d$ContinentWisePhonemeDiversity = unlist(lapply(split(d$TotalNormalizedPhonemeDiversity, d$Continent), mean))[as.character(d$Continent)]
d$CountryWisePhonemeDiversity = unlist(lapply(split(d$TotalNormalizedPhonemeDiversity, d$Country), mean))[as.character(d$Country)]
d$FamilyWisePhonemeDiversity = unlist(lapply(split(d$TotalNormalizedPhonemeDiversity, d$Family), mean))[as.character(d$Family)]
d$SubfamilyWisePhonemeDiversity = unlist(lapply(split(d$TotalNormalizedPhonemeDiversity, d$Subfamily), mean))[as.character(d$Subfamily)]
d$GenusWisePhonemeDiversity = unlist(lapply(split(d$TotalNormalizedPhonemeDiversity, d$Genus), mean))[as.character(d$Genus)]
my.pairs(d[,grep("FamilyWisePhonemeDiversity|SubfamilyWisePhonemeDiversity|GenusWisePhonemeDiversity|ContinentWisePhonemeDiversity|CountryWisePhonemeDiversity|ArealPhonemeDiversity", names(d))],
 labels = c("Family","Subfamily","Genus","Continent","Country",
 "Areal\n(s=50)", "Areal\n(s=100)", "Areal\n(s=250)", "Areal\n(s=500)", 
 "Areal\n(s=1000)", "Areal\n(s=2500)", "Areal\n(s=5000)"
 )
)

round(cor(d$ArealPhonemeDiversity50, d$TotalNormalizedPhonemeDiversity)^2, 3)
round(cor(d$ArealPhonemeDiversity100, d$TotalNormalizedPhonemeDiversity)^2, 3)
round(cor(d$ArealPhonemeDiversity250, d$TotalNormalizedPhonemeDiversity)^2, 3)
round(cor(d$ArealPhonemeDiversity500, d$TotalNormalizedPhonemeDiversity)^2, 3)
round(cor(d$ArealPhonemeDiversity1000, d$TotalNormalizedPhonemeDiversity)^2, 3)
round(cor(d$ArealPhonemeDiversity2500, d$TotalNormalizedPhonemeDiversity)^2, 3)
round(cor(d$ArealPhonemeDiversity5000, d$TotalNormalizedPhonemeDiversity)^2, 3)

ddd = melt(d, 
 measure.var = grep("ArealPhonemeDiversity", names(d))
)
ggplot(ddd, aes(y = TotalNormalizedPhonemeDiversity, x= value, color= variable)) +
 geom_point(alpha = .4) +
 geom_smooth(size = 1.3, method = "lm") + 
 scale_y_continuous("Normalized phonological diversity") +
 scale_x_continuous("Weighted areal normalized phonological diversity") +
 scale_color_manual("Standard deviation of\ndistance-based weight decay",
 values = c("red","orange","yellow","green","blue","purple","black"),
 breaks = levels(ddd$variable),
 labels = c(50,100,250,500,1000,2500,5000)
 ) +
 opts(panel.background = theme_blank()) 
ggsave(file = "figures/areal-effects.png", width = 8, height = 4.5)

## -------------------------------------------------------------
# assessing areal effects - spill over vs. random effect
## -------------------------------------------------------------
best_s = 685

l.country.continent = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Country) +
 (1 | Continent) +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
l.country.continent.areal = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 getMigrationDistanceWeightedArealPhonemeDiversity(d,best_s) +
 (1 | Country) +
 (1 | Continent) +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
anova(l.country.continent.areal, l.country.continent)

l.country = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Country) +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
l.country.areal = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 getMigrationDistanceWeightedArealPhonemeDiversity(d,best_s) +
 (1 | Country) +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
l.areal = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 getMigrationDistanceWeightedArealPhonemeDiversity(d,best_s) +
 (1 | Genus) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
l.country.areal.noGenus = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 getMigrationDistanceWeightedArealPhonemeDiversity(d,best_s) +
 (1 | Country) +
 (1 | Subfamily) +
 (1 | Family),
 prepareVars(d))
anova(l.country.areal, l.country)
anova(l.country.areal, l.areal)
anova(l.country.areal, l.country.areal.noGenus)

l.country.areal.noGenus.noSubfamily = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 getMigrationDistanceWeightedArealPhonemeDiversity(d,best_s) +
 (1 | Country) +
 (1 | Family),
 prepareVars(d))
anova(l.country.areal, l.country.areal.noGenus, l.country.areal.noGenus.noSubfamily)

l.country.areal.noGenus.noSubfamily.noFamily = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 getMigrationDistanceWeightedArealPhonemeDiversity(d,best_s) +
 (1 | Country),
 prepareVars(d))
l.country.areal.noGenus.noSubfamily.noCountry = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 getMigrationDistanceWeightedArealPhonemeDiversity(d,best_s) +
 (1 | Family),
 prepareVars(d))
anova(l.country.areal, l.country.areal.noGenus, l.country.areal.noGenus.noSubfamily,l.country.areal.noGenus.noSubfamily.noCountry)
anova(l.country.areal, l.country.areal.noGenus, l.country.areal.noGenus.noSubfamily,l.country.areal.noGenus.noSubfamily.noFamily)

l.country.noAreal.noGenus.noSubfamily = lmer(TotalNormalizedPhonemeDiversity ~ 
 clEstimatedSpeakerPopSize *
 cDistanceFromBestFitOrigin1000 +
 (1 | Country) +
 (1 | Family),
 prepareVars(d))
anova(l.country.areal.noGenus.noSubfamily,l.country.noAreal.noGenus.noSubfamily)

pvals.fnc(l.country.areal.noGenus.noSubfamily, nsim = 20000)

s_best = 685
distanceBasedWeight(100, s_best) 
distanceBasedWeight(500, s_best) / distanceBasedWeight(100, s_best) 
distanceBasedWeight(1000, s_best) /distanceBasedWeight(100, s_best) 
distanceBasedWeight(2500, s_best) /distanceBasedWeight(100, s_best) 

## -------------------------------------------------------------
# illustrate areal fit for example language
## -------------------------------------------------------------
s_best = 685

chosenlang = "Hindi"
langs = as.character(d$ourWALScode)
dd = data.frame(t(d[d$Language == chosenlang,langs]))
names(dd) = c("distance")
dd$weight = distanceBasedWeight(dd$distance, s_best)
dd = subset(dd, weight > 0)
str(dd)

ddd = d
row.names(ddd) = ddd$ourWALScode
dd$AdjustedLongitude = (ddd[row.names(dd),]$Longitude - d[d$Language == chosenlang,]$Longitude) * (dd$distance / max(dd$distance))
dd$AdjustedLatitude = (ddd[row.names(dd),]$Latitude - d[d$Language == chosenlang,]$Latitude) * (dd$distance / max(dd$distance))
dd$AbsoluteContribution = abs(ddd[row.names(dd),]$TotalNormalizedPhonemeDiversity * dd$weight)
dd$Language = ddd[row.names(dd),]$Language
dd$ourWALScode = ddd[row.names(dd),]$ourWALScode
dd = subset(dd, Language != chosenlang)

#size = AbsoluteContribution
ggplot(dd, aes(x=AdjustedLongitude, y=AdjustedLatitude, size = weight )) +
 geom_text(aes(label = Language), alpha = .8) +
 geom_point(aes(x=0,y=0),shape=8,size=6,color="red") +
 scale_x_continuous("Longitudinal migration distance") +
 scale_y_continuous("Latitudinal migration distance") +
 scale_size_continuous("Weight based on\nmigration distance") +
 coord_cartesian(xlim=c(-1,1), ylim=c(-1,1)) +
 opts(panel.background = theme_blank())
ggsave(file = "figures/hindi-weights.png", width = 4.5, height = 4.5)

chosenlang = "Albanian"
langs = as.character(d$ourWALScode)
dd = data.frame(t(d[d$Language == chosenlang,langs]))
names(dd) = c("distance")
dd$weight = distanceBasedWeight(dd$distance, s_best)
dd = subset(dd, weight > 0)
str(dd)

ddd = d
row.names(ddd) = ddd$ourWALScode
dd$AdjustedLongitude = (ddd[row.names(dd),]$Longitude - d[d$Language == chosenlang,]$Longitude) * (dd$distance / max(dd$distance))
dd$AdjustedLatitude = (ddd[row.names(dd),]$Latitude - d[d$Language == chosenlang,]$Latitude) * (dd$distance / max(dd$distance))
dd$AbsoluteContribution = abs(ddd[row.names(dd),]$TotalNormalizedPhonemeDiversity * dd$weight)
dd$Language = ddd[row.names(dd),]$Language
dd$ourWALScode = ddd[row.names(dd),]$ourWALScode
dd = subset(dd, Language != chosenlang)

#size = AbsoluteContribution
ggplot(dd, aes(x=AdjustedLongitude, y=AdjustedLatitude, size = weight )) +
 geom_text(aes(label = Language), alpha = .8) +
 geom_point(aes(x=0,y=0),shape=8,size=6,color="red") +
 scale_x_continuous("Longitudinal migration distance") +
 scale_y_continuous("Latitudinal migration distance") +
 scale_size_continuous("Weight based on\nmigration distance") +
 coord_cartesian(xlim=c(-1,1), ylim=c(-1,1)) +
 opts(panel.background = theme_blank(), legend.position = "none")
ggsave(file = "figures/albanian-weights.png", width = 3.75, height = 4.5)

## -------------------------------------------------------------
# assessing non-linearity
## -------------------------------------------------------------
library(ggplot2)
library(gam)

hist(table(d$Family), breaks = 100)
ggplot(d, 
 aes(
 x = DistanceFromBestFitOrigin, 
 y = TotalNormalizedPhonemeDiversity
 )) +
 geom_point(aes(size = EstimatedSpeakerPopSize), alpha = .5) +
 scale_color_discrete("Family") +
 scale_x_continuous("Distance from origin") +
 scale_y_continuous("Normalized phonological diversity") + 
 scale_size_continuous("Population Size", trans="log10") +
 geom_smooth(aes(x = DistanceFromBestFitOrigin, 
 y = TotalNormalizedPhonemeDiversity), 
 color = "black", linetype = 2, size = 1.5, 
 method = "gam", formula = y ~ ns(x,5)) +
 coord_cartesian(ylim = c(-1.5,2)) +
 opts(panel.background = theme_blank(), legend.position = "none")
ggsave(file = "figures/regression-linear.png", width = 8, height = 4.5)

ggplot(d, 
 aes(
 x = DistanceFromBestFitOrigin, 
 y = TotalNormalizedPhonemeDiversity
 )) +
 geom_point(aes(size = EstimatedSpeakerPopSize), alpha = .5) +
 scale_color_discrete("Family") +
 scale_x_continuous("Distance from origin") +
 scale_y_continuous("Normalized phonological diversity") + 
 scale_size_continuous("Population Size", trans="log10") +
 geom_smooth(aes(x = DistanceFromBestFitOrigin, 
 y = TotalNormalizedPhonemeDiversity), 
 color = "red", linetype = 2, size = 1.3, 
 method = "loess") +
 geom_smooth(aes(x = DistanceFromBestFitOrigin, 
 y = TotalNormalizedPhonemeDiversity), 
 color = "black", linetype = 1, size = 1.3, 
 method = "lm") +
 coord_cartesian(ylim = c(-1.5,2)) +
 opts(panel.background = theme_blank(), legend.position = "none")
ggsave(file = "figures/regression-loess.png", width = 8, height = 4.5)

## -------------------------------------------------------------
# assessing local trends
## -------------------------------------------------------------
ggplot(subset(d, table(d$Family)[as.character(d$Family)] >= 16), 
 aes(
 x = DistanceFromBestFitOrigin, 
 y = TotalNormalizedPhonemeDiversity, 
 color = as.factor(as.character(Family))
 )) +
 geom_point(aes(size = EstimatedSpeakerPopSize), alpha = .5) +
 scale_color_brewer("Family", palette = "Set1") +
 scale_x_continuous("Distance from origin") +
 scale_y_continuous("Normalized phonological diversity") + 
 scale_size_continuous("Population Size", trans="log10") +
 geom_smooth(aes(x = DistanceFromBestFitOrigin, 
 y = TotalNormalizedPhonemeDiversity, 
 color = as.factor(as.character(Family))),
 method = "lm", formula = y ~ x, size=1.5) +
 coord_cartesian(ylim = c(-1.5,2)) +
 opts(panel.background = theme_blank())
ggsave(file = "figures/regression-linear-less.png", width = 8, height = 4.5)

ggplot(subset(d, table(d$Family)[as.character(d$Family)] >= 16), 
 aes(
 x = DistanceFromBestFitOrigin, 
 y = TotalNormalizedPhonemeDiversity, 
 size = EstimatedSpeakerPopSize, 
 color = as.factor(as.character(Family))
 )) +
 geom_point(alpha = .75) +
 scale_color_brewer("Family", palette = "Set1") +
 scale_x_continuous("Distance from origin") +
 scale_y_continuous("Normalized phonological diversity") + 
 scale_size_continuous("Population Size", trans="log10") +
 geom_smooth(size=1.5) +
 geom_smooth(aes(x = DistanceFromBestFitOrigin, 
 y = TotalNormalizedPhonemeDiversity), color = "black", linetype = 2) +
 coord_cartesian(ylim = c(-1.5,2))
ggsave(file = "figures/regression-non-linear-less.png", width = 8, height = 4.5)

## -------------------------------------------------------------
# plotting distribution on map
## -------------------------------------------------------------
library(maps)
ggplot(subset(d, table(d$Family)[as.character(d$Family)] >= 10), 
 aes(
 x = Longitude, 
 y = Latitude
 )) +
 borders("world") +
 geom_point(aes(
 size = EstimatedSpeakerPopSize, 
 color = as.factor(as.character(Family))
 ), 
 alpha = .8) +
 scale_color_brewer("Family", palette = "Set3") +
 scale_size_continuous("Population Size", trans="log10") +
 opts(panel.grid.major = theme_blank(), panel.background = theme_blank(), panel.grid.minor = theme_blank())
ggsave(file = "figures/world.png", width = 8, height = 4.5)

ggplot(subset(d, table(d$Family)[as.character(d$Family)] >= 16), 
 aes(
 x = Longitude, 
 y = Latitude
 )) +
 borders("world") +
 geom_point(aes(
 size = EstimatedSpeakerPopSize, 
 color = as.factor(as.character(Family))
 ), 
 alpha = .75) +
 scale_color_brewer("Family", palette = "Set1") +
 scale_size_continuous("Population Size", trans="log10") +
 opts(panel.grid.major = theme_blank(), panel.background = theme_blank(), panel.grid.minor = theme_blank())
ggsave(file = "figures/world-less.png", width = 8, height = 4.5)

ggplot(subset(d, table(d$Family)[as.character(d$Family)] >= 30), 
 aes(
 x = Longitude, 
 y = Latitude
 )) +
 borders("world") +
 geom_point(aes(
 size = EstimatedSpeakerPopSize, 
 shape = as.factor(as.character(Family)),
 color = TotalNormalizedPhonemeDiversity
 ), 
 alpha = .75) +
 scale_color_gradient("Phonological Complexity") +
 scale_size_continuous("Population Size", trans="log10") +
 scale_shape_discrete("Family")

## -------------------------------------------------------------
# simulation for simpson's paradox
## -------------------------------------------------------------

## -----------------
# set parameters
## -----------------
init = function(ngroup = 10, nitem = 10) {
 x = 1:27000
 spread_x = (max(x) - min(x)) * .2
 group_x_start = runif(ngroup, min(x), max(x))
 group_x_end = runif(ngroup, 
 apply(cbind(group_x_start + spread_x/2, max(x)), MARGIN = 1, FUN = min),
 apply(cbind(group_x_start + spread_x, max(x)), MARGIN = 1, FUN = min))

 # variances
 group_sigma = .1
 group_x_sigma = .00001
 indiv_sigma = .1

 # betas
 alpha = .6
 group_alpha = rnorm(ngroup, 0, group_sigma)
 beta_x = -.00005
 # for normal differences
 # beta_group_x = rnorm(ngroup, 0, group_x_sigma)
 beta_group_x = 2 * rbinom(ngroup, 1, 0.5) * -beta_x + rnorm(ngroup, 0, group_x_sigma)

 ## -----------------
 # create data set
 ## -----------------
 d = data.frame(Group = rep(1:ngroup, nitem), 
 Item = sort(rep(1:nitem, ngroup)))
 d$group_x_start = group_x_start[d$Group]
 d$group_x_end = group_x_end[d$Group]
 d$x = runif(ngroup * nitem, d$group_x_start, d$group_x_end)
 # y under assumption linear mixed model
 d$y = (alpha + group_alpha[d$Group]) + 
 (beta_x + beta_group_x[d$Group]) * d$x + 
 rnorm(ngroup * nitem, 0, indiv_sigma) 
 # y under simpson's paradox
 d$y_simpson = (alpha + group_alpha[d$Group]) + 
 (beta_x) * d$x + 
 (beta_group_x[d$Group]) * (d$x - d$group_x_start) +
 rnorm(ngroup * nitem, 0, indiv_sigma) 

 # convery group and item to factor (only AFTER all the above has happened)
 d$Group = factor(d$Group)
 d$Item = factor(d$Item)

 return(d)
}

## -----------------
# plot
## -----------------
library(ggplot2)
dd = init(15, 20)
ggplot(dd, aes(y = y_simpson, x = x, color = Group)) +
 geom_point(alpha = .4, size = 3) +
 geom_smooth(method = "lm", 
 formula = y ~ x, 
 se = F, 
 size = 1.2
 ) +
 geom_smooth(aes(y = y_simpson, x = x),
 method = "lm", 
 formula = y ~ x, 
 size = 2, 
 color = "black", 
 linetype = 2
 ) +
 scale_y_continuous("Pseudo normalized phonological diversity") +
 scale_x_continuous("Pseudo distance from origin") + 
 scale_color_discrete("Pseudo\nlanguage\nfamily") +
 coord_cartesian(ylim = c(-1.6, 1.2)) +
 opts(panel.background = theme_blank()) 
ggsave(file = "figures/simpsons-paradox.png", width = 8, height = 4.5)

# vectors to store coefficient
l = c()
l.est = c()
l.m.int = c()
l.m.int.est = c()
l.m.slp = c()
l.m.slp.est = c()
k = 2000
steps = c(10,20,40,80,160)
s = data.frame(cbind(1:(k*length(steps)^2), l, l.est, l.m.int, l.m.int.est, l.m.slp, l.m.slp.est))

library(lme4)
j = 0
for (i in 1:k) {
 for (ngroup in steps) { 
 for (nitem in steps) {
 j = j + 1
 dd = init(ngroup, nitem)
 s$ngroup[j] = nlevels(dd$Group)
 s$nitem[j] = nlevels(dd$Item)

 l.1 = lm(y_simpson ~ 1 + x, dd)
 s$l[j] = summary(l.1)$coefficients[2,"Pr(>|t|)"]
 s$l.est[j] = summary(l.1)$coefficients[2,"Estimate"]

 l.2 = lmer(y_simpson ~ 1 + x + (1 | Group), dd)
 s$l.m.int[j] = abs(summary(l.2)@coefs[2,3])
 s$l.m.int.est[j] = as.numeric(fixef(l.2)[2])

 l.3 = lmer(y_simpson ~ 1 + x + (1 + x | Group), dd)
 s$l.m.slp[j] = abs(summary(l.3)@coefs[2,3])
 s$l.m.slp.est[j] =as.numeric(fixef(l.3)[2])
 }
 }
}
summary(s)
save(s, file = "simulation-simpson-paradox.RData", compress=T)

lapply(split(s$l.m.int, paste(s$nitem, s$ngroup)), FUN = function(x) { sum(ifelse(x > 1.96,1,0)) / length(x) } ) 
lapply(split(s$l.m.slp, paste(s$nitem, s$ngroup)), FUN = function(x) { sum(ifelse(x > 1.96,1,0)) / length(x) } ) 

load("simulation-simpson-paradox.RData")
ggplot(s, aes(x = as.factor(ngroup), 
 y = l.m.slp, 
 color = as.factor(nitem)
 )
 ) +
 geom_point(size = 4, 
 alpha = .4, 
 position = position_jitter(width = .2)
 ) +
 stat_summary(fun.y = "mean", 
 geom = "point",
 aes(shape = as.factor(nitem)), 
 color = "black",
 size = 4
 ) + 
 geom_abline(intercept = 1.96, 
 slope = 0, 
 size = 1.2, 
 color = "gray25", 
 linetype = 2
 ) +
 scale_x_discrete("Number of groups") +
 scale_y_continuous("absolute t-value") +
 scale_color_brewer("Number of individual\ndata points per group", palette = "Set1") +
 scale_shape_discrete("Mean by individual\ndata points per group",
 solid = F
 ) +
 opts(panel.background = theme_blank())
ggsave(file = "figures/simpsons-paradox-simulation-lmer-with-slope.png", 
 width = 4, height = 4.5)
last_plot() + aes(y = l.m.int) + opts(legend.position = "none")
ggsave(file = "figures/simpsons-paradox-simulation-lmer-no-slope.png", 
 width = 2.5, height = 4.5)

## -------------------------------------------------------------
# re-analysis of single origin determination
## -------------------------------------------------------------
load("BestOriginFitByModelType.RData")

# plot model fit for each hypothetical origin
library(maps)
library(ggplot2)
dd = subset(all.long, model == "l.dist.pop.int.fam.sub.count")
dd[dd$devDifference == max(dd$devDifference),]
aggregate(dd[,"devDifference"],
 by = list(Continent = dd$Continent), quantile)
ggplot(dd) +
 borders("world") +
 geom_point(aes(
 x = Longitude, 
 y = Latitude,
 color = devDifference
 ),
 alpha = 0.5,
 size = 2.5
 ) +
 geom_point(aes(x = dd[dd$devDifference == max(dd$devDifference),]$Longitude, 
 y = dd[dd$devDifference == max(dd$devDifference),]$Latitude
 ),
 size = 6,
 color = "black", 
 shape = 8,
 alpha = 1,
 solid = T
 ) +
 geom_point(aes(x = c(31,29,105,-170,-77.5), 
 y = c(30,41,11.5,66,8)
 ),
 size = 4,
 color = "black", 
 alpha = 1,
 solid = T
 ) +
 scale_color_gradient("Improvement in\nmodel quality\n(change in deviance)") +
 opts(panel.grid.major = theme_blank(), panel.background = theme_blank(), panel.grid.minor = theme_blank())
ggsave(file = "figures/best-orgin-l.dist.pop.int.fam.sub.count.png", width = 8, height = 4.5)

dd = subset(all.long, model == "l.dist.pop.int.fam.sub.gen")
last_plot() 
ggsave(file = "figures/best-orgin-l.dist.pop.int.fam.sub.gen.png", width = 8, height = 4.5)

dd = subset(all.long, model == "l.dist.pop.int.fam.sub")
last_plot() 
ggsave(file = "figures/best-orgin-l.dist.pop.int.fam.sub.png", width = 8, height = 4.5)

dd = subset(all.long, model == "l.dist")
last_plot() 
ggsave(file = "figures/best-orgin-l.dist.png", width = 8, height = 4.5)

dd = subset(all.long, model == "l.dist.pop")
last_plot() 
ggsave(file = "figures/best-orgin-l.dist.pop.png", width = 8, height = 4.5)

dd = subset(all.long, model == "l.dist.pop.int")
last_plot() 
ggsave(file = "figures/best-orgin-l.dist.pop.int.png", width = 8, height = 4.5)

The above scripts reference additional functions, loaded from functions.R. These are included below.

source("C:\\Users\\tiflo\\Documents\\R\\My R playground\\General Functions\\functions.R")

degreeToRadians = function(x) {
 return(x * pi / 180)
}

getDistance = function(latitude, longitude, d) {
 # calculates distances for all cases in d from the given longitude and latitude
 # d is a dataframe with variables Longitude and Latitude
 # formula for distance on a sphere: 
 # http://www.movable-type.co.uk/scripts/latlong.html

 R = 6371
 deltaLat = d$radiansLatitude - latitude
 deltaLon = d$radiansLongitude - longitude 
 a = sin(deltaLat/2)^2 + cos(latitude) * cos(d$radiansLatitude) * sin(deltaLon/2)^2
 c = 2 * atan2(sqrt(a), sqrt(1-a))

 return(R * c)
}

distanceBasedWeight = function(distance, s) {
# maxdistance = 1000
# w = ifelse(distance <= maxdistance, 1, 0)
 w = 1 / exp(distance^2 / (2*s^2))

 return(w)
}

getWeightedArealPhonemeDiversity = function(d, s = 500) {
 n = nrow(d) - 1

 weightedarealphonemediversity = c()
 sumofcontributinglgsweights = c()
 mindistance = c()
 for (i in 1:nrow(d)) {
 dd = d[-i,]
 distance = getDistance(
 d[i,]$radiansLatitude, 
 d[i,]$radiansLongitude, 
 dd
 )
 # paste(dd$Language, distance)[order(distance)]
 w = distanceBasedWeight(
 distance, 
 s
 )
 # paste(dd$Language, distance, w / sum(w))[order(w)]
 # barplot((w * dd$TotalNormalizedPhonemeDiversity)[order(distance)])
 weightedarealphonemediversity = append(weightedarealphonemediversity, 
 if(sum(w) == 0) { mean(dd$TotalNormalizedPhonemeDiversity) 
 } else { sum(dd$TotalNormalizedPhonemeDiversity * w) / sum(w) }
 ) 
 sumofcontributinglgsweights = append(sum(w), sumofcontributinglgsweights)
 mindistance = append(min(distance), mindistance)
 }

 cat("Sum of weights of contributing lgs assumed to enter areal effects:\n")
 print(summary(sumofcontributinglgsweights))
 # hist(sumofcontributinglgsweights)
 cat("Min distance:\n")
 print(summary(mindistance))

 return(myCenter(weightedarealphonemediversity))
}


getMigrationDistanceWeightedArealPhonemeDiversity = function(d, s = 500) {
 # this function returns a aggegrate measure of the phonological diversity 
 # of the languages surrounding the target language. Each surrounding 
 # language's contribution
 # to the aggregate measure is determine by its distance to the target lg.

 langs = as.character(d$ourWALScode)
 phons = as.numeric(d$TotalNormalizedPhonemeDiversity)

 # we're using distances divided by 1000
 s = s / 1000

 # calculate distance weighted sum of phonological diversity
 # the second set of rows removes the contribution of the 
 # respective language's diversity from it's own weighted 
 # areal diversity measure.
 # the last line normalizes by the sum of the weights.

 w = myCenter(
 (
 apply(d[,langs], FUN = function(x) { sum(distanceBasedWeight(x / 1000, s) * phons) } , MARGIN = 1) -
 (distanceBasedWeight(
 apply(d[,as.character(d$ourWALScode)], FUN = min, MARGIN = 1), 
 s
 ) * d$TotalNormalizedPhonemeDiversity)
 ) / 
# using the first line normalizes by row. that means that different distances means different things for 
# different countries (depending on what their closest neighbors are). Using the second line normalizes 
# by the average sum of weights,
# apply(d[,langs], FUN = function(x) { sum(distanceBasedWeight(x / 1000, s)) } , MARGIN = 1)
 (sum(distanceBasedWeight(d[,langs] / 1000, s)) / nrow(d))
 )

 return(w)
}

prepareVars = function(d) {
 d$clEstimatedSpeakerPopSize = myCenter(d$lEstimatedSpeakerPopSize)
 d$cDistanceFromBestFitOrigin = myCenter(d$DistanceFromBestFitOrigin)
 d$cDistanceFromBestFitOrigin1000 = myCenter(d$DistanceFromBestFitOrigin1000)

 d$Family = factor(as.character(d$Family))
 return(d)
}
Advertisements

Questions? Thoughts?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s