# Workshops on regression at the LSA 2009 Summer Institute

In case there is interest, we will hold two statistical tutorials on logistic regression and mixed models during the LSA 2009 Summer Institute at Berkeley, CA. Please read about the workshops below and, if you’re interested, follow the registration links. There is limited space capacity, so please only register if you indeed plan to come. That being said, spread the word.

The meetings will be held in Dwinelle 370. Note that only the front doors of Dwinelle will be open after 5:30pm.

## Introduction to Logistic Regression in R (with case studies on the phonological organization of mental lexicon)

Slides – Part 1

T. Florian Jaeger (Brain and Cognitive Sciences, Computer Science, University of Rochester)
Peter Graff (Linguistics, MIT)
Prerequisites: Familiarity with basic concepts of statistics (probabilities; statistical inference; significance testing); and at least very basic understanding of linear regression (e.g. because you read the Ch. 6.1-6.2 of Baayen, 2008 or another introduction chapter to regression). We will assume very little background. If you not sure whether it makes sense for you to attend, please contact Peter Graff at graff@mit.edu.

Where: Dwinelle 370

When: Wednesday, 7/8/09, 7-9pm (with break)

Goal: Maximum likelihood fitted logistic regression provide a statistical framework for theory-driven investigation of multiple effects on a binary outcome (such as whether a subject answers a question correctly; whether a speaker chooses a passive or an active realization of a transitive event; or whether a theoretically possible phonological word exists in the actual lexicon of a given language). If handled appropriately, logistic regression can simultaneously assess the partial effects of multiple input variable (aka independent variables, predictors) onto this outcome. Model comparison between different logistic regression models provides a systematic way to compare hypotheses (theories) given their data coverage and given the theories’ complexity (cf. Occam’s razor). Logistic regression also provides ways to compare effect sizes to assess the relative contributions of multiple effects on the overall data pattern.

We start with a quick conceptual introduction to logistic regression (multilevel or mixed models will not be covered in this lecture, though a lot of the steps involved in fitting ordinary and multilevel logistic models are identical). We then illustrate how to import data into the statistic software R and to analyze it using logistic regression. The data we use for as case study consists of several annotated theoretical lexica (including Dutch, Javanese, and Aymara). We assume that there is a set of theoretically possible words determined by the consonant inventory of a language. We utilize Logistic Regression to assign a probability to a possible word’s attestation in the language. In particular, we ask whether words that contain multiple similar consonants are dispreferred across these languages, as expected if similarity avoidance (OCP, Leben, 1973) shapes the mental lexicon. We use the properties of logistic regression to ask whether the use of phonological features to model such effects is justified (despite the fact that they increase the complexity of the theory), whether features differ in the strength of the associated OCP effect (e.g. for perceptual reasons related to word processing), whether any regularities emerge across languages, and whether similarity avoidance should be considered a cumulative constraint (as expected if the effect is due to interference during processing). We provide the necessary linguistic background to appreciate these questions, but the focus will lie on the use of logistic regression to compare different models in a non-arbitrary fashion.

## Introduction to Mixed Linear Models in R (with case studies on phonetic reduction in spontaneous speech

Slides – Part 1

Slides – Part 2

T. Florian Jaeger (Brain and Cognitive Sciences, Computer Science, University of Rochester)

Klinton Bicknell (Linguistics, UCSD)

Prerequisites: Familiarity with linear regression — it’s best if you have had some experience with using simple linear regression models.

Where: Dwinelle 370

When: Wednesday, 7/15/09, 7-9pm

Goal: Multilevel (or mixed) linear models provide efficient ways to model or analyze continuous data (durations, reaction times, gradient acceptability, looking times, formant ratios, etc.) while controlling for random effects (e.g. subject, speakers, items). If appropriately applied, they can even be used to analyze highly unbalanced data such as from corpora, which is typical in linguistic analyses. These models (along with multilevel logit models for categorical outcomes) are also of interest to sociolinguists (failure to account for random subject effects when investigating between-speaker [e.g. social] differences is problematic; see also Johnson, 2009).

I will use a large database of word duration measures from spontaneous speech to show how linear mixed models can be used to assess the partial effect of of multiple variables on a continuous outcome (word duration) while dealing with common challenges to regression models (collinearity, outliers, overfitting), and while accounting for individual speaker differences via random effect modeling (different speakers speak differently fast). In this case, I try to address whether speakers lengthen word because they have trouble with upcoming words (strategic lengthening, e.g. Fox-Tree & Clark, 1997; cf. availability-based production, Ferreira & Dell, 2000) or because they word they are currently pronouncing is more or less redundant (Aylett & Turk, 2004; Bell et al., 2003, 2009). I show how these question can be addressed at the same time (unless in previous work), while also controlling for various other factors, such as speech rate, the phonological environment, and social differences between speakers. The primary goal is to show what can be done with mixed models and how to do it (e.g. residualization, effect comparison, modeling of non-linearities) and what it means to make certain decisions during the modeling. The workshop will also serve to bring together interested researchers working with mixed linear models and to provide a forum for Q&A.More detailed schedule upcoming.

### References

• Aylett, M. P. and Turk, A. (2004). The smooth signal redundancy hypothesis: A functional explanation for relationships between redundancy, prosodic prominence, and duration in spontaneous speech. Language and Speech, 47(1):31–56.
• Bell, A., Jurafsky, D., Fosler-Lussier, E., Girand, C., Gregory, M., and Gildea, D. (2003). Eﬀects of disﬂuencies, predictability, and utterance position on word form variation in English conversation. Journal of the Acoustical Society of America, 113(2):1001–1024.
• Bell, A., Brenier, J., Gregory, M., Girand, C., and Jurafsky, D. (2009). Predictability eﬀects on durations of content and function words in conversational english. Journal of Memory and Language, 60(1):92–111.
• Ferreira, V. S. and Dell, G. S. (2000). The eﬀect of ambiguity and lexical availability on syntactic and lexical production. Cognitive Psychology, 40:296–340.
• Fox Tree, J. E. and Clark, H. H. (1997). Pronouncing “the” as “thee” to signal problems in speaking. Cognition, 62:151–167.
• Johnson, D. E. (2009). Getting oﬀ the GoldVarb Standard: Introducing Rbrul for Mixed-Eﬀects Variable Rule Analysis. Language and Linguistics Compass, 3(1):359–383, 10.1111/j.1749-818X.2008.00108.x