A few days ago, I presented at the Gradience in Grammar workshop organized by Joan Bresnan, Dan Lassiter , and Annie Zaenen at Stanford’s CSLI (1/17-18). The discussion and audience reactions (incl. lack of reaction in some parts of the audience) prompted a few thoughts/questions about Gradience, Grammar, and to what extent the meaning of generative has survived in the modern day generative grammar. I decided to break this up into two posts. This summarizes the workshop – thanks to Annie, Dan, and Joan for putting this together!
The stated goal of the workshop was (quoting from the website):
For most linguists it is now clear that most, if not all, grammaticality judgments are graded. This insight is leading to a renewed interest in implicit knowledge of “soft” grammatical constraints and generalizations from statistical learning and in probabilistic or variable models of grammar, such as probabilistic or exemplar-based grammars. This workshop aims to stimulate discussion of the empirical techniques and linguistic models that gradience in grammar calls for, by bringing internationally known speakers representing various perspectives on the cognitive science of grammar from linguistics, psychology, and computation.
Apologies in advance for butchering the presenters’ points with my highly subjective summary; feel free to comment. Two of the talks demonstrated
the power of bottom-up item-specific approaches (when done right) and how they are quite capable of generalization:
- Harald Baayen talked about naïve discrimination learning and what it can and cannot do (it’s really quite striking how powerful summary measures of the networks derived from NDL does in predicting human behavior!) ,
- Antal van den Bosch discussed memory-based exemplar-based models in which only surface forms are stored and generalization is obtained via analogical reasoning across context.
Neat, though I think the challenge for these approaches remains to show that they work even when the input is a speech stream, rather than words or letter sequences (which still represent linguistic structure). Still, it’s encouraging to see how much information even letter (or phoneme) sequences contain. One particularly nice feature of both Harald’s and Antal’s approaches is that they scale well to large amounts of data.
Most directly related was probably Roger Levy’s talk on item-specific vs abstract grammatical knowledge. Roger presented a series of experiments (sorry, don’t remember all the names of the collaborators – some was done at UCSD and some at Stanford with Mike Frank) aimed to distinguish between these sources of knowledge and how both sources of knowledge seem to be consulted during language processing. Some aspects of the talk made for a nice link with Tim O’Donnell’s work on morphological acquisition, generalization, etc. and, more generally, approaches that fall on the continuum between fully episodic and fully abstractionist theories of grammar (see, e.g., Post and Gildea, 2013 in Wiechman, Kerz, Snider & Jaeger’s special issue in Language and Speech).
Amy Perfors focused on one of the classic problems in acquisition. She talked about the informational value of negative feedback and negative evidence (also showing nicely why the two are not the same), compared to positive evidence. Through a series of clever simulations she demonstrated why positive evidence carries a lot of information when searching a vast hypothesis space for a hypothesis that covers only a tiny part of that space (like language).
Steve Piantadosi (this time representing himself) reviewed various approaches to deriving one of the oldest and most striking linguistic generalizations: Zipf’s law. Drawing on a very large cross-linguistic data set (wikipedia) we saw that language’s actually are near-Zipfean rather than Zipfean (even once Mandelbrot’s correction is applied). Steve went through various models of this semi-law (e.g., monkey on a type writer), many of which just didn’t match with any other property of language (e.g., a monkey on a type writer …). My personal impression was that explanations referring to re-occurrence of events (those that we’d like to talk about) are among the more valid derivations, but one of the take home points was that any simple one-argument explanation of Zipf’s law is unlikely to capture the complex but systematic violations observed in his studies, while also capturing the consistent near-Zipfean nature of language.
Finally, there were three talks that focused explicitly on the nature of, or cause for, distributional linguistic knowledge:
- Ted Gibson (incarnated by Steve Piantadosi, since Ted couldn’t make it) talked about the noisy channel model and what it can explain about sentence processing and the shape of lexica & grammars cross-linguistically. Of course, an assumption here is that robust/efficient inference over noisy input requires prediction and that in turn requires some way in which linguistic distributions are stored.
- Maryellen MacDonald discussed how the pressures inherent to linguistic encoding (interference in memory, linearization of not necessary sequentially organized conceptual messages into a stream of words, phonemes, etc.) affect speakers’ production preferences, thereby creating the linguistic distributions that shape comprehenders’ predictions/expectations.
- I presented work on speech adaptation by Dave Kleinschmidt (4th year grad at Rochester) and on syntactic adaptation by Alex Fine (now a post-doc at Psychology in Illinois) and Thomas Farmer (now faculty in Psychology, Iowa). The two papers that summarize much of our perspective can be found HLP Lab’s academia.edu page. Dave’s paper presents a framework for speech perception, talker-specificity, generalziation, and adaption (under revision for Psych Review, to be updated soon). The paper by Alex Fine, Thomas Farmer, and Ting Qian shows that similar incremental distributional learning processes seem to be at work during sentence processing. The point of this work is that it seems that we not only have distributional linguistic knowledge but that we seem to have such knowledge for different language contexts (e.g., specific talkers but also specific dialects). At least, there’s enough intriguing evidence out there that suggests that this might be the case and, if so, that would have a rather considerable impact on how linguistic ‘knowledge’ is organized (and how we study it).
A model of how this might work is presented in the Kleinschmidt and Jaeger paper, but also owes in large parts to Ting Qian’s work (e.g., I highly recommend his review article on learning across different contexts in Frontiers, which came straight out of his quals; Dick and I just helped shape the presentation of the argument). Finally, we have an perspective piece in the works in which we connect this line of thinking to research on multilingualism and second language acquisition (co-authored by Bozena Pajak, Alex Fine, Dave Kleinschmidt , and me). Let us know if you might be interested in that one.
All of these talks seemed to argue that distributional (statistical) linguistic knowledge needs to be part of the explanandum of linguistic research. Roger Levy gave a great introduction to the workshop, laying out this idea and how it relates to, and might conflict with, other beliefs about what constitutes the explanandum of linguistic research. That’s actually what prompted some of the thoughts that I’ll continue to ramble on about in the next post.