Speech recognition: Recognizing the familiar, generalizing to the similar, and adapting to the novel
At long last, we have finished a substantial revision of Dave Kleinschmidt‘s opus “Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel“. It’s still under review, but we’re excited about it and wanted to share what we have right now.
The paper builds on a large body of research in speech perception and adaptation, as well as distributional learning in other domains to develop a normative framework of how we manage to understand each other despite the infamous lack of invariance. At the core of the proposal stands the (old, but often under-appreciated) idea that variability in the speech signal is often structured (i.e., conditioned on other variables in the world) and that an ideal observer should take advantage of that structure. This makes speech perception a problem of inference under uncertainty at multiple different levels (e.g., uncertainty about the ‘perceived’ category, uncertainty about the identity or group membership of the talker, uncertainty about the appropriate generative model—or mixture of models—for the current talker, etc.). For some of these inference processes (e.g., category recognition or similarity between percepts), this has been recognized by previous work (see e.g., Meghan Clayards, Naomi Feldman, Dennis Norris and James McQueen, Morgan Sonderegger). We expand this idea to deal with the lack of invariance problem—talkers differ from each other in how they map categories to the acoustic cue dimensions.
In the first part of the paper, we aim to provide readers unfamiliar with this perspective with intuitions about how this framework guides our thinking about adaptation to unexpected ‘accents’. We present a series of modeling studies that interpret speech adaptation in terms of statistical learning (following work on language acquisition by, e.g., Bob McMurray, Joe Toscano, Jay MacClelland, etc.). This approach explains several otherwise puzzling results from the literature on speech adaptation. In the second part of the paper, we extend the framework from laboratory experiments to speech perception in everyday life. We discuss what is known about talker-specificity, the recognition of accented talkers, generalization between talkers as well as lack thereof, etc. We discuss to what extent these findings are accounted for by the proposed perspectives and what questions this proposal raises for future work. In the final part of the paper, we relate our account to other models of speech perception, such as exemplar-based, episodic, and abstractionist approaches. We also discuss how our proposal relates to work on inference and learning in other perceptual and motor domains.
The paper is long and might be cut, but for now we wanted to lay out our ideas with sufficient space to make them clear. Feedback welcome. We have tried our best to do justice to relevant works, but if you spot important things that we missed or mis-represented, please feel free to comment here (or send us a note).