Speech recognition: Recognizing the familiar, generalizing to the similar, and adapting to the novel

Posted on Updated on


At long last, we have finished a substantial revision of Dave Kleinschmidt‘s opus “Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel“. It’s still under review, but we’re excited about it and wanted to share what we have right now.

Phonetic recalibration via belief updating
Figure 1: Modeling changes in phonetic classification as belief updating. After repeatedly hearing a VOT that is ambiguous between /b/ and /p/ but occurs in a word where it can only be a /b/, listeners change their classification of that sound, calling it a /b/ much more often. We model this as a belief updating process, where listeners track the underlying distribution of cues associated with the /b/ and /p/ categories (or the generative model), and then use their beliefs about those distributions to guide classification behavior later on.

The paper builds on a large body of research in speech perception and adaptation, as well as distributional learning in other domains to develop a normative framework of how we manage to understand each other despite the infamous lack of invariance. At the core of the proposal stands the (old, but often under-appreciated) idea that variability in the speech signal is often structured (i.e., conditioned on other variables in the world) and that an ideal observer should take advantage of that structure. This makes speech perception a problem of inference under uncertainty at multiple different levels (e.g., uncertainty about the ‘perceived’ category, uncertainty about the identity or group membership of the talker, uncertainty about the appropriate generative model—or mixture of models—for the current talker, etc.). For some of these inference processes (e.g., category recognition or similarity between percepts), this has been recognized by previous work (see e.g., Meghan Clayards, Naomi Feldman, Dennis Norris and James McQueen, Morgan Sonderegger). We expand this idea to deal with the lack of invariance problem—talkers differ from each other in how they map categories to the acoustic cue dimensions.

Depending on the talker, the boundary between two phonetic categories (here /s/ and /sh/) can be quite different (left), due to differences between talkers in the distributions of cues they produce for each category (center).  Each talker's cue distributions can be characterized by probability distributions, and in this paper we explore how listeners can adapt to different situations by inferring the appropriate generative model parameters based on their prior beliefs (from experience in other situations) and the speech they hear in the current situation.
Figure 2: Depending on the talker, the boundary between two phonetic categories (here /s/ and /sh/) can be quite different (left), due to differences between talkers in the distributions of cues they produce for each category (center). Each talker’s cue statistics can be characterized by probability distributions, and in this paper we explore how listeners can adapt to different situations by inferring the appropriate generative model parameters based on their prior beliefs (from experience in other situations) and the speech they hear in the current situation.

In the first part of the paper, we aim to provide readers unfamiliar with this perspective with intuitions about how this framework guides our thinking about adaptation to unexpected ‘accents’. We present a series of modeling studies that interpret speech adaptation in terms of statistical learning (following work on language acquisition by, e.g., Bob McMurray, Joe Toscano, Jay MacClelland, etc.). This approach explains several otherwise puzzling results from the literature on speech adaptation. In the second part of the paper, we extend the framework from laboratory experiments to speech perception in everyday life. We discuss what is known about talker-specificity, the recognition of accented talkers, generalization between talkers as well as lack thereof, etc. We discuss to what extent these findings are accounted for by the proposed perspectives and what questions this proposal raises for future work. In the final part of the paper, we relate our account to other models of speech perception, such as exemplar-based, episodic, and abstractionist approaches. We also discuss how our proposal relates to work on inference and learning in other perceptual and motor domains.

The paper is long and might be cut, but for now we wanted to lay out our ideas with sufficient space to make them clear. Feedback welcome. We have tried our best to do justice to relevant works, but if you spot important things that we missed or mis-represented, please feel free to comment here (or send us a note).

Advertisements

5 thoughts on “Speech recognition: Recognizing the familiar, generalizing to the similar, and adapting to the novel

    Fred Hasselman said:
    July 10, 2014 at 3:39 am

    Interesting model to explain metastability.
    How would it account for hysteresis and enhanced comtrast?

    See e.g.

    http://dx.doi.org/10.6084/m9.figshare.1080795

    Which inspired by the work described here:

    http://www.nsf.gov/sbe/bcs/pac/nmbs/chap8.pdf

    Liked by 1 person

      tiflo responded:
      July 10, 2014 at 6:44 am

      Hi Fred,

      thanks for the links. In particular Betty’s work (which I am ashamed to admit I hadn’t been aware of before, though Dave might have already known it) is relevant to some aspects of our paper. Order effects like the ones reported in that paper are not expected by fully Bayesian models (unboundedly rational models). They are one of the reasons why we describe the ideal adapter as a framework (see, e.g., Part III of the paper, where we discuss other order effects).

      The first thing to note about potential adaptation effects in Betty’s paper is that the statistical learning is essentially unsupervised in this case (i.e., in this way comparable to, e.g., Cheyenne Munson’s thesis work or Clayards et al 2008). One way how one could account for such order effects would be to say that listeners cannot maintain the full perceptual information arbitrarily long (a pretty uncontroversial assumption, though ongoing work by Klinton Bicknell with Mike Tanenhaus and me suggests that listeners can maintain uncertainty about the percept for longer than previously assumed). This forces listeners to categorize or otherwise compress information about the percepts, which would mean they lose information about percepts over time. This predicts order effects, but doesn’t yet predict when one would get contrast or hysteresis. I have to read Betty’s paper in full to see whether the different results were obtained for the same or different items. If it’s the latter, the explanation might lie in word frequency effects or perceptual magnet effects.

      In any case, thanks for posting this. It was also interesting to see that you applied these ideas to literacy. We have done some work on changes to reading behavior based on recent exposure (motivated by the same ideas we’re pursuing here for speech perception). Perhaps Fine, Jaeger, Farmer, and Qian, 2013-PloS One would be of interest to you?

      http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0077661

      Like

        Fred Hasselman said:
        July 11, 2014 at 5:24 am

        Thank you, I will certainly have a look at those references.

        In Tuller et al. (1994) category switches were induced by approaching the ‘bi-stable state’ sequentially, but never ‘crossing the boundary’. At different points along the continuum the very same stimulus was repeated a number of times. This caused category switches after 2-6 presentations depending on the stopping point along the continuum (I note in the discussion of the linked chapter that these phenomena could have consequences for the interpretation of results from Mismatch / Error Related Negativity paradigms)

        best,
        Fred

        Tuller, B., Case, P., Ding, M., & Kelso, J. (1994). The nonlinear dynamics of speech categorization. Journal of Experimental Psychology: Human Perception and Performance, 20(1), 3–16.

        Like

          Dave Kleinschmidt said:
          July 11, 2014 at 7:16 pm

          I’m a bit embarrassed to admit that I haven’t read much of this literature. Bodo Winter has some interesting work in a similar vein (dynamical systems model of phonetic categorization that has slower habituation and Hebbian perceptual learning dynamics) that might also account for some (but not all) of the data we discuss in this paper. I agree, with Florian, that one of the things from our work that I’m particularly excited about is that it provides a framework for thinking about how people ought to hold on to what they’ve learned in one situation, and apply it in others. This is a bit of an underexplored frontier at the moment, and I think that having a way of formally thinking about it is a real advantage of this way of thinking over previous work.

          Equally exciting, though, is making connections between other approaches to medium-scale temporal effects in speech perception. Like Florian said, the normative, Bayesian approach isn’t normally thought to be very well-suited to this kind of stuff, because it typically focuses on asymptotic characterizations of behavior in different statistical environments. But there’s a lot of exciting stuff being done that starts to bridge this gap, both from the angle of thinking about how resource limitation and approximation of truly optimal Bayesian inference affects performance, as well as from explicitly incorporating order effects into the underlying statistical model of the world that determines what is optimal at all.

          So I think there’s a lot of potential in bringing insights from the literature on dynamical systems approaches into Bayesian modeling efforts, and (hopefully!) vice-versa. This has definitely bumped that up on my to-do list 🙂

          Like

          Fred Hasselman said:
          July 12, 2014 at 1:23 pm

          Indeed I think it is worthwhile trying to figure out the common ground in these approaches. A downside to the potential models is that parameter fitting of the “participant” parameter is not very well specified. At first glance this is similar to what you seem to model. Its on my list too!

          Like

Questions? Thoughts?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s