An Auditory Scene Analysis Approach to Speech Segregation

DeLiang Wang
Ohio State University

Segregation of speech from interfering sounds, or cocktail-party
processing, has proven to be very challenging. We describe an auditory
scene analysis model for the task. The model starts with simulated
auditory periphery. A subsequent stage computes mid-level auditory
representations, including correlogram and cross-channel correlation.
The core of the model performs segmentation and grouping in a
two-dimensional time-frequency representation that encodes proximity in
frequency and time, periodicity, and amplitude modulation (AM).
Motivated by psychoacoustic observations, our system employs different
mechanisms for handling resolved and unresolved harmonics, and for the
latter it generates segments based on common AM in addition to temporal
continuity and groups them according to AM repetition rates. The model
yields substantially better performance than previous systems. We also
discuss oscillatory correlation as a potential neural mechanism
underlying auditory scene analysis.

Presentation (PowerPoint File)

Back to Mathematics of the Ear and Sound Signal Processing