Cochlear modeling and its role in human speech recognition

Jont Allen
University of Illinois at Urbana-Champaign

In the last 20 years our understanding of the cochlea has dramatically
improved. In the mid 1980 time frame, and certainly before 1970, little
was understood of inner and outer hair cell function. Furthermore the
role of the cochlea was poorly understood when it came to the most
important problem, speech processing by the auditory system. Today this
is all changed. We now know that the outer hair cells play a key role
in controlling the dynamic range of hearing, and are the source of wide
dynamic range compression in the cochlea. We know that the narrow band
tuning of the cochlear filters, arising from dispersive wave propagation
within the cochlea, are responsible for the detection of signals, such
as tones, music, and the most important signal, speech in high levels of noise.

In fact, the problem of the robustness of human speech recognition may
be traced back to the cochlea. The question of the role of the cochlea
and speech processing was extensively explored at Bell Labs by Harvey
Fletcher, and then by many others, also at Bell Labs. These studies were
quantified in terms of a measure called the articulation index. There
are may forms of the articulation index available, but none work as well
as the first two, the first due to Fletcher (Fletcher, 1921; Fletcher
and Galt, 1951) and the follow-on due to French and Steinberg (1947).
The AI was developed years before information theory, but surprisingly
there seems to be some interesting connections. The AI measure (a sum over
cochlear frequency bands, of the log of 1 plus the signal to noise ratio)
is very similar to Shannon's Channel capacity measure for a Gaussian
channel (an integral over frequency of the log of 1 plus the signal to
noise ratio). Fletcher showed that the probability correct for nonsense
{C,V} materials was given by

P_c(AI) = 1 - e_{min}^{AI}.

This formula was based on Fletcher's observation, and his subsequent model
of speech recognition, that speech features are independently detected
over cochlear frequency bands. Fletcher's concept stands tall today.

I shall show that Fletcher's independent channel formulation holds when
one represents a human listener as a Shannon channel, and characterizes
human speech recognition performance in terms of the channel's confusion
matrix, over nonsense speech sounds (consonants and vowels).

I will show how the AI measure, which is based on human critical bands
and the French and Steinberg theory (Allen, 1996) can be used to model
the nonsense consonant confusion matrix data of Miller and Nicely
(1951). I will present a quantitative relation between these two
information-theoretic measures, the confusion matrix C_{s|h} and the AI
(a form of channel capacity), for the case of human nonsense speech
recognition, with high amounts of noise (-18 to +12 dB SNR wideband).
This key relationship gives deep insight into the robust recognition of
nonsense speech sounds by human listeners, in high levels of noise.

Presentation (PDF File)

Back to Mathematics of the Ear and Sound Signal Processing