Utterances in English and other languages can be analyzed into a sequence of abstract units called phonemes like those in the table to the left, which includes a complete set of phonemes for American English. Subsets of these phonemes can be grouped into categories according to the type of phonation involved, that is, what the speaker is doing with his or her vocal organs to create the speech sound in question.
There are many ways in which speech sounds can be grouped. Our primary school categories of vowel and consonant capture the most basic contrast in speech sounds; can a phoneme serve as a syllable nucleus or not? Another fundamental criterion for separating speech sounds into categories concerns the mode of phonation, in which case we have four categories:
According to this classification scheme, as a linguist once pointed out, human language is a sequence of buzzes (voicing), hisses (frication), and pops (plosive), but of course we humans do not hear it like that.
For the purposes of spectrogram reading, I prefer to divide speech sounds into nine categories as set forth below, each of which has a distinctive signature in the spectrogram, which is a representation of the phonation types, and ultimately of the phonemes present in the utterance. We speak by using sound to produce complex three-dimensional patterns in the two-space of time and frequency. The patterns are so complex, in fact, that it takes a little time and practice to learn how to read them. But it is well worth the effort, because in this way and only in this way can one appreciate something of the complexity and beauty of those patterns which we code and decode so easily with our ears and the neuronal structures which are "attached" to the ears, including some of the highest cortical processing areas.
Below are the nine categories, with a description of the spectral characteristics and example spectrograms for each category. The categories are listed in the same order as the phonemes in the panel to the left; to see spectrograms for each American English phoneme in the phoneme list, click on the appropriate symbol or word.
Monophthong vowels These are characterized by strong stable
voicing, as represented in Figure 1 below. The formants in the
vowel, visible as a grouping of three components: (1) a red band of increasing energy, (2)
a maximum in green and yellow, and (3) decreasing energy in blue, are stable in time.
Geometrically this means that they are horizontal, showing no motion in the y-axis of
frequency. In Figure 1, we are seeing F1 and F3, since F2 has been absorbed into F1 .
Figure 1 - Monophthong /A/ from the utterance "ah."
Diphthong vowels The diphthongs have strong moving voicing, as
represented in Figure 2 below. The formants are not horizontal
throughout the life of the vowel as they were in the monophthong vowels, but move from a
beginning configuration to a target configuration.
Figure 2 - Diphthong /aI/ from the utterance "eye" with /a/ passing into /I/.
Approximants The liquids (/9r/ and /l/) and glides (/j/ and /w/)
have formants which are less pronounced than those of vowels,
because of a slight obstruction placed somewhere along the vocal tract which creates a
unique signature for each approximant, as represented by Figure 3 below.
Figure 3 - Two liquids and a vowel from the utterance "real", with /9r/ symbolized by F3. lower than 2000 Hz at the beginning, the stable vowel /i:/ in the center, and /l/ symbolized by the wide jaw-like opening of a gap between F2 and F3 at the end of the spectrogram.
Nasals The nasals have much less energy than any of the previous
phonation categories. This is because the oral tract is completely blocked, and sound
waves radiate principally from the nose. There is a characteristic nasal "zero"
or region of extremely low energy.
Figure 4 - Two nasals and a vowel from the utterance "mean", with /m/ symbolized by the rapid rise of F2 from 900 Hz at the beginning, the stable vowel /i:/ in the center, and /n/ symbolized by a less dramatic fall toward 1800 Hz at the end.
Fricatives The fricatives do not necessarily involve any voicing,
although the voiced fricatives may have a very low voice bar as in the /v/ in Figure 5
below. The signature of fricatives is in their high-frequency regions, which are more
random in their energy distribution than voicing.
Figure 5 - Two fricatives and a vowel from the utterance "save", with /s/ symbolized by the opening frication rectangle, the diphthong vowel /ei/ in the center, and /v/ symbolized by a drop in voicing and the high-energy plume of frication at the end.
Plosives The plosives involve an explosive burst of acoustic
energy following a short period of silence; because of the silence during which the vocal
tract is completely blocked, these phonemes are also called stops. The signature of
plosives is an almost instantaneous passage from little or no acoustic energy to a short
burst of high-energy in a wide frequency band. The plosives, like the fricatives, may be
accompanied by voicing.
Figure 6 - Two plosives and a vowel from the utterance "tide", with the burst of /t/ followed by aspiration, the diphthong vowel /aI/ in the center, and voicing continuing through the closure of /d/, the release for which is at the right.
Flaps Flaps are abbreviated forms of the alveolar plosives /t/
and /d/ and the alveolar nasal /n/. In a normal alveolar plosive closure, the vocal tract
is blocked for some 50 ms, but in the flap, produced by one rapid tap of the tongue
against the alveolar ridge, the duration is very short, on the order of 10-20 ms. The flap
is very common in American English.
Figure 7 - The word "rider" with the initial /9r/, the diphthong /aI/, the central flap /d_(/, and the final r-flavored reduced vowel /&r/.
Affricates The affricates /tS/ and /dZ/, as their Worldbet
symbols show, are compounds of a plosive and a fricative. The plosive is much reduced from
the full /t/ or /d/, usually showing as one or more thin bars to the left of the large
rectangle of frication.
Figure 8 - Two examples of the same affricate and a vowel from the utterance "church". The affricates are /tS/, while the central vowel is /3r/.
Syllabics When liquids or nasals occur in an unstressed syllable,
the vowel is often merged into the liquid or nasal, which becomes syllabic in that it
bears the weight of the syllable. The spectral appearance of the syllabics is midway
between that of a vowel and that of a liquid or nasal.
Figure 9 - The word "button" with the initial weak plosive /b/, the back vowel /^/, the flap /th_(/, and the final syllabic nasal /n_=/.