Spectrogram Reading Home Page

Spectrogram Reading
What are waveforms?

Speech sounds are created by vibratory activity in the human vocal tract. Speech is normally transmitted to a listener's ears or to a microphone through the air, where speech and other sounds take on the form of radiating waves of variation in air pressure around an average resting value at sea level of about 100,000 Pascals (Pa).

Figure 1 below is a visual representation of vibrations typical of those in human speech - a speech waveform. Resting atmospheric pressure is represented by the straight horizontal line in the center of the image. A waveform tracks excess pressure as a function of time for a given point in space. In a waveform, any reading above the zero line means that the pressure is greater than resting atmospheric pressure at that time and place. Similarly, readings below the line signify pressure values lower than 100,000 Pa. Human ears are sensitive to variations as small as 0.00002 Pa.


Figure 1 - Speech waveform for part of the word "compute" pronounced by Tim Carmell.


When we speak into a microphone, these changes in pressure are converted to proportional variations in electrical voltage. Computers equipped with the proper hardware can convert the analog voltage variations into digital sound waveforms by a process called analog-to-digital conversion (ADC), which involves two separate components:

1. Sampling - Even though a waveform is usually depicted as a continuous function of time, Figure 2 below, a more detailed rendition of a few millseconds from the waveform of Figure 1 above, shows that the function is in fact discrete. Sampling means taking a fixed number of pressure value readings at equal time intervals from the continuously varying speech signal. For example, clean speech such as that depicted on this page is sampled 16000 times per second; its sampling frequency is 16 kHz. If you count carefully, you will find a total of sixteen dots per millisecond in Figure 2. Telephone speech is sampled at half that rate - 8000 kHz. On the other hand, compact disk recordings have a sampling rate of 44.1 kHz. The higher the sampling rate, the better the sound quality, but the more bits required.
2. Quantization - Each sampled pressure value is rounded or quantized to the nearest value which is expressible in a given number of bits. There is a direct relationship between the accuracy of quantization and the number of bits required. Clean speech such as that depicted here often uses 16 bits, for a total of 65536 possible quantization levels, while telephone speech is accommodated in 8 bits for a total of 256 quantization levels.


Figure 2 - A little more than 4 milliseconds from the waveform also depicted in Figure 1, showing the discrete nature of computer waveform files.


Sampled quantized speech can be stored as a permanent disk file, in which case it is called a waveform file. There are many different standards for storing speech and other sounds; at OGI we commonly use the NIST Sphere standard.

The sound waveform presented at the top of the page in Figure 1, and again below in Figure 3, shows air pressure variations over 0.175 seconds of speech extracted from the middle of the word "compute." Starting from near silence before the 'p', the waveform evolves through the large, irregular swings in pressure toward the center which constitute the 'p', then through the smaller, more regular variations occupying the right of the diagram which are perceived as the vowel 'u' flavored by an initial 'y' sound.


Figure 3 - The waveform for the "pu" portion of the word "compute," showing the absence of marked vibrations representing near silence at the left, slow high amplitude swings corresponding to 'p' in the middle, and quasi-periodic cycles representing the voicing of the 'yu' vowel toward the right.


Although we can learn quite a lot by a visual inspection of a speech waveform, it is impossible to detect individual speech sounds from waveforms because of the variability of human speech between individuals, and even in two different pronunciations of a given word by the same person. This brings us to spectrograms, which represent speech in a manner which is much more invariant to individual differences than the waveform representation. What are spectrograms?.


Tim Carmell, Last modified: 19-MAR-97