Figure 1 below is a visual representation of vibrations typical of those in human speech - a speech waveform. Resting atmospheric pressure is represented by the straight horizontal line in the center of the image. A waveform tracks excess pressure as a function of time for a given point in space. In a waveform, any reading above the zero line means that the pressure is greater than resting atmospheric pressure at that time and place. Similarly, readings below the line signify pressure values lower than 100,000 Pa. Human ears are sensitive to variations as small as 0.00002 Pa.
Figure 1 - Speech waveform for part of the word "compute" pronounced by Tim
Carmell.
When we speak into a microphone, these changes in pressure are converted to proportional variations in electrical voltage. Computers equipped with the proper hardware can convert the analog voltage variations into digital sound waveforms by a process called analog-to-digital conversion (ADC), which involves two separate components:
Sampled quantized speech can be stored as a permanent disk file, in which case it is called a waveform file. There are many different standards for storing speech and other sounds; at OGI we commonly use the NIST Sphere standard.
The sound waveform presented at the top of the page in Figure 1, and again below in Figure 3, shows air pressure variations over 0.175 seconds of speech extracted from the middle of the word "compute." Starting from near silence before the 'p', the waveform evolves through the large, irregular swings in pressure toward the center which constitute the 'p', then through the smaller, more regular variations occupying the right of the diagram which are perceived as the vowel 'u' flavored by an initial 'y' sound.
Although we can learn quite a lot by a visual inspection of a speech waveform, it is impossible to detect individual speech sounds from waveforms because of the variability of human speech between individuals, and even in two different pronunciations of a given word by the same person. This brings us to spectrograms, which represent speech in a manner which is much more invariant to individual differences than the waveform representation. What are spectrograms?.