National Cellular v2.3

Structure | Protocol | Versions | Misc



Overview
The Cellular Corpus consists of cellular telephone speech from 2336 callers from locations throughout the United States. The data collection protocol contains requests for fixed vocabulary and continuous speech utterances. A total of about one minute of speech from each caller is collected.

Recording Conditions
The data were collected with the CSLU T1 digital data collection system. The sampling rate was 8khz and the files were stored in 8-bit mu-law format on a UNIX file system.

File Name Conventions
A call is composed of the series of files recorded during each recording session. Every call is identified by a unique call number, and each file in the call is further identifed by an utterance type.

The filename identifies the call number and the question type.
 NC000041.WAV 


The first two capitalized letters, "NC", indicate the corpus, National Cellular.

The next 5 digits are the call number. The last digit indicates the utterance type. The utterance types are shown in this table:

A background noise B brand C date D date of birth
E digital or analog F familiar license plate number G familiar phone number H where did you growup
I handset or microphone (not in vehicle) J lastname K location L male or female
M native language N phone2 O spell lastname P story1
Q story2 R story3 S story4 T story5
U story6 V story7 W story8 X story9
Y thanks Z time 0 week 1 yes or no
2 describe your environment 3 describe the traffic 4 how fast are you going 5 handset or microphone


The word "WAV" indicates that this is a speechfile.

Speech File Formats
The speech file in this distribution are stored as RIFF wav files. 8kHz sampling and 16-bit linear coding.

Distribution directory structure
At the top level of the distribution there are two directories: speech, trans. Immediately below the top level of each directory there are several number subdirectories (0, 1, 2, etc.). These numbers directories hold the files, split by call number div 10. That is, in subdirectory 0 will be the files for calls 0-9, subdirectory 1 will hold the files for calls 10-19, and so on.

Transcription
Each utterance in the National Cellular corpus has an orthographic transcription. The transcriptions are in the trans directory.