Using Audio Quality to Predict Word Error Rate in an Automatic Speech Recognition System

Dec 1, 2006

By Dr. Randall Fish , Qian Hu , Stanley Boykin

Download Resources

Using Audio Quality to Predict Word Error Rate in an Automatic Speech Recognition System

PDF Accessibility

One or more of the PDF files on this page fall under E202.2 Legacy Exceptions and may not be completely accessible. You may request an accessible version of a PDF using the form on the Contact Us page.

Faced with a backlog of audio recordings, users of automatic speech recognition (ASR) systems would benefit from the ability to predict which files would result in useful output transcripts in order to prioritize processing resources. ASR systems used in non-research environments typically run in "real time". In other words, one hour of speech requires one hour of processing. These systems produce transcripts with widely varying Word Error Rates (WER) depending upon the actual words spoken and the quality of the recording. Existing correlations between WER and the ability to perform tasks such as information retrieval or machine translation could be leveraged if one could predict WER before processing an audio file. We describe here a method for estimating the quality of the ASR output transcript by predicting the portion of the total WER in a transcript attributable to the quality of the audio recording.