Normalization for Automated Metrics: English and Arabic Speech Translation

Nov 1, 2009

By Sherri Condon , Gregory Sanders , Dan Parvaz , Alan Rubenstein , Christine Doran , John Aberdeen , Beatrice Oshika

Download Resources

Normalization for Automated Metrics: English and Arabic Speech Translation

PDF Accessibility

One or more of the PDF files on this page fall under E202.2 Legacy Exceptions and may not be completely accessible. You may request an accessible version of a PDF using the form on the Contact Us page.

The Defense Advanced Research Projects Agency (DARPA) Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program has experimented with applying automated metrics to speech translation dialogues. For translations into English, BLEU, TER, and METEOR scores correlate well with human judgments, but scores for translation into Arabic correlate with human judgments less strongly. This paper provides evidence to support the hypothesis that automated measures of Arabic are lower due to variation and inflection in Arabic by demonstrating that normalization operations improve correlation between BLEU scores and Likert-type judgments of semantic adequacy—as well as between BLEU scores and human judgments of the successful transfer of the meaning of individual content words from English to Arabic.