University of Southern California

Automatic Parallel Fragment Extraction From Noisy Data (NAACL HLT Practice Talk)

When:
Friday, May 18, 2012, 03:00 pm - 3:30 pm
Where:
11th Floor Conf. Room (#1135)
Speaker:
NL Seminar-Jason Riesa
Description:

Abstract:

We present a novel method to detect parallel fragments within noisy parallel corpora. Isolating these parallel fragments from the noisy data in which they are contained frees us from noisy alignments and stray links that can severely constrain translation-rule extraction. We do this with existing machinery, making use of an existing word alignment model for this task. We evaluate the quality and utility of the extracted data on large-scale Chinese-English and Arabic-English translation tasks and show significant improvements over a state-of-the-art baseline.

View Event Calendar »