Decipherment for Universal Language Tools: a case study for Unsupervised Part of Speech Induction

Friday, August 17, 2018, 3:00 pm - 4:00 pm PSTiCal
11th Floor Large Conference Room [1135]
This event is open to the public.
NL Seminar
Ronald Cardenas (USC/ISI)

Abstract: Unsupervised Part of Speech induction can be viewed as a two-steps task. The first step infers a sequence of states, while the second step maps this sequence to an actual Part-of-Speech sequence at training or testing time. Hence, this last step requires reference tagged data, a luxury low-resource target languages might not have. In this talk, we present an alternative approach to the second step, modeling it as a decipherment problem in which the ciphered text is the sequence of states and the original text we want to recover is the POS sequence. This approach requires no reference data in the target language and allows to leverage POS sequences in much richer languages. Our experiments show that our approach benefits the most from simple strategies for inferring state sequences, such as Brown clustering. This allow our method to obtain reasonable performance in low-resource and limited-time scenarios.

Bio: Ronald Cardenas is a Master's student in the Language and Communication Technologies programme at Charles University in Prague. His research interests span morphological analysis and parsing of low-resource languages. At ISI, he works with Jonatan May on developing universal language tools.

« Return to Upcoming Events