Abstract: We investigate paradigmatic representations of word context in the domain of unsupervised part of speech induction. Paradigmatic representations of word context are based on potential substitutes of a word in contrast to syntagmatic representations based on properties of neighboring words. We demonstrate paradigmatic representations within two frameworks: (1) context clustering and (2) co-occurrence modeling. In context clustering we cluster word contexts based on the potential substitutes and they reveal a grouping that largely match the traditional part of speech boundaries. In co-occurrence modeling we construct a Euclidean embedding that models the co-occurrence of
word types and their contexts. Clustering the points that correspond to word types in the Euclidean embedding gives state-of-the-art results in unsupervised part of speech induction, including 80% many-to-one accuracy on the Penn Treebank and improvements on 16 out of 19 corpora in 15 languages.
Bio: Mehmet Ali Yatbaz is a PhD candidate in Deniz Yuret's AI Lab at Koç University, Turkey. His research is on unsupervised word sense disambiguation, unsupervised morphological disambiguation and part of speech induction. He is also a member of Bologna Translation Service European Union Project and responsible for the collecting, extracting and cleaning of the parallel text corpora from publicly available web sites as a part of the Turkish - English machine translation system.
Home Page: