Sujith Ravi

Research Scientist
Yahoo! Research
4301 Great America Parkway,
Santa Clara, CA 95054, USA
 Email: sujithr AT yahoo DASH inc DOT com

Home

Education

Research Interests

Publications

Teaching


 
2011

[1] Sujith Ravi and Kevin Knight. Deciphering foreign language. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), 2011. [ bib | .pdf ]
In this work, we tackle the task of machine translation (MT) without parallel training data. We frame the MT problem as a decipherment task, treating the foreign text as a cipher for English and present novel methods for training translation models from non-parallel text.

[2] Sujith Ravi and Kevin Knight. Bayesian inference for Zodiac and other homophonic ciphers. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), 2011. [ bib | .pdf ]
We introduce a novel Bayesian approach for deciphering complex substitution ciphers. Our method uses a decipherment model which combines information from letter n-gram language models as well as word dictionaries. Bayesian inference is performed on our model using an efficient sampling technique. We evaluate the quality of the Bayesian decipherment output on simple and homophonic letter substitution ciphers and show that unlike a previous approach, our method consistently produces almost 100% accurate decipherments. The new method can be applied on more complex substitution ciphers and we demonstrate its utility by cracking the famous Zodiac-408 cipher in a fully automated fashion, which has never been done before.

[3] Stephen Boxwell, Chris Brew, Jason Baldridge, Dennis Mehay, and Sujith Ravi. Semantic Role Labeling for CCG without treebanks. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP)., 2011. [ bib ]
[4] Zornitsa Kozareva and Sujith Ravi. Unsupervised name ambiguity resolution using a generative model. In Proceedings of the EMNLP Workshop on Unsupervised Learning in NLP, 2011. [ bib ]
2010

[5] Sujith Ravi, Ashish Vaswani, Kevin Knight, and David Chiang. Fast, greedy model minimization for unsupervised tagging. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pages 940-948, 2010. [ bib | .pdf ]
Model minimization has been shown to work well for the task of unsupervised part-of-speech tagging with a dictionary. In Ravi and Knight (2009), the authors invoke an integer programming (IP) solver to do model minimization. However, solving this problem exactly using an integer programming formulation is intractable for practical purposes. We propose a novel two-stage greedy approximation scheme to replace the IP. Our method runs fast, while yielding highly accurate tagging results. We also compare our method against standard EM training, and show that we consistently obtain better tagging accuracies on test data of varying sizes for English and Italian.

[6] Suijth Ravi, Jason Baldridge, and Kevin Knight. Minimized models and grammar-informed initialization for supertagging with highly ambiguous lexicons. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pages 495-503, 2010. [ bib | .pdf ]
We combine two complementary ideas for learning supertaggers from highly ambiguous lexicons: grammar-informed tag transitions and models minimized via integer programming. Each strategy on its own greatly improves performance over basic expectation-maximization training with a bitag Hidden Markov Model, which we show on the CCGbank and CCG-TUT corpora. The strategies provide further error reductions when combined. We describe a new two-stage integer programming strategy that efficiently deals with the high degree of ambiguity on these datasets while obtaining the full effect of model minimization.

[7] Sujith Ravi and Kevin Knight. Does GIZA++ make search errors? Computational Linguistics, 36(3):295-302, 2010. [ bib | .pdf ]
Word alignment is a critical procedure within statistical machine translation (SMT). Brown et al. (1993) have provided the most popular word alignment algorithm to date, one that has been implemented in GIZA (Al-Onaizan et al. 1999) and GIZA++ (Och and Ney 2003) software and adopted by nearly every SMT project. In this paper, we investigate whether this algorithm makes search errors when it computes Viterbi alignments, i.e., whether it returns alignments that are sub-optimal according to a trained model.

[8] Jihie Kim, Erin Shaw, and Sujith Ravi. Mining student discussions to profile participation and scaffold learning. In Cristobal Romero, Sebastian Ventura, Mykola Pechenizkiy, and Ryan Baker, editors, The Handbook of Educational Data Mining, pages 299-310. CRC Press, 2010. [ bib ]
[9] David Chiang, Jonathan Graehl, Kevin Knight, Adam Pauls, and Sujith Ravi. Bayesian inference for finite-state transducers. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL/HLT), pages 447-455, 2010. [ bib | .pdf ]
We describe a Bayesian inference algorithm that can be used to train any cascade of weighted finite-state transducers on end-toend data. We also investigate the problem of automatically selecting from among multiple training runs. Our experiments on four di erent tasks demonstrate the genericity of this framework, and, where applicable, large improvements in performance over EM. We also show, for unsupervised part-of-speech tagging, that automatic run selection gives a large improvement over previous Bayesian approaches.

[10] Sujith Ravi, Andrei Z. Broder, Evgeniy Gabrilovich, Vanja Josifovski, Sandeep Pandey, and Bo Pang. Automatic generation of bid phrases for online advertising. In Proceedings of the International Conference on Web Search and Data Mining (WSDM), pages 341-350, 2010. [ bib | .pdf ]
One of the most prevalent online advertising methods is textual advertising. To produce a textual ad, an advertiser must craft a short creative (the text of the ad) linking to a landing page, which describes the product or service being promoted. Furthermore, the advertiser must associate the creative to a set of manually chosen bid phrases representing those Web search queries that should trigger the ad. For efficiency, given a landing page, the bid phrases are often chosen first, and then for each bid phrase the creative is produced using a template. Nevertheless, an ad campaign (e.g., for a large retailer) might involve thousands of landing pages and tens or hundreds of thousands of bid phrases, hence the entire process is very laborious.

Our study aims towards the automatic construction of online ad campaigns: given a landing page, we propose several algorithmic methods to generate bid phrases suitable for the given input. Such phrases must be both relevant (that is, reflect the content of the page) and well-formed (that is, likely to be used as queries to a Web search engine). To this end, we use a two phase approach. First, candidate bid phrases are generated by a number of methods, including a (monolingual) translation model capable of generating phrases not contained within the text of the input as well as previously “unseen” phrases. Second, the candidates are ranked in a probabilistic framework using both the translation model, which favors relevant phrases, as well as a bid phrase language model, which favors well-formed phrases.

Empirical evaluation based on a real-life corpus of advertisercreated landing pages and associated bid phrases confirms the value of our approach, which successfully re-generates many of the human-crafted bid phrases and performs significantly better than a pure text extraction method.

2009

[11] Sujith Ravi and Kevin Knight. Minimized models for unsupervised part-of-speech tagging. In Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP), pages 504-512, 2009. Nominated for the Best Paper Award. [ bib | .pdf ]
We describe a novel method for the task of unsupervised POS tagging with a dictionary, one that uses integer programming to explicitly search for the smallest model that explains the data, and then uses EM to set parameter values. We evaluate our method on a standard test corpus using different standard tagsets (a 45-tagset as well as a smaller 17-tagset), and show that our approach performs better than existing state-of-the-art systems in both settings.

[12] Tugba Bodrumlu, Kevin Knight, and Sujith Ravi. A new objective function for word alignment. In Proceedings of the NAACL/HLT Workshop on Integer Programming for Natural Language Processing, pages 28-35, 2009. [ bib | .pdf ]
We develop a new objective function for word alignment that measures the size of the bilingual dictionary induced by an alignment. A word alignment that results in a small dictionary is preferred over one that results in a large dictionary. In order to search for the alignment that minimizes this objective, we cast the problemas an integer linear program. We then extend our objective function to align corpora at the sub-word level, which we demonstrate on a small Turkish-English corpus.

[13] Sujith Ravi and Kevin Knight. Learning phoneme mappings for transliteration without parallel data. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL/HLT), pages 37-45, 2009. [ bib | .pdf ]
We present a method for performing machine transliteration without any parallel resources. We frame the transliteration task as a decipherment problem and show that it is possible to learn cross-language phoneme mapping tables using only monolingual resources. We compare various methods and evaluate their accuracies on a standard name transliteration task.

[14] Sujith Ravi and Kevin Knight. Probabilistic methods for a japanese syllable cipher. In Proceedings of the 22nd International Conference on the Computer Processing of Oriental Languages (ICCPOL), pages 270-281, 2009. [ bib | .pdf ]
This paper attacks a Japanese syllable-substitution cipher. We use a probabilistic, noisy-channel framework, exploiting various Japanese language models to drive the decipherment. We describe several innova- tions, including a new objective function for searching for the highest- scoring decipherment. We include empirical studies of the relevant phenomena, and we give improved decipherment accuracy rates.

[15] Sujith Ravi and Kevin Knight. Attacking letter substitution ciphers with integer programming. Cryptologia, 33(4):321-334, 2009. [ bib | http ]
We introduce a method for solving substitution ciphers using low-order letter n-gram models. This method enforces global constraints using integer programming, and it guarantees that no decipherment key is overlooked. We carry out extensive empirical experiments showing how decipherment accuracy varies as a function of cipher length and n-gram order. We also make an empirical investigation of Shannon's (1949) theory of uncertainty in decipherment.

2008

[16] Sujith Ravi and Kevin Knight. Attacking decipherment problems optimally with low-order n-gram models. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 812-819, 2008. [ bib | .pdf ]
We introduce a method for solving substitution ciphers using low-order letter n-gram models. This method enforces global constraints using integer programming, and it guarantees that no decipherment key is overlooked. We carry out extensive empirical experiments showing how decipherment accuracy varies as a function of cipher length and n-gram order. We also make an empirical investigation of ShannonÕs (1949) theory of uncertainty in decipherment.

[17] Sujith Ravi, Kevin Knight, and Radu Soricut. Automatic prediction of parser accuracy. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 887-896, 2008. [ bib | .pdf ]
Statistical parsers have become increasingly accurate, to the point where they are useful in many natural language applications. However, estimating parsing accuracy on a wide variety of domains and genres is still a challenge in the absence of gold-standard parse trees. In this paper, we propose a technique that automatically takes into account certain characteristics of the domains of interest, and accurately predicts parser performance on data from these new domains. As a result, we have a cheap (no annotation involved) and effective recipe for measuring the performance of a statistical parser on any given domain.

[18] Sujith Ravi and Marius Pasca. Using structured text for large-scale attribute extraction. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM), pages 1183-1192, 2008. [ bib | http ]
We propose a weakly-supervised approach for extracting class attributes from structured text available within Web documents. The overall precision of the extracted attributes is around 30% higher than with previous methods operating on Web documents. In addition to attribute extraction, this approach also automatically identifies values for a subset of the extracted class attributes.

[19] Jihie Kim, Erin Shaw, Sujith Ravi, Erin Tavano, Aniwat Arromratana, and Pankaj Sarda. Scaffolding on-line discussions with past discussions: An analysis and pilot study of pedabot. In Proceedings of the 9th International Conference on Intelligent Tutoring Systems Conference (ITS), pages 343-352, 2008. [ bib | .pdf ]
PedaBot is a new discussion scaffolding application designed to aid student knowledge acquisition, promote reflection about course topics and encourage student participation in discussions. It dynamically processes student discussions and presents related discussions from a knowledge base of past discussions. This paper describes the system and presents a comparative analysis of the information retrieval techniques used to respond to free-form student discussions, a combination of topic profiling, term frequency-inverse document frequency, and latent semantic analysis. Responses are presented as annotated links that students can follow and rate. We report a pilot study of PedaBot based on student viewings, student ratings, and a small survey. Initial results indicate that there is a high level of student interest in the feature and that its responses are moderately relevant to student discussions.

2007

[20] Sujith Ravi and Jihie Kim. Profiling student interactions in threaded discussions with speech act classifiers. In Proceedings of the 13th International Conference on Artificial Intelligence in Education (AIED), pages 357-364, 2007. [ bib | .pdf ]
On-line discussion is a popular form of web-based computer-mediated communication and is an important medium for distance education. Automatic tools for analyzing online discussions are highly desirable for better information management and assistance. This paper presents an approach for automatically profiling student interactions in on-line discussions. Using N-gram features and linear SVM, we developed “speech act” classifiers that identify the roles that individual messages play. The classifiers were used in finding messages that contain questions or answers. We then applied a set of thread analysis rules for identifying threads that may have unanswered questions and need instructor attention. We evaluated the results with three human annotators, and 70-75% of the predictions from the system were consistent with human answers.

[21] Sujith Ravi, Jihie Kim, and Erin Shaw. Mining on-line discussions: Assessing technical quality for student scaffolding and classifying messages for participation profiling. In Proceedings of the Educational Data Mining Workshop in the 13th International Conference on Artificial Intelligence in Education (AIED), 2007. [ bib ]
On-line collaborative discussions play an important role in distance education and web-enhanced courses. Automatic tools for assessing student activities and promoting collaborative problem solving can provide a better learning experience for students and also offer useful assistance to teachers. This paper presents two novel instructional tools that apply data mining and information retrieval techniques. First, we describe an approach that could be used to scaffold undergraduate student discussions by retrieving useful information from past student discussions. The tool exploits both the discussions from the same undergraduate course and the ones from a graduate-level course. The second part of the paper presents an instructional tool that profiles student contributions with respect to student genders and the roles that students play in discussion. We apply speech act classifiers that automatically identify whether the given message contains questions and/or answers, and use the classification results in profiling male and female student contributions. Our initial evaluation of the scaffolding tool shows that discussions from the same course contain more number of similar concepts than the ones from the graduate-level course. However, technical quality of graduate-level discussions is higher. The results from the profiling tool indicate that female participation in undergraduate-level discussions is lower than that in graduate-level discussions, and graduate female students post more questions and answers compared to undergraduate female students.


** The easiest way to reach me is via email.