Sujith Ravi

Research Scientist
Yahoo! Research
4301 Great America Parkway,
Santa Clara, CA 95054, USA
 Email: sujithr AT yahoo DASH inc DOT com

Home

Education

Research Interests

Publications

Teaching




My main research interests span various problems and theory related to the fields of Natural Language Processing (NLP) and Machine Learning. I am specifically interested in un-supervised and semi-supervised methods and their applications to NLP problems such as Machine Translation, Name Transliteration, Question Answering, Natural Language Parsing, Information Retrieval in Discourse, etc.


My dissertation research (with advisor Prof. Kevin Knight) focused on decipherment techniques and algorithms applicable to different NLP problems. An example of a decipherment problem is decrypting/breaking a cipher code without any knowledge of the encryption mechanism used to construct the cipher in the first place.

For example, the encrypted message "ABCD AEBA" stands for the plain message "TAKE THAT" in English.

Is it possible to decrypt the cipher ("ABCD AEBA") into the correct plain message ("TAKE THAT") when we have no information about how the cipher was constructed, but we do know that the original message was written in English? Secondly, how much information from the source language (English, here) is needed to optimally decrypt the cipher?

Some of these questions were explored by Shannon as early as 1940's using Information Theory. It turns out that some of the unsupervised algorithms and models (e.g. ngram language models) commonly employed in NLP can be employed for the same task.

The same idea can be extended to existing NLP problems like Machine Translation to translate human languages automatically. We can treat the foreign language as an encrypted form of English, and it might be possible to use decipherment techniques to learn translation tables from large quantities of non-parallel linguistic data.


I have also worked on problems related to extracting meaningful information from discourse (joint work with Prof. Jihie Kim). Online discussions on the web and in classroom forums have proven to be an important medium for collaborative problem solving among multiple participants. But owing to the increasing popularity of discussion boards, the amount of information exchanged among users has also increased multiple-fold. This makes it difficult for a particular user to find only information that is relevant to his/her concern. We use NLP and Information Retrieval technqiues to build tools which (1) aids the user by extracting only relevant information from the discussions, and (2) classifies messages on the dicussion board into various categories called Speech Acts, that help in identifying the roles played by individual messages on the discussion board (e.g. question, answer, elaboration, proposition, etc.)


Another area that I have worked on is semi-supervised or weakly supervised learning approaches for automatic Information Extraction on a large-scale from the Web. This work was in collaboration with Marius Pasca (Research Scientist, Google Inc.) when I was over at Google Research, last summer (2007).

 

** The easiest way to reach me is via email.