Sujith Ravi and Kevin Knight.
Deciphering foreign language.
In Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies (ACL-HLT), 2011.
[ bib |
.pdf ]
In this work, we tackle the task of machine translation (MT) without parallel training data. We frame the MT problem as a decipherment task, treating the foreign text as a cipher for English and present novel methods for training translation models from non-parallel text.
Sujith Ravi and Kevin Knight.
Bayesian inference for Zodiac and other homophonic ciphers.
In Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies (ACL-HLT), 2011.
[ bib |
.pdf ]
We introduce a novel Bayesian approach for deciphering complex substitution ciphers. Our method uses a decipherment model which combines information from letter n-gram language models as well as word dictionaries. Bayesian inference is performed on our model using an efficient sampling technique. We evaluate the quality of the Bayesian decipherment output on simple and homophonic letter substitution ciphers and show that unlike a previous approach, our method consistently produces almost 100% accurate decipherments. The new method can be applied on more complex substitution ciphers and we demonstrate its utility by cracking the famous Zodiac-408 cipher in a fully automated fashion, which has never been done before.
Stephen Boxwell, Chris Brew, Jason Baldridge, Dennis Mehay, and Sujith Ravi.
Semantic Role Labeling for CCG without treebanks.
In Proceedings of the International Joint Conference on Natural
Language Processing (IJCNLP)., 2011.
[ bib ]
Zornitsa Kozareva and Sujith Ravi.
Unsupervised name ambiguity resolution using a generative model.
In Proceedings of the EMNLP Workshop on Unsupervised Learning in
NLP, 2011.
[ bib ]
Sujith Ravi, Ashish Vaswani, Kevin Knight, and David Chiang.
Fast, greedy model minimization for unsupervised tagging.
In Proceedings of the 23rd International Conference on
Computational Linguistics (COLING), pages 940-948, 2010.
[ bib |
.pdf ]
Model minimization has been shown to work well for the task of unsupervised part-of-speech tagging with a dictionary. In Ravi and Knight (2009), the authors invoke an integer programming (IP) solver to do model minimization. However, solving this problem exactly using an integer programming formulation is intractable for practical purposes. We propose a novel two-stage greedy approximation scheme to replace the IP. Our method runs fast, while yielding highly accurate tagging results. We also compare our method against standard EM training, and show that we consistently obtain better tagging accuracies on test data of varying sizes for English and Italian.
Suijth Ravi, Jason Baldridge, and Kevin Knight.
Minimized models and grammar-informed initialization for supertagging
with highly ambiguous lexicons.
In Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics (ACL), pages 495-503, 2010.
[ bib |
.pdf ]
We combine two complementary ideas
for learning supertaggers from highly ambiguous
lexicons: grammar-informed tag
transitions and models minimized via integer
programming. Each strategy on its
own greatly improves performance over
basic expectation-maximization training
with a bitag Hidden Markov Model, which
we show on the CCGbank and CCG-TUT
corpora. The strategies provide further error
reductions when combined. We describe
a new two-stage integer programming
strategy that efficiently deals with
the high degree of ambiguity on these
datasets while obtaining the full effect of
model minimization.
Sujith Ravi and Kevin Knight.
Does GIZA++ make search errors?
Computational Linguistics, 36(3):295-302, 2010.
[ bib |
.pdf ]
Word alignment is a critical procedure within statistical machine translation (SMT). Brown et
al. (1993) have provided the most popular word alignment algorithm to date, one that has been
implemented in GIZA (Al-Onaizan et al. 1999) and GIZA++ (Och and Ney 2003) software and
adopted by nearly every SMT project. In this paper, we investigate whether this algorithm makes
search errors when it computes Viterbi alignments, i.e., whether it returns alignments that are
sub-optimal according to a trained model.
Jihie Kim, Erin Shaw, and Sujith Ravi.
Mining student discussions to profile participation and scaffold
learning.
In Cristobal Romero, Sebastian Ventura, Mykola Pechenizkiy, and Ryan
Baker, editors, The Handbook of Educational Data Mining, pages
299-310. CRC Press, 2010.
[ bib ]
David Chiang, Jonathan Graehl, Kevin Knight, Adam Pauls, and Sujith Ravi.
Bayesian inference for finite-state transducers.
In Proceedings of the Conference of the North American Chapter
of the Association for Computational Linguistics - Human Language
Technologies (NAACL/HLT), pages 447-455, 2010.
[ bib |
.pdf ]
We describe a Bayesian inference algorithm
that can be used to train any cascade of
weighted finite-state transducers on end-toend
data. We also investigate the problem
of automatically selecting from among multiple
training runs. Our experiments on four
dierent tasks demonstrate the genericity of
this framework, and, where applicable, large
improvements in performance over EM. We
also show, for unsupervised part-of-speech
tagging, that automatic run selection gives a
large improvement over previous Bayesian approaches.
Sujith Ravi, Andrei Z. Broder, Evgeniy Gabrilovich, Vanja Josifovski, Sandeep
Pandey, and Bo Pang.
Automatic generation of bid phrases for online advertising.
In Proceedings of the International Conference on Web Search and
Data Mining (WSDM), pages 341-350, 2010.
[ bib |
.pdf ]
One of the most prevalent online advertising methods is textual
advertising. To produce a textual ad, an advertiser
must craft a short creative (the text of the ad) linking to a
landing page, which describes the product or service being
promoted. Furthermore, the advertiser must associate the
creative to a set of manually chosen bid phrases representing
those Web search queries that should trigger the ad. For
efficiency, given a landing page, the bid phrases are often
chosen first, and then for each bid phrase the creative is
produced using a template. Nevertheless, an ad campaign
(e.g., for a large retailer) might involve thousands of landing
pages and tens or hundreds of thousands of bid phrases,
hence the entire process is very laborious.
Our study aims towards the automatic construction of online
ad campaigns: given a landing page, we propose several
algorithmic methods to generate bid phrases suitable for the
given input. Such phrases must be both relevant (that is, reflect
the content of the page) and well-formed (that is, likely
to be used as queries to a Web search engine). To this end,
we use a two phase approach. First, candidate bid phrases
are generated by a number of methods, including a (monolingual)
translation model capable of generating phrases not
contained within the text of the input as well as previously
“unseen” phrases. Second, the candidates are ranked in a
probabilistic framework using both the translation model,
which favors relevant phrases, as well as a bid phrase language
model, which favors well-formed phrases.
Empirical evaluation based on a real-life corpus of advertisercreated
landing pages and associated bid phrases confirms
the value of our approach, which successfully re-generates
many of the human-crafted bid phrases and performs significantly
better than a pure text extraction method.
Sujith Ravi and Kevin Knight.
Minimized models for unsupervised part-of-speech tagging.
In Proceedings of the Joint Conferenceof the 47th Annual Meeting
of the Association for Computational Linguistics and the 4th International
Joint Conference on Natural Language Processing of the Asian Federation of
Natural Language Processing (ACL-IJCNLP), pages 504-512, 2009.
Nominated for the Best Paper Award.
[ bib |
.pdf ]
We describe a novel method for the task
of unsupervised POS tagging with a dictionary,
one that uses integer programming
to explicitly search for the smallest model
that explains the data, and then uses EM
to set parameter values. We evaluate our
method on a standard test corpus using
different standard tagsets (a 45-tagset as
well as a smaller 17-tagset), and show that
our approach performs better than existing
state-of-the-art systems in both settings.
Tugba Bodrumlu, Kevin Knight, and Sujith Ravi.
A new objective function for word alignment.
In Proceedings of the NAACL/HLT Workshop on Integer Programming
for Natural Language Processing, pages 28-35, 2009.
[ bib |
.pdf ]
We develop a new objective function for word
alignment that measures the size of the bilingual
dictionary induced by an alignment. A
word alignment that results in a small dictionary
is preferred over one that results in a large
dictionary. In order to search for the alignment
that minimizes this objective, we cast the
problemas an integer linear program. We then
extend our objective function to align corpora
at the sub-word level, which we demonstrate
on a small Turkish-English corpus.
Sujith Ravi and Kevin Knight.
Learning phoneme mappings for transliteration without parallel data.
In Proceedings of Conference of the North American Chapter of
the Association for Computational Linguistics - Human Language Technologies
(NAACL/HLT), pages 37-45, 2009.
[ bib |
.pdf ]
We present a method for performing machine
transliteration without any parallel resources.
We frame the transliteration task as a decipherment
problem and show that it is possible
to learn cross-language phoneme mapping
tables using only monolingual resources. We
compare various methods and evaluate their
accuracies on a standard name transliteration
task.
Sujith Ravi and Kevin Knight.
Probabilistic methods for a japanese syllable cipher.
In Proceedings of the 22nd International Conference on the
Computer Processing of Oriental Languages (ICCPOL), pages 270-281, 2009.
[ bib |
.pdf ]
This paper attacks a Japanese syllable-substitution cipher.
We use a probabilistic, noisy-channel framework, exploiting various Japanese
language models to drive the decipherment. We describe several innova-
tions, including a new objective function for searching for the highest-
scoring decipherment. We include empirical studies of the relevant phenomena,
and we give improved decipherment accuracy rates.
Sujith Ravi and Kevin Knight.
Attacking letter substitution ciphers with integer programming.
Cryptologia, 33(4):321-334, 2009.
[ bib |
http ]
We introduce a method for solving substitution ciphers using low-order letter n-gram models. This method enforces global constraints using integer programming, and it guarantees that no decipherment key is overlooked. We carry out extensive empirical experiments showing how decipherment accuracy varies as a function of cipher length and n-gram order. We also make an empirical investigation of Shannon's (1949) theory of uncertainty in decipherment.
Sujith Ravi and Kevin Knight.
Attacking decipherment problems optimally with low-order n-gram
models.
In Proceedings of Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 812-819, 2008.
[ bib |
.pdf ]
We introduce a method for solving substitution
ciphers using low-order letter n-gram
models. This method enforces global constraints
using integer programming, and it
guarantees that no decipherment key is overlooked.
We carry out extensive empirical experiments
showing how decipherment accuracy
varies as a function of cipher length and
n-gram order. We also make an empirical investigation
of ShannonÕs (1949) theory of uncertainty
in decipherment.
Sujith Ravi, Kevin Knight, and Radu Soricut.
Automatic prediction of parser accuracy.
In Proceedings of Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 887-896, 2008.
[ bib |
.pdf ]
Statistical parsers have become increasingly
accurate, to the point where they are useful in
many natural language applications. However,
estimating parsing accuracy on a wide variety
of domains and genres is still a challenge in
the absence of gold-standard parse trees.
In this paper, we propose a technique that automatically
takes into account certain characteristics
of the domains of interest, and accurately
predicts parser performance on data
from these new domains. As a result, we have
a cheap (no annotation involved) and effective
recipe for measuring the performance of a statistical
parser on any given domain.
Sujith Ravi and Marius Pasca.
Using structured text for large-scale attribute extraction.
In Proceedings of the 17th ACM Conference on Information and
Knowledge Management (CIKM), pages 1183-1192, 2008.
[ bib |
http ]
We propose a weakly-supervised approach for extracting
class attributes from structured text available within Web
documents. The overall precision of the extracted attributes
is around 30% higher than with previous methods operating
on Web documents. In addition to attribute extraction, this
approach also automatically identifies values for a subset of
the extracted class attributes.
Jihie Kim, Erin Shaw, Sujith Ravi, Erin Tavano, Aniwat Arromratana, and Pankaj
Sarda.
Scaffolding on-line discussions with past discussions: An analysis
and pilot study of pedabot.
In Proceedings of the 9th International Conference on
Intelligent Tutoring Systems Conference (ITS), pages 343-352, 2008.
[ bib |
.pdf ]
PedaBot is a new discussion scaffolding application designed to aid
student knowledge acquisition, promote reflection about course topics and
encourage student participation in discussions. It dynamically processes student
discussions and presents related discussions from a knowledge base of past
discussions. This paper describes the system and presents a comparative
analysis of the information retrieval techniques used to respond to free-form
student discussions, a combination of topic profiling, term frequency-inverse
document frequency, and latent semantic analysis. Responses are presented as
annotated links that students can follow and rate. We report a pilot study of
PedaBot based on student viewings, student ratings, and a small survey. Initial
results indicate that there is a high level of student interest in the feature and
that its responses are moderately relevant to student discussions.
Sujith Ravi and Jihie Kim.
Profiling student interactions in threaded discussions with speech
act classifiers.
In Proceedings of the 13th International Conference on
Artificial Intelligence in Education (AIED), pages 357-364, 2007.
[ bib |
.pdf ]
On-line discussion is a popular form of web-based computer-mediated
communication and is an important medium for distance education. Automatic
tools for analyzing online discussions are highly desirable for better information
management and assistance. This paper presents an approach for automatically
profiling student interactions in on-line discussions. Using N-gram features and
linear SVM, we developed “speech act” classifiers that identify the roles that
individual messages play. The classifiers were used in finding messages that
contain questions or answers. We then applied a set of thread analysis rules for
identifying threads that may have unanswered questions and need instructor
attention. We evaluated the results with three human annotators, and 70-75% of
the predictions from the system were consistent with human answers.
Sujith Ravi, Jihie Kim, and Erin Shaw.
Mining on-line discussions: Assessing technical quality for student
scaffolding and classifying messages for participation profiling.
In Proceedings of the Educational Data Mining Workshop in the
13th International Conference on Artificial Intelligence in Education
(AIED), 2007.
[ bib ]
On-line collaborative discussions play an important role in distance
education and web-enhanced courses. Automatic tools for assessing student
activities and promoting collaborative problem solving can provide a better
learning experience for students and also offer useful assistance to teachers. This
paper presents two novel instructional tools that apply data mining and
information retrieval techniques. First, we describe an approach that could be
used to scaffold undergraduate student discussions by retrieving useful
information from past student discussions. The tool exploits both the discussions
from the same undergraduate course and the ones from a graduate-level course.
The second part of the paper presents an instructional tool that profiles student
contributions with respect to student genders and the roles that students play in
discussion. We apply speech act classifiers that automatically identify whether the
given message contains questions and/or answers, and use the classification
results in profiling male and female student contributions. Our initial evaluation
of the scaffolding tool shows that discussions from the same course contain more
number of similar concepts than the ones from the graduate-level course. However,
technical quality of graduate-level discussions is higher. The results from the
profiling tool indicate that female participation in undergraduate-level discussions
is lower than that in graduate-level discussions, and graduate female students post
more questions and answers compared to undergraduate female students.