The NL Seminar is a weekly meeting of the Natural Language Group here at ISI. The seminars usually take place on Fridays from 3:00pm until 4:00pm, though exceptions are made for visitors, etc. Contact Sujith Ravi to schedule a talk. Non Natural Language Group members may receive seminar announcements by subscribing to the nlg-seminar list here.
An iCal feed is now available at http://www.isi.edu/natural-language/nl-seminar/nl.ics
Click on the titles to view the abstracts.
Note: Outside visitors should go to the tenth floor lobby where they will be met and escorted to the appropriate location immediately before the talk.
| Date | Speaker | Title |
| 19 Sep 08 | Fei Sha (USC) |
TBA
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: title and abstract coming soon |
| 10 Oct 08 | Sujith Ravi |
EMNLP practice talk
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: title and abstract coming soon |
| 10 Oct 08 | Steve Deneefe |
AMTA practice talk
Time: 3:45 pm - 4:15 pm Location: 11 Large Abstract: title and abstract coming soon |
| 14 Oct 08 | Victoria Fossum |
AMTA practice talk
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: title and abstract coming soon |
| 14 Oct 08 | David Chiang |
EMNLP practice talk
Time: 3:45 pm - 4:15 pm Location: 11 Large Abstract: title and abstract coming soon |
| Date | Speaker | Title |
| 22 Aug 08 | Catalin Tirnauca (Univ. Rovira i Virgili) |
Intern Final Talk: On the Consistency of Probabilistic Context-Free Grammars
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: Probabilistic context-free grammars can describe probability distributions over strings, i.e., the sum of probabilities of all generated strings is 1.This condition is often called consistency. It has applications in fields of natural language processing such as probabilistic parsing (disambiguate by picking the parse with the highest score), or speech recognition (rank hypotheses returned by a speech recognizer). The talk is a survey of some of the previous results. We investigate how we can determine if a probabilistic context-free grammar is consistent, and if such a test can always be done. Also, we study a method, namely normalization, which guarantees consistent probabilistic context-free grammars. Moreover, we mention briefly some techniques that train probabilistic context-free grammars and guarantee consistency. |
| 22 Aug 08 | Amittai Axelrod (UW) |
Intern Final Talk: Structural constraints for efficient decoding.
Time: 3:45 pm - 4:15 pm Location: 11 Large Abstract: String-to-tree machine translation decoders are effective but very slow, especially compared to other decoding approaches. We explore various methods to identify constraints on the search space, with the aim of improving the efficiency of the syntax-based decoder. |
| 20 Aug 08 | John DeNero (Berkeley) |
Intern Final Talk: Minimum Risk Decoding over Forests
Time: 3:45 pm - 4:15 pm (NOTE different day and location!) Location: 11 Small Abstract: Minimum Bayes risk (MBR) decoding improves the output of machine translation systems by selecting a translation that matches a large proportion of the k-best hypotheses of a system. We extend this idea to apply to packed forests by selecting an output sentence that matches a large proportion of all hypotheses in the pruned forest of derivations from a syntax-based translation system. |
| 20 Aug 08 | Kyle Gorman (Penn) |
Intern Final Talk: The Entropy of English given French
Time: 3:00 pm - 3:30 pm (NOTE different day and location!) Location: 11 Small Abstract: The fundamental task in statistical machine translation (SMT) is to characterize the probability of a target sentence given its source translation; for translating French as English, P(f | e). By applying Bayes Rule, we derive the fundamental theorem of SMT: e maximizing P(e) P(f | e). Advances in SMT come from improving estimations of these two terms, or from more efficient ways of searching for optimal solutions (Brown et al. 1993). In the case of language modeling, Shannon (1949) and Brown et al. (1992) identified upper and lower bounds for the per-character entropy of English, H(e), for humans and machines, respectively. We ask the same question for SMT, H(e | f), comparing the results for human translators and a simple machine baseline based on IBM Model 1. These numbers are the upper and lower bounds for SMT systems trained on parallel data. |
| 18 Jul 08 | Sujith Ravi |
Deciphering Ciphers Optimally Using Only Minimal Knowledge of the Source Language
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: I will be talking about deciphering letter-substitution ciphers *optimally* using only minimal knowledge (bigrams, trigrams, etc.) of the source language, instead of relying on large look-up dictionaries. We also plan to show how our empirical results compare with Shannon's predictions on the equivocation curves and unicity distance measure. |
| 11 Jul 08 | Jon May |
Thesis Proposal Practice Talk: A Weighted Tree Transducer Toolkit for Syntactic Natural Language Processing Models
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Solutions for many natural language processing problems such as speech recognition, transliteration, and translation have been described as weighted finite-state transducer cascades. The transducer formalism is very useful for researchers, not only for its ability to expose the deep similarities between seemingly disparate models, but also because expressing models in this formalism allows for rapid implementation of real, data-driven systems. Finite-state toolkits can interpret and process transducer chains using generic algorithms and many real-world systems have been built using these toolkits. Current research in NLP makes use of syntax-rich models that are poorly suited to extant transducer toolkits, which process linear input and output. Tree transducers can handle these models, and a weighted tree transducer toolkit with appropriate generic algorithms will lead to the sort of gains in syntax-based modeling that were achieved with string transducer toolkits. In this thesis proposal practice talk I will briefly trace the history of finite-state transducers and automata as they relate to natural language processing and the evolution of formalisms and the toolkits that support them, leading up to motivation for the design and creation of Tiburon, the toolkit referenced in this talk's title. I will describe previous, current, and future work on Tiburon's algorithms and the effectiveness of both algorithms and software at cleanly representing syntax-based NLP models from the literature and at constructing and evaluating novel models. |
| 13 Jun 08 | Ellen Riloff |
Effective Information Extraction with Relevant Regions and Semantic Affinity Patterns
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: I will briefly overview the landscape of event-oriented information extraction (IE) systems and explain why it is especially challenging to learn IE systems without annotated training data. Then I will describe one attempt to do so by decoupling the tasks of finding relevant text regions and applying extraction patterns. First, a self-trained relevant sentence classifier identifies relevant regions in documents. Second, a "semantic affinity" measure identifies domain-relevant extraction patterns. We further distinguish between "primary" patterns and "secondary" patterns and apply the patterns selectively in the relevant regions. This approach is weakly supervised, requiring only a few seed patterns plus relevant and irrelevant (but unannotated) documents for training. The resulting IE system achieves reasonably good performance, despite the fact that the relevant region classifier leaves a lot to be desired. |
| 06 Jun 08 | Tom Murray (USC) |
Knowledge as a Constraint on Uncertainty for Unsupervised Classification
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: This talk investigates the use of domain knowledge to constrain and improve the unsupervised learning of a classifier, by placing limits or biases on the possible hypotheses for each input. Theoretically, we view the contribution of the knowledge source as a reduction in the uncertainty of the model's decisions, quantified by the resulting conditional entropy of the label distribution given the input corpus. Evaluating on the simple case of an unsupervised HMM tagger, we find surprising levels of improvement from little knowledge, with more stable and efficient training convergence and label assignment, and a high degree of correlation between classification entropy and model performance. We conclude that, while we should always seek better generic models and techniques, for applications in an unsupervised setting, knowledge may still be key. |
| 30 May 08 | Steve DeNeefe |
BLEU Sway Issues: one way to get statistical significance, two ways to get a better score, and three ways to thwart them
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: BLEU the de facto standard for evaluation and development of statistical machine translation systems. We describe three real-world situations involving comparisons between different versions of the same systems where one can obtain improvements in BLEU scores that are questionable or even absurd. We propose a very conservative modification to BLEU that addresses these issues while improving correlation with human judgements, then explore some deeper modifications that alleviate the problems further. |
| 16 May 08 | David Newman (UCI) |
Theory and Applications of Topic Modeling
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Topic models, a class of Bayesian probabilistic models for discrete data, have recently gained popularity in applications ranging from document modeling to computer vision. Since the introduction of Latent Dirichlet Allocation (LDA) in 2003, there have been numerous extensions to this archetype. I will review the theory behind LDA, and discuss subsequent models, including (some of): Correlated Topic Model, Dynamic Topic Model, Hierarchical Topic Model, Special Words Topic Model, Hierarchical Dirichlet Process Model, Pachinko Allocation Machine, Topics and Syntax Model, Bi-LDA, Author-Topic Model, Supervised Topic Model, Spatial LDA, etc. |
| 09 May 08 | John DeNero (Berkeley) |
Inference in phrase alignment models
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Models that align phrases instead of words offer an appealing alternative to the standard relative frequency estimates of phrase translation probabilities. But, while some effective word alignment models (Model 1, Model 2 & HMM) can be estimated tractably with EM, phrase alignment models cannot. I'll talk about how to show that estimation and inference under these models is intractable. Then, I'll present two useful approximation techniques. First, I'll talk about how to cast phrase alignment search as an integer linear programming (ILP) problem and find the optimal alignment reliably and quickly with off-the-shelf ILP software. Some applications of this technique include training phrase alignment models and interpreting the output of word alignment models. Second, we'll look at how to estimate translation probabilities under a phrase alignment model using a Gibbs sampling procedure. The sampler has some nice asymptotic convergence properties and also seems to produce good results in practice. I'll walk through the different models we've trained and how they performed. Time permitting, I'll also talk about some of the ways in which we could potentially extend this work to syntactic MT. |
| 02 May 08 | Zornitsa Kozareva |
Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We present a novel approach to weakly supervised semantic class learning from the web, using a single powerful hyponym pattern combined with graph structures, which capture two properties associated with pattern-based extractions: popularity and productivity. Intuitively, a candidate is popular if it was discovered many times by other instances in the hyponym pattern. A candidate is productive if it frequently leads to the discovery of other instances. Together, these two measures capture not only frequency of occurrence, but also cross-checking that the candidate occurs both near the class name and near other class members. We developed two algorithms that begin with just a class name and one seed instance and then automatically generate a ranked list of new class instances. We conducted experiments on four semantic classes and consistently achieved high accuracies. |
| 25 Apr 08 | David Chiang |
Tutorial: Randomized data structures for large statistical NLP models
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Randomized algorithms are those which use randomness to achieve efficient performance with a bounded probability of error; typically, the bound is adjustable and the performance depends on the bound. Randomized data structures, likewise, use randomness to achieve efficient storage with a bounded probability of error. I will give an overview of the use of such data structures, namely, Bloom filters and "Bloomier" filters, for storing very large n-gram language models, and will discuss possibilities for using randomized data structures for other purposes as well. |
| 18 Apr 08 | Rahul Bhagat |
Learning Paraphrases from Text
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Paraphrases are textual expressions that convey the same meaning using different words. They capture variability, which is a common phenomenon in language. Given this, paraphrases have been shown to be useful in many natural language applications like Question-Answering, Machine Translation, Summarization and Information Retrieval. In this talk, I'll discuss the phenomenon paraphrasing and focus on methods for automatically acquiring paraphrases from text. |
| 11 Apr 08 | Jon May |
Syntactic Re-Alignment Models for Machine Translation
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We present a method for improving word alignment for statistical syntax-based machine translation that employs a syntactically informed alignment model closer to the translation model than commonly-used word alignment models. This leads to extraction of more useful linguistic patterns and improved BLEU scores on translation experiments in Chinese and Arabic. |
| 04 Apr 08 | Ulf Hermjakob |
Name Translation in Statistical Machine Translation: Learning When to Transliterate
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We present a method to transliterate names in the framework of end-to-end statistical machine translation. The system is trained to learn when to transliterate. For Arabic to English MT, we developed and trained a transliterator on a bitext of 7 million sentences and Google's English terabyte ngrams and achieved better name translation accuracy than 3 out of 4 professional translators. The talk also includes a discussion of challenges in name translation evaluation. |
| 25 Mar 08 | Jason Riesa |
Tutorial on Arabic Orthography
Time: 10:30 am - 11:30 am Location: 11 Large Abstract: This tutorial is intended to provide attendees with working knowledge of the Arabic writing system. No previous experience with Arabic is required. At the end of this tutorial you should be able to read and segment individual Arabic characters, read common ligatures, identify possible affixes on stems, and understand the various lexical normalizations used in Arabic text preprocessing. The focus will be on the formal writing system in printed text for Modern Standard Arabic, although handwriting will be briefly discussed. |
| 18 Jan 08 | Victoria Fossum |
Using Syntax to Improve Word Alignment Precision for Syntactic Machine Translation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Automatically word-aligning a parallel bitext in the source and target languages constitutes the first stage of most statistical machine translation pipelines. Automatic word alignment is error-prone, and produces many incorrect links. Incorrect links that violate syntactic correspondences interfere with the extraction of string-to-tree transducer rules for syntactic machine translation. We present an algorithm for identifying and deleting incorrect word alignment links, using features of the extracted rules. We obtain gains in both alignment quality and translation quality in Chinese-English and Arabic-English translation experiments, relative to a GIZA++ union baseline. |
| 11 Jan 08 | Kevin Knight |
How to Make EM Do What You Want
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I'll talk about some unsupervised learning experiments -- how I was satisfied with the initial results, how I became very dissatisfied, and how I became (somewhat) satisified again. |
| 14 Dec 07 | Marieke van Erp |
MITCH: Mining for Information in Texts from the Cultural Heritage
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Naturalis, the Dutch National Museum of Natural History, harbours one of the largest treasures of the world: the key specimens of millions of animals found throughout the world through centuries of biological expeditions. While the depot where the animals are stored is a technical marvel, Noah's ark of the 21st century, it is hard to search through it. Research in taxonomy, the evolution of life and biodiversity revolves around the specimens in the depot. The main key to accessing the depot are(mostly) handwritten expedition logs and registration books, which are currently being photographed and keyed in to be stored in searchable digital archives. Such digital logs already enable a kind of "Biogoogle" search, but actual research questions are more complicated ("how did this kind of frog develop over the last century in the Amazon rainforests?"), and demand more intelligent handling. This is where the MITCH project comes in. The goal of MITCH is to turn the field logs and registration books into a populated semantic network, in which concepts such as animal specimens are related to all other concepts that define them: where, when, under which circumstances and by whom were they found, who described them first in the academic literature, who prepared them for storage in the Naturalis depot, which registration number was assigned to them, etc. This means that all textual descriptions of a specimen need to be parsed into exactly these concepts and their relations. All of this needs to be done at a scale that goes far beyond the human capacity, as tens of thousands of digitized but unanalysed textual records are waiting for semantic analysis. This necessitates the use of state-of-the-art machine learning methods that learn from examples automatically. The project addresses its goals on three levels. The basic level is the development and application of automatic data cleaning and markup tools. On top of this, semi-structured textual material such as fieldbook logs and scientific papers, are semi-automatically converted to a searchable knowledge base. Search results are visualised by displaying maps and specimen photos. The conversion phase assumes the active intervention of domain experts, such as collection managers, to correct and steer the automatic extraction procedure. At the top level, information resources are cross-linked using a domain ontology, populating a semantic network that can be hooked up to any other standardised cultural heritage knowledge base or to a search engine. |
| 02 Nov 07 | Bill Rounds (Michigan and Stanford) |
Constructions, Constraints, Transducers, and TAGs: A unifying view through Feature Logic
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The value of mathematical formalisms for speech recognition, language generation, and machine translation has long been recognized. Not so much work, though, has been spent reconciling these formalisms with linguistic theories. In this talk I'll propose a theoretical descriptive mechanism based on feature logic, which is central to construction and constraint-based linguistic theories like construction grammar and HPSG, and which can be used to view tree transducers and tree-adjoining grammars as giving rise to a construction-based framework. |
| 19 Oct 07 | Slav Petrov (Berkeley) |
Learning and Inference for Hierarchically Split PCFGs
Time: 10:30 am - 11:30 am Location: 11 Large Abstract: Treebank parsing can be seen as the search for an optimally refined grammar consistent with a coarse training treebank. We describe a method in which a minimal grammar is hierarchically refined using EM to give accurate, compact grammars. The resulting grammars are extremely compact compared to other high-performance parsers, yet the parser gives the best published accuracies on several languages, as well as the best generative parsing numbers in English. In addition, we give an associated coarse-to-fine inference scheme which vastly improves inference time with no loss in test set accuracy. |
| 17 Oct 07 | Jon Patrick (Univ. of Sydney) |
Enhancement Technologies for ICU Information Systems
Time: 3:30 pm - 4:30 pm Location: 11 Large Abstract: The School of Information Technologies at the University of Sydney has had a 3 year partnership with the Intensive Care Unit at the Royal Prince Alfred Hospital, Sydney. In that time they have managed 8 joint projects aimed at producing software solutions that enhance productivity in the Unit and in some cases enabled entirely new functionalities in their information systems. The principle motivation for the research is the processing of the narratives in clinical notes but concomitant problems in information systems have also been tackled and the combination of the two disciplines have led to the two related processing systems to be described in this presentation. - Ward Rounds Information Systems (WRIS) & Handovers - The WRIS is designed to support the work of all clinical staff in their ward rounds activities. The system, when activated, automatically populates from the resident clinical database a pro forma report with the most recent relevant data about the patient, such as vital signs, pathology reports, and other diagnostic measurements, presented as a web page. The clinical staff then write their progress notes into the web page which converts the text to SNOMED CT codes and other relevant concepts and entities. The clinician is given the opportunity to change any analyses done by the processor. This clinician approved data is loaded to the patient record. The essential elements of this system, that is computing an extract of the patient record, accepting narrative input, and analysing the text for coding, is a productivity gain of itself, but more importantly, also constitutes the beginning of a hospital wide Handovers System for use throughout each step in the patient journey. This system is being tested at the RPAH ICU in readiness for ward usage. The impact of this system in improving the quality and safety of handovers has the potential to be very significant. - Clinical Data Analytics Language (CDAL) - General purpose access to data from clinical information systems, beyond retrieval for point of care work, is needed for many aspects of the hospital's work particularly for clinical research, logistics & operational planning, and auditing patient safety. Most current clinical systems only provide access to data identified in standard reports with no flexibility to make ad hoc enquiries or to pursue new directions of enquiry. The clinical data analytics language developed enables the expression of any question that can be answered from the data in the database in a restricted natural language. A prototype of the language has been developed for the CareVue information system used in the ICU at the Royal Prince Alfred Hospital. It provides for the use of local medical dialects, SNOMED CT terminology including all forms of collective expressions in SNOMED (e.g. infectious diseases), specification of patient groups, a variety of statistical functions, and constraints over any medical variable, Time, and Location. CDAL is general in that it can be bolted on to any clinical information system and is applicable to any clinical specialisation.
|
| 12 Oct 07 | David Talbot (Edinburgh) |
Scalable Language Modeling: Breaking the Curse of Dimensionality
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Randomized data structures can help us scale discrete models encountered in NLP. This talk will describe their use in language modeling and present some more general related results. N-gram language models are fundamental to speech recognition and machine translation. Unfortunately, the n-gram parameter space grows exponentially with the dimension of the feature vector. I will describe how randomization can be used to remove the space-dependency of such models on the a priori parameter space. The novel extensions of the Bloom filter that I will present are able to take advantage of the entropy of the distribution of values assigned to feature vectors to save space in a discrete statistical model. I will review some results applying these models to language modeling in machine translation and relate their space-requirements to a novel lower bound on the general problem of querying a map of key/value pairs. No prior knowledge of randomized data structures will be assumed.
|
| 05 Oct 07 | Sujith Ravi |
Will this parser work with my data? - Predicting Parser Accuracy without Gold-Standard information
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: There are many tools available to the NLP community for Natural Language Parsing, (i.e converting a raw sentence in to a parse-tree). NLP researchers usually use some "off-the-shelf" parser which has been trained on the Wall Street Journal (WSJ) corpora and then apply the WSJ-trained parser to their data. This works in many cases, especially for systems which use data from WSJ or similar corpora. However, in real life applications, the data may be compiled from many different sources and span different genres, and may not be similar to the WSJ corpora in terms of sentence structure, etc . A particular parser might parse well on some corpora and not so well on others. Choosing the right parser for your data may have an impact on the performance of the NLP system as a whole. But in order to measure the accuracy of any parser for a given corpus, we require a set of gold-standard parse trees corresponding to the sentences within the corpus. Generating gold-standard set takes a lot of manual work and in many real-life applications, it is not a feasible task to generate gold-standard parses for large corpora. We attempted to build a system which can predict the accuracy (in terms of f-measure value) of the Charniak parser (a popular parsing tool) on any given sentence corpus. Without using any additional information (i.e gold std. parses), our system predicts "how accurately the Charniak parser could parse the given corpus". In order to evaluate our system's predictions on a particular corpus, we compute the "Correlation" measure between the "actual accuracies (using Gold-standard)" vs. "predicted accuracies (from our system)" for the given corpus. We tested our system on different corpora and using different methods and will present these results. |
| 29 Aug 07 | Carmen Heger (Dresden) and Michael Bloodgood (Delaware) |
Summer Intern Presentations: Composition of Tree Transducers AND Using the Perceptron Algorithm to Tune Large Numbers of Feature Weights for Syntax-Based Statistical Machine Translation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Composition of Tree Transducers Since finite state (string) transducers are not expressive enough for many NLP applications, computational linguistics started to investigate tree transducers for the task of machine translation, for example. Quite some successful work has been done on generalizing results from string transducers to tree transducers. But when it comes to composition results are not satisfying because generally tree transducers are not closed under composition. Still we think that most of the tree transducers used in NLP are composable and that is why we defined the problem of the composition for two individual transducers instead of the whole class. During the summer we started with linear nondeleting tree transducers with epsilon rules and approached an algorithm to decide for two such transducers whether their composition is again in the same class. Using the Perceptron Algorithm to Tune Large Numbers of Feature Weights for Syntax-Based Statistical Machine Translation Current state-of-the-art syntax-based statistical machine translation systems produce many candidate translations out of which the output translation is selected by taking the argmax over all candidates i of <w,f_i> where w is a weight vector and f_i is a vector of the feature values for candidate i. The features used by the system and their corresponding weights have a major impact on a system's performance. Currently, Minimum Error Rate Training (MERT) is used to tune the weights of the features. A drawback of this is that it isn't tractable to tune large numbers of feature weights. I will discuss using the perceptron algorithm to tune feature weights for statistical machine translation. If I get interesting results before my talk, I may also dicsuss new classes of features (potentially very large numbers of features) that can be used for improving MT performance. |
| 24 Aug 07 | Wei Ho (Princeton) and Jennifer Gillenwater (Rice) |
Summer Intern Presentations: Noisy Language Models AND Context for Syntax-Based Translation Rules
Time: 3:30 pm - 5:00 pm Location: 11 Large Abstract: Noisy Language Models The language models used in statistical machine translation are often quite large, requiring significant memory and sometimes pre-processing in order to be utilized effectively. It would be desirable to have a more compact representations of language models while minimizing the impact on translation quality. Various quantization methods and lossy storage of language models will be presented. Context for Syntax-Based Translation Rules The rules that a translation system employs should be applicable in many contexts. This ensures that a rich language is expressible with a minimum number of rules. However, when rules that are applicable in too many contexts are combined, they result in nonsensical translations. How can we keep rules general but constrain the context of their use? This summer we explored the approach of constraining the context by conditioning on various neighboring elements of each rule.
|
| 16 Aug 07 | Anoop Sarkar (Simon Fraser) |
Extensions of Regular Tree Grammars and their relation to Tree Adjoining Grammars
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: There is a hierarchy of generative devices that generate trees: starting with regular tree languages (RTLs), which are contained within context-free tree languages (CFTLs), and so on. The string yield of the RTLs is exactly the set of Context-Free Languages, while the yield of the CFTLs is exactly the set of Indexed Languages. In this talk we introduce Adjoining Tree Languages (ATLs) which sit in between RTLs and CFTLs. The yield of ATGs is exactly the set of Tree-Adjoining Languages. Just like RTGs are stronger than CFGs, ATGs are stronger than TAGs. In addition we will show that the ATG notation simplifies many of the foundational proofs for TAGs including proofs of the closure properties. In particular, ATLs do not use adjunction constraints, and thus are much easier to understand than TAGs. We compare ATGs with previously proposed simplifications of CFTGs, called monadic simple CFTGs, which also have been shown to be weakly equivalent to TAG (i.e. they generate the same set of string languages). We consider the question of whether these two weakly equivalent formalisms are strongly equivalent (i.e. generate exactly the same set of tree languages). Finally, we will show that the standard definition used for probabilistic TAG is (surprisingly) very different from the natural definition of probabilistic ATL. Using an example of PP-attachment ambiguity we show that the two probabilistic models are different from each other. About the speaker: Anoop Sarkar is an assistant professor in the Department of Computing Science at Simon Fraser University. He received his PhD in 2002 from the Department of Computer and Information Science at the University of Pennsylvania, with Prof. Aravind Joshi as his advisor. His research work is on machine learning, especially semi-supervised learning, applied to the processing of natural language and stochastic formal grammars. Anoop Sarkar's web-page: http://www.cs.sfu.ca/~anoop |
| 15 Jun 07 | Donghui Feng |
Extracting Data Records from Unstructured Biomedical Full Text
Time: 11:00 am - 11:30 am Location: 11 Large Abstract: In this paper, we address the problem of extracting data records and their attributes from unstructured biomedical full text. There has been little effort reported on this in the research community. We argue that semantics is important for record extraction or finer-grained language processing tasks. We derive a data record template including semantic language models from unstruc-tured text and represent them with a dis-course level Conditional Random Fields (CRF) model. We evaluate the approach from the perspective of Information Extrac-tion and achieve significant improvements on system performance compared with other baseline systems. |
| 15 Jun 07 | Alex Fraser |
Getting the structure right for word alignment: LEAF
Time: 10:30 am - 11:00 am Location: 11 Large Abstract: Automatic word alignment is the problem of automatically annotating parallel text with translational correspondence. Previous generative word alignment models have made structural assumptions such as the 1-to-1, 1-to-N, or phrase-based consecutive word assumptions, while previous discriminative models have either made one of these assumptions directly or used features derived from a generative model using one of these assumptions. We present a new generative alignment model which avoids these structural limitations, and show that it is effective when trained using both unsupervised and semi-supervised training methods. Experiments show strong improvements in word alignment accuracy and usage of the generated alignments in hierarchical and phrasal SMT systems improves the BLEU score. |
| 08 Jun 07 | Liang-Chih Yu (Cheng Kung U) |
Topic Analysis for Psychiatric Document Retrieval (Practice Talk for ACL)
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: Psychiatric document retrieval attempts to help people to efficiently and effectively locate the consultation documents relevant to their depressive problems. Individuals can understand how to alleviate their symptoms according to recommendations in the relevant documents. This work proposes the use of high-level topic information extracted from consultation documents to improve the precision of retrieval results. The topic information adopted herein includes negative life events, depressive symptoms and semantic relations between symptoms, which are beneficial for better understanding of users' queries. Experimental results show that the proposed approach achieves higher precision than the word-based retrieval models, namely the vector space model (VSM) and Okapi model, adopting word-level information alone. About the speaker: Liang-Chih Yu (http://www.isi.edu/~liangchi) is now a visiting student in the Information Sciences Institute (ISI) of University of Southern California (USC). My host advisor is Dr. Eduard Hovy. I am also a PhD candidate in the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan. My advisor is Dr. Chung-Hsien Wu. My research interests include natural language processing, text mining, information retrieval, ontology construction, spoken dialogue system.
|
| 08 Jun 07 | Jonathan May |
Bisimulation Minimisation for Weighted Tree Automata
Time: 3:30 pm - 4:00 pm Location: 11 Large Abstract: We describe existing forward and backward bisimulation minimisation algorithms for nondeterministic automata and extend these algorithms to weighted tree automata. The extended algorithms, which work for all semirings, retain the time complexity of their counterparts for unweighted tree automata for additively cancellative semirings, and are only slightly higher (linear instead of logarithmic in the number of states) on other semirings. We describe the effectiveness of an implementation of these algorithms on a typical task in natural language processing. This is joint work with Johanna Högberg, Umeå University and Andreas Maletti, Technische Universität Dresden. |
| 01 Jun 07 | Jingbo Zhu |
Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: In this paper, we analyze the effect of resampling techniques, including under-sampling and over-sampling used in active learning for word sense disambiguation (WSD). Experimental results show that under-sampling causes negative effects on active learning, but over-sampling is a relatively good choice. To alleviate the within-class imbalance problem of over-sampling, we propose a bootstrap-based over-sampling (BootOS) method that works better than ordinary over-sampling in active learning for WSD. Finally, we investigate when to stop active learning, and adopt two strategies, max-confidence and min-error, as stopping conditions for active learning. According to experimental results, we sug-gest a prediction solution by considering max-confidence as the upper bound and min-error as the lower bound for stopping conditions. |
| 01 Jun 07 | Andrew S. Gordon |
Generalizing Semantic Role Annotations Across Syntactically Similar Verbs
Time: 3:30 pm - 4:00 pm Location: 11 Large Abstract: Large corpora of parsed sentences with semantic role labels (e.g. PropBank) provide training data for use in the creation of high-performance automatic semantic role labeling systems. Despite the size of these corpora, individual verbs (or rolesets) often have only a handful of instances in these corpora, and only a fraction of English verbs have even a single annotation. In this paper, we describe an approach for dealing with this sparse data problem, enabling accurate semantic role labeling for novel verbs (rolesets) with only a single training example. Our approach involves the identification of syntactically similar verbs found in PropBank, the alignment of arguments in their corresponding rolesets, and the use of their corresponding annotations in PropBank as surrogate training data. |
| 25 May 07 | Wei Wang (Language Weaver) |
Binarizing Syntax Trees to Improve Syntax-Based Machine Translation Accuracy
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: We show that phrase structures in Penn Treebank style parses are not optimal for syntax-based machine translation. We exploit a series of binarization methods to restructure the Peen Treebank style trees such that syntactified phrases smaller than Penn Treebank constituents can be acquired and exploited in translation. We find that by employing the EM algorithm for determining the binarization of a parse tree among a set of alternative binarizations gives us the best translation result. |
| 18 May 07 | Feng Pan |
Computing Semantic Similarity between Skill Statements for Approximate Matching
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: (This will be an extended version of the talk for NAACL-HLT 2007. It's based on my summer internship work at IBM T.J. Watson Research Center last year.) The project aimed to address the problems encountered when trying to match available employees to open job positions, based on skill matches. Currently, job search applications, like IBM's Professional Marketplace, only find exact matches. A skill affinity computation is desired to allow searches to be expanded to related/similar skills, and return more potential matches. In this talk, I will explore the problem of computing text similarity between verb phrases describing skilled human behavior for the purpose of finding approximate matches. Four parsers (Charniak's parser, Stanford's parser, IBM XSG slot grammar parser, and Lin's MINIPAR) are evaluated on a corpus of skill statements extracted from an enterprise-wide expertise taxonomy. A similarity measure utilizing common semantic role features extracted from parse trees was found superior to an information-theoretic measure of similarity and comparable to the level of human agreement.
|
| 11 May 07 | Steve DeNeefe |
What Can Syntax-based MT Learn from Phrase-based MT?
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We compare and contrast the strengths and weaknesses of a syntax-based machine translation model with a phrase-based machine translation model on several levels. We briefly describe each model, highlighting points where they differ. We include a quantitative comparison of the phrase pairs that each model has to work with, as well as the reasons why some phrase pairs are not learned by the syntax-based model. We then propose improvements to the syntax-based extraction techniques to capture more phrases. We also compare the translation accuracy for all variations. |
| 04 May 07 | Sheelagh Carpendale (Calgary) |
Information Visualization and Collaboration
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Consider Donald Norman's quote, "The power of the unaided mind is highly overrated. Without external aids, memory, thought, and reasoning are all constrained. But human intelligence is highly flexible and adaptive, superb at inventing procedures and objects that overcome its own limits. The real powers come from devising external aids that enhance cognitive abilities." (Norman, 1993) Common methods for externalization include making sketches on whatever happens to be handy -- paper napkins, program margins, etc. -- and/or finding a colleague or two to discuss the problem with. It would seem then, that visualization and collaboration are natural possibilities for creating positive cognitive aids. I will discuss our approach to developing interactive information visualizations both to support individuals and small groups of collaborators and briefly describe some of our recent results. About the speaker: Sheelagh Carpendale holds a Canada Research Chair in Information Visualization at the University of Calgary. Her research focuses on the visualization, exploration and manipulation of information; visualizing such topics as ecological dynamics, uncertainty in information, social and communication information and investigating the development of information visualization environments that support collaboration. Dr. Carpendale's research in information visualization and interaction design draws on her dual background in Computer Science (BSc. and Ph.D. Simon Fraser University) and Visual Arts (Sheridan College, School of Design and Emily Carr, College of Art). |
| 20 Apr 07 | Christopher Collins (Toronto) |
Information Visualization to Support Computational Linguistics
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We present a survey of resent research into using information visualization to reveal new insights about linguistic data. Our recent work includes using WordNet hyponymy as a basis for document visualization and visualizing the uncertainty in machine translation in an instant messaging chat context. We will present our preliminary findings and prototype visualization for machine translation data resulting from a week of collaboration with ISI researchers. About the speaker: Christopher Collins is a PhD candidate in information visualization and computational linguistics at the University of Toronto. He works with Prof. Gerald Penn and Prof. Sheelagh Carpendale (University of Calgary).
|
| 30 Mar 07 | Ido Dagan (Bar-Ilan U) |
Textual entailment as a framework for applied semantics
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We have recently proposed Recognizing Textual Entailment (RTE) as a generic task that captures major semantic inferences across different natural language processing applications. The talk will first review the motivation and definition of the textual entailment task and the PASCAL RTE-1,2&3 Challenges benchmarks. Then we will demonstrate directions for building textual entailment systems, based on knowledge acquisition and inference, and for utilizing them within concrete applications. Furthermore, we suggest that textual entailment modeling may become a comprehensive framework for applied semantics research. Such framework introduces useful variants of known semantic problems and highlights important tasks which were hardly investigated so far at an applied computational level. The semantic modeling perspective will be illustrated in more detail by a case study for an entailment-based variant of word sense disambiguation. About the speaker: Ido Dagan is a Senior Lecturer at the Department of Computer Science at Bar Ilan University, Israel. His areas of interest are largely within empirical NLP, particularly empirical approaches for applied semantic processing. In the last few years Ido and his colleagues introduced textual entailment as a generic framework for applied semantic inference and have organized the first three rounds of the PASCAL Recognizing Textual Entailment Challenges. Ido received his Ph.D. from the Technion. He has been a research fellow at the IBM Haifa Scientific Center and a Member of Technical Staff at AT&T Bell Laboratories. During 1998-2003 he was co-founder and CTO of FocusEngine and VP of Technology of LingoMotors. |
| 23 Mar 07 | Hermann Helbig (U at Hagen, Germany) |
Multilayered Extended Semantic Networks as a Knowledge Representation Paradigm and Interlingua for Meaning Representation
Time: 3:00 pm - 4:30 pm Location: 4 CR Abstract: The talk gives an overview of Multilayered Extended Semantic Networks (abbreviated MultiNet), which is one of the most comprehensively described knowledge representation paradigms used as a semantic interlingua in large-scale NLP applications and for linguistic investigations into the semantics and pragmatics of natural language. As with other semantic networks, concepts are represented in MultiNet by nodes, and relations between concepts are represented as arcs between these nodes. Additionally to that, every node is classified according to a predefined conceptual ontology forming a hierarchy of sorts, and the nodes are embedded in a multidimensional space of layer attributes and their values. MultiNet provides a set of about 150 standardized relations and functions which are described in a very concise way including an axiomatic apparatus, where the axioms are classified according to predefined types. The representational means of MultiNet claim to fulfill the criteria of universality, homogeneity, and cognitive adequacy. In the talk, it is also shown, how MultiNet can be used for the semantic representation of different semantic phenomena. To overcome the quantitative barrier in building large knowledge bases and semantically oriented computational lexica, MultiNet is associated with a set of tools including a semantic interpreter NatLink for automatically translating natural language expressions into MultiNet networks, a workbench LIA for the computer lexicographer, and a workbench MWR for the knowledge engineer for managing and graphically manipulating semantic networks. The applications of MultiNet as a semantic interlingua range from natural language interfaces to the Internet and to dedicated databases, over question-answering systems, to systems for automatic knowledge acquisition. About the speaker: Prof. Helbig is head of the chair Intelligent Information and Communication Systems at the University of Hagen, Germany. His main research areas are Knowledge Representation, Semantic Natural Language Processing, and Question-Answering. A CV can be found here. |
| 09 Mar 07 | Kevin Knight |
The Voynich Manuscript
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The medieval Voynich Manuscript has been called "the most mysterious document in the world". Its pages contain bizarre drawings of strange plants and astrological diagrams, as well as an undeciphered script of 20,000 running words, written in a character set that has never been seen elsewhere. Its origin is also controversial, with many theories abounding. I will describe the document, show samples, explain where it may have come from, and present some properties of the text. This will more of a history/mystery talk than a computer science talk. |
| 26 Jan 07 | Gerald Penn (Toronto) |
The Quantitative Study of Writing Systems
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: If you understood all of the world's languages, you would still not be able to read many of the texts that you find on the world wide web, because they are written in non-Roman scripts -- often ones that have been arbitrarily encoded for electronic transmission in the absence of an accepted standard. This very modern nuisance reflects a dilemma as ancient as writing itself: the association between a language as it is spoken and its written form has a sort of internal logic to it that we can comprehend, but the conventions are different in every individual case --- even among languages that use the same script, or between scripts used by the same language. This conventional association between language and script, called a writing system, is indeed reminiscent of the Saussurean conception of language itself, a conventional association of meaning and sound, upon which modern linguistic theory is based. Despite linguists' reliance upon writing to present and preserve linguistic data, however, writing systems were a largely forgotten corner of linguistics until the 1960s, when Gelb presented their first classification. This talk will describe recent work that aims to place the study of writing systems upon a sound computational and statistical foundation. While archaeological decipherment may eternally remain the holy grail of this area of research, it also has applications to speech synthesis, machine translation, and multilingual document retrieval. |
| 12 Jan 07 | Kevin Knight |
Capturing Natural Language Transformations
Time: 2:00 pm - 3:30 pm Location: 11 Large Abstract: Knowledge representation is hard. As natural language scientists and engineers, we'd like something that - is expressive enough to capture how natural language works - permits tractable inference - admits learning algorithms for automatic knowledge acquisition - leads to modular system construction This talk will look at knowledge representation for capturing natural language transformations. A lot of what we do falls into this category. Examples of transformations include language translation (French to English), question answering (Question to Answer), transliteration (foreign script to Roman alphabet), summarization (long text to short text), parsing (string to tree), language generation (meaning to string), etc. I'll show various knowledge formats (starting with simple finite-state transducers) and show how they stack up on the 4 criteria above, using theorems and examples. We'll see that different types of tree and string automata lead to good behavior on various subsets of the 4 criteria, but getting 4 out of 4 is still elusive. This is a Krazy Theory talk -- since this kind of talk should not go on and on, I promise to finish within 50 minutes. |
| 05 Jan 07 | Beata Klebanov (Hebrew U) |
Experimental and Computational Investigation of Lexical Cohesion in Texts
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Lexical cohesion refers to structure created in a text by use of words with related meanings. Apart from its importance in theoretical and applied linguistics, lexical cohesion detection is used in NLP tasks like topic segmentation, extractive summarization, spelling correction, etc. However, the intuitive potential of lexical cohesion for such tasks is often not realized in practice, possibly due to shortcomings of detection algorithms. I will briefly describe an experiment with readers aimed at providing reliable data for a computational investigation of lexical cohesion. We then discuss a number of informative features for cohesion detection, drawing on sources like WordNet, distributional information, free associations, and the structure of information in the text itself. Finally, I report experiments with supervised learning of lexical cohesion. About the speaker: Beata Beigman Klebanov is a PhD candidate at the Hebrew University of Jerusalem, Israel, currently a visiting scholar at Northwestern University. Beata's interests are in experimental, computational and applied research in text pragmatics. |
| 15 Dec 06 | Jerry Hobbs |
When Will Computers Understand Shakespeare?
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: In this talk I will examine problems encountered in coming to some kind of understanding of one sonnet by Shakespeare (his 64th), ask what it would take to solve these problems computationally, and suggests routes to the solution. The general conclusion is that we are closer to this goal as one might think. Or are we? Bio: Jerry Hobbs is famous primarily for having an office next to Kevin Knight's and a parking space next to Ed Hovy's. He has read everything of Shakespeare's that survives, including his will and plays of dubious authorship. But that was all a long time ago. |
| 14 Dec 06 | Liang Huang (Penn) |
Faster Decoding with Synchronous Grammars and n-gram Language Models
Time: 1:30 pm - 3:00 pm Location: 11 Large Abstract: A major obstacle in syntax-based machine translation is the prohibitively large search space for decoding with an integrated language model. We develop faster approaches for this problem based on lazy algorithms for k-best parsing. When comparing against Chiang's technique of cube pruning, our method runs up to twice as fast without making more search errors or decreasing translation accuracy as measured by BLEU. We demonstrate the effectiveness of the algorithm on a large-scale translation system. Interestingly, these techniques can be applied to speed up bilexical parsing as well, where the (bi-) lexical probabilities can be viewed as n-gram probabilities that causes non-monotonicity. This method fits naturally into the coarse-to-fine grained multi-pass parsing schemes. To push this direction even further, we can generalize cube and lazy cube pruning as generic tools for reducing complicated search spaces, as alternatives to the well-known A* and annealing techniques. This is joint work with David Chiang (ISI). |
| 27 Nov 06 | Mark Hopkins (Potsdam) |
Towards the Effective Exploitation of Syntax in Machine Translation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We discuss preliminary work on a possible approach to exploiting syntax in an effective way for machine translation. The driving guideline is to devise a machine translation system that can perform effectively, given a very limited quantity of parsed training data. |
| 17 Nov 06 | David DeVault (Rutgers) |
Scorekeeping in an Uncertain Language Game
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Practical dialogue systems must exploit context to interpret user utterances correctly. Received views of context and coordination in pragmatic theory equate utterance context with the occurrent subjective states of interlocutors using notions like common knowledge or mutual belief. We argue that these views are not well suited for practical modeling due to the uncertainty and robustness of context dependence in human-human dialogue. We present an alternative characterization of utterance context as objective and normative. On this view, an interlocutor's representation of context reflects private uncertainty about the true objective context as determined by prior speaker meanings. As conversation moves forward, new utterances provide interlocutors with retrospective insight about each other's prior meanings and therefore about what the true context really is. This view reconciles the need for uncertainty with received intuitions about coordination, and can directly inform computational approaches to dialogue. Joint work with Matthew Stone, Rutgers and Rich Thomason, Michigan About the Speaker: David DeVault is a Ph.D. candidate in the Department of Computer Science at Rutgers University. He holds a B.S. in Engineering and Applied Science from the California Institute of Technology and an M.A. in Philosophy from Rutgers University. David's research aims to develop techniques to allow computational agents to participate in flexible task-oriented conversations with human beings. His recent work has drawn on design challenges encountered in building such an agent to try to articulate practical, learnable, and theoretically satisfying representations of context, utterance meaning, and speaker intention for implemented conversational systems. |
| 03 Nov 06 | Jens-Soenke Voeckler |
perl part 2 - advanced magick
Time: 3:30 pm - 5:00 pm Location: 11 Large Abstract: Since part 1 of the Perl tutorial didn't cover the juicy bits (like a unique function in Perl), based on feedback from participants, I am offering a part 2 "Perl - Advanced Magick" covering: o the slides from roughly page 40 - The Schwartzian Transform - Dissecting a program o What to do, if you do need popen or backticks? o OO Perl - a start o C embedding - definitely only a "start here" o Useful recipes, e.g. interpolating variables in configuration scripts from Perl values. If there is something you are especially interested in seeing, please send me an email |
| 23 Oct 06 | Jens-Soenke Voeckler |
perl - how to use it, not abuse it
Time: 12:00 pm - 1:30 pm Location: 11 Large Abstract: If you speak a little perl, are an occasional perl-scripter, and would like to know more about how to use it as a (p)ortable, (e) fficient, and (r)eadible (l)anguage, you may be interested in my brown bag (read: bring your own) lunch seminar: I will talk about using Perl in a portable fashion, the environment it is run in, and how avoid common mistakes and misconceptions. Perl offers more than a thousand ways to solve a problem, but some are more portable or more efficient than others. If time permits, simple hands-on examples can be tried out during the talk, so power for laptops will be provided. |
| 29 Sep 06 | Ashish Venugopal (CMU) |
Delayed LM Intersection and Left-to-Right N-Best Extraction for Syntax-Based MT
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We begin by describing a set of pruning constraints that are applied in the literature to effectively restrict the search space of synchronous PCFGs intersected with target language model contexts. We apply these constraints to non-binarized grammars with a large number of non-terminals and demonstrate effective parsing within the framework of Wu, 97. We then present a novel parsing approach that avoids language model context intersection during parsing in favor of language model driven n-best list extraction.Ê The parsing step produces aÊ sentence spanning parse forest which is explored in left-to-right target order by the N-Best extraction method. This method avoids lossy pruning during the parsing process, searching a much larger effective parse space than practically possible in the full intersection scenario, and has the important benefit of allowing integration of a high order language within the N-Best search process, rather than only in parse re-scoring. We demonstrate the impact of this parsing approach using the SPCFG approach described in Zollmann, Venugopal, Vogel 06, which is similar to Galley et al., 04 and compare performance against full intersection. This is joint work with Andreas Zollmann About the Speaker: Ashish Venugopal is a Ph.D candidate at the Language Technologies Institute at Carnegie Mellon University, and holds B.S (SCS, Univ. Honors), M.S degrees from the same institution. He is a Seibel Scholar and has received the annual Graduate Student Teaching Award at Carnegie Mellon. His research focus is on syntax augmented machine translation.
|
| 22 Sep 06 | Eduard Hovy |
Toward a 'Science' of Annotation: Experiences from OntoNotes
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: As machine learning algorithms and their application for NLP become better understood, attention turns toward the production of annotated corpora to which they can be applied. Numerous phenomena present themselves for annotation, including aspects in lexical semantics, discourse, pragmatics, and dialogue. But several questions immediately must be answered: 1. How does one obtain a balanced corpus to annotate? What is a balanced corpus? 2. How does one decide which aspects to annotate? How does one adequately express the theory behind the phenomena in simple annotation steps? 3. Which annotators does one hire? How does one ensure that they are adequately trained? 4. How does one establish a simple, fast, and trustworthy annotation procedure? What interfaces does one build? How does one ensure that the interfaces do not affect the annotation results? 5. How does evaluate the results? What are the appropriate agreement measures? At which cutoff points should one re-do the annotations? How does one ensure improvement? 6. How should one formulate and store the results? How does one ensure compatibility with other existing resources? How does one make results available for best impact? 7. How does one report the annotation effort and results? How does one actually get a paper on this work published at an important conference? What should the paper contain? Despite their being so basic, there is almost no established procedure or standard set of answers to these questions today. In this talk I discuss some of these aspects, pointing to the lessons learned in the ongoing OntoNotes project (joint with BBN, the University of Colorado (PropBank), the University of Pennsylvania (Treebank), and ISI). |
| 25 Aug 06 | Victoria Fossum (Michigan) |
Improving Precision of Word Alignments Using GHKM Syntax-Based Rule Extraction
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: Noisy word alignments negatively affect the quality of the translation rules extracted by the ISI syntax-based MT system. In the literature, alignment is typically treated as a separate process from subsequent stages in the MT pipeline. By contrast, we allow rule extraction to guide the alignment process. We present an unsupervised algorithm for identifying and removing "bad" links using GHKM syntax-based rule extraction. We show that we can improve upon the precision of GIZA union (measured against a gold standard set of manually aligned Chinese-English sentence pairs), while only decreasing recall slightly. (Note: This is part of the Summer Intern Series) |
| 25 Aug 06 | Jason Riesa |
Minimally Supervised Morphological Segmentation with Applications to Machine Translation
Time: 3:30 pm - 4:00 pm Location: 11 Large Abstract: Inflected languages in a low-resource setting present a data sparsity problem for statistical machine translation. In this work, we present a minimally supervised algorithm for morpheme segmentation on Arabic dialects which reduces unknown words at translation time by over 50%, total vocabulary size by over 40%, and yields a significant increase in BLEU score over a previous state-of-the-art phrase-based statistical MT system. |
| 23 Aug 06 | Joseph Turian (NYU) |
Speeding-up Syntax-based Decoding
Time: 3:30 pm - 4:00 pm Location: 11 Large Abstract: TBA (Note: This is part of the Summer Intern Series) |
| 23 Aug 06 | Oana-Diana Postolache |
Towards combining Searn and Syntax-Based Machine Translation (SBMT)
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: This talk is about modeling the Syntax-Based Machine Translation (SBMT) problem within the Searn (Search & Learn) framework developed by Hal Daume in his PhD thesis. I will present the way we define the states, actions and the search space and how to implement the cost function. (Note: This is part of the Summer Intern Series) |
| 18 Aug 06 | Chenhai Xi |
Name Entity Transliteration Discovery from Large Bilingual Comparable Corpora
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: In this summer project, we investigate a scalable method to extract Chinese-English name transliterations from large comparable corpora, which consist of two languages discussing same or similar topics. We show that bigram Jaccard coefficient is a good similarity method to compare English and Chinese names, at Chinese pronunciation (Pinyin) level. Based on this phonetic similarity score, an efficient randomized algorithm is then used to find name pair candidates from English and Chinese lists. Finally, context information, such as dates, frequency, place and titles are combined with the phonetic similarity to improve the accuracy of the name pairs list. (Note: This is part of the Summer Intern Series) |
| 11 Aug 06 | Idan Szpektor (Bar-Ilan U) |
Textual Entailment: Framework, Learning and Applications
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Textual Entailment has been proposed recently as a generic framework for modeling semantic variability in many Natural Language Processing applications, such as Question Answering, Information Extraction, Information Retrieval and Document Summarization. The Textual Entailment relationship holds between two text fragments, termed text and hypothesis, if the truth of the hypothesis can be inferred from the text. In this talk, the Textual Entailment framework will be introduced. I'll then present an algorithm for large-scale Web-based acquisition of entailment rules, a type of knowledge needed for robust inference. Finally, I will present an unsupervised Relation Extraction approach based on the Textual Entailment framework. About the speaker: Idan Szpektor is a PhD student under the supervision of Dr. Ido Dagan at Bar Ilan University, Israel. His current research activity is in acquisition of knowledge for textual entailment.
|
| 04 Aug 06 | Shou-de Lin |
Ph.D. defense practice talk
Time: 3:30 pm - 4:30 pm Location: 11 Large Abstract: This is a practice talk for my Ph.D. defense, which will be held on Aug 24th 3-5pm, SAL 322. An important problem in the area of homeland security and fraud detection is to identify abnormal entities in large datasets. Although there are methods from knowledge discovery and data mining focusing on finding anomalies in numerical datasets, there has been little work aimed at discovering abnormal or suspicious instances in large and complex semantic graphs whose nodes are richly connected with many different types of links. In this talk, I will describe a novel, domain-independent and unsupervised framework to identify such instances. Besides discovering suspicious instances, we believe that to complete the discovery process and to deal with the "curse of false positives", a system has to convince the users by providing explanations for its findings. Therefore, in the second part of the talk I will describe an explanation mechanism to automatically generate human-understandable explanations for the discovered results. Experimental results show that our discovery system outperforms state-of-the-art unsupervised network algorithms used to analyze the 9/11 terrorist network by a large margin. Additionally, a human study we conducted demonstrates that our explanation system, which provides natural language explanations for its findings, allowed human subjects to perform complex data analysis in a much more efficient and accurate manner
|
| 28 Jul 06 | Qin Iris Wang (Alberta) |
Improved Large Margin Dependency Parsing via Local Constraints and Laplacian Regularization
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This talk is about an improved approach for learning dependency parsers from treebank data. Our technique is based on two ideas for improving large margin training in the context of dependency parsing. First, we incorporate local constraints that enforce the correctness of each individual link, rather than just scoring the global parse tree. Second, to cope with sparse data, we smooth the lexical parameters according to their underlying word similarities using Laplacian Regularization. To demonstrate the benefits of our approach, we consider the problem of parsing Chinese treebank data using only lexical features, that is, without part-of-speech tags or grammatical categories. We achieve state of the art performance, improving upon current large margin approaches. Here is the link for the paper: http://www.cs.ualberta.ca/~wqin/papers/depar_margin_conll06.pdf About the speaker: Qin Iris Wang is a Ph.D. student from the University of Alberta, working with Dekang Lin and Dale Schuurmans. Her research interests are in natural language processing and machine learning. Specifically, she has been working on dependency parsing using both generative and discriminative methods. |
| 11 Jul 06 | Dragos Munteanu + Joseph Turian |
Practice Talks for ACL
Time: 2:30 pm - 4:00 pm Location: 11 Large Abstract: Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora Dragos Munteanu We present a novel method for extracting parallel sub-sentential fragments from comparable bilingual corpora. Currently, the state of the art in comparable corpus mining is only able to extract full sentence pairs which are judged to be parallel. We advance the state of the art by showing how to obtain useful data even from not-fully-parallel sentences. By analyzing sentence pairs using a signal-processing-inspired approach, we detect which segments of the source sentence are translated into segments of the target sentence, and which are not. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-othe-art machine translation system. Advances in Discriminative Parsing Joseph Turian The present work advances the accuracy and training speed of discriminative parsing. Our discriminative parsing method has no generative component, yet surpasses a generative baseline on constituent parsing, and does so with minimal linguistic cleverness. Our model can incorporate arbitrary features of the input and parse state, and performs feature selection incrementally over an exponential feature space during training. We demonstrate the flexibility of our approach by testing it with several parsing strategies and various feature sets. |
| 30 Jun 06 | David Chiang and Kevin Knight |
Synchronous Grammars and Tree Transducers
Time: 2:00 pm - 5:00 pm Location: 11 Large Abstract: (Practice tutorial for ACL/COLING 2006) Once upon a time, synchronous grammars and tree transducers were esoteric topics in formal language theory, far removed from the practice of building real, large-scale natural language systems. However, these tools are now rapidly becoming essential for modeling machine translation and other complex language transformations. It has therefore become practical and important to understand the basic properties of tree transformation systems, which we cover in this tutorial.
|
| 23 Jun 06 | Joseph Turian (NYU) |
Discriminative Training for Large-Scale NLP
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Parsing and translating natural languages can be viewed as structured-prediction problems. We outline the crucial design decisions that must be made to build a machine to solve structured prediction problems, and explain our particular choices for these two large-scale NLP problems. Our approach uses a purely discriminative learning method that scales up well to problems of this size. Unlike currently popular methods, this one does not require a great deal of feature engineering a priori, because it performs feature selection over a compound feature space as it learns. Accuracy on constituent parsing was at least as good as other comparable methods. To our knowledge, it is the first purely discriminative learning algorithm for translation with tree-structured models. Experiments demonstrate the method's versatility, accuracy, and efficiency.
|
| 26 May 06 | Radu Soricut and Hal Daume III |
Defense Practice Talks: Generation and Learning
Time: 3:00 pm - 5:00 pm Location: 11 Large Abstract: These are two practice talks for our upcoming thesis defenses. The titles and abstracts are: -------------------------------------------------------------------------- NATURAL LANGUAGE GENERATION FOR TEXT-TO-TEXT APPLICATIONS USING AN INFORMATION-SLIM REPRESENTATION Radu Soricut In this talk, I describe a new natural language generation paradigm, based on direct transformation of textual information into well-formed textual output. I support this language generation paradigm with theoretical contributions in the field of formal languages, new algorithms, empirical results, and software implementations. At the core of this work is a novel representation formalism for probability distributions over finite languages. Due to its convenient representation and computational properties, this formalism supports a wide range of language generation needs, from sentence realization to text planning. Based on this formalism, I describe, implement, and analyze theoretically a family of algorithms that perform language generation using direct transformations of text. These algorithms use stochastic models of language to drive the generation process. I perform extensive empirical evaluations using my implementation of these algorithms. These evaluations show state-of-the-art performance in automatic translation, and significant improvements in state-of-the-art performance in abstractive headline generation and coherent discourse generation. -------------------------------------------------------------------------- PRACTICAL STRUCTURED LEARNING FOR NATURAL LANGUAGE PROCESSING Hal Daume III Natural language processing is replete with problems whose outputs are highly complex and structured. The current state-of-the-art in machine learning is not yet sufficiently general to be applied to general problems in NLP. In this thesis, I present Searn (for "search" + "learn"), an approach to learning for structured outputs that is applicable to the wide variety of problems encountered in natural language. Searn operates by transforming structured prediction problems into a collection of classification problems, to which any standard binary classifier may be applied. From a theoretical perspective, Searn satisfies a strong fundamental performance guarantee: given a good classification algorithm, Searn yields a good structured prediction algorithm. To demonstrate Searn's general applicability, I present applications in such diverse areas as automatic document summarization and entity detection and tracking. In these applications, Searn is empirically shown to achieve state-of-the-art performance. |
| 24 May 06 | Hal Daume III |
Beyond EM: Bayesian Techniques for Human Language Technology Researchers
Time: 9:00 am - 12:00 pm Location: 4th Floor Abstract: This is a practice tutorial for one I am giving at HLT/NAACL one week later. Comments/feedback are very welcome. ---------------------------------------------------------------------- Expectation Maximization (EM) has proved to be a great and useful technique for unsupervised learning problems in speech and language processing. Unfortunately, its range of applications is limited either by intractable E- or M-steps, or by its reliance on the maximum likelihood estimator. The natural language processing community typically resorts to ad-hoc approximation methods to get (some reduced form of) EM to apply to NLP tasks. However, many of the problems that plague EM can be solved with Bayesian methods, which are theoretically more well justified. In this tutorial, I discuss Bayesian methods as they can be used in natural language processing. The two primary foci of this tutorial are specifying prior distributions and performing the necessary computations to perform inference in Bayesian models. I focus on unsupervised techniques (for which EM is the obvious choice), but discuss supervised and discriminative techniques at the conclusion with pointers to relevant literature. Depending on one's inference technique of choice, the math required to build Bayesian learning models can be difficult. Compounding this problem is the fact that current written tutorials on Bayesian techniques tend to focus on continuous-valued problems, a poor match for the high-dimension discrete world of text. This combination makes the cost of entrance to the Bayesian learning literature often too high. The goal of this tutorial is to provide sufficient motivation, intuition and vocabulary mapping so that one can easily understand recent papers in Bayesian learning that are published at conferences like NIPS, and increasingly at ACL. In addition to the standard tutorial materials (slides), this tutorial is accompanied by a technical report that spells out all the mathematic derivations in great detail, for those who wish to start research projects in this fields. This tutorial should be accessible to anyone with a basic understanding of statistics. I use a query-focused summarization task as a motivating running example for the tutorial, which should be of interest to researchers in natural language processing and in information retrieval. Additionally, though the tutorial does not focus on speech problems, those attendees interested in graphical modeling techniques for automatic speech recognition might also find the tutorial of interest. |
| 19 May 06 | Patrick Pantel |
Espresso: Making Use of Generic Patterns for Mining Relations from Small and Large Corpora
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: In the past decade, researchers have explored many approaches to automatically extract large collections of knowledge from text. In this talk, we present Espresso, a weakly-supervised, general-purpose, and broad-coverage algorithm for harvesting binary semantic relations. The main contributions are: i) a method for exploiting generic patterns by filtering incorrect instances using the Web; and ii) a principled measure of pattern and instance reliability enabling the filtering algorithm. We present an empirical comparison of Espresso with various state of the art systems, on different size and genre corpora, on extracting various general and specific relations. Experimental results show that our exploitation of generic patterns substantially increases system recall with small effect on overall precision.
|
| 12 May 06 | Nick Mote and Donghui Feng |
Pedagogical Contextualization of Language Learner Speech Errors AND Learning to Detect Conversation Focus of Threaded Discussions
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This is two practice talks. ----------------------------------------------------------------------------- FIRST TALK: The traditional approach to diagnosing learner speech errors in Computer Aided Language Learning is to create a linguistic profile of the learner/user. We, however, propose that work must also be done to model the linguistic profile of a typcial native listener. Not all errors in second langage learner speech are created equal. Different errors sound more "severe" or "harsh" to native speaker ears and should therefore be treated with more emphasis in pedagogical interaction. The Tactical Language Training System (TLTS) is a speech-enabled virtual-reality based computer learning environment designed to teach Arabic spoken communication to American English speakers. This talk addresses the ways the TLTS contextualizes non-native speech errors, and how this contextualization fits in the corrective exchanges between a non-native learner and a pedagogical agent built to model a native listener. The pedagogical system used in TLTS includes: * Automatic Speech Recognition (ASR) models which are built on a combination of both annnotated and unannotated non-native speech with native speech data. * A stochastic generative model for errors in learner speech that creates mispronunciation grammars for the ASR * Reweighting of system-perceived mispronunciation severity based on aggregate native speaker judgements of quality pronunciation and intelligiblity. * Contextualization of feedback based on lexical and phonetic inventories of the native and non-native languages. ----------------------------------------------------------------------------- SECOND TALK: We present a novel feature-enriched approach that learns to detect the conversation focus of threaded discussions by combining NLP analysis and IR techniques. Using the graph-based algorithm HITS, we integrate different features such as lexical similarity, poster trustworthiness, and speech act analysis of human conversations with featureoriented link generation functions. It is the first quantitative study to analyze human conversation focus in the context of online discussions that takes into account heterogeneous sources of evidence. Experimental results using a threaded discussion corpus from an undergraduate class show that it achieves significant performance improvements compared with the baseline system.
|
| 05 May 06 | Namhee Kwon |
Recognizing Argument Structures in Texts
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I present our approach to identify an argument structure defined as a simple hierarchical structure of claim and reasons. The claim is also classified into "in favor of" or "against" the topic. The experiment is performed on the comments from the general public sent to government officials in response to proposed regulations.
|
| 28 Apr 06 | Feng Pan |
Learning Event Durations from Event Descriptions
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The research of extracting event duration information from texts is potentially very important in applications in which the time course of events is to be extracted from news. For example, whether two events overlap or are in sequence often depends very much on their durations. If a war started yesterday, we can be pretty sure it is still going on today. If a hurricane started last year, we can be sure it is over by now. In the talk, I will first present our work on constructing an annotated corpus for extracting information about the typical durations of events from texts, including the annotation guidelines, the event classes we categorized, the way we use normal distributions to model such vague and implicit temporal information, and how we evaluate inter-annotator agreement. I will then show that machine learning techniques applied to this data yield coarse-grained event duration information, considerably outperforming a baseline and approaching human performance. At the beginning of the talk, I will also give a brief overview of the time ontology (OWL-Time, formerly DAML-Time) we have developed, which is represented in both first-order logic and the OWL web ontology language.
|
| 21 Apr 06 | Soo-Min Kim |
Identifying and Analyzing Judgment Opinions
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: In this talk, we introduce a methodology for analyzing judgment opinions. We define a judgment opinion as consisting of a valence, a holder, and a topic. We decompose the task of opinion analysis into four parts: 1) recognizing the opinion; 2) identifying the valence; 3) identifying the holder; and 4) identifying the topic. We evaluate our methodology using both intrinsic and extrinsic measures. |
| 14 Apr 06 | Radu Soricut |
Natural Language Generation for Text-to-Text Applications using an Information-Slim Representation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Although a considerable number of generic Natural Language Generation (NLG) systems has been produced over the years, none of them is usually employed in end-to-end, text-to-text applications such as Machine Translation, Summarization, Question Answering, etc. In this talk, we identify the likely reasons for this state of affairs, and propose WIDL-expressions as a flexible formalism that facilitates the integration of a generic NLG engine within end-to-end language processing applications. WIDL-expressions represent compactly probability distributions over finite sets of candidate realizations, and have optimal algorithms for text realization via interpolation with language model probability distributions. We show the effectiveness of our WIDL-based NLG engine for both sentence realization and document realization tasks. By employing language models that capture sentence-level properties, we perform Machine Translation and Headline Generation at state-of-the-art levels or better. By employing language models that capture document-level properties such as text coherence, we synthesize output for Multi-document Summarization that displays both high content selection performance and increased coherence.
|
| 24 Mar 06 | Dragos Munteanu |
Automatic creation of parallel corpora
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Parallel texts -- texts that are translations of each other -- are an important resource in many cross-lingual NLP applications, such as lexical acquisition, cross-language IR, and annotation projection. However, their importance is paramount for Statistical Machine Translation (SMT), as they provide the training data from which all the translation knowledge is learned. The state of the art in SMT is advanced enough that, given sufficient parallel data (i.e. a few million words) for any language pair in a given domain, a generic SMT system trained on it will achieve a reasonable translation performance in that domain. The main reason why SMT systems exist only for a handful of languages is that, for most language pairs, parallel training data is simply not available. One way to alleviate this lack of parallel data is to exploit a much richer and more diverse resource: comparable corpora, texts which are not strictly parallel but related. The prototypical example of comparable texts are two news articles in different languages which report on the same event. I will present methods for automatic extraction of parallel data from such corpora. I will show how to detect parallel data at various levels of granularity: parallel documents, parallel sentences, and even parallel sub-sentence fragments. The parallel corpora obtained using these methods help improve translation performance for both resource-scarce language pairs (such as Romanian-English) and resource-rich ones (such as Arabic-English).
|
| 17 Mar 06 | Jon May |
Tiburon: A Finite State Tree Automata Toolkit
Time: 3:00 pm - 4:30 pm Location: 4th Floor Abstract: In the 1990s, researchers applied their new developments in transducer theory using widely available easy-to-use toolkits for string transducers, and made well-known advances in parsing, machine translation, and other areas. Rapid prototyping via software such as the AT&T toolkit and carmel was useful for proofs of concept and in many cases led to unforseen developments in novel areas. In the current nlp research environment tree based strategies and new models have shown promising results in advancing the state of the art, and recent developments in weighted tree automata theory are enriching the bedrock created 40 years ago, but as of yet there is no toolkit available with the necessary capabilities to turn promise into solution. Tiburon is the first probablistic tree transducer toolkit. Similar in form and function to the string-based toolkits of yesteryear, it is designed to be easy to use, with simple but expressive definitions of tree automata and a concise set of vital operations that can be used to construct many useful tree-based nlp projects. Although a work in progress, Tiburon is already a usable tool with active users between the ages of 6 and 41. I will describe the current status of the system, demonstrate ease of use and potential power, and discuss the challenges ahead. |
| 10 Mar 06 | Mark Hopkins |
Exploring the Potential of Intractable Parsers
Time: 3:00 pm - 4:30 pm Location: 10th Floor Abstract: We revisit the idea of history-based parsing, and present a history-based parsing framework that strives to be simple, general, and flexible. We also provide a decoder for this probability model that is linear-space, optimal, and anytime. A parser based on this framework, when evaluated on Section 23 of the Penn Treebank, compares favorably with other state-of-the-art approaches, in terms of both accuracy and speed.
|
| 03 Mar 06 | Liang Huang (Penn) |
Syntax-Directed Translation with Extended Domain of Locality
Time: 3:00 pm - 4:30 pm Location: 11th Floor (Large) Abstract: (note: this is a very tentative title -- comments welcome!) We present a novel extension of syntax-directed translation for statistical MT. Formally speaking, our model is based on tree-to- string transducers that recursively convert a parse-tree in the source-language into a string in the target-language. These transduction rules have multi-level trees on the source-side, giving this system more transformational power due to the extended domain of locality. We also present efficient algorithms for decoding based on dynamic programming. Initial experiments on English-to-Chinese translation show promising results in both speed and the translation quality. Joint work with Kevin Knight and Aravind Joshi. Bio: Liang Huang is a 3rd-year PhD student from the University of Pennsylvania. He is mainly interested in algorithms and formalisms for parsing and syntax-based machine translation. His recent work has been on k-best parsing algorithms (with David Chiang) and synchronous binarization for MT (with Hao Zhang, Dan Gildea, and Kevin Knight). |
| 24 Feb 06 | Hal Daume III |
Search-based Structured Prediction
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I present an algorithm, Searn (for "search-learn") that is designed to solve structured prediction problem: problems whose goal is to learn to predict complex objects such as parts-of-speech, parse trees, translations, etc... Searn functions by "breaking apart" structured prediction problems into classification problems in the process of search. I analyze Searn in the framework of learning reductions and show that good performance on the underlying classification problems implies good search performance. Moreover, Searn is computationally efficient in a superset of the settings where previous algorithms are efficient and is not limited by conditional independence assumptions (as in CRFs). This excessively simple and general algorithm turns out to have excellent state-of-the-art performance. This is joint work with John Langford (TTI-C) and Daniel Marcu; and, to a lesser extent, with Drew Bagnell (CMU) and Bianca Zadrozny (IBM TJ Watson). |
| 10 Feb 06 | David Chiang |
Parsing Arabic Dialects
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: The Arabic language exhibits diglossia, i.e., the coexistence of two forms of language, a variety with standard orthography and sociopolitical clout which is not natively spoken by anyone (Modern Standard Arabic, MSA) and varieties that are primarily spoken and lack writing standards (Arabic dialects). There are important resources currently available for MSA with much on-going NLP work; for example, there is an Arabic Treebank and several syntactic parsers for MSA. However, Arabic dialect resources and NLP research are still at an infancy stage. I will present work done at the Johns Hopkins CLSP Summer Workshop on parsing of Arabic dialects, in particular, Levantine Arabic. We have experimented with three approaches to leveraging MSA resources to create a parser for Levantine Arabic, as well as methods for induction of MSA-Levantine translation lexicons and a Levantine part-of-speech tagger. Using these methods we obtain error reductions of up to 15% compared with applying an MSA parser directly to Levantine text. Rambow et al. Parsing Arabic Dialects: Final Report. Johns Hopkins University Center for Language and Speech Processing Workshop 2005. http://www.clsp.jhu.edu/ws2005/groups/arabic/documents/finalreport.pdf Chiang et al. Parsing Arabic Dialects. To appear in Proc. EACL 2006. This is joint work with O. Rambow, M. Diab, N. Habash, R. Hwa, K. Sima'an, V. Lacey, R. Levy, C. Nichols and S. Shareef. |
| 03 Feb 06 | Alex Fraser |
Measuring Word Alignment Quality for Statistical Machine Translation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Automatic word alignment plays a critical role in statistical machine translation. Unfortunately the relationship between alignment quality and statistical machine translation performance has not been well understood. In the recent literature the alignment task has frequently been decoupled from the translation task, and assumptions have been made about measuring alignment quality for machine translation which, it turns out, are not justified. In particular, none of the tens of papers published over the last five years has shown that significant decreases in Alignment Error Rate (AER) result in significant increases in translation quality. I will explain this state of affairs and present steps towards measuring alignment quality in a way which is predictive of statistical machine translation quality. I will also provide a brief overview of some of my other work on training and search for word alignment.
|
| 27 Jan |