|
|
|
|
Non Natural Language Group members may receive seminar announcements by subscribing to the nlg-seminar list.
An iCal feed is available at http://nlg.isi.edu/nl-seminar/nl.ics
| Date | Speaker | Title |
| 20 Nov 09 | Marco Pennacchiotti (Yahoo! Research) |
TBA
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: TBA |
| 04 Dec 09 | Don Metzler (Yahoo! Research) |
TBA
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: TBA |
| 11 Dec 09 | Anselmo Peñas (UNED, Spain) |
TBA
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: TBA |
| Date | Speaker | Title |
| 23 Oct 09 | Steve DeNeefe |
Tree Adjoining Machine Translation (thesis proposal practice talk)
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Tree Adjoining Grammars have well-known advantages but are typically considered too difficult for practical systems. We propose that, when done right, adjoining improves translation quality without becoming computationally intractable. Using adjoining to model optionality allows general translation patterns to be learned without the clutter of endless variations of optional material. The appropriate modifiers can later be spliced in as needed to translate details. In this proposal, we describe challenges encountered by phrase-based and syntax-based machine translation (MT) systems today, and present an in-depth, quantitative comparison of both models. Then, we describe a novel model for statistical MT which addresses these challenges using a Synchronous Tree Adjoining Grammar. We introduce a method of converting these grammars to a weakly equivalent tree transducer for decoding. And we present a method for learning the rules and associated probabilities of this grammar from aligned tree/string training data. Finally, our initial results show that adjoining already delivers an end-to-end improvement of +0.8 BLEU over a baseline statistical syntax-based MT model on a medium-scale Arabic/English MT task. Furthermore, we demonstrate it is a competitive entry in the Urdu-English track of the 2009 NIST MT evaluation. We then propose improvements to the model, decoding, and extraction that promise to allow this new, linguistically-motivated MT model to surpass its syntax-based and phrase-based cousins in a wide range of scenarios and language pairs.
|
| 21 Oct 09 | Douglas W. Oard (Maryland) |
Who 'Dat? Identity resolution in large email collections
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Automated techniques that can support the human activities of search and sense-making in large email collections are of increasing importance for a broad range of uses, including historical scholarship, law enforcement and intelligence applications, and lawyers involved in "e-discovery" incident to civil litigation. In this talk, I'll briefly describe some of the work to date on searching large email collections, and then for most of the talk I will focus on the more challenging task of support for sense-making. Specifically, I'll describe joint work with Tamer Elsayed to automatically resolve the identity of people who are mentioned ambiguously (e.g., just by first name) in a collection of email from a failed corporation (Enron). Our results indicate that for people who are well represented in the collection we can use a generative model to guess the right identity about 80% of the time, and for others we are right about half the time. I'll conclude the talk with a few remarks on our next directions for techniques, evaluation, and additional types of collections to which similar ideas might be applied. About the Speaker: Douglas Oard is an Associate Professor at the University of Maryland, College Park, with joint appointments in the College of Information Studies and the Institute for Advanced Computer Studies; he is on sabbatical at Berkeley's iSchool for the Fall 2009 semester. Dr. Oard earned his Ph.D. in Electrical Engineering from the University of Maryland, and his research interests center around the use of emerging technologies to support information seeking by end users. His recent work has focused on interactive techniques for cross-language information retrieval and techniques for search and sense-making in conversational media. Additional information is available at http://www.glue.umd.edu/~oard/. |
| 09 Oct 09 | Nandakishore Kambhatla (IBM India) |
Extracting Social Networks and Biographical Facts from Conversational Speech Transcripts
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: We present a general framework for automatically extracting social networks and biographical facts from conversational speech. Our approach relies on fusing the output produced by multiple information extraction modules, including entity recognition and detection, relation detection, and event detection modules. We describe the specific features and algorithmic refinements effective for conversational speech. These cumulatively increase the performance of social network extraction from 0.06 to 0.30 for the development set, and from 0.06 to 0.28 for the test set, as measured by f-measure on the ties within a network. The same framework can be applied to other genres of text -- we have built an automatic biography generation system for general domain text using the same approach. -- Brief Bio: Nanda Kambhatla has nearly 17 years of research experience in the areas of Natural Language Processing (NLP), text mining, information extraction, dialog systems, and machine learning. He holds 6 U.S patents and has authored over 30 publications in books, journals, and conferences in these areas. Nanda holds a B.Tech in Computer Science and Engineering from the Institute of Technology, Benaras Hindu University, India, and a Ph.D in Computer Science and Engineering from the Oregon Graduate Institute of Science & Technology, Oregon, USA. Currently, Nanda is the manager of the Data Analytics Group at IBM's India Research Lab (IRL), Bangalore. The group is focused on research on machine translation, Natural Language Processing, text analysis and machine learning techniques for developing analytics solutions to help IBM's services divisions. Most recently, Nanda was the manager of the Statistical Text Analytics Group at IBM's T.J. Watson Research Center, the Watson co-chair of the Natural Language Processing PIC, and the task PI for the Language Exploitation Environment (LEE) subtask for the DARPA GALE project. He has been leading the development of information extraction tools/products and his team has achieved top tier results in successive Automatic Content Extraction (ACE) evaluations conducted by NIST for extracting entities, events and relations from text from multiple sources, in multiple languages and genres. Earlier in his career, Nanda has worked on natural language web-based and spoken dialog systems at IBM. Before joining IBM, he has worked on information retrieval and filtering algorithms as a senior research scientist at WiseWire Corporation, Pittsburgh and on image compression algorithms while working as a postdoctoral fellow under Prof. Simon Haykin at McMaster University, Canada. Nanda's research interests are focused on NLP and technology solutions for creating, storing, searching, and processing large volumes of unstructured data (text, audio, video, etc.) and specifically on applications of statistical learning algorithms to these tasks. |
| 11 Sep 09 | David Chiang |
Tutorial on HPC
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: This tutorial will be a short introduction to using the Linux cluster at USC's High-Performance Computing (HPC) facility. Topics will include: (1) basics of starting jobs on the cluster using Torque/PBS, (2) dealing with common problems like jobs not starting or spontaneously dying, (3) maximizing the performance of your jobs (both yours and other people's), e.g., using the correct filesystem and tuning it for better speed, (4) embarrassingly parallel processing and poor-man's workflows. It will NOT cover Hadoop, MPI, real workflow management tools like Condor.
|
| 28 Aug 09 |
Adam Pauls (UC Berkeley) Michael Auli (Edinburgh) |
Intern Final Talks
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Tree-to-String Alignment Models Machine translation systems typically rely on some form alignment as a preprocessing step. Typically, these alignments take the form of word-to-word alignments. In this talk, we will introduce several models aimed at aligning foreign words to either English words or nodes in the English parse tree. Such word-to-node alignments offer several potential advantages over traditional word-to-word alignments. Firstly, since the extraction process for some syntactic systems explicitly considers the English trees, we expect that also considering the trees at alignment time will produce alignments that will better suit the extraction process. Secondly, aligning foreign function words to English tree nodes can admits highly desirable syntactic transfer rules which cannot be directly as word-to-word alignments. Finally, word-to-node alignments can effectively model many-to-one alignments. We present four models of increasing complexity and show preliminary results for each model.
|
| 27 Aug 09 |
Erica Greene (Haverford) Paramveer Dhillon (Penn) |
Intern Final Talks
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: TALK 1: Erica Greene Title: A Statistical Foray into Poetry Abstract: Although the analysis and generation of poetry is often considered an exclusively human task, we have taken some initial steps to automate the process. We build a series of finite state transducers to analyze poetic meter and train them on a handmade corpus of poetry. We then use these trained transducers to generate poetry. Specifically, we focus on generating sonnets and limericks. ------------------------------------------ TALK 2: Paramveer Dhillon Title: Learning to simplify target language for MT + Unsupervised log-linear models for Word Alignment Abstract: We consider the Machine Translation task for the language pair (Chinese and English), where English is the target language. There are lots of redundancies in English language, e.g. It has capitalization, i.e. the first word of each sentence is capitalized, and it has different morphology i.e. it has noun and verb endings; none of which are present in Chinese. In a way, due to these redundancies, we are learning that a single Chinese word "tamen" translates to "They" and "they" and another Chinese word translates to "run", "runs" and "running". We present generative models which learn to "cluster" the target language vocabulary, by removing the above redundancies, namely (Capitalization and Different morphology). We show results on how this "clustering" affects the translation quality in end-to-end MT experiments. In the last part of the talk, I would talk about using unsupervised log-linear(discriminative) models for improving word alignments. There are very few precedents of using discriminative models for word alignment in totally unsupervised settings. (Taskar et. al. '05) and (Lacoste-Julien et. al. '06) used maximum weight bipartite matching in "nearly" unsupervised setting and (Blunsom et. al. '06) used CRFs for supervised word alignment. We use log-linear models in totally unsupervised settings to do word alignments. Speicifically we use Contrastive Estimation (Smith et. al. '05) to shift the probability mass to the correct set of alignments from a well-chosen "neighborhood" of those alignments. In the end I will show some preliminary word alignment results using our approach. |
| 26 Aug 09 | Sujith Ravi |
Natural Language Decipherment: Solving Problems in Natural Language Processing without Labeled Data (Thesis Proposal practice talk)
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Natural Language Decipherment: Solving Problems in Natural Language Processing without Labeled Data (Thesis Proposal practice talk) A wide variety of problems in NLP require parallel data to train supervised models to perform different tasks. For example, in machine translation (where the task is to translate between two languages automatically) parallel data containing source/target language sentence pairs is required to train various models which can then be used to translate new sentences or documents. The dependency on parallel data for many of these NLP tasks limits their applications to specific domains, or language pairs for which a lot of training data is readily available. On the other hand, collecting parallel data for new domains, language pairs, etc. is a costly as well as time-intensive operation. For such tasks, the development of novel unsupervised approaches which require only {\em non-parallel} data for training can enable their application to new domains and potentially broaden the impact and benefits of NLP research to wider areas. A similar problem has been tackled by cryptographers and archaeologists in a different context---for "decipherment" purposes. During the 1940's and 1950's, mathematicians and scientists worked on code-breaking operations, which spurred the development of many research ideas for modern computer science. For such problems, it is highly unlikely to assume the availability of parallel data relating the ciphertext and plaintext, yet cryptographers and archaeologists have attempted to solve such tasks using various decipherment techniques along with other non-parallel sources of information. In this thesis proposal practice talk, I will show how we combine the two ideas (decipherment and unsupervised learning for NLP problems) together and present a unified decipherment-based approach for modeling a wide range of problems in NLP. Instead of relying on parallel data, I propose to use alternate sources of linguistic knowledge and large quantities of readily available monolingual data to induce strong bilingual connections in problems such as machine transliteration and translation. The talk will describe how various NLP problems such as unsupervised part-of-speech tagging, word alignment, transliteration, and machine translation can be formulated as decipherment tasks. I will present decipherment algorithms for tackling many of these problems and show that it is possible to achieve good results for many problems of interest in NLP without using any parallel data at all. |
| 21 Aug 09 | Liang Huang |
Bilingually-Constrained (Monolingual) Shift-Reduce Parsing
Time: 3:00 pm - 4:15 pm Location: 4th Floor Conference Room Abstract: Jointly parsing two languages has been shown to improve accuracies on either or both sides. However, its search space is much bigger than the monolingual case, forcing existing approaches to employ complicated modeling and crude approximations. Here we propose a much simpler alternative, bilingually-constrained monolingual parsing, where a source-language parser learns to exploit reorderings as additional observation, but not bothering to build the target-side tree as well. We show specifically how to enhance a shift-reduce dependency parser to use alignment features to resolve shift-reduce conflicts. Experiments on the bilingual portion of Chinese Treebank show that, with just 3 bilingual features, we can improve parsing accuracies by 0.6% for both English and Chinese, with negligible (~6%) efficiency overhead, thus much faster than biparsing. http://www.cis.upenn.edu/~lhuang3/biparsing.pdf |
| 24 Jul 09 |
Adam Pauls (UC Berkeley) Ulf Hermjakob |
Practice talks for EMNLP
Time: 3:00 pm - 4:15 pm Location: 11 Large Abstract: K-Best A* Parsing (Adam Pauls) A* parsing makes 1-best search efficient by suppressing unlikely 1-best items. Existing k- best extraction methods can efficiently search for top derivations, but only after an exhaus- tive 1-best pass. We present a unified algo- rithm for k-best A* parsing which preserves the efficiency of k-best extraction while giv- ing the speed-ups of A* methods. Our algo- rithm produces optimal k-best parses under the same conditions required for optimality in a 1-best A* parser. Empirically, optimal k-best lists can be extracted significantly faster than with other approaches, over a range of gram- mar types. ------------------------------------------ Improved Word Alignment with Statistics and Linguistic Heuristics (Ulf Hermjakob) We present a method to align words in a bitext that combines elements of a traditional statistical approach with linguistic knowledge. We demonstrate this approach for Arabic-English, using an alignment lexicon produced by a statistical word aligner, as well as linguistic resources ranging from an English parser to heuristic alignment rules for function words. These linguistic heuristics have been generalized from a development corpus of 100 parallel sentences. Our aligner, UALIGN, outperforms both the commonly used GIZA++ aligner and the state-of-the-art LEAF aligner on F-measure and produces superior scores in end-to-end statistical machine translation, +1.3 BLEU points over GIZA++, and +0.7 over LEAF.
|
| 23 Jul 09 | Mark Hopkins (Language Weaver) |
Cube Pruning as Heuristic Search (Practice talk for EMNLP)
Time: 3:00 pm - 3:45 pm Location: 11 Large Abstract: Cube pruning is a fast inexact method for generating the items of a beam decoder. Here we show that cube pruning is essentially equivalent to A* search on a specific search space with specific heuristics. We use this insight to develop faster and exact variants of cube pruning.
|
| 17 Jul 09 | Paramveer Dhillon (Penn) |
Transfer Learning for WSD & Non-local constraints for Named Entity Recognition
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: This talk will be divided into two parts. In the first part I will talk about using Transfer Learning techniques to improve the task of Word Sense Disambiguation (WSD). Usually in supervised WSD, we suffer due to paucity of labeled data as there are some words that occur less frequently in the data and its very difficult to get enough labeled data for these words. In such cases it is very difficult to build high accuracy supervised learning models for these words. So, we propose an approach called TransFeat (based on the MDL principle) which ``transfers information", from similar words in the form of a feature relevance prior to get improved accuracies on these rare words. Besides this, our experiments show that we also get decent improvement in accuracy for words that have more amount of labeled data available. TransFeat gives accuracies that are in the worst case comparable to state-of-the-art on ONTONOTES and SENSEVAL-2 datasets. In the second part of the talk I will talk about incorporating non-local constraints in Named Entity Recognition (NER) systems. The main idea is that some linguistic constraints (e.g. every occurrence of the word ``Einstein" in the data should have the tag PER i.e. person ) are very useful and can give improved performance but they are non - local and hence are intractable and can not be efficiently modeled using state-of-the-art sequence modeling methods like CRFs. Though people have used Skip-chain CRFs (with Loopy BP)(Sutton and McCallum '04) and Gibbs Sampling (Finkel and Manning '05) to enforce these non-local constraints, but they turn out to be really inefficient and custom-tailored to one particular kind of constraints (say) consistency constraints of the type mentioned above. We propose a constrained version of EM in which a general set of constraints (not limited to consistency constraints!) can be incorporated into the model. In the end I will show some results of this approach on CoNLL 03 English and CoNLL 02 Spanish NER shared tasks. |
| 16 Jul 09 | Yang Liu (ICT China) |
Weighted Alignment Matrices for Statistical Machine Translation
Time: 10:30 am - 11:30 am Location: 11 Large Abstract: Current statistical machine translation systems usually extract rules from bilingual corpora annotated with 1-best alignments. They are prone to learn noisy rules due to alignment mistakes. We propose a new structure called weighted alignment matrix to encode all possible alignments for a parallel text compactly. The key idea is to assign a probability to each word pair to indicate how well they are aligned. We design new algorithms for extracting phrase pairs from weighted alignment matrices and estimating their probabilities. Our experiments on multiple language pairs show that using weighted matrices achieves consistent improvements over using n-best lists in significant less extraction time. About the speaker: Yang Liu is an Assistant Researcher at Institute of Computing Technology (ICT), Chinese Academy of Sciences. He received his PhD degree in Computer Science from ICT in 2007. His major research interests include statistical machine translation and Chinese information processing. He has been working on syntax-based modeling, word alignment, and system combination. His paper on tree-to-string translation won the Meritorious Asian NLP Paper Award of COLING/ACL 2006. He served as Reviewers for TALIP, TSLP, JNLE, ACL, EMNLP, AMTA, and SSST.
|
| 15 Jul 09 | Yang Liu (ICT China) |
An Overview of Tree-to-String Translation Models
Time: 4:00 pm - 5:00 pm Location: 11 Large Abstract: Recent research on statistical machine translation has lead to the rapid development of syntax-based translation models, which exploit syntactic information to direct translation. In this talk, I will give an overview of tree-to-string translation models, one of the state-of-the-art syntax-based models. In a tree-to-string model, the source side is a phrase structure parse tree and the target side is a string. This talk includes the following topics: (1) tree-based tree-to-string model, (2) tree-sequence based tree-to-string model, (3) forest-based tree-to-string model, and (4) context-aware tree-to-string model. Experimental results show that the forest-based tree-to-string system outperforms Hiero significantly on Chinese-to-English translation. About the speaker: Yang Liu is an Assistant Researcher at Institute of Computing Technology (ICT), Chinese Academy of Sciences. He received his PhD degree in Computer Science from ICT in 2007. His major research interests include statistical machine translation and Chinese information processing. He has been working on syntax-based modeling, word alignment, and system combination. His paper on tree-to-string translation won the Meritorious Asian NLP Paper Award of COLING/ACL 2006. He served as Reviewers for TALIP, TSLP, JNLE, ACL, EMNLP, AMTA, and SSST.
|
| 10 Jul 09 | Kevin Knight |
Excerpts from ACL-09 Tutorial on "Topics in Machine Translation"
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Philipp Koehn and I will do a machine translation tutorial at ACL. Instead of an introductory tutorial, we'll do short 15-minute segments on various hot topics in MT research. For the ISI NL seminar, I'll present 3 or 4 of those topics, determined by audience vote. |
| 26 Jun 09 | Steve DeNeefe |
Synchronous Tree Adjoining Machine Translation (Practice talk for EMNLP)
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: Tree Adjoining Grammars have well-known advantages, but are typically considered too difficult for practical systems. We demonstrate that, when done right, adjoining improves translation quality without becoming computationally intractable. Using adjoining to model optionality allows general translation patterns to be learned without the clutter of endless variations of optional material, with extra information spliced in as needed. In this paper, we describe a novel method for learning a type of Synchronous Tree Adjoining Grammar and associated probabilities from aligned tree/string training data. We introduce a method of converting these grammars to a weakly equivalent tree transducer for efficient decoding. Finally, we show that adjoining results in an end-to-end improvement of +0.8 BLEU over a baseline statistical syntax-based MT model on a large-scale Arabic/English MT task. |
| 19 Jun 09 | Adam Pauls (UC Berkeley) |
Hierarchical Search for Parsing (and Machine Translation)
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Both coarse-to-fine and A* parsing use simple grammars to guide search in complex ones. We compare the two approaches in a common, agenda-based framework, demonstrating the tradeoffs and relative strengths of each method. Overall, coarse-to-fine is much faster for moderate levels of search errors, but below a certain threshold A* is superior. In addition, we present the first experiments on hierarchical A* parsing, in which computation of heuristics is itself guided by meta-heuristics. Multi-level hierarchies are helpful in both approaches, but are more effective in the coarse-to-fine case because of accumulated slack in A* heuristics. |
| 29 May 09 | Marta Recasens Potau (Universitat de Barcelona) |
Learning-based Coreference Resolution for Spanish and Catalan
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: The task of coreference resolution identifies those expressions in a text that point to the same discourse entity. Natural language applications such as information extraction, question answering and machine translation can greatly benefit from its output (the different pieces of information in connection with the same entity are linked, pronouns are disambiguated, etc.). The task is extremely complex since a number of knowledge sources come into play, from morphology to discourse structure and world knowledge. In this talk I present the results of my PhD research up to now, including the development of two 400k-word corpora for Spanish and Catalan (AnCora) annotated at various levels (morphology, syntax, semantics, pragmatics), a 100k-word corpus for English, and a series of experiments towards building a learning-based coreference resolution system. More specifically, I'll discuss issues concerning the definition of the annotation scheme, the selection of features for machine learning, the effect of sample selection, and I'll introduce CISTELL, the new learning-approach we propose for coreference resolution. |
| 22 May 09 |
Victoria Fossum Dirk Hovy |
Practice talks for NAACL HLT
Time: 3:00 pm - 4:00 pm Location: 11th flr CR Abstract: Combining Constituent Parsers (Victoria Fossum: 3:00pm -- 3:30pm) Combining the 1-best output of multiple parsers via parse selection or parse hybridization improves f-score over the best individual parser (Henderson and Brill, 1999; Sagae and Lavie, 2006). We propose three ways to improve upon existing methods for parser combination. --------------------------------------------------------- Disambiguation of Preposition Sense Using Linguistically Motivated Features (Dirk Hovy: 3:30pm -- 4:00pm) Classifying polysemous words into their proper sense classes is potentially useful to any NLP application that needs to extract information from text or build a semantic representation of the textual information. Like instances of other word classes, many prepositions are ambiguous, carrying different semantic meanings (including notions of instrumental, accompaniment, location, etc.) In this paper, we present a supervised classification approach for disambiguation of preposition senses. We use the SemEval 2007 Preposition Sense Disambiguation datasets to evaluate our system and compare its results to those of the systems participating in the workshop. We derived linguistically motivated features from both sides of the preposition. Instead of restricting these to a fixed window size, we utilized the phrase structure. Testing with five different classifiers, we can report an increased accuracy (76.4%) that outperforms the best system in the SemEval task. |
| 15 May 09 | David Chiang |
Practice talks for NAACL HLT
Time: 3:00 pm - 4:00 pm Location: 4th flr CR Abstract: 11,001 New Features for Statistical Machine Translation (David Chiang) - Winner of Best Paper Award at NAACL/HLT 2009 We use the Margin Infused Relaxed Algorithm of Crammer et al. to add a large number of new features to two machine translation systems: the Hiero hierarchical phrase based translation system and our syntax-based translation system. On a large-scale Chinese-English translation task, we obtain statistically significant improvements of +1.5 BLEU and +1.1 BLEU, respectively. We analyze the impact of the new features and the performance of the learning algorithm. |
| 14 May 09 | Sujith Ravi |
Practice talks for NAACL HLT
Time: 3:00 pm - 4:00 pm Location: 4th flr CR Abstract: Talk-1: Learning Phoneme Mappings for Transliteration without Parallel Data We present a method for performing machine transliteration without any parallel resources. We frame the transliteration task as a decipherment problem and show that it is possible to learn cross-language phoneme mapping tables using only monolingual resources. We compare various methods and evaluate their accuracies on a standard name transliteration task. This is joint work with Kevin Knight. ---------------------------------------------------- Talk-2: A New Objective Function for Word Alignment We develop a new objective function for word alignment that measures the size of the bilingual dictionary induced by an alignment. A word alignment that results in a small dictionary is preferred over one that results in a large dictionary. In order to search for the alignment that minimizes this objective, we cast the problem as one of integer linear programming. We then extend our objective function to align corpora at the sub-word level, which we demonstrate on a small Turkish-English corpus. This is joint work with Tugba Bodrumlu and Kevin Knight.
|
| 08 May 09 | Andrew Kehler (UCSD) |
Coherence and the (Psycho-) Linguistics of Pronoun Interpretation
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: More than three decades of research has sought to uncover the principles that determine how hearers interpret pronouns in context. This work has focused predominantly on identifying so-called 'preferences' or 'heuristics' that hearers utilize based on linguistic properties of antecedent expressions. This focus is a departure from the type of approach outlined in Hobbs (1979), which argues that the mechanisms that drive pronoun interpretation are driven predominantly by semantics, world knowledge, and inference, with particular reference to how these are used to establish the coherence of discourses. In this talk, I report on new experimental evidence in support of a coherence-driven analysis, and describe how the analysis can accommodate a range of previous findings suggestive of conflicting preferences and biases. Case studies of four commonly-cited preferences are described, specifically (i) the parallel grammatical role preference (e.g., Smyth 1994), (ii) thematic role preferences (e.g., Stevenson et al. 1994), (iii) implicit causality biases (e.g., Caramazza et al. 1977), and (iv) the subject assignment strategy (e.g., Crawley et al. 1990). In each case, the experimental results offer an explanation of what the underlying source of the bias is, and predicts in what contexts evidence for it will surface. These results suggest that pronoun interpretation is incrementally influenced in part by the probabilistic expectations that hearers have about how the discourse will be coherently continued. They are also argued to leave various myths by the roadside, e.g., that pronoun interpretation can be profitably thought of as a 'search and match' procedure, and that coherence relations need not be controlled for in experimental stimuli. This talk includes joint work with Laura Kertz, Hannah Rohde, and Jeffrey Elman.
|
| 17 Apr 09 | Rahul Bhagat |
Learning Paraphrases from Text (Ph.D. Defense practice talk)
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Paraphrases are textual expressions that convey the same meaning using different surface forms. Capturing the variability of language, they play an important role in many natural language applications including question answering, machine translation, and multi-document summarization. In linguistics, paraphrases are characterized by approximate conceptual equivalence. Since no automated semantic interpretation systems available today can identify conceptual equivalence, paraphrases are difficult to acquire without human effort. The aim of this thesis is to develop methods for automatically acquiring and filtering phrase-level paraphrases using a monolingual corpus. Noting that the real world uses far more quasi-paraphrases than the logically equivalent ones, we first present a general typology of quasi-paraphrases together with their relative frequencies. To our knowledge the first one ever. We then present a method for automatically learning the contexts in which quasi-paraphrases obtained from a corpus are mutually replaceable. Knowing that quasi-paraphrases are often inexact because they contain semantic implications which can be directional, we present an algorithm called LEDIR to learn the directionality of quasi-paraphrases. Since semantic classes play a crucial role in our work, we also investigate the use of a semi-supervised clustering algorithm for learning semantic classes. We next investigate the task of learning surface paraphrases, i.e., paraphrases that do not require the use of any syntactic interpretation. Since one would need a very large corpus to find enough surface variations, we use a really large but unprocessed corpus of 150GB (25 billion words) obtained from Google News to do this learning. We show that these paraphrases can be used to learn surface patterns for relation extraction. Finally, we use paraphrases to learn patterns for domain-specific information extraction. Thus, in this thesis we define quasi-paraphrases, present methods to learn them from a corpus, and show that quasi-paraphrases are useful for information extraction.
|
| 27 Mar 09 | David Chiang |
Tutorial on Hadoop
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Hadoop is an open-source implementation of the Map/Reduce framework introduced by Google Research. It is a simple abstraction for describing parallelizable algorithms that admits very efficient execution: in one case, one of my (poorly implemented) algorithms was improved from a typical runtime of 72 hours to 3 hours. I will give a short introduction to Hadoop that is highly colored by my experiences with it and the likely experiences of other natural language processing researchers at ISI. I will show how to run Hadoop on HPC, how to use Hadoop Streaming (which allows implementation in any language you choose), and how to define Map/Reduce algorithms for a few incarnations of a typical NLP task, relative-frequency estimation of a large probability distribution. Input from others who are more experienced with Hadoop than I am is welcome! |
| 19 Mar 09 | Rutu Mulkar |
Discovering Causal and Temporal Relations in Biomedical Texts (practice talk for AAAI Spring Symposium)
Time: 2:00 pm - 2:30 pm Location: 4th floor CR Abstract: In previous work on "Learning by Reading" we successfully extracted entities, states and events from technical natural language descriptions of processes. The research described here is aimed at the automatic discovery of causal and temporal ordering relations among states and events, specifically, among molecular and other events in biomedical articles. We have annotated causal and temporal relations in articles on the cell cycle, and we discuss our annotation guidelines and the issue of inter-annotator agreement. We then describe the natural language parsing and the inference system we use to extract these relations. We have created axioms manually for this system, focusing on temporal, causal and aspectual information and we have used semi-automatic means to augment these axioms. We have evaluated the performance of this system, and the results are promising. |
| 06 Mar 09 | Andreas Maletti |
Minimizing Deterministic Weighted Tree Automata
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Weighted tree automata are equivalent to weighted tree grammars, which can be used, for example, to easily model weighted context-free grammars. In constrast to context-free grammars, tree automata work directly on a tree representation and not on strings. We will introduce weighted tree automata and review the important results on minimization of them. For example, it is known that deterministic devices over commutative semifields (commutative semirings with multiplicative inverses) can be effectively minimized. In the main part of the talk, we present the first efficient algorithm for this minimization. If the operations can be performed in constant time, then our algorithm constructs an equivalent minimal (with respect to the number of states) deterministic automaton in time linear in the maximal rank of the input symbols, the number of (useful) transitions, and the number of states of the input automaton.
|
| 27 Feb 09 | Carlos Busso (USC) |
Multimodal Processing of Human Behavior in Intelligent Instrumented Spaces: A Focus on Expressive Human Communication
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Advances in technologies to capture and process multimedia signals are enabling new opportunities for understanding and modeling human behavior, and designing new human-centered applications. Intelligent environments equipped with a range of audio-visual sensors provide suitable means for automatically monitoring and tracking the behavior, strategies and engagement of the participants in multiperson interactions such as meetings, at various levels of interest. We describe a case study of a "Smartroom" being developed at USC in which high-level features are calculated from active speaker segmentations, automatically annotated by our system, to infer the interaction dynamics between the participants. The results show that it is possible to accurately estimate in real-time not only the flow of the interaction, but also how dominant and engaged each participant was during the discussion. Additionally, we describe analysis of human expressive behavior that can be afforded by such audio-visual data. We describe an analysis of the interrelation between facial gestures and speech using a multimodal approach. Using a controlled setting, motion capture technology was used to simultaneously acquire speech and detailed facial information. Our results indicate that the verbal and non-verbal channels of human communication are internally and intricately connected. The interplay is observed across the different communication channels such as various aspects of speech, facial expressions, and movements of the hands, head and body, and is greatly affected by the linguistic and emotional content of the message being communicated. As a result of the analysis, applications in automatic emotion recognition and synthesis of expressive communication are presented. [This research was supported in part by funds from the NSF, NIH, and the Department of the Army]
|
| 13 Feb 09 | Joseph Tepperman (Signal Analysis and Interpretation Laboratory, USC) |
Estimating Subjective Judgments of Speech on Multiple Levels
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: People make explicit subjective judgments of speech when doing things like tutoring students in a foreign language, or testing a child's reading skills. On what do we base these judgments, and how can they be made automatically? The "quality" of speech does not exist on any one scale alone, and is not limited strictly to pronunciation - it is manifested through a multiplicity of simultaneous and interacting cues of various sizes. In this talk I'll discuss modeling strategies for categorical pronunciation on several scales, cognitive models for estimating student knowledge demonstrated through speech, and applications in the fields of education and speech synthesis.
|
| 30 Jan 09 | Kevin Knight |
Sixty Years of Statistical Machine Translation
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: This high-level survey will describe the results of statistical machine translation (SMT) research since 1948. Part of the survey will cover the explosion of work in the past few years that has resulted from intense interest on the part of scientists, funders, and industry. We will also examine the roots of SMT in World War II decipherment activities. Some of the concepts from that era have become core to the field, while others still remain to be picked up. |
| 23 Jan 09 | Roger Levy (UCSD) |
Noise and memory in rational human language comprehension
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Considering the adversity of the conditions under which linguistic communication takes place in everyday life---ambiguity of the signal, environmental competition for our attention, speaker error, our limited memory, and so forth---it is perhaps remarkable that we are as successful at it as we are. Perhaps the leading explanation of this success is that (a) the linguistic signal is redundant, (b) diverse information sources are generally available that can help us obtain infer the intended message (or something close enough) when comprehending an utterance, and (c) we use these diverse information sources very quickly and to the fullest extent possible. This explanation can be thought of as treating language comprehension as a rational, evidential process. Nevertheless, there are number of prominent phenomena reported in the sentence processing literature that remain clear puzzles for the rational approach. In this talk I address three such phenomena, whose common underlying thread is an apparent failure to use information available in a sentence appropriately in global or incremental inferences about the correct interpretation of a sentence. I argue that the apparent puzzle posed by these phenomena for models of rational sentence comprehension may derive from the failure of existing models to appropriately account for the environmental and cognitive constraints---namely, noisy input and limited memory---under which comprehension takes place. I present two new probabilistic models of language comprehension under noisy input and limited memory, and show that these models lead to solutions to the above puzzles. More generally, these models suggest how appropriately accounting for environmental and cognitive constraints can lead to a more nuanced and ultimately more satisfactory picture of key aspects of human cognition. |
| 17 Dec 08 | Liang Huang (UPenn => Google Research) |
Tree-based and Forest-based Translation
Time: 3:00 pm - 4:00 pm Location: 4th Floor CR Abstract: What is in common, and what is different, between translating from English to Chinese and compiling C++ into machine code? In this talk I will first introduce a tree-based (aka syntax-directed) paradigm for machine translation, inspired by both human translators and compilers. In this paradigm, a source language sentence is first parsed into a syntactic tree, which is then recursively converted into a target language sentence via tree-to-string transformation rules. Since the translation process is driven by the syntax, this approach resembles the classical "syntax-directed translation" method in compiling theory. However, natural languages are crucially different from programming languages in that they are fundamentally ambiguous. So we don't (and will probably never) have perfect parsers, and parsing errors adversely affect translation quality. To alleviate this problem, an obvious idea is to use the top-k parses, rather than a single 1-best, but this only helps a little bit due to the limited scope of the k-best list. We instead propose a "forest-based approach", which translates a packed forest encoding *exponentially* many parses in a compact (polynomial) space by sharing common subtrees. Large-scale experiments showed very significant improvements (over the 1-best baseline) in terms of translation quality, which outperforms the best reported systems to date. More interestingly, translating a forest of millions of trees is even faster than translating on top-30 individual trees thanks to dynamic programming. This talk includes joint work with Kevin Knight and Aravind Joshi (first part), and with Haitao Mi and Qun Liu (second/third parts). Short Bio: Liang Huang recently completed his PhD study at the University of Pennsylvania, co-supervised by Aravind Joshi and Kevin Knight (USC/ISI). He is mainly interested in the theoretical aspects of computational linguistics, in particular, efficient algorithms in parsing and machine translation, generic dynamic programming, and formal properties of synchronous grammars. His thesis develops a set of "forest-based methods" that have been applied to many problems in NLP including k-best parsing, forest rescoring and reranking, and forest-based translation. His awards include an Outstanding Paper Award at ACL 2008, and a University Teaching Award at Penn in 2005. http://www.cis.upenn.edu/~lhuang3/ |
| 07 Nov 08 | Daniel Marcu |
The best/worst Speech Recognition, Language Modeling, and Machine Translation ideas
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: A group of 60 researchers have been asked to comment on what they perceive to be - the most important contributions in the fields of speech recognition, language modeling, and machine translation; - past ideas that failed to lead to substantial improvements; - and contributions that are most likely to have a material impact in the future. This talk summarizes the perceptions and trends identified in the collection of answers provided by the researchers. |
| 17 Oct 08 | Jens Voeckler |
Parsing XRS with(out) regular expressions
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: If you ever needed to extract information, e.g. LHS, RHS words, features, etc., from an XRS rules, this talk is for you. Over the years, a variety of regular expressions have been used to obtain data from XRS rules. However, in light of recent pipeline efforts, the copy-n-paste culture lead to expressions that were sometimes too complex for the task at hand, unnecessarily slowing down processing steps, or too trivial to work correctly on boundary cases. A unified effort by Steve, David, Wei, Michael and Jens culminated in the NLPRules module for Perl. While the talk centers on the Perl module, and some surprising benchmark results, any language supporting libpcre (perl compatible regular expression) will benefit from the insights, and from knowing the right regular expression for the task at hand.
|
| 14 Oct 08 | Victoria Fossum + David Chiang |
Practice talks for AMTA/EMNLP
Time: 3:00 pm - 4:15 pm Location: 11 Large Abstract: Using Bilingual Chinese-English Word Alignments to Resolve PP-Attachment Ambiguity in English (practice talk for AMTA) Errors in English parse trees impact the quality of syntax-based MT systems trained using those parses. Frequent sources of error for English parsers include PP-attachment ambiguity, NP-bracketing ambiguity, and coordination ambiguity. Not all ambiguities are preserved across languages. We examine a common type of ambiguity in English that is not preserved in Chinese: given a sequenc "VP NP PP", should the PP be attached to the main verb, or to the object noun phrase? We present a discriminative method for exploiting bilingual Chinese-English word alignments to resolve this ambiguity in English. On a heldout test set of Chinese-English parallel sentences, our method achieves 86.3% accuracy on this PP-attachment disambiguation task, an improvement of 4% over the accuracy of the baseline Collins parser (82.3%). Online Large-Margin Training of Syntactic and Structural Translation Features (practice talk for EMNLP) Minimum-error-rate training (MERT) is a bottleneck for current development in statistical machine translation because it is limited in the number of weights it can reliably optimize. Building on the work of Watanabe et al., we explore the use of the MIRA algorithm of Crammer et al. as an alternative to MERT. We first show that by parallel processing and exploiting more of the parse forest, we can obtain results using MIRA that match or surpass MERT in terms of both translation quality and computational cost. We then test the method on two classes of features that address deficiencies in the Hiero hierarchical phrase based model: first, we simultaneously train a large number of Marton and ResnikÂ’s soft syntactic constraints, and, second, we introduce a novel structural distortion model. In both cases we obtain significant improvements in translation performance. Optimizing them in combination, for a total of 56 feature weights, we improve performance by 2.6 Bleu on a subset of the NIST 2006 Arabic-English evaluation data. (Joint work with Yuval Marton and Philip Resnik)
|
| 10 Oct 08 | Sujith Ravi + Steve DeNeefe |
Practice talks for AMTA/EMNLP
Time: 3:00 pm - 4:15 pm Location: 11 Large Abstract: Automatic Prediction of Parser Accuracy (practice talk for EMNLP) Statistical parsers have become increasingly accurate, to the point where they are useful in many natural language applications. However, estimating parsing accuracy on a wide variety of domains and genres is still a challenge in the absence of gold-standard parse trees. We propose a technique that automatically takes into account certain characteristics of the domains of interest, and accurately predicts parser performance on data from these new domains. As a result, we have a cheap (no annotation involved) and effective recipe for measuring the performance of a statistical parser on any given domain. (Joint work with Kevin Knight and Radu Soricut)
Overcoming Vocabulary Sparsity in MT Using Lattices (practice talk for AMTA) Source languages with complex word formation rules present a challenge for statistical machine translation (SMT). In this paper, we take on three facets of this challenge: (1) common stems are fragmented into many different forms in training data, (2) rare and unknown words are frequent in test data, and (3) spelling variation creates additional sparseness problems. We present a novel, lightweight technique for dealing with this fragmentation, based on bilingual data, and we also present a combination of linguistic and statistical techniques for dealing with rare and unknown words. Taking these techniques together, we demonstrate +1.3 and +1.6 BLEU increases on top of strong baselines for Arabic-English machine translation. (Joint work with Ulf Hermjakob and Kevin Knight)
|
| 26 Sep 08 | Eugene Charniak (Brown University) |
EM Works for Pronoun-Anaphora Resolution
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: EM (the Expectation Maximization Algorithm) is a well known technique for unsupervised learning (where one does not have any hand labeled solutions available, but instead one must learn from the raw text). Unfortunately EM is known to fail to find good solutions in many (most?) applications on which it is tried. In this talk we present some recent work on using EM to learn how to resolve pronoun-anaphora: determining that "the dog" is the antecedent of "he" and "his" in "When Sally fed the dog he wagged his tail". For this application EM works strikingly well, determining tens of thousands of parameters and resulting in a program that probably produces state of the art results, although because this is preliminary work, and pronoun-anaphora has no standard evaluation metrics, this is just a guess. About the Speaker: Eugene Charniak is Professor of Computer Science. and Cognitive Science at Brown University. He received an A.B. degree in Physics from University of Chicago and a Ph.D. from M.I.T. in Computer Science. He has published four books: Computational Semantics, with Yorick Wilks (1976); Artificial Intelligence Programming (now in a second edition) with Chris Riesbeck, Drew McDermott, and James Meehan (1980, 1987); Introduction to Artificial Intelligence with Drew McDermott (1985); and Statistical Language Learning (1993). He is a Fellow of the American Association of Artificial Intelligence and was previously a Councilor of the organization. His research has always been in the area of language understanding or technologies which relate to it, such as knowledge representation, reasoning under uncertainty, and learning. Over the last few years he has been interested in statistical techniques for language understanding. His research in this area has included work in the subareas of part-of-speech tagging, probabilistic context-free grammar induction, and, more recently, syntactic disambiguation through word statistics, efficient syntactic parsing, and lexical resource acquisition through statistical means.
|
| 19 Sep 08 | Fei Sha (USC) |
Large margin based parameter estimation for hidden Markov models
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: In many application domains, we face the task of characterizing the distribution of continuous random variables. For instance, in automatic speech recognition (ASR), these variables are acoustic properties of speech signals. For such tasks, Gaussian mixture models (GMMs) are widely used as an very effective density estimator. Particularly, in the context of ASR, they are embedded in continuous-density hidden Markov models (CD-HMMs) to yield emission probabilities, i.e., the likelihoods of acoustic observations conditioned on hidden states such as phonemes. Meanwhile, the transition probabilities in CD-HMMs attempt to capture temporal properties of speech signals. Similar modeling choices arise in other applications, for instance, in activity recognition. Various techniques have been developed to estimate the parameters of CD-HMMs. In particular, discriminative techniques such as conditional maximum likelihood and minimum classification error have attracted significant research attention. When carefully and skillfully implemented, they often lead to lower error rates (in speech recognition) than traditional techniques of maximum likelihood estimation. In this talk, I will describe a new discriminative technique that is based on the principle of large margin, a key framework in many machine learning algorithms including support vector machines and boosting. The new technique differs from previous discriminative methods for ASR in the goal of margin maximization. In particular, in our large margin training of CD-HMMs, model parameters are optimized to maximize the gap (or the margin) between correct and incorrect classifications. I will present an extensive empirical evaluation of our approach on two benchmark problems in speech recognition: phonetic classification and recognition on the TIMIT speech database. In both tasks, large margin systems obtain significantly better performance than systems trained by maximum likelihood estimation or competing discriminative frameworks. An in-depth analysis also reveals some interesting features of our approach, which contribute to the superior performance. Towards the end of the talk, I will discuss briefly the connection of our work to the structured prediction problems in the machine learning community. I will also discuss the future direction of this line of work and other application potentials.
|
| 22 Aug 08 | Catalin Tirnauca (Univ. Rovira i Virgili) |
Intern Final Talk: On the Consistency of Probabilistic Context-Free Grammars
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: Probabilistic context-free grammars can describe probability distributions over strings, i.e., the sum of probabilities of all generated strings is 1.This condition is often called consistency. It has applications in fields of natural language processing such as probabilistic parsing (disambiguate by picking the parse with the highest score), or speech recognition (rank hypotheses returned by a speech recognizer). The talk is a survey of some of the previous results. We investigate how we can determine if a probabilistic context-free grammar is consistent, and if such a test can always be done. Also, we study a method, namely normalization, which guarantees consistent probabilistic context-free grammars. Moreover, we mention briefly some techniques that train probabilistic context-free grammars and guarantee consistency. |
| 22 Aug 08 | Amittai Axelrod (UW) |
Intern Final Talk: Structural constraints for efficient decoding.
Time: 3:45 pm - 4:15 pm Location: 11 Large Abstract: String-to-tree machine translation decoders are effective but very slow, especially compared to other decoding approaches. We explore various methods to identify constraints on the search space, with the aim of improving the efficiency of the syntax-based decoder. |
| 20 Aug 08 | John DeNero (Berkeley) |
Intern Final Talk: Minimum Risk Decoding over Forests
Time: 3:45 pm - 4:15 pm (NOTE different day and location!) Location: 11 Small Abstract: Minimum Bayes risk (MBR) decoding improves the output of machine translation systems by selecting a translation that matches a large proportion of the k-best hypotheses of a system. We extend this idea to apply to packed forests by selecting an output sentence that matches a large proportion of all hypotheses in the pruned forest of derivations from a syntax-based translation system. |
| 20 Aug 08 | Kyle Gorman (Penn) |
Intern Final Talk: The Entropy of English given French
Time: 3:00 pm - 3:30 pm (NOTE different day and location!) Location: 11 Small Abstract: The fundamental task in statistical machine translation (SMT) is to characterize the probability of a target sentence given its source translation; for translating French as English, P(f | e). By applying Bayes Rule, we derive the fundamental theorem of SMT: e maximizing P(e) P(f | e). Advances in SMT come from improving estimations of these two terms, or from more efficient ways of searching for optimal solutions (Brown et al. 1993). In the case of language modeling, Shannon (1949) and Brown et al. (1992) identified upper and lower bounds for the per-character entropy of English, H(e), for humans and machines, respectively. We ask the same question for SMT, H(e | f), comparing the results for human translators and a simple machine baseline based on IBM Model 1. These numbers are the upper and lower bounds for SMT systems trained on parallel data. |
| 18 Jul 08 | Sujith Ravi |
Deciphering Ciphers Optimally Using Only Minimal Knowledge of the Source Language
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: I will be talking about deciphering letter-substitution ciphers *optimally* using only minimal knowledge (bigrams, trigrams, etc.) of the source language, instead of relying on large look-up dictionaries. We also plan to show how our empirical results compare with Shannon's predictions on the equivocation curves and unicity distance measure. |
| 11 Jul 08 | Jon May |
Thesis Proposal Practice Talk: A Weighted Tree Transducer Toolkit for Syntactic Natural Language Processing Models
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Solutions for many natural language processing problems such as speech recognition, transliteration, and translation have been described as weighted finite-state transducer cascades. The transducer formalism is very useful for researchers, not only for its ability to expose the deep similarities between seemingly disparate models, but also because expressing models in this formalism allows for rapid implementation of real, data-driven systems. Finite-state toolkits can interpret and process transducer chains using generic algorithms and many real-world systems have been built using these toolkits. Current research in NLP makes use of syntax-rich models that are poorly suited to extant transducer toolkits, which process linear input and output. Tree transducers can handle these models, and a weighted tree transducer toolkit with appropriate generic algorithms will lead to the sort of gains in syntax-based modeling that were achieved with string transducer toolkits. In this thesis proposal practice talk I will briefly trace the history of finite-state transducers and automata as they relate to natural language processing and the evolution of formalisms and the toolkits that support them, leading up to motivation for the design and creation of Tiburon, the toolkit referenced in this talk's title. I will describe previous, current, and future work on Tiburon's algorithms and the effectiveness of both algorithms and software at cleanly representing syntax-based NLP models from the literature and at constructing and evaluating novel models. |
| 13 Jun 08 | Ellen Riloff |
Effective Information Extraction with Relevant Regions and Semantic Affinity Patterns
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: I will briefly overview the landscape of event-oriented information extraction (IE) systems and explain why it is especially challenging to learn IE systems without annotated training data. Then I will describe one attempt to do so by decoupling the tasks of finding relevant text regions and applying extraction patterns. First, a self-trained relevant sentence classifier identifies relevant regions in documents. Second, a "semantic affinity" measure identifies domain-relevant extraction patterns. We further distinguish between "primary" patterns and "secondary" patterns and apply the patterns selectively in the relevant regions. This approach is weakly supervised, requiring only a few seed patterns plus relevant and irrelevant (but unannotated) documents for training. The resulting IE system achieves reasonably good performance, despite the fact that the relevant region classifier leaves a lot to be desired. |
| 06 Jun 08 | Tom Murray (USC) |
Knowledge as a Constraint on Uncertainty for Unsupervised Classification
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: This talk investigates the use of domain knowledge to constrain and improve the unsupervised learning of a classifier, by placing limits or biases on the possible hypotheses for each input. Theoretically, we view the contribution of the knowledge source as a reduction in the uncertainty of the model's decisions, quantified by the resulting conditional entropy of the label distribution given the input corpus. Evaluating on the simple case of an unsupervised HMM tagger, we find surprising levels of improvement from little knowledge, with more stable and efficient training convergence and label assignment, and a high degree of correlation between classification entropy and model performance. We conclude that, while we should always seek better generic models and techniques, for applications in an unsupervised setting, knowledge may still be key. |
| 30 May 08 | Steve DeNeefe |
BLEU Sway Issues: one way to get statistical significance, two ways to get a better score, and three ways to thwart them
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: BLEU the de facto standard for evaluation and development of statistical machine translation systems. We describe three real-world situations involving comparisons between different versions of the same systems where one can obtain improvements in BLEU scores that are questionable or even absurd. We propose a very conservative modification to BLEU that addresses these issues while improving correlation with human judgements, then explore some deeper modifications that alleviate the problems further. |
| 16 May 08 | David Newman (UCI) |
Theory and Applications of Topic Modeling
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Topic models, a class of Bayesian probabilistic models for discrete data, have recently gained popularity in applications ranging from document modeling to computer vision. Since the introduction of Latent Dirichlet Allocation (LDA) in 2003, there have been numerous extensions to this archetype. I will review the theory behind LDA, and discuss subsequent models, including (some of): Correlated Topic Model, Dynamic Topic Model, Hierarchical Topic Model, Special Words Topic Model, Hierarchical Dirichlet Process Model, Pachinko Allocation Machine, Topics and Syntax Model, Bi-LDA, Author-Topic Model, Supervised Topic Model, Spatial LDA, etc. |
| 09 May 08 | John DeNero (Berkeley) |
Inference in phrase alignment models
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Models that align phrases instead of words offer an appealing alternative to the standard relative frequency estimates of phrase translation probabilities. But, while some effective word alignment models (Model 1, Model 2 & HMM) can be estimated tractably with EM, phrase alignment models cannot. I'll talk about how to show that estimation and inference under these models is intractable. Then, I'll present two useful approximation techniques. First, I'll talk about how to cast phrase alignment search as an integer linear programming (ILP) problem and find the optimal alignment reliably and quickly with off-the-shelf ILP software. Some applications of this technique include training phrase alignment models and interpreting the output of word alignment models. Second, we'll look at how to estimate translation probabilities under a phrase alignment model using a Gibbs sampling procedure. The sampler has some nice asymptotic convergence properties and also seems to produce good results in practice. I'll walk through the different models we've trained and how they performed. Time permitting, I'll also talk about some of the ways in which we could potentially extend this work to syntactic MT. |
| 02 May 08 | Zornitsa Kozareva |
Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We present a novel approach to weakly supervised semantic class learning from the web, using a single powerful hyponym pattern combined with graph structures, which capture two properties associated with pattern-based extractions: popularity and productivity. Intuitively, a candidate is popular if it was discovered many times by other instances in the hyponym pattern. A candidate is productive if it frequently leads to the discovery of other instances. Together, these two measures capture not only frequency of occurrence, but also cross-checking that the candidate occurs both near the class name and near other class members. We developed two algorithms that begin with just a class name and one seed instance and then automatically generate a ranked list of new class instances. We conducted experiments on four semantic classes and consistently achieved high accuracies. |
| 25 Apr 08 | David Chiang |
Tutorial: Randomized data structures for large statistical NLP models
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Randomized algorithms are those which use randomness to achieve efficient performance with a bounded probability of error; typically, the bound is adjustable and the performance depends on the bound. Randomized data structures, likewise, use randomness to achieve efficient storage with a bounded probability of error. I will give an overview of the use of such data structures, namely, Bloom filters and "Bloomier" filters, for storing very large n-gram language models, and will discuss possibilities for using randomized data structures for other purposes as well. |
| 18 Apr 08 | Rahul Bhagat |
Learning Paraphrases from Text
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Paraphrases are textual expressions that convey the same meaning using different words. They capture variability, which is a common phenomenon in language. Given this, paraphrases have been shown to be useful in many natural language applications like Question-Answering, Machine Translation, Summarization and Information Retrieval. In this talk, I'll discuss the phenomenon paraphrasing and focus on methods for automatically acquiring paraphrases from text. |
| 11 Apr 08 | Jon May |
Syntactic Re-Alignment Models for Machine Translation
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We present a method for improving word alignment for statistical syntax-based machine translation that employs a syntactically informed alignment model closer to the translation model than commonly-used word alignment models. This leads to extraction of more useful linguistic patterns and improved BLEU scores on translation experiments in Chinese and Arabic. |
| 04 Apr 08 | Ulf Hermjakob |
Name Translation in Statistical Machine Translation: Learning When to Transliterate
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We present a method to transliterate names in the framework of end-to-end statistical machine translation. The system is trained to learn when to transliterate. For Arabic to English MT, we developed and trained a transliterator on a bitext of 7 million sentences and Google's English terabyte ngrams and achieved better name translation accuracy than 3 out of 4 professional translators. The talk also includes a discussion of challenges in name translation evaluation. |
| 25 Mar 08 | Jason Riesa |
Tutorial on Arabic Orthography
Time: 10:30 am - 11:30 am Location: 11 Large Abstract: This tutorial is intended to provide attendees with working knowledge of the Arabic writing system. No previous experience with Arabic is required. At the end of this tutorial you should be able to read and segment individual Arabic characters, read common ligatures, identify possible affixes on stems, and understand the various lexical normalizations used in Arabic text preprocessing. The focus will be on the formal writing system in printed text for Modern Standard Arabic, although handwriting will be briefly discussed. |
| 18 Jan 08 | Victoria Fossum |
Using Syntax to Improve Word Alignment Precision for Syntactic Machine Translation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Automatically word-aligning a parallel bitext in the source and target languages constitutes the first stage of most statistical machine translation pipelines. Automatic word alignment is error-prone, and produces many incorrect links. Incorrect links that violate syntactic correspondences interfere with the extraction of string-to-tree transducer rules for syntactic machine translation. We present an algorithm for identifying and deleting incorrect word alignment links, using features of the extracted rules. We obtain gains in both alignment quality and translation quality in Chinese-English and Arabic-English translation experiments, relative to a GIZA++ union baseline. |
| 11 Jan 08 | Kevin Knight |
How to Make EM Do What You Want
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I'll talk about some unsupervised learning experiments -- how I was satisfied with the initial results, how I became very dissatisfied, and how I became (somewhat) satisified again. |
| 14 Dec 07 | Marieke van Erp |
MITCH: Mining for Information in Texts from the Cultural Heritage
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Naturalis, the Dutch National Museum of Natural History, harbours one of the largest treasures of the world: the key specimens of millions of animals found throughout the world through centuries of biological expeditions. While the depot where the animals are stored is a technical marvel, Noah's ark of the 21st century, it is hard to search through it. Research in taxonomy, the evolution of life and biodiversity revolves around the specimens in the depot. The main key to accessing the depot are(mostly) handwritten expedition logs and registration books, which are currently being photographed and keyed in to be stored in searchable digital archives. Such digital logs already enable a kind of "Biogoogle" search, but actual research questions are more complicated ("how did this kind of frog develop over the last century in the Amazon rainforests?"), and demand more intelligent handling. This is where the MITCH project comes in. The goal of MITCH is to turn the field logs and registration books into a populated semantic network, in which concepts such as animal specimens are related to all other concepts that define them: where, when, under which circumstances and by whom were they found, who described them first in the academic literature, who prepared them for storage in the Naturalis depot, which registration number was assigned to them, etc. This means that all textual descriptions of a specimen need to be parsed into exactly these concepts and their relations. All of this needs to be done at a scale that goes far beyond the human capacity, as tens of thousands of digitized but unanalysed textual records are waiting for semantic analysis. This necessitates the use of state-of-the-art machine learning methods that learn from examples automatically. The project addresses its goals on three levels. The basic level is the development and application of automatic data cleaning and markup tools. On top of this, semi-structured textual material such as fieldbook logs and scientific papers, are semi-automatically converted to a searchable knowledge base. Search results are visualised by displaying maps and specimen photos. The conversion phase assumes the active intervention of domain experts, such as collection managers, to correct and steer the automatic extraction procedure. At the top level, information resources are cross-linked using a domain ontology, populating a semantic network that can be hooked up to any other standardised cultural heritage knowledge base or to a search engine. |
| 02 Nov 07 | Bill Rounds (Michigan and Stanford) |
Constructions, Constraints, Transducers, and TAGs: A unifying view through Feature Logic
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The value of mathematical formalisms for speech recognition, language generation, and machine translation has long been recognized. Not so much work, though, has been spent reconciling these formalisms with linguistic theories. In this talk I'll propose a theoretical descriptive mechanism based on feature logic, which is central to construction and constraint-based linguistic theories like construction grammar and HPSG, and which can be used to view tree transducers and tree-adjoining grammars as giving rise to a construction-based framework. |
| 19 Oct 07 | Slav Petrov (Berkeley) |
Learning and Inference for Hierarchically Split PCFGs
Time: 10:30 am - 11:30 am Location: 11 Large Abstract: Treebank parsing can be seen as the search for an optimally refined grammar consistent with a coarse training treebank. We describe a method in which a minimal grammar is hierarchically refined using EM to give accurate, compact grammars. The resulting grammars are extremely compact compared to other high-performance parsers, yet the parser gives the best published accuracies on several languages, as well as the best generative parsing numbers in English. In addition, we give an associated coarse-to-fine inference scheme which vastly improves inference time with no loss in test set accuracy. |
| 17 Oct 07 | Jon Patrick (Univ. of Sydney) |
Enhancement Technologies for ICU Information Systems
Time: 3:30 pm - 4:30 pm Location: 11 Large Abstract: The School of Information Technologies at the University of Sydney has had a 3 year partnership with the Intensive Care Unit at the Royal Prince Alfred Hospital, Sydney. In that time they have managed 8 joint projects aimed at producing software solutions that enhance productivity in the Unit and in some cases enabled entirely new functionalities in their information systems. The principle motivation for the research is the processing of the narratives in clinical notes but concomitant problems in information systems have also been tackled and the combination of the two disciplines have led to the two related processing systems to be described in this presentation. - Ward Rounds Information Systems (WRIS) & Handovers - The WRIS is designed to support the work of all clinical staff in their ward rounds activities. The system, when activated, automatically populates from the resident clinical database a pro forma report with the most recent relevant data about the patient, such as vital signs, pathology reports, and other diagnostic measurements, presented as a web page. The clinical staff then write their progress notes into the web page which converts the text to SNOMED CT codes and other relevant concepts and entities. The clinician is given the opportunity to change any analyses done by the processor. This clinician approved data is loaded to the patient record. The essential elements of this system, that is computing an extract of the patient record, accepting narrative input, and analysing the text for coding, is a productivity gain of itself, but more importantly, also constitutes the beginning of a hospital wide Handovers System for use throughout each step in the patient journey. This system is being tested at the RPAH ICU in readiness for ward usage. The impact of this system in improving the quality and safety of handovers has the potential to be very significant. - Clinical Data Analytics Language (CDAL) - General purpose access to data from clinical information systems, beyond retrieval for point of care work, is needed for many aspects of the hospital's work particularly for clinical research, logistics & operational planning, and auditing patient safety. Most current clinical systems only provide access to data identified in standard reports with no flexibility to make ad hoc enquiries or to pursue new directions of enquiry. The clinical data analytics language developed enables the expression of any question that can be answered from the data in the database in a restricted natural language. A prototype of the language has been developed for the CareVue information system used in the ICU at the Royal Prince Alfred Hospital. It provides for the use of local medical dialects, SNOMED CT terminology including all forms of collective expressions in SNOMED (e.g. infectious diseases), specification of patient groups, a variety of statistical functions, and constraints over any medical variable, Time, and Location. CDAL is general in that it can be bolted on to any clinical information system and is applicable to any clinical specialisation.
|
| 12 Oct 07 | David Talbot (Edinburgh) |
Scalable Language Modeling: Breaking the Curse of Dimensionality
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Randomized data structures can help us scale discrete models encountered in NLP. This talk will describe their use in language modeling and present some more general related results. N-gram language models are fundamental to speech recognition and machine translation. Unfortunately, the n-gram parameter space grows exponentially with the dimension of the feature vector. I will describe how randomization can be used to remove the space-dependency of such models on the a priori parameter space. The novel extensions of the Bloom filter that I will present are able to take advantage of the entropy of the distribution of values assigned to feature vectors to save space in a discrete statistical model. I will review some results applying these models to language modeling in machine translation and relate their space-requirements to a novel lower bound on the general problem of querying a map of key/value pairs. No prior knowledge of randomized data structures will be assumed.
|
| 05 Oct 07 | Sujith Ravi |
Will this parser work with my data? - Predicting Parser Accuracy without Gold-Standard information
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: There are many tools available to the NLP community for Natural Language Parsing, (i.e converting a raw sentence in to a parse-tree). NLP researchers usually use some "off-the-shelf" parser which has been trained on the Wall Street Journal (WSJ) corpora and then apply the WSJ-trained parser to their data. This works in many cases, especially for systems which use data from WSJ or similar corpora. However, in real life applications, the data may be compiled from many different sources and span different genres, and may not be similar to the WSJ corpora in terms of sentence structure, etc . A particular parser might parse well on some corpora and not so well on others. Choosing the right parser for your data may have an impact on the performance of the NLP system as a whole. But in order to measure the accuracy of any parser for a given corpus, we require a set of gold-standard parse trees corresponding to the sentences within the corpus. Generating gold-standard set takes a lot of manual work and in many real-life applications, it is not a feasible task to generate gold-standard parses for large corpora. We attempted to build a system which can predict the accuracy (in terms of f-measure value) of the Charniak parser (a popular parsing tool) on any given sentence corpus. Without using any additional information (i.e gold std. parses), our system predicts "how accurately the Charniak parser could parse the given corpus". In order to evaluate our system's predictions on a particular corpus, we compute the "Correlation" measure between the "actual accuracies (using Gold-standard)" vs. "predicted accuracies (from our system)" for the given corpus. We tested our system on different corpora and using different methods and will present these results. |
| 29 Aug 07 |
Carmen Heger (Dresden) Michael Bloodgood (Delaware) |
Summer Intern Presentations: Composition of Tree Transducers AND Using the Perceptron Algorithm to Tune Large Numbers of Feature Weights for Syntax-Based Statistical Machine Translation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Composition of Tree Transducers Since finite state (string) transducers are not expressive enough for many NLP applications, computational linguistics started to investigate tree transducers for the task of machine translation, for example. Quite some successful work has been done on generalizing results from string transducers to tree transducers. But when it comes to composition results are not satisfying because generally tree transducers are not closed under composition. Still we think that most of the tree transducers used in NLP are composable and that is why we defined the problem of the composition for two individual transducers instead of the whole class. During the summer we started with linear nondeleting tree transducers with epsilon rules and approached an algorithm to decide for two such transducers whether their composition is again in the same class. Using the Perceptron Algorithm to Tune Large Numbers of Feature Weights for Syntax-Based Statistical Machine Translation Current state-of-the-art syntax-based statistical machine translation systems produce many candidate translations out of which the output translation is selected by taking the argmax over all candidates i of <w,f_i> where w is a weight vector and f_i is a vector of the feature values for candidate i. The features used by the system and their corresponding weights have a major impact on a system's performance. Currently, Minimum Error Rate Training (MERT) is used to tune the weights of the features. A drawback of this is that it isn't tractable to tune large numbers of feature weights. I will discuss using the perceptron algorithm to tune feature weights for statistical machine translation. If I get interesting results before my talk, I may also dicsuss new classes of features (potentially very large numbers of features) that can be used for improving MT performance. |
| 24 Aug 07 |
Wei Ho (Princeton) Jennifer Gillenwater (Rice) |
Summer Intern Presentations: Noisy Language Models AND Context for Syntax-Based Translation Rules
Time: 3:30 pm - 5:00 pm Location: 11 Large Abstract: Noisy Language Models The language models used in statistical machine translation are often quite large, requiring significant memory and sometimes pre-processing in order to be utilized effectively. It would be desirable to have a more compact representations of language models while minimizing the impact on translation quality. Various quantization methods and lossy storage of language models will be presented. Context for Syntax-Based Translation Rules The rules that a translation system employs should be applicable in many contexts. This ensures that a rich language is expressible with a minimum number of rules. However, when rules that are applicable in too many contexts are combined, they result in nonsensical translations. How can we keep rules general but constrain the context of their use? This summer we explored the approach of constraining the context by conditioning on various neighboring elements of each rule.
|
| 16 Aug 07 | Anoop Sarkar (Simon Fraser) |
Extensions of Regular Tree Grammars and their relation to Tree Adjoining Grammars
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: There is a hierarchy of generative devices that generate trees: starting with regular tree languages (RTLs), which are contained within context-free tree languages (CFTLs), and so on. The string yield of the RTLs is exactly the set of Context-Free Languages, while the yield of the CFTLs is exactly the set of Indexed Languages. In this talk we introduce Adjoining Tree Languages (ATLs) which sit in between RTLs and CFTLs. The yield of ATGs is exactly the set of Tree-Adjoining Languages. Just like RTGs are stronger than CFGs, ATGs are stronger than TAGs. In addition we will show that the ATG notation simplifies many of the foundational proofs for TAGs including proofs of the closure properties. In particular, ATLs do not use adjunction constraints, and thus are much easier to understand than TAGs. We compare ATGs with previously proposed simplifications of CFTGs, called monadic simple CFTGs, which also have been shown to be weakly equivalent to TAG (i.e. they generate the same set of string languages). We consider the question of whether these two weakly equivalent formalisms are strongly equivalent (i.e. generate exactly the same set of tree languages). Finally, we will show that the standard definition used for probabilistic TAG is (surprisingly) very different from the natural definition of probabilistic ATL. Using an example of PP-attachment ambiguity we show that the two probabilistic models are different from each other. About the speaker: Anoop Sarkar is an assistant professor in the Department of Computing Science at Simon Fraser University. He received his PhD in 2002 from the Department of Computer and Information Science at the University of Pennsylvania, with Prof. Aravind Joshi as his advisor. His research work is on machine learning, especially semi-supervised learning, applied to the processing of natural language and stochastic formal grammars. Anoop Sarkar's web-page: http://www.cs.sfu.ca/~anoop |
| 15 Jun 07 | Donghui Feng |
Extracting Data Records from Unstructured Biomedical Full Text
Time: 11:00 am - 11:30 am Location: 11 Large Abstract: In this paper, we address the problem of extracting data records and their attributes from unstructured biomedical full text. There has been little effort reported on this in the research community. We argue that semantics is important for record extraction or finer-grained language processing tasks. We derive a data record template including semantic language models from unstruc-tured text and represent them with a dis-course level Conditional Random Fields (CRF) model. We evaluate the approach from the perspective of Information Extrac-tion and achieve significant improvements on system performance compared with other baseline systems. |
| 15 Jun 07 | Alex Fraser |
Getting the structure right for word alignment: LEAF
Time: 10:30 am - 11:00 am Location: 11 Large Abstract: Automatic word alignment is the problem of automatically annotating parallel text with translational correspondence. Previous generative word alignment models have made structural assumptions such as the 1-to-1, 1-to-N, or phrase-based consecutive word assumptions, while previous discriminative models have either made one of these assumptions directly or used features derived from a generative model using one of these assumptions. We present a new generative alignment model which avoids these structural limitations, and show that it is effective when trained using both unsupervised and semi-supervised training methods. Experiments show strong improvements in word alignment accuracy and usage of the generated alignments in hierarchical and phrasal SMT systems improves the BLEU score. |
| 08 Jun 07 | Liang-Chih Yu (Cheng Kung U) |
Topic Analysis for Psychiatric Document Retrieval (Practice Talk for ACL)
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: Psychiatric document retrieval attempts to help people to efficiently and effectively locate the consultation documents relevant to their depressive problems. Individuals can understand how to alleviate their symptoms according to recommendations in the relevant documents. This work proposes the use of high-level topic information extracted from consultation documents to improve the precision of retrieval results. The topic information adopted herein includes negative life events, depressive symptoms and semantic relations between symptoms, which are beneficial for better understanding of users' queries. Experimental results show that the proposed approach achieves higher precision than the word-based retrieval models, namely the vector space model (VSM) and Okapi model, adopting word-level information alone. About the speaker: Liang-Chih Yu (http://www.isi.edu/~liangchi) is now a visiting student in the Information Sciences Institute (ISI) of University of Southern California (USC). My host advisor is Dr. Eduard Hovy. I am also a PhD candidate in the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan. My advisor is Dr. Chung-Hsien Wu. My research interests include natural language processing, text mining, information retrieval, ontology construction, spoken dialogue system.
|
| 08 Jun 07 | Jonathan May |
Bisimulation Minimisation for Weighted Tree Automata
Time: 3:30 pm - 4:00 pm Location: 11 Large Abstract: We describe existing forward and backward bisimulation minimisation algorithms for nondeterministic automata and extend these algorithms to weighted tree automata. The extended algorithms, which work for all semirings, retain the time complexity of their counterparts for unweighted tree automata for additively cancellative semirings, and are only slightly higher (linear instead of logarithmic in the number of states) on other semirings. We describe the effectiveness of an implementation of these algorithms on a typical task in natural language processing. This is joint work with Johanna Högberg, Umeå University and Andreas Maletti, Technische Universität Dresden. |
| 01 Jun 07 | Jingbo Zhu |
Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: In this paper, we analyze the effect of resampling techniques, including under-sampling and over-sampling used in active learning for word sense disambiguation (WSD). Experimental results show that under-sampling causes negative effects on active learning, but over-sampling is a relatively good choice. To alleviate the within-class imbalance problem of over-sampling, we propose a bootstrap-based over-sampling (BootOS) method that works better than ordinary over-sampling in active learning for WSD. Finally, we investigate when to stop active learning, and adopt two strategies, max-confidence and min-error, as stopping conditions for active learning. According to experimental results, we sug-gest a prediction solution by considering max-confidence as the upper bound and min-error as the lower bound for stopping conditions. |
| 01 Jun 07 | Andrew S. Gordon |
Generalizing Semantic Role Annotations Across Syntactically Similar Verbs
Time: 3:30 pm - 4:00 pm Location: 11 Large Abstract: Large corpora of parsed sentences with semantic role labels (e.g. PropBank) provide training data for use in the creation of high-performance automatic semantic role labeling systems. Despite the size of these corpora, individual verbs (or rolesets) often have only a handful of instances in these corpora, and only a fraction of English verbs have even a single annotation. In this paper, we describe an approach for dealing with this sparse data problem, enabling accurate semantic role labeling for novel verbs (rolesets) with only a single training example. Our approach involves the identification of syntactically similar verbs found in PropBank, the alignment of arguments in their corresponding rolesets, and the use of their corresponding annotations in PropBank as surrogate training data. |
| 25 May 07 | Wei Wang (Language Weaver) |
Binarizing Syntax Trees to Improve Syntax-Based Machine Translation Accuracy
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: We show that phrase structures in Penn Treebank style parses are not optimal for syntax-based machine translation. We exploit a series of binarization methods to restructure the Peen Treebank style trees such that syntactified phrases smaller than Penn Treebank constituents can be acquired and exploited in translation. We find that by employing the EM algorithm for determining the binarization of a parse tree among a set of alternative binarizations gives us the best translation result. |
| 18 May 07 | Feng Pan |
Computing Semantic Similarity between Skill Statements for Approximate Matching
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: (This will be an extended version of the talk for NAACL-HLT 2007. It's based on my summer internship work at IBM T.J. Watson Research Center last year.) The project aimed to address the problems encountered when trying to match available employees to open job positions, based on skill matches. Currently, job search applications, like IBM's Professional Marketplace, only find exact matches. A skill affinity computation is desired to allow searches to be expanded to related/similar skills, and return more potential matches. In this talk, I will explore the problem of computing text similarity between verb phrases describing skilled human behavior for the purpose of finding approximate matches. Four parsers (Charniak's parser, Stanford's parser, IBM XSG slot grammar parser, and Lin's MINIPAR) are evaluated on a corpus of skill statements extracted from an enterprise-wide expertise taxonomy. A similarity measure utilizing common semantic role features extracted from parse trees was found superior to an information-theoretic measure of similarity and comparable to the level of human agreement.
|
| 11 May 07 | Steve DeNeefe |
What Can Syntax-based MT Learn from Phrase-based MT?
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We compare and contrast the strengths and weaknesses of a syntax-based machine translation model with a phrase-based machine translation model on several levels. We briefly describe each model, highlighting points where they differ. We include a quantitative comparison of the phrase pairs that each model has to work with, as well as the reasons why some phrase pairs are not learned by the syntax-based model. We then propose improvements to the syntax-based extraction techniques to capture more phrases. We also compare the translation accuracy for all variations. |
| 04 May 07 | Sheelagh Carpendale (Calgary) |
Information Visualization and Collaboration
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Consider Donald Norman's quote, "The power of the unaided mind is highly overrated. Without external aids, memory, thought, and reasoning are all constrained. But human intelligence is highly flexible and adaptive, superb at inventing procedures and objects that overcome its own limits. The real powers come from devising external aids that enhance cognitive abilities." (Norman, 1993) Common methods for externalization include making sketches on whatever happens to be handy -- paper napkins, program margins, etc. -- and/or finding a colleague or two to discuss the problem with. It would seem then, that visualization and collaboration are natural possibilities for creating positive cognitive aids. I will discuss our approach to developing interactive information visualizations both to support individuals and small groups of collaborators and briefly describe some of our recent results. About the speaker: Sheelagh Carpendale holds a Canada Research Chair in Information Visualization at the University of Calgary. Her research focuses on the visualization, exploration and manipulation of information; visualizing such topics as ecological dynamics, uncertainty in information, social and communication information and investigating the development of information visualization environments that support collaboration. Dr. Carpendale's research in information visualization and interaction design draws on her dual background in Computer Science (BSc. and Ph.D. Simon Fraser University) and Visual Arts (Sheridan College, School of Design and Emily Carr, College of Art). |
| 20 Apr 07 | Christopher Collins (Toronto) |
Information Visualization to Support Computational Linguistics
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We present a survey of resent research into using information visualization to reveal new insights about linguistic data. Our recent work includes using WordNet hyponymy as a basis for document visualization and visualizing the uncertainty in machine translation in an instant messaging chat context. We will present our preliminary findings and prototype visualization for machine translation data resulting from a week of collaboration with ISI researchers. About the speaker: Christopher Collins is a PhD candidate in information visualization and computational linguistics at the University of Toronto. He works with Prof. Gerald Penn and Prof. Sheelagh Carpendale (University of Calgary).
|
| 30 Mar 07 | Ido Dagan (Bar-Ilan U) |
Textual entailment as a framework for applied semantics
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We have recently proposed Recognizing Textual Entailment (RTE) as a generic task that captures major semantic inferences across different natural language processing applications. The talk will first review the motivation and definition of the textual entailment task and the PASCAL RTE-1,2&3 Challenges benchmarks. Then we will demonstrate directions for building textual entailment systems, based on knowledge acquisition and inference, and for utilizing them within concrete applications. Furthermore, we suggest that textual entailment modeling may become a comprehensive framework for applied semantics research. Such framework introduces useful variants of known semantic problems and highlights important tasks which were hardly investigated so far at an applied computational level. The semantic modeling perspective will be illustrated in more detail by a case study for an entailment-based variant of word sense disambiguation. About the speaker: Ido Dagan is a Senior Lecturer at the Department of Computer Science at Bar Ilan University, Israel. His areas of interest are largely within empirical NLP, particularly empirical approaches for applied semantic processing. In the last few years Ido and his colleagues introduced textual entailment as a generic framework for applied semantic inference and have organized the first three rounds of the PASCAL Recognizing Textual Entailment Challenges. Ido received his Ph.D. from the Technion. He has been a research fellow at the IBM Haifa Scientific Center and a Member of Technical Staff at AT&T Bell Laboratories. During 1998-2003 he was co-founder and CTO of FocusEngine and VP of Technology of LingoMotors. |
| 23 Mar 07 | Hermann Helbig (U at Hagen, Germany) |
Multilayered Extended Semantic Networks as a Knowledge Representation Paradigm and Interlingua for Meaning Representation
Time: 3:00 pm - 4:30 pm Location: 4 CR Abstract: The talk gives an overview of Multilayered Extended Semantic Networks (abbreviated MultiNet), which is one of the most comprehensively described knowledge representation paradigms used as a semantic interlingua in large-scale NLP applications and for linguistic investigations into the semantics and pragmatics of natural language. As with other semantic networks, concepts are represented in MultiNet by nodes, and relations between concepts are represented as arcs between these nodes. Additionally to that, every node is classified according to a predefined conceptual ontology forming a hierarchy of sorts, and the nodes are embedded in a multidimensional space of layer attributes and their values. MultiNet provides a set of about 150 standardized relations and functions which are described in a very concise way including an axiomatic apparatus, where the axioms are classified according to predefined types. The representational means of MultiNet claim to fulfill the criteria of universality, homogeneity, and cognitive adequacy. In the talk, it is also shown, how MultiNet can be used for the semantic representation of different semantic phenomena. To overcome the quantitative barrier in building large knowledge bases and semantically oriented computational lexica, MultiNet is associated with a set of tools including a semantic interpreter NatLink for automatically translating natural language expressions into MultiNet networks, a workbench LIA for the computer lexicographer, and a workbench MWR for the knowledge engineer for managing and graphically manipulating semantic networks. The applications of MultiNet as a semantic interlingua range from natural language interfaces to the Internet and to dedicated databases, over question-answering systems, to systems for automatic knowledge acquisition. About the speaker: Prof. Helbig is head of the chair Intelligent Information and Communication Systems at the University of Hagen, Germany. His main research areas are Knowledge Representation, Semantic Natural Language Processing, and Question-Answering. A CV can be found here. |
| 09 Mar 07 | Kevin Knight |
The Voynich Manuscript
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The medieval Voynich Manuscript has been called "the most mysterious document in the world". Its pages contain bizarre drawings of strange plants and astrological diagrams, as well as an undeciphered script of 20,000 running words, written in a character set that has never been seen elsewhere. Its origin is also controversial, with many theories abounding. I will describe the document, show samples, explain where it may have come from, and present some properties of the text. This will more of a history/mystery talk than a computer science talk. |
| 26 Jan 07 | Gerald Penn (Toronto) |
The Quantitative Study of Writing Systems
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: If you understood all of the world's languages, you would still not be able to read many of the texts that you find on the world wide web, because they are written in non-Roman scripts -- often ones that have been arbitrarily encoded for electronic transmission in the absence of an accepted standard. This very modern nuisance reflects a dilemma as ancient as writing itself: the association between a language as it is spoken and its written form has a sort of internal logic to it that we can comprehend, but the conventions are different in every individual case --- even among languages that use the same script, or between scripts used by the same language. This conventional association between language and script, called a writing system, is indeed reminiscent of the Saussurean conception of language itself, a conventional association of meaning and sound, upon which modern linguistic theory is based. Despite linguists' reliance upon writing to present and preserve linguistic data, however, writing systems were a largely forgotten corner of linguistics until the 1960s, when Gelb presented their first classification. This talk will describe recent work that aims to place the study of writing systems upon a sound computational and statistical foundation. While archaeological decipherment may eternally remain the holy grail of this area of research, it also has applications to speech synthesis, machine translation, and multilingual document retrieval. |
| 12 Jan 07 | Kevin Knight |
Capturing Natural Language Transformations
Time: 2:00 pm - 3:30 pm Location: 11 Large Abstract: Knowledge representation is hard. As natural language scientists and engineers, we'd like something that - is expressive enough to capture how natural language works - permits tractable inference - admits learning algorithms for automatic knowledge acquisition - leads to modular system construction This talk will look at knowledge representation for capturing natural language transformations. A lot of what we do falls into this category. Examples of transformations include language translation (French to English), question answering (Question to Answer), transliteration (foreign script to Roman alphabet), summarization (long text to short text), parsing (string to tree), language generation (meaning to string), etc. I'll show various knowledge formats (starting with simple finite-state transducers) and show how they stack up on the 4 criteria above, using theorems and examples. We'll see that different types of tree and string automata lead to good behavior on various subsets of the 4 criteria, but getting 4 out of 4 is still elusive. This is a Krazy Theory talk -- since this kind of talk should not go on and on, I promise to finish within 50 minutes. |
| 05 Jan 07 | Beata Klebanov (Hebrew U) |
Experimental and Computational Investigation of Lexical Cohesion in Texts
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Lexical cohesion refers to structure created in a text by use of words with related meanings. Apart from its importance in theoretical and applied linguistics, lexical cohesion detection is used in NLP tasks like topic segmentation, extractive summarization, spelling correction, etc. However, the intuitive potential of lexical cohesion for such tasks is often not realized in practice, possibly due to shortcomings of detection algorithms. I will briefly describe an experiment with readers aimed at providing reliable data for a computational investigation of lexical cohesion. We then discuss a number of informative features for cohesion detection, drawing on sources like WordNet, distributional information, free associations, and the structure of information in the text itself. Finally, I report experiments with supervised learning of lexical cohesion. About the speaker: Beata Beigman Klebanov is a PhD candidate at the Hebrew University of Jerusalem, Israel, currently a visiting scholar at Northwestern University. Beata's interests are in experimental, computational and applied research in text pragmatics. |
| 15 Dec 06 | Jerry Hobbs |
When Will Computers Understand Shakespeare?
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: In this talk I will examine problems encountered in coming to some kind of understanding of one sonnet by Shakespeare (his 64th), ask what it would take to solve these problems computationally, and suggests routes to the solution. The general conclusion is that we are closer to this goal as one might think. Or are we? Bio: Jerry Hobbs is famous primarily for having an office next to Kevin Knight's and a parking space next to Ed Hovy's. He has read everything of Shakespeare's that survives, including his will and plays of dubious authorship. But that was all a long time ago. |
| 14 Dec 06 | Liang Huang (Penn) |
Faster Decoding with Synchronous Grammars and n-gram Language Models
Time: 1:30 pm - 3:00 pm Location: 11 Large Abstract: A major obstacle in syntax-based machine translation is the prohibitively large search space for decoding with an integrated language model. We develop faster approaches for this problem based on lazy algorithms for k-best parsing. When comparing against Chiang's technique of cube pruning, our method runs up to twice as fast without making more search errors or decreasing translation accuracy as measured by BLEU. We demonstrate the effectiveness of the algorithm on a large-scale translation system. Interestingly, these techniques can be applied to speed up bilexical parsing as well, where the (bi-) lexical probabilities can be viewed as n-gram probabilities that causes non-monotonicity. This method fits naturally into the coarse-to-fine grained multi-pass parsing schemes. To push this direction even further, we can generalize cube and lazy cube pruning as generic tools for reducing complicated search spaces, as alternatives to the well-known A* and annealing techniques. This is joint work with David Chiang (ISI). |
| 27 Nov 06 | Mark Hopkins (Potsdam) |
Towards the Effective Exploitation of Syntax in Machine Translation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We discuss preliminary work on a possible approach to exploiting syntax in an effective way for machine translation. The driving guideline is to devise a machine translation system that can perform effectively, given a very limited quantity of parsed training data. |
| 17 Nov 06 | David DeVault (Rutgers) |
Scorekeeping in an Uncertain Language Game
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Practical dialogue systems must exploit context to interpret user utterances correctly. Received views of context and coordination in pragmatic theory equate utterance context with the occurrent subjective states of interlocutors using notions like common knowledge or mutual belief. We argue that these views are not well suited for practical modeling due to the uncertainty and robustness of context dependence in human-human dialogue. We present an alternative characterization of utterance context as objective and normative. On this view, an interlocutor's representation of context reflects private uncertainty about the true objective context as determined by prior speaker meanings. As conversation moves forward, new utterances provide interlocutors with retrospective insight about each other's prior meanings and therefore about what the true context really is. This view reconciles the need for uncertainty with received intuitions about coordination, and can directly inform computational approaches to dialogue. Joint work with Matthew Stone, Rutgers and Rich Thomason, Michigan About the Speaker: David DeVault is a Ph.D. candidate in the Department of Computer Science at Rutgers University. He holds a B.S. in Engineering and Applied Science from the California Institute of Technology and an M.A. in Philosophy from Rutgers University. David's research aims to develop techniques to allow computational agents to participate in flexible task-oriented conversations with human beings. His recent work has drawn on design challenges encountered in building such an agent to try to articulate practical, learnable, and theoretically satisfying representations of context, utterance meaning, and speaker intention for implemented conversational systems. |
| 03 Nov 06 | Jens-Soenke Voeckler |
perl part 2 - advanced magick
Time: 3:30 pm - 5:00 pm Location: 11 Large Abstract: Since part 1 of the Perl tutorial didn't cover the juicy bits (like a unique function in Perl), based on feedback from participants, I am offering a part 2 "Perl - Advanced Magick" covering: o the slides from roughly page 40 - The Schwartzian Transform - Dissecting a program o What to do, if you do need popen or backticks? o OO Perl - a start o C embedding - definitely only a "start here" o Useful recipes, e.g. interpolating variables in configuration scripts from Perl values. If there is something you are especially interested in seeing, please send me an email |
| 23 Oct 06 | Jens-Soenke Voeckler |
perl - how to use it, not abuse it
Time: 12:00 pm - 1:30 pm Location: 11 Large Abstract: If you speak a little perl, are an occasional perl-scripter, and would like to know more about how to use it as a (p)ortable, (e) fficient, and (r)eadible (l)anguage, you may be interested in my brown bag (read: bring your own) lunch seminar: I will talk about using Perl in a portable fashion, the environment it is run in, and how avoid common mistakes and misconceptions. Perl offers more than a thousand ways to solve a problem, but some are more portable or more efficient than others. If time permits, simple hands-on examples can be tried out during the talk, so power for laptops will be provided. |
| 29 Sep 06 | Ashish Venugopal (CMU) |
Delayed LM Intersection and Left-to-Right N-Best Extraction for Syntax-Based MT
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We begin by describing a set of pruning constraints that are applied in the literature to effectively restrict the search space of synchronous PCFGs intersected with target language model contexts. We apply these constraints to non-binarized grammars with a large number of non-terminals and demonstrate effective parsing within the framework of Wu, 97. We then present a novel parsing approach that avoids language model context intersection during parsing in favor of language model driven n-best list extraction.Ê The parsing step produces aÊ sentence spanning parse forest which is explored in left-to-right target order by the N-Best extraction method. This method avoids lossy pruning during the parsing process, searching a much larger effective parse space than practically possible in the full intersection scenario, and has the important benefit of allowing integration of a high order language within the N-Best search process, rather than only in parse re-scoring. We demonstrate the impact of this parsing approach using the SPCFG approach described in Zollmann, Venugopal, Vogel 06, which is similar to Galley et al., 04 and compare performance against full intersection. This is joint work with Andreas Zollmann About the Speaker: Ashish Venugopal is a Ph.D candidate at the Language Technologies Institute at Carnegie Mellon University, and holds B.S (SCS, Univ. Honors), M.S degrees from the same institution. He is a Seibel Scholar and has received the annual Graduate Student Teaching Award at Carnegie Mellon. His research focus is on syntax augmented machine translation.
|
| 22 Sep 06 | Eduard Hovy |
Toward a 'Science' of Annotation: Experiences from OntoNotes
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: As machine learning algorithms and their application for NLP become better understood, attention turns toward the production of annotated corpora to which they can be applied. Numerous phenomena present themselves for annotation, including aspects in lexical semantics, discourse, pragmatics, and dialogue. But several questions immediately must be answered: 1. How does one obtain a balanced corpus to annotate? What is a balanced corpus? 2. How does one decide which aspects to annotate? How does one adequately express the theory behind the phenomena in simple annotation steps? 3. Which annotators does one hire? How does one ensure that they are adequately trained? 4. How does one establish a simple, fast, and trustworthy annotation procedure? What interfaces does one build? How does one ensure that the interfaces do not affect the annotation results? 5. How does evaluate the results? What are the appropriate agreement measures? At which cutoff points should one re-do the annotations? How does one ensure improvement? 6. How should one formulate and store the results? How does one ensure compatibility with other existing resources? How does one make results available for best impact? 7. How does one report the annotation effort and results? How does one actually get a paper on this work published at an important conference? What should the paper contain? Despite their being so basic, there is almost no established procedure or standard set of answers to these questions today. In this talk I discuss some of these aspects, pointing to the lessons learned in the ongoing OntoNotes project (joint with BBN, the University of Colorado (PropBank), the University of Pennsylvania (Treebank), and ISI). |
| 25 Aug 06 | Victoria Fossum (Michigan) |
Improving Precision of Word Alignments Using GHKM Syntax-Based Rule Extraction
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: Noisy word alignments negatively affect the quality of the translation rules extracted by the ISI syntax-based MT system. In the literature, alignment is typically treated as a separate process from subsequent stages in the MT pipeline. By contrast, we allow rule extraction to guide the alignment process. We present an unsupervised algorithm for identifying and removing "bad" links using GHKM syntax-based rule extraction. We show that we can improve upon the precision of GIZA union (measured against a gold standard set of manually aligned Chinese-English sentence pairs), while only decreasing recall slightly. (Note: This is part of the Summer Intern Series) |
| 25 Aug 06 | Jason Riesa |
Minimally Supervised Morphological Segmentation with Applications to Machine Translation
Time: 3:30 pm - 4:00 pm Location: 11 Large Abstract: Inflected languages in a low-resource setting present a data sparsity problem for statistical machine translation. In this work, we present a minimally supervised algorithm for morpheme segmentation on Arabic dialects which reduces unknown words at translation time by over 50%, total vocabulary size by over 40%, and yields a significant increase in BLEU score over a previous state-of-the-art phrase-based statistical MT system. |
| 23 Aug 06 | Joseph Turian (NYU) |
Speeding-up Syntax-based Decoding
Time: 3:30 pm - 4:00 pm Location: 11 Large Abstract: TBA (Note: This is part of the Summer Intern Series) |
| 23 Aug 06 | Oana-Diana Postolache |
Towards combining Searn and Syntax-Based Machine Translation (SBMT)
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: This talk is about modeling the Syntax-Based Machine Translation (SBMT) problem within the Searn (Search & Learn) framework developed by Hal Daume in his PhD thesis. I will present the way we define the states, actions and the search space and how to implement the cost function. (Note: This is part of the Summer Intern Series) |
| 18 Aug 06 | Chenhai Xi |
Name Entity Transliteration Discovery from Large Bilingual Comparable Corpora
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: In this summer project, we investigate a scalable method to extract Chinese-English name transliterations from large comparable corpora, which consist of two languages discussing same or similar topics. We show that bigram Jaccard coefficient is a good similarity method to compare English and Chinese names, at Chinese pronunciation (Pinyin) level. Based on this phonetic similarity score, an efficient randomized algorithm is then used to find name pair candidates from English and Chinese lists. Finally, context information, such as dates, frequency, place and titles are combined with the phonetic similarity to improve the accuracy of the name pairs list. (Note: This is part of the Summer Intern Series) |
| 11 Aug 06 | Idan Szpektor (Bar-Ilan U) |
Textual Entailment: Framework, Learning and Applications
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Textual Entailment has been proposed recently as a generic framework for modeling semantic variability in many Natural Language Processing applications, such as Question Answering, Information Extraction, Information Retrieval and Document Summarization. The Textual Entailment relationship holds between two text fragments, termed text and hypothesis, if the truth of the hypothesis can be inferred from the text. In this talk, the Textual Entailment framework will be introduced. I'll then present an algorithm for large-scale Web-based acquisition of entailment rules, a type of knowledge needed for robust inference. Finally, I will present an unsupervised Relation Extraction approach based on the Textual Entailment framework. About the speaker: Idan Szpektor is a PhD student under the supervision of Dr. Ido Dagan at Bar Ilan University, Israel. His current research activity is in acquisition of knowledge for textual entailment.
|
| 04 Aug 06 | Shou-de Lin |
Ph.D. defense practice talk
Time: 3:30 pm - 4:30 pm Location: 11 Large Abstract: This is a practice talk for my Ph.D. defense, which will be held on Aug 24th 3-5pm, SAL 322. An important problem in the area of homeland security and fraud detection is to identify abnormal entities in large datasets. Although there are methods from knowledge discovery and data mining focusing on finding anomalies in numerical datasets, there has been little work aimed at discovering abnormal or suspicious instances in large and complex semantic graphs whose nodes are richly connected with many different types of links. In this talk, I will describe a novel, domain-independent and unsupervised framework to identify such instances. Besides discovering suspicious instances, we believe that to complete the discovery process and to deal with the "curse of false positives", a system has to convince the users by providing explanations for its findings. Therefore, in the second part of the talk I will describe an explanation mechanism to automatically generate human-understandable explanations for the discovered results. Experimental results show that our discovery system outperforms state-of-the-art unsupervised network algorithms used to analyze the 9/11 terrorist network by a large margin. Additionally, a human study we conducted demonstrates that our explanation system, which provides natural language explanations for its findings, allowed human subjects to perform complex data analysis in a much more efficient and accurate manner
|
| 28 Jul 06 | Qin Iris Wang (Alberta) |
Improved Large Margin Dependency Parsing via Local Constraints and Laplacian Regularization
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This talk is about an improved approach for learning dependency parsers from treebank data. Our technique is based on two ideas for improving large margin training in the context of dependency parsing. First, we incorporate local constraints that enforce the correctness of each individual link, rather than just scoring the global parse tree. Second, to cope with sparse data, we smooth the lexical parameters according to their underlying word similarities using Laplacian Regularization. To demonstrate the benefits of our approach, we consider the problem of parsing Chinese treebank data using only lexical features, that is, without part-of-speech tags or grammatical categories. We achieve state of the art performance, improving upon current large margin approaches. Here is the link for the paper: http://www.cs.ualberta.ca/~wqin/papers/depar_margin_conll06.pdf About the speaker: Qin Iris Wang is a Ph.D. student from the University of Alberta, working with Dekang Lin and Dale Schuurmans. Her research interests are in natural language processing and machine learning. Specifically, she has been working on dependency parsing using both generative and discriminative methods. |
| 11 Jul 06 | Dragos Munteanu + Joseph Turian |
Practice Talks for ACL
Time: 2:30 pm - 4:00 pm Location: 11 Large Abstract: Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora Dragos Munteanu We present a novel method for extracting parallel sub-sentential fragments from comparable bilingual corpora. Currently, the state of the art in comparable corpus mining is only able to extract full sentence pairs which are judged to be parallel. We advance the state of the art by showing how to obtain useful data even from not-fully-parallel sentences. By analyzing sentence pairs using a signal-processing-inspired approach, we detect which segments of the source sentence are translated into segments of the target sentence, and which are not. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-othe-art machine translation system. Advances in Discriminative Parsing Joseph Turian The present work advances the accuracy and training speed of discriminative parsing. Our discriminative parsing method has no generative component, yet surpasses a generative baseline on constituent parsing, and does so with minimal linguistic cleverness. Our model can incorporate arbitrary features of the input and parse state, and performs feature selection incrementally over an exponential feature space during training. We demonstrate the flexibility of our approach by testing it with several parsing strategies and various feature sets. |
| 30 Jun 06 | David Chiang and Kevin Knight |
Synchronous Grammars and Tree Transducers
Time: 2:00 pm - 5:00 pm Location: 11 Large Abstract: (Practice tutorial for ACL/COLING 2006) Once upon a time, synchronous grammars and tree transducers were esoteric topics in formal language theory, far removed from the practice of building real, large-scale natural language systems. However, these tools are now rapidly becoming essential for modeling machine translation and other complex language transformations. It has therefore become practical and important to understand the basic properties of tree transformation systems, which we cover in this tutorial.
|
| 23 Jun 06 | Joseph Turian (NYU) |
Discriminative Training for Large-Scale NLP
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Parsing and translating natural languages can be viewed as structured-prediction problems. We outline the crucial design decisions that must be made to build a machine to solve structured prediction problems, and explain our particular choices for these two large-scale NLP problems. Our approach uses a purely discriminative learning method that scales up well to problems of this size. Unlike currently popular methods, this one does not require a great deal of feature engineering a priori, because it performs feature selection over a compound feature space as it learns. Accuracy on constituent parsing was at least as good as other comparable methods. To our knowledge, it is the first purely discriminative learning algorithm for translation with tree-structured models. Experiments demonstrate the method's versatility, accuracy, and efficiency.
|
| 26 May 06 | Radu Soricut and Hal Daume III |
Defense Practice Talks: Generation and Learning
Time: 3:00 pm - 5:00 pm Location: 11 Large Abstract: These are two practice talks for our upcoming thesis defenses. The titles and abstracts are: -------------------------------------------------------------------------- NATURAL LANGUAGE GENERATION FOR TEXT-TO-TEXT APPLICATIONS USING AN INFORMATION-SLIM REPRESENTATION Radu Soricut In this talk, I describe a new natural language generation paradigm, based on direct transformation of textual information into well-formed textual output. I support this language generation paradigm with theoretical contributions in the field of formal languages, new algorithms, empirical results, and software implementations. At the core of this work is a novel representation formalism for probability distributions over finite languages. Due to its convenient representation and computational properties, this formalism supports a wide range of language generation needs, from sentence realization to text planning. Based on this formalism, I describe, implement, and analyze theoretically a family of algorithms that perform language generation using direct transformations of text. These algorithms use stochastic models of language to drive the generation process. I perform extensive empirical evaluations using my implementation of these algorithms. These evaluations show state-of-the-art performance in automatic translation, and significant improvements in state-of-the-art performance in abstractive headline generation and coherent discourse generation. -------------------------------------------------------------------------- PRACTICAL STRUCTURED LEARNING FOR NATURAL LANGUAGE PROCESSING Hal Daume III Natural language processing is replete with problems whose outputs are highly complex and structured. The current state-of-the-art in machine learning is not yet sufficiently general to be applied to general problems in NLP. In this thesis, I present Searn (for "search" + "learn"), an approach to learning for structured outputs that is applicable to the wide variety of problems encountered in natural language. Searn operates by transforming structured prediction problems into a collection of classification problems, to which any standard binary classifier may be applied. From a theoretical perspective, Searn satisfies a strong fundamental performance guarantee: given a good classification algorithm, Searn yields a good structured prediction algorithm. To demonstrate Searn's general applicability, I present applications in such diverse areas as automatic document summarization and entity detection and tracking. In these applications, Searn is empirically shown to achieve state-of-the-art performance. |
| 24 May 06 | Hal Daume III |
Beyond EM: Bayesian Techniques for Human Language Technology Researchers
Time: 9:00 am - 12:00 pm Location: 4th Floor Abstract: This is a practice tutorial for one I am giving at HLT/NAACL one week later. Comments/feedback are very welcome. ---------------------------------------------------------------------- Expectation Maximization (EM) has proved to be a great and useful technique for unsupervised learning problems in speech and language processing. Unfortunately, its range of applications is limited either by intractable E- or M-steps, or by its reliance on the maximum likelihood estimator. The natural language processing community typically resorts to ad-hoc approximation methods to get (some reduced form of) EM to apply to NLP tasks. However, many of the problems that plague EM can be solved with Bayesian methods, which are theoretically more well justified. In this tutorial, I discuss Bayesian methods as they can be used in natural language processing. The two primary foci of this tutorial are specifying prior distributions and performing the necessary computations to perform inference in Bayesian models. I focus on unsupervised techniques (for which EM is the obvious choice), but discuss supervised and discriminative techniques at the conclusion with pointers to relevant literature. Depending on one's inference technique of choice, the math required to build Bayesian learning models can be difficult. Compounding this problem is the fact that current written tutorials on Bayesian techniques tend to focus on continuous-valued problems, a poor match for the high-dimension discrete world of text. This combination makes the cost of entrance to the Bayesian learning literature often too high. The goal of this tutorial is to provide sufficient motivation, intuition and vocabulary mapping so that one can easily understand recent papers in Bayesian learning that are published at conferences like NIPS, and increasingly at ACL. In addition to the standard tutorial materials (slides), this tutorial is accompanied by a technical report that spells out all the mathematic derivations in great detail, for those who wish to start research projects in this fields. This tutorial should be accessible to anyone with a basic understanding of statistics. I use a query-focused summarization task as a motivating running example for the tutorial, which should be of interest to researchers in natural language processing and in information retrieval. Additionally, though the tutorial does not focus on speech problems, those attendees interested in graphical modeling techniques for automatic speech recognition might also find the tutorial of interest. |
| 19 May 06 | Patrick Pantel |
Espresso: Making Use of Generic Patterns for Mining Relations from Small and Large Corpora
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: In the past decade, researchers have explored many approaches to automatically extract large collections of knowledge from text. In this talk, we present Espresso, a weakly-supervised, general-purpose, and broad-coverage algorithm for harvesting binary semantic relations. The main contributions are: i) a method for exploiting generic patterns by filtering incorrect instances using the Web; and ii) a principled measure of pattern and instance reliability enabling the filtering algorithm. We present an empirical comparison of Espresso with various state of the art systems, on different size and genre corpora, on extracting various general and specific relations. Experimental results show that our exploitation of generic patterns substantially increases system recall with small effect on overall precision.
|
| 12 May 06 | Nick Mote and Donghui Feng |
Pedagogical Contextualization of Language Learner Speech Errors AND Learning to Detect Conversation Focus of Threaded Discussions
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This is two practice talks. ----------------------------------------------------------------------------- FIRST TALK: The traditional approach to diagnosing learner speech errors in Computer Aided Language Learning is to create a linguistic profile of the learner/user. We, however, propose that work must also be done to model the linguistic profile of a typcial native listener. Not all errors in second langage learner speech are created equal. Different errors sound more "severe" or "harsh" to native speaker ears and should therefore be treated with more emphasis in pedagogical interaction. The Tactical Language Training System (TLTS) is a speech-enabled virtual-reality based computer learning environment designed to teach Arabic spoken communication to American English speakers. This talk addresses the ways the TLTS contextualizes non-native speech errors, and how this contextualization fits in the corrective exchanges between a non-native learner and a pedagogical agent built to model a native listener. The pedagogical system used in TLTS includes: * Automatic Speech Recognition (ASR) models which are built on a combination of both annnotated and unannotated non-native speech with native speech data. * A stochastic generative model for errors in learner speech that creates mispronunciation grammars for the ASR * Reweighting of system-perceived mispronunciation severity based on aggregate native speaker judgements of quality pronunciation and intelligiblity. * Contextualization of feedback based on lexical and phonetic inventories of the native and non-native languages. ----------------------------------------------------------------------------- SECOND TALK: We present a novel feature-enriched approach that learns to detect the conversation focus of threaded discussions by combining NLP analysis and IR techniques. Using the graph-based algorithm HITS, we integrate different features such as lexical similarity, poster trustworthiness, and speech act analysis of human conversations with featureoriented link generation functions. It is the first quantitative study to analyze human conversation focus in the context of online discussions that takes into account heterogeneous sources of evidence. Experimental results using a threaded discussion corpus from an undergraduate class show that it achieves significant performance improvements compared with the baseline system.
|
| 05 May 06 | Namhee Kwon |
Recognizing Argument Structures in Texts
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I present our approach to identify an argument structure defined as a simple hierarchical structure of claim and reasons. The claim is also classified into "in favor of" or "against" the topic. The experiment is performed on the comments from the general public sent to government officials in response to proposed regulations.
|
| 28 Apr 06 | Feng Pan |
Learning Event Durations from Event Descriptions
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The research of extracting event duration information from texts is potentially very important in applications in which the time course of events is to be extracted from news. For example, whether two events overlap or are in sequence often depends very much on their durations. If a war started yesterday, we can be pretty sure it is still going on today. If a hurricane started last year, we can be sure it is over by now. In the talk, I will first present our work on constructing an annotated corpus for extracting information about the typical durations of events from texts, including the annotation guidelines, the event classes we categorized, the way we use normal distributions to model such vague and implicit temporal information, and how we evaluate inter-annotator agreement. I will then show that machine learning techniques applied to this data yield coarse-grained event duration information, considerably outperforming a baseline and approaching human performance. At the beginning of the talk, I will also give a brief overview of the time ontology (OWL-Time, formerly DAML-Time) we have developed, which is represented in both first-order logic and the OWL web ontology language.
|
| 21 Apr 06 | Soo-Min Kim |
Identifying and Analyzing Judgment Opinions
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: In this talk, we introduce a methodology for analyzing judgment opinions. We define a judgment opinion as consisting of a valence, a holder, and a topic. We decompose the task of opinion analysis into four parts: 1) recognizing the opinion; 2) identifying the valence; 3) identifying the holder; and 4) identifying the topic. We evaluate our methodology using both intrinsic and extrinsic measures. |
| 14 Apr 06 | Radu Soricut |
Natural Language Generation for Text-to-Text Applications using an Information-Slim Representation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Although a considerable number of generic Natural Language Generation (NLG) systems has been produced over the years, none of them is usually employed in end-to-end, text-to-text applications such as Machine Translation, Summarization, Question Answering, etc. In this talk, we identify the likely reasons for this state of affairs, and propose WIDL-expressions as a flexible formalism that facilitates the integration of a generic NLG engine within end-to-end language processing applications. WIDL-expressions represent compactly probability distributions over finite sets of candidate realizations, and have optimal algorithms for text realization via interpolation with language model probability distributions. We show the effectiveness of our WIDL-based NLG engine for both sentence realization and document realization tasks. By employing language models that capture sentence-level properties, we perform Machine Translation and Headline Generation at state-of-the-art levels or better. By employing language models that capture document-level properties such as text coherence, we synthesize output for Multi-document Summarization that displays both high content selection performance and increased coherence.
|
| 24 Mar 06 | Dragos Munteanu |
Automatic creation of parallel corpora
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Parallel texts -- texts that are translations of each other -- are an important resource in many cross-lingual NLP applications, such as lexical acquisition, cross-language IR, and annotation projection. However, their importance is paramount for Statistical Machine Translation (SMT), as they provide the training data from which all the translation knowledge is learned. The state of the art in SMT is advanced enough that, given sufficient parallel data (i.e. a few million words) for any language pair in a given domain, a generic SMT system trained on it will achieve a reasonable translation performance in that domain. The main reason why SMT systems exist only for a handful of languages is that, for most language pairs, parallel training data is simply not available. One way to alleviate this lack of parallel data is to exploit a much richer and more diverse resource: comparable corpora, texts which are not strictly parallel but related. The prototypical example of comparable texts are two news articles in different languages which report on the same event. I will present methods for automatic extraction of parallel data from such corpora. I will show how to detect parallel data at various levels of granularity: parallel documents, parallel sentences, and even parallel sub-sentence fragments. The parallel corpora obtained using these methods help improve translation performance for both resource-scarce language pairs (such as Romanian-English) and resource-rich ones (such as Arabic-English).
|
| 17 Mar 06 | Jon May |
Tiburon: A Finite State Tree Automata Toolkit
Time: 3:00 pm - 4:30 pm Location: 4th Floor Abstract: In the 1990s, researchers applied their new developments in transducer theory using widely available easy-to-use toolkits for string transducers, and made well-known advances in parsing, machine translation, and other areas. Rapid prototyping via software such as the AT&T toolkit and carmel was useful for proofs of concept and in many cases led to unforseen developments in novel areas. In the current nlp research environment tree based strategies and new models have shown promising results in advancing the state of the art, and recent developments in weighted tree automata theory are enriching the bedrock created 40 years ago, but as of yet there is no toolkit available with the necessary capabilities to turn promise into solution. Tiburon is the first probablistic tree transducer toolkit. Similar in form and function to the string-based toolkits of yesteryear, it is designed to be easy to use, with simple but expressive definitions of tree automata and a concise set of vital operations that can be used to construct many useful tree-based nlp projects. Although a work in progress, Tiburon is already a usable tool with active users between the ages of 6 and 41. I will describe the current status of the system, demonstrate ease of use and potential power, and discuss the challenges ahead. |
| 10 Mar 06 | Mark Hopkins |
Exploring the Potential of Intractable Parsers
Time: 3:00 pm - 4:30 pm Location: 10th Floor Abstract: We revisit the idea of history-based parsing, and present a history-based parsing framework that strives to be simple, general, and flexible. We also provide a decoder for this probability model that is linear-space, optimal, and anytime. A parser based on this framework, when evaluated on Section 23 of the Penn Treebank, compares favorably with other state-of-the-art approaches, in terms of both accuracy and speed.
|
| 03 Mar 06 | Liang Huang (Penn) |
Syntax-Directed Translation with Extended Domain of Locality
Time: 3:00 pm - 4:30 pm Location: 11th Floor (Large) Abstract: (note: this is a very tentative title -- comments welcome!) We present a novel extension of syntax-directed translation for statistical MT. Formally speaking, our model is based on tree-to- string transducers that recursively convert a parse-tree in the source-language into a string in the target-language. These transduction rules have multi-level trees on the source-side, giving this system more transformational power due to the extended domain of locality. We also present efficient algorithms for decoding based on dynamic programming. Initial experiments on English-to-Chinese translation show promising results in both speed and the translation quality. Joint work with Kevin Knight and Aravind Joshi. Bio: Liang Huang is a 3rd-year PhD student from the University of Pennsylvania. He is mainly interested in algorithms and formalisms for parsing and syntax-based machine translation. His recent work has been on k-best parsing algorithms (with David Chiang) and synchronous binarization for MT (with Hao Zhang, Dan Gildea, and Kevin Knight). |
| 24 Feb 06 | Hal Daume III |
Search-based Structured Prediction
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I present an algorithm, Searn (for "search-learn") that is designed to solve structured prediction problem: problems whose goal is to learn to predict complex objects such as parts-of-speech, parse trees, translations, etc... Searn functions by "breaking apart" structured prediction problems into classification problems in the process of search. I analyze Searn in the framework of learning reductions and show that good performance on the underlying classification problems implies good search performance. Moreover, Searn is computationally efficient in a superset of the settings where previous algorithms are efficient and is not limited by conditional independence assumptions (as in CRFs). This excessively simple and general algorithm turns out to have excellent state-of-the-art performance. This is joint work with John Langford (TTI-C) and Daniel Marcu; and, to a lesser extent, with Drew Bagnell (CMU) and Bianca Zadrozny (IBM TJ Watson). |
| 10 Feb 06 | David Chiang |
Parsing Arabic Dialects
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: The Arabic language exhibits diglossia, i.e., the coexistence of two forms of language, a variety with standard orthography and sociopolitical clout which is not natively spoken by anyone (Modern Standard Arabic, MSA) and varieties that are primarily spoken and lack writing standards (Arabic dialects). There are important resources currently available for MSA with much on-going NLP work; for example, there is an Arabic Treebank and several syntactic parsers for MSA. However, Arabic dialect resources and NLP research are still at an infancy stage. I will present work done at the Johns Hopkins CLSP Summer Workshop on parsing of Arabic dialects, in particular, Levantine Arabic. We have experimented with three approaches to leveraging MSA resources to create a parser for Levantine Arabic, as well as methods for induction of MSA-Levantine translation lexicons and a Levantine part-of-speech tagger. Using these methods we obtain error reductions of up to 15% compared with applying an MSA parser directly to Levantine text. Rambow et al. Parsing Arabic Dialects: Final Report. Johns Hopkins University Center for Language and Speech Processing Workshop 2005. http://www.clsp.jhu.edu/ws2005/groups/arabic/documents/finalreport.pdf Chiang et al. Parsing Arabic Dialects. To appear in Proc. EACL 2006. This is joint work with O. Rambow, M. Diab, N. Habash, R. Hwa, K. Sima'an, V. Lacey, R. Levy, C. Nichols and S. Shareef. |
| 03 Feb 06 | Alex Fraser |
Measuring Word Alignment Quality for Statistical Machine Translation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Automatic word alignment plays a critical role in statistical machine translation. Unfortunately the relationship between alignment quality and statistical machine translation performance has not been well understood. In the recent literature the alignment task has frequently been decoupled from the translation task, and assumptions have been made about measuring alignment quality for machine translation which, it turns out, are not justified. In particular, none of the tens of papers published over the last five years has shown that significant decreases in Alignment Error Rate (AER) result in significant increases in translation quality. I will explain this state of affairs and present steps towards measuring alignment quality in a way which is predictive of statistical machine translation quality. I will also provide a brief overview of some of my other work on training and search for word alignment.
|
| 27 Jan 06 | John Conroy |
Multi-Document Summary Space:What do People Agree is Important?
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: A multi-document summary gives the "gist" of what is contained in a collection of related documents. But how can we define a "gist?" We explore this question by analyzing human written summaries for clusters of document sets. In particular, we estimate the probability that word will be chosen by a human to be included in a summary. We demonstrate that if this probability model were given by an oracle, then a simple automatic method of summarization can produce extract summaries which are statistically indistinguishable from the human summaries. About the Speaker: John M. Conroy received a B.S. in Mathematics from Saint Joseph's University in 1980 and a Ph.D. in Applied Mathematics from the University of Maryland in 1986. Since then he has been a research staff member for the IDA Center for Computing Sciences in Bowie, MD. His research interest is applications of numerical linear algebra and statistics. He is a member of the Society for Industrial and Applied Mathematics, Institute of Electrical and Electronics Engineers (IEEE), and the Association for Computational Linguistics.
|
| 26 Jan 06 | Tim Chklovski |
GrainPile: Deriving Quantitative Overviews of Free Text Assessments on the Web
Time: 1:00 pm - 2:00 pm Location: 4th floor Abstract: Many research efforts are addressing the problem of enabling automatic summarization of opinions and assessments stated on the web in product reviews, discussion forums, and blogs. One key difficulty is that relevant assessments scattered throughout web pages are obscured by variations in natural language. In this paper, we focus on a novel aspect of enabling aggregations of assessments of degree to which a given property holds for a given entity (for instance, how touristy is Boston). We present GrainPile, a user interface for extracting from the web, aggregating and quantifying degree assessments of unconstrained topics. The interface provides a variety of functions: a) identification of dimensions of comparison (properties) relevant to a particular entity or set of entities, b) comparisons of like entities on user-specified properties (for example, which university is more prestigious, Yale or Cornell), c) tracing the derived opinions back to their sources (so that the reasons for the opinions can be found). A central contribution in GrainPile is the evaluated demonstration of feasibility of mapping the recognized expressions (such as fairly, very, extremely, and so on) to a common scale of numerical values and aggregating across all the extracted assessments to derive an overall assessment of degree. GrainPile’s novel assessment and aggregation of degree expressions is shown to strongly outperform an interpretation-free, co-occurrence based method. Full paper: http://www.isi.edu/~timc/papers/IUI06-grainpile-chkl.pdf
|
| 16 Dec 05 | Jonathan May |
A Better N-Best List - Practical Determinization of Weighted Finite Tree Automata
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Ranked lists of output trees from syntactic statistical NLP applications frequently contain multiple repeated entries. This redundancy leads to misrepresentation of tree weight and reduced information for debugging and tuning purposes. It is chiefly due to nondeterminism in the weighted automata that produce the results. I will introduce an algorithm that determinizes such automata while preserving proper weights, returning the sum of the weight of all multiply derived trees. I will also report results of the application of the algorithm to machine translation and Data Oriented Parsing.
|
| 30 Sep 05 | David Chiang |
Some Computational Complexity Results for Synchronous Context-Free Grammars
Time: 3:00 pm - 4:30 pm Location: 4 Large Abstract: (This is a practice talk for a paper by Giorgio Satta and Enoch Peserico) This paper investigates some computational problems associated with probabilistic translation models that have recently been adopted in the literature on machine translation. These models can be viewed as pairs of probabilistic context-free grammars working in a `synchronous' way. Two hardness results for the class NP are reported, along with an exponential time lower-bound for certain classes of algorithms that are currently used in the literature.
|
| 29 Sep 05 | Tim Chklovski |
Previews of my talks for K-CAP
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The topics & approximate start times: (3:00 sharp) My 7-10 min bit for panel discussion on "Manual vs. Automated Knowledge Acquisition" Will touch on web extraction vs. learning from volunteers -- strengths and weaknesses, new thoughts on synergies (3:15) Designing Intelligent Acquisition Interfaces for Collecting World Knowledge from Web Contributors (paper by Timothy Chklovski, Yolanda Gil) (3:55) Collecting Paraphrase Corpora from Volunteer Contributors (paper by Timothy Chklovski) |
| 26 Aug 05 | Fossum, Huang and Zhang |
Summer Student Presentations
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: 3:00pm Victoria Fossum (Michigan) Exploring the Continuum between Phrase-based and Syntax-based Machine Translation State-of-the-art statistical machine translation systems use lexical phrases as the basic unit of translation. Phrase-based systems can capture those aspects of translation that are sensitive to local context. Syntax-based systems, on the other hand, make use of linguistically motivated syntactic structure, can capture long-distance dependencies and reorderings, and offer greater generalization in translation rules. However, their performance lags that of phrase-based systems. Hierarchical phrase-based translation, introduced by [Chiang 05], provides an elegant framework for exploring the continuum between phrase-based and syntax-based translation. This system combines the "formal machinery" of syntax-based systems without any "linguistic commitment" to a particular syntactic structure [Chiang 05]. I will present results from my re-implementation of Chiang's hierarchical phrase-based system, and (if time permits) compare those results with the following systems on Chinese-English translation: ISI's phrase-based system, and ISI's syntax-based system. Between now and December 2005, I plan to incrementally explore the space between phrase-based and syntax-based systems by augmenting these hierarchical phrase-based rules with richer syntactic annotation. 3:30pm Liang Huang (Penn) and Hao Zhang (Rochester) Efficient Integration of n-gram Language Models with Syntax-based Decoding We first give an overview of the ISI syntax-based MT system which is based on tree-to-string (xRs) translation rules. The biggest problem at this stage is the inefficiency of the integration of n-gram models. Without n-gram models, the xRs translation rules can be easily binarized with respect to the foreign language to ensure cubic-time decoding. With n-gram models, however, binarization without considering both languages will lead to exponential complexity. Inspired by Inversion Transduction Grammar (ITG) (Wu, 97), we will focus on the so-called ITG binarizable rules which count for over 99% of the whole rule set. A simple linear-time algorithm will be presented to do the binarization. Decoding with ITG-like rules is of low polynomial complexity in both time and space. We will discuss experimental results on both efficiency and accuracy of decoding with the new binarization. If time permits, we will also present the "hook trick" (inspired by (Eisner and Satta, 99)) to even further reduce the polynomial complexity of the decoding process. |
| 24 Aug 05 | Hopkins, Riesa, and Nakov |
Summer Student Presentations
Time: 3:30 pm - 5:00 pm Location: 11 Large Abstract: 3:30pm Mark Hopkins (UCLA) Tree Sequence Automata: A Unifying Framework for Tree Relation Formalisms There exist a wide variety of competing formalisms for representing a language of ordered tree pairs. These include (bottom-up and top-down) tree transducers, synchronous tree-substitution grammars (STSGs), synchronous tree-adjoining grammars (STAGs), and inversion transduction grammars (ITGs). Since these formalisms have all developed independently of one another, it is difficult to compare their respective representational power. This work seeks to make this task simpler by viewing these formalisms as instances of a general unifying formalism, which we call tree sequence automata (TSA). By casting these different formalisms in a single framework, we can compare them directly by studying the specific subclass of TSA that they fall into. 4:00pm Jason Riesa (Johns Hopkins) A case study in building a cost-effective speech-to-speech machine translation system with sparse resources: English - Iraqi Arabic The Arabic spoken dialect of Iraq is a language deprived of the vast resources that researchers enjoy when working with its written counterpart, Modern Standard Arabic (MSA). The Iraqi Arabic lexicon and grammar are also sufficiently distinct so that the use of existing tools or corpora for MSA yield little or no positive effect on machine translation output quality. One can see that building a machine translation system normally dependent on a large parallel corpus is a particularly difficult task when given just a 37,000 line translated parallel text based on transcribed speech. This talk will explore the constraints involved in working with this type of data, how we endeavored to mitigate such problems as a non-standard orthography and a highly inflected grammar, and propose a cost- effective way for dealing with such projects in the future. 4:30pm Preslav Nakov (UC Berkeley) Multilingual Word Alignment Recently there has been a growing number of available multilingual parallel texts. One such source is the European Union, which publishes its official documents in the official languages of all member states (sometimes also in the languages of the candidates). Another source are the United Nations. These corpora are a great source of training data for machine translation between new language pairs. But they also offer the opportunity to obtain better pairwise word alignments by looking at multiple languages in parallel. In this talk I will present my research as a summer intern at ISI on getting better French (Fr) to English (En) word alignments using an additional language (Xx). First, I will introduce two heuristics which start with pairwise alignments between Fr-Xx, En-Xx and Fr-En and then combine them probabilistically (in a linear model) or graph-theoretically (by looking at in- and out-degrees for each word). Then I will present two Model1 inspired alignment models: (a) from "Fr and Xx" to En; and (b) from Fr to "En and Xx". |
| 05 Aug 05 | Doug Oard (Maryland) |
The CLEF Cross-Language Speech Retrieval Test Collection
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Test collections for information retrieval tasks have traditionally assumed that what we are searching for are documents (e.g., Web pages, news stories, or academic documents). Most information that is generated is, however, not in originally generated as part of a document, but rather as what we might refer to as "conversational media" (e.g., email, speech, or instant messaging). In this talk, I'll describe the creation of two test collections for conversational media, an email collection being created in the TREC Enterprise Search track and a spoken word test collection for the the Cross-Language Evaluation Forum (CLEF). I'll spend most of the talk describing the details of the CLEF test collection, illustrating the issues with some of the results that we have obtained from our experiments with that collection. I'll conclude with a few remarks about the implications of what we are learning for DARPA's new GALE program. This is joint work with Charles University, the IBM TJ Watson Research Center, the Johns Hopkins University, the Survivors of the Shoah Visual History Foundation, and the University of West Bohemia. About the speaker: Douglas Oard is an Associate Professor at the University of Maryland, College Park, with a joint appointment in the College of Information Studies and the Institute for Advanced Computer Studies. He holds a Ph.D. in Electrical Engineering from the University of Maryland, and his research interests center around the use of emerging technologies to support information seeking by end users. In 2002 and 2003, Doug spent a year in paradise here at USC-ISI. His recent work has focused on interactive techniques for cross-language information retrieval and on searching conversational text and speech. Additional information is available at http://www.glue.umd.edu/~oard/. |
| 05 Aug 05 | Jan Hajic (Charles U) |
The Family of Prague Dependency Treebanks
Time: 10:30 am - 12:00 pm Location: 11 Large Abstract: The Prague Dependency Treebank project is aimed at a linguistically complex, multi-tier annotation of relatively large amounts of naturally occuring sentences of natural language. There are four tiers at present: the basic token tier (level 0), and the morphological, surface-syntacic, and semantic (called "tectogrammatics") tiers. The syntactic and tectogrammatic tiers are based on a richly labelled dependency representation principle. So far, the project produced three corpora: the Czech-language-only Prague Dependency Treebank, the Prague Czech-English Dependency Treebank and the Prague Arabic Dependency Treebank. In the talk, the principles of the Prague Dependency Treebank linguistic annotation scheme will be presented. Some technical details will also be discussed, as well as some of the tools developed both for the manual annotation itself and for corpus-based NLP of Czech, English and Arabic.
|
| 15 Jul 05 | Victoria Li Fossum (Michigan) |
Inducing POS Taggers by Projecting from Multiple Source Languages
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: (Yarowsky et al., 2001) present an algorithm for bootstrapping a POS tagger for an arbitrary target language, using an existing POS tagger for a source language and a parallel corpus in the source and target languages. The source text is annotated with the POS tagger; the parallel corpus is word-aligned; the POS tags are "projected" from source to target language; and finally smoothing is performed before training a POS tagger for the target language on the projected annotations. I will talk about my work (jointly with my advisor, Steve Abney, at U. of Michigan) in which we extend this algorithm by projecting from multiple source languages onto a target language, then combining the outputs to compute a consensus POS tagger. Our hypothesis is that systematic transfer errors from different source-target pairs can be reduced by using multiple source languages. I will present experimental results for three different source languages (English, German, and Spanish), and two different target languages (French and Czech). Our results indicate that using multiple source languages improves performance. |
| 07 Jul 05 | Radu Soricut |
Natural Language Generation for Text-to-Text Applications Using an Information-Slim Representation
Time: 3:00 pm - 4:30 pm Location: 11 Small Abstract: Text-to-text applications -- Machine Translation, Summarization, Question Answering -- do not usually involve generic Natural Language Generation (NLG) systems in their generation components, but rather use application-specific algorithms. The main reason for this state of affairs is that virtually all the formalisms used by current generic NLG systems require information that cannot be reliably extracted from unrestricted text. This thesis proposal is about meeting the demand for natural language generation in the context of text-to-text applications. I introduce a new representation formalism (WIDL-expressions), propose generation algorithms that operate on representations specific to this formalism, and discuss a generic sentence realization framework for text-to-text applications. The generation mechanism is based on algorithms for intersecting WIDL-expressions with probabilistic language models. I present both theoretical and empirical results concerning the correctness and efficiency of these algorithms. I also discuss the practical aspects arising from implementing this generation mechanism. In a concrete application of the proposed generation mechanisms, I present an end-to-end Machine Translation application. I also discuss another possible application for Automated Summarization, namely automated headline generation.
|
| 06 Jul 05 | Alessandro Moschitti (Rome) |
Kernel Methods for Semantic Role Labeling
Time: 2:00 pm - 3:30 pm Location: 11 Large Abstract: Automatic Natural Language applications often require the processing of structured data. Traditional machine learning approaches attempt to represent structured syntactic/semantic objects by means of flat feature representations, i.e. attribute-value vectors. This raises two problems: 1. There is no well defined theoretical motivation for such feature model. Structural properties may not fit in any flat feature representation. 2. To define effective flat features, a deep knowledge about the linguistic phenomenon is required. Kernel methods for Natural Language Processing aim to solve both the above problems as kernel functions can be used to define similarities between linguistic objects without explicitly defining the target feature space. In this way, a linguistic phenomenon can be modeled at a more abstract level where the modeling is easier. Such property is extremely useful when the representation of linguistic phenomena is still not well understood. For example, the feature design of semantic role labeling appear to be quite complex since several and non-definitive feature sets have been proposed. As a viable alternative to manual feature design, kernel methods propose two steps: (1) they generate all substructures of the target syntactic/semantic structures and (2) they let the learning algorithm (e.g. Support Vector Machines) to select the most relevant substructures. In this talk, we (1) introduce the PropBank and FrameNet predicate argument structures, (2) present the standard approaches to the automatic labeling of semantic roles and (3) show advanced semantic role labeling models based on kernel methods. About the speaker: Alessandro Moschitti is a researcher at the Computer Science Department of the University of Rome ^ÓTor Vergata^Ô. In 1998 he took his master degree in Computer Science at the University of Rome ^ÓLa Sapienza^Ô. In 2003 he finished his PhD in Computer Science at ^ÓTor Vergata^Ô University. Between 2002 and 2004 he worked as an associate researcher in the University of Texas at Dallas. His research interests concern machine learning approaches for Natural Language Processing and Information Retrieval. His deep expertise relates to automated text categorization and semantic role labeling. Recently, he has devised new kernels which enable Support Vector and other kernel-based machines to carry out advanced semantic processing.
|
| 23 Jun 05 | Michael Fleischman (MIT) |
Intentional Context in Situated Language Learning
Time: 10:30 am - 12:00 pm Location: 11 Small Abstract: Natural language interfaces designed for agents that interact with users in shared environments (e.g. training simulators, videogames) must incorporate knowledge about the users' context in order to address the many ambiguities of situated language use. We introduce a model of situated language acquisition that operates in two phases. First, intentional context is represented and inferred from user actions using probabilistic context free grammars. Then, utterances are mapped onto this representation in a noisy channel framework. The acquisition model is trained on unconstrained speech collected from subjects playing an interactive game, and tested using an understanding task. Discussion of results focuses both on the implications for theoretical models of cognition, as well as, for natural language applications in shared environments.
|
| 22 Jun 05 | Mitsunori Matsushita |
Lumisight Table: A Face-to-face Collaboration Support System That Optimizes Direction of Projected Information to Each Stakeholder
Time: 11:00 am - 12:00 pm (Wednesday the 22nd!) Location: 11 Large Abstract: (This talk occurs in the morning on the same day as the Bayesian tutorial.) The goal of our research is to support cooperative work performed by stakeholders sitting around a table. To support such cooperation, various table-based systems with a shared electronic display on the tabletop have been developed. These systems, however, suffer the common problem of not recognizing shared information such as text and images equally because the orientation of their view angle is not favorable. To solve this problem, we propose the Lumisight Table. This is a system capable of displaying personalized information to each required direction on one horizontal screen simultaneously by multiplexing them and of capturing stakeholders' gestures to manipulate the information. About the Speaker: Mitsunori Matsushita is a research scientist of NTT Communication Science Labs., Nippon Telegraph and Telephone Corporation (NTT). He received B.E., M.E., and Dr.E. degrees from Osaka University, in 1993, 1995 and 2003 respectively. In 1995, he joined NTT, and has been engaged in researches on natural language understanding, information visualization, and interaction design.
|
| 22 Jun 05 | Hal Daume III |
Beyond EM: Bayesian Techniques for NLP Researchers
Time: 1:00 pm - 4:30 pm (Wednesday and long!) Location: 11 Large Abstract: EM has proved to be a great and useful technique for unsupervised learning problems in natural language. Unfortunately, it cannot solve every problem out there, either because the E-step is intractable, the M-step is intractable or both. Typically our community resorts to a Viterbi approximation in this case, which really isn't very justified and can easily diverge from our expectations (no pun intended). Moreover, EM -- like all maximum likelihood methods -- suffers from a need for ad-hoc and undesirable smoothing. All of these problems -- intractable E- or M-steps, the Viterbi approximation, and the annoyance of smoothing -- are solved by using Bayesian methods. Moreover, from a theoretic point of view, the Bayesian paradigm is much more foundationally well justified than the frequentist use of estimators (such as the maximum likelihood estimator), at some cost in computation (though not as much as you might believe). In this tutorial, I will discuss Bayesian methods as they can be used in natural language processing. The first half will be background (some of which you probably won't have seen, some of which you probably will have seen, but which will probably be presented in a different way that you're used to) including graphical models, EM, priors and pro- (and con-) Bayesian arguments. The second half of the tutorial will focus on solving complex inference problems, essentially building on what we've seen from EM. I'll cover MAP (*not* Bayesian -- if you can't tell me why, then you should come to the tutorial!), summing, Monte Carlo, MCMC, Laplace, variational and expectation propagation. Time permitting, I will briefly discuss Bayesian discriminative models (basically what a Bayesian uses instead of SVMs), non-parametric (infinite) models and Bayesian decision theory, all of which make use of the inference techniques we will have already covered. This tutorial is intended to be largely self contained, though I will expect that you know what probabilities are, what distributions are and the standard manipulations of conditional/joint distributions. Familiarity with EM would be helpful, but I'll cover this topic in some depth since it will be important for understanding the rest of the tutorial. I hope -- though this never really seems to come to fruition -- that this will be a semi-interactive talk and I will attempt to adjust according to what people are interested in and what is putting people to sleep. (see http://www.isi.edu/~hdaume/bayesnlp/ for more information)
|
| 20 Jun 05 | Birte Loenneker (Hamburg) |
Between Story Generation and Natural Language Generation
Time: 10:00 am - 11:30 am Location: 11 Small Abstract: Narratology analyzes the discursive structure of narratives as finalized products of human invention, such as novels, short-stories, or fairy-tales. Those narratives are rendered in a given surface form; Narratology focuses on narratives in natural language. Narratologists assume that each narrative surface representation is associated with a neutral, abstract event sequence, the "Story" (histoire, sjuzhet). The abstractness of Story is illustrated by the fact that the same Story can be realized in different surface texts. By discursive structure or "Discourse" (discours, fabula), narralogists mean the relation between an abstract Story and its concrete expression in a sequential text. For example, if the chronological order of the Story is not respected in its textual recount, we are dealing with the Discourse parameter of order. Other Discourse parameters include the frequency with which Story events are evoked, the point of view from which they are narrated (perceived, evaluated,...), or framed narratives with several narrative levels. The Story Generator Algorithms project at the University of Hamburg evaluated several existing Story Generators with respect to their discursive abilities. It became obvious that most Story Generators concentrate on creating a coherent and chronological abstract Story, which is directly mapped onto natural language. This results in a predominance of 1:1 relations between Story and surface, and in most cases corresponds to a default or zero instantiation of Discourse parameters. As a consequence, Story Generator outputs tend to be very explicit and straightforward, and are likely to be perceived as uniform and boring. Narratological expert knowledge might be useful to future enhanced Story Generators and to Natural Language Generation systems dealing with narrative. One of the aims of Computational Narratology is to model that expert knowledge. Ideally, narratological knowledge will be integrated into a Narratological Structurer, as a processing component of an advanced system that creates narratives. In such a system, the Narratological Structurer will be the interface between a Story Generator and subsequent Natural Language Generation modules. The talk also presents examples of the knowledge that is being modelled. About the Speaker: Birte Lönneker graduated from the University of Hamburg, Germany, with a degree in French with Finno-Ugristics (Finnish) and Business Administration. Since then, her main fields of publication are Cognitive Linguistics and electronic resources for Natural Language Processing, with special focus on frames and metaphors, as well as electronic dictionaries, corpora, and recently part-of-speech tagging. Her PhD on Concept Frames and Relations, also published as a book in 2003, was co-supervised at the Institute for Romance Languages and at the Department of Informatics in Hamburg. For her Slovenian-German online dictionary, Birte Lönneker was twice awarded the EURALEX Laurence Urdang Award. From 2002 to 2004, she received various research grants for Slovenia, where she was working in the Corpus Laboratory of the Institute of Slovenian Language. Since 2004, Birte Lönneker carries out research on Story Generator Algorithms within the Narratology Research Group Hamburg. She is also a board member of the German Cognitive Linguistics Association.
|
| 17 Jun 05 | Gully Burns |
The neuroscience laboratory as a knowledge factory: challenges, approaches and tools
Time: 10:30 am - 12:00 pm Location: 11 Large Abstract: As a discipline of biology, the field of neuroscience suffers greatly from information overload, non-standardization and complexity. In the absence of a mathematical theoretical structure for the subject, scientists use their own ad-hoc methods of collating and synthesizing information from both the primary literature and their own data. In order to eventually formalize and accelerate the development of theoretical approaches in the subject, we are combining an Electronic Laboratory Notebook (ELN) with asset management of the primary research literature to construct a knowledge engineering framework based around the organizational unit of a neuroscience laboratory. This project, called ¡NeuroScholar¢ (http://www.neuroscholar.org/) is open-source, and is being tested and used in the laboratories of Prof. Larry Swanson and Prof. Alan Watts at USC. In each laboratory, the system will operate on top of a ¡laboratory corpus¢ of knowledge resources (data files, full-text pdf files , etc.) that summarizes the relevant knowledge for that laboratory. Not only will this collection provide a valuable resource for the members of the laboratory, it provides a platform for natural language processing and knowledge engineering to answer formally-defined research questions. The Society for Neuroscience¢s annual meeting attracts over 30,000 attendees, who collectively form potential user-base of this software. I will talk about the ideas underlying the project, the current implementation of NeuroScholar, developments from collaboration with the natural language group at ISI and possible collaborations for the future.
|
| 13 Jun 05 | Hal Daume III |
Search, Learning and Features (my thesis proposal proposal)
Time: 10:30 am - 12:00 pm (MONDAY!!!) Location: 11 Small Abstract: I'm going to talk about what I've been working on recently. My thesis proposal is something having to do with the interaction of search, learning and features in supervised natural language problems. I will be focusing on the task of coreference, since it is a well-studied problem, yet nevertheless not really solved and quite difficult. It is also a great pedagogical example for why we should care about something *other* than standard Markov random fields for structured prediction, since, for the coreference problem (and pretty much every other "real" natural language problem) inference in such models is intractable. The contents of this talk will be roughly 40% from a paper I have at ICML this year on efficient, accurate supervised learning techniques for structured prediction (and why I feel inclined to make the very controversial statement that supervised learning for NLP problems is solved); it will be roughly 40% about an application of this technique to the coreference resolution problem and an exploration of the feature space for solving this problem (submitted to HLT); and it will be roughly 20% about looking forward to what I want to accomplish in the remainder of my thesis, not covered by the first 80%. |
| 10 Jun 05 | Liang Huang (Penn) |
Better k-best Parsing, Hypergraphs and Dynamic Programming
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We discuss the relevance of k-best parsing to recent applications in natural language parsing, and develop algorithms that substantially improve on previously-used algorithms with respect to efficiency, scalability, and accuracy. We demonstrate these algorithms in experiments on Bikel's implementation of Collins' lexicalized PCFG model, and on a synchronous CFG based decoder for statistical machine translation. We show in particular how the improved output of our algorithms has the potential to improve results from parse reranking systems and other applications. In this talk, I will demonstrate the convergence of several popular parsing formalisms (weighted deduction, shared forest, semiring) under the powerful hypergraph formalism. If time permits, I will also show how generic Dynamic Programming can be formalised as hypergraph searching. Joint work with David Chiang (University of Maryland)
|
| 08 Jun 05 | Hao Zhang (Rochester) |
Lexicalization and A* Searching for Inversion Transduction Grammar
Time: 3:00 pm - 4:30 pm Location: 4th floor Abstract: The Inversion Transduction Grammar (ITG) of \cite{DekaiCL} generates a synchronous parse tree for a given pair of sentences in two languages. By allowing inversion of the order of children at any level of the synchronous parse tree, ITG can do recursive, systematic word reordering. We made a version of ITG where the nonterminals are lexicalized by word pairs and the inversions are dependent on the so-lexicalized nonterminals. We found out that after lexicalization, the Alignment Error Rate (AER) against gold standard is reduced for short sentences. ITG parsing complexity is high polynomial. We proposed a pruning techique that utilizes IBM Model 1 to estimate the inside and outside probability of a bitext cell. Taking a step further, we applied the A* parsing having been used for monolingual parsing to ITG. I will talk about the heuristic estimates we used for A* parsing for Viterbi alignment selection and decoding.
|
| 27 May 05 | Radu Soricut |
Towards Developing Generation Algorithms for Text-to-Text
Time: 3:00 pm - 4:30 pm Location: 11 Small Abstract: We describe a new sentence realization framework for text-to-text applications. This framework uses IDL-expressions as a representation formalism, and a generation mechanism based on algorithms for intersecting IDL-expressions with probabilistic language models. We present both theoretical and empirical results concerning the correctness and efficiency of these algorithms.
|
| 13 May 05 | Ed Stabler (UCLA) |
Natural Logic
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I will describe some recent work on "natural logics", logics for languages that are more similar to human languages than traditional first order predicate logic, giving particular attention to questions about what the syntax encodes about semantic relations among sentences. On everyone's view, some but not all entailments are syntactically encoded (in a sense that I will define precisely), but, beyond this starting point, controversy starts almost immediately. Considering some particular examples, I will sketch methods for addressing some of the basic questions.
|
| 22 Apr 05 | Deepak Ravichandran |
Working with Large Corpus, High speed clustering and its applications
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I am going to be talking about stuff that I have been working over the past 6-9 months. This includes randomized algorithms and its application to 2 NLP problems: noun clustering and noun-pair clustering. I will also be commenting on my experience of working with very very large amounts of real Natural Language text (This includes processing and working with data available from the web. This corpus is not the standard newspaper text that we are so used to in the NLP community.) This talk will also cover a large part of my thesis work. |
| 08 Apr 05 | Jamie Callan (CMU) |
Search Engines for HLT Applications
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: TBA
|
| 25 Mar 05 | Dagen Wang |
Metalinguistic feature study for spontaneous speech in human computer interaction
Time: 3:00 pm - 4:30 pm Location: 11 Large (THIS HAS CHANGED!!!) Abstract: Speech is a crucial component in human computer interaction. While tremendous progress has been made in automatic speech recognition, speech transcription -- which is the output of automatic speech recognition -- is far from providing all the information that one could retrieve from speech. For example, prominence, pause, rhythm, and rate of speech all carry important information in speech and are crucial in speech perception. Inclusion of such information can facilitate better machine recognition and understanding of speech. In this talk, we will introduce the research effort and result in speech rate, prominence, disfluency and utterance boundary detection. We will also show some interesting applications utilizing these features in natural language understanding and dialog management. |
| 18 Mar 05 | Ed Hovy |
Methodologies of ontology content construction
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This talk is the second in three tutorial lectures on ontologies. It first shows some details of various Upper Ontologies-ResearchCYC, SUMO, DOLCE, and the Penman Upper Model. It then discusses the problem of creating content for the 'Middle Model' zone of ontologies, and outlines a methodology for moving from words to word senses to concepts. It concludes by describing ISI's Omega ontology and showing how Omega has been used in annotation projects to support semantic labeling of texts. Please bring a pen or pencil and some paper; there is a small exercise!
|
| 18 Feb 05 | Inderjeet Mani (Georgetown) |
TBA
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: TBA
|
| 14 Feb 05 | Tim Chklovski |
Collecting Broad-Coverage Knowledge Bases from Volunteers
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: (Note that this is a MONDAY!)
|
| 11 Feb 05 | Hae-Chang Rim |
Unsupervised Word Sense Disambiguation Using Wordnet Relatives
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: |
| 28 Jan 05 | Yutaka Sasaki (ATR) |
Research Activities in Speech Translation at ATR/QA as Question-Biased Term Extraction
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This talk has two parts. In the first part, I will introduce research activities in Speech-to-Speech Translation at ATR, including on-going research on statistical machine translation. In the second part, I will present a new approach to QA named Question-Biased Term Extraction (QBTE). The QBTE directly extracts answers as terms biased by the question. To confirm the feasibility of our QBTE approach, we conducted experiments on the CRL QA Data based on 10-fold cross validation, using Maximum Entropy Models as an ML technique. Experimental results showed that the trained system achieved approximately 0.35 in MRR and 50% in TOP5 accuracy. This part is an English version of my presentation given in IPSJ SIGNL-163 in 2004 in Japanese. If time allows, I would like to introduce the NTCIR-5 (2004/2005) Cross-Lingual QA task (CLQA) that I am going to organize. About the speaker: Yutaka Sasaki received his Ph.D. in Engineering from the University of Tsukuba, Japan in 2000 for his work on generating Information Extraction rules with hierarchically sored Inductive Logic Programming. He joined NTT Laboratories in 1988. Since then, he was involved in research in rule-based CAI, inductive logic programming, Information Extraction, and Question Answering. From 1995 to 1996, he spent one year at Simon Fraser University, Canada as a visiting researcher. From 1999, he led a subgroup to develop the first practical Japanese Question Answering System SAIQA. Then, he applied SVMs to automatically construct the QA system SAIQA-II from QA and NE data. In June 2004, he moved to ATR Spoken Language Translation Research Laboratories. Currently, he is the head of Department of Natural Language Processing. He is also an organizer of the NTCIR 5 Cross-Lingual Question Answering Task.
|
| 17 Dec 04 | Nicola Ueffing |
Word-Level Confidence Measures for SMT
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This talk will address the problem of assessing the correctness of MT output on the word level. I will give an overview on word confidence measures for SMT. Different variants of word posterior probabilities that can be directly used as confidence measure will be presented. Their connection with the Bayes decision rule and the underlying error measure will be shown. Experimental comparison of different word confidence measures will be presented on a translation task consisting of technical manuals. Additionally, I will show how word confidence measures can be applied in an interactive SMT system. This system predicts translations, taking parts of the sentence into account that have already been accepted or typed by the user. Through the use of confidence measures, the performance of the prediction engine can be improved. About the Speaker: Nicola Ueffing is a graduate research assistant at the group for "Human Language Technology and Pattern Recognition" (Lehrstuhl fuer Informatik VI) at RWTH Aachen University. She received her diploma in mathematics from RWTH Aachen University in 2000. Her research topic is statistical machine translation, focusing on confidence measures for SMT. In 2003, she was a member of the team working on "Confidence Estimation for SMT" at the CLSP workshop at JHU.
|
| 10 Dec 04 | Nick Mote |
Developing a Language Model for Second Language Learner Speech
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: ISI's Tactical Language Project is a system designed to teach Americans how to speak Arabic through a video game environment. We've taken a FPS engine (Unreal 2003) and re-did the graphics so it looks like you're in a typical Lebanese village. We took away the guns, added speech recognition, and set the player in the middle of it all. The theory is that if you learn well in a classroom, you'll perform well in a classroom, but if you learn well in a pseudo-naturalistic environment, you'll perform better in real life. In a pedagogical context, speech recognition is a hard thing we're trying to recover signal from noisy language-learner speech--with all of its mispronunciations, disfluencies, and grammatical errors . Language understanding is hopeless unless you have a good approximation of what kinds of mistakes learners make, and you can build a system to anticipate them. Suppose an English language learner says "Water". Is he asking you for water? Is he telling you there's a puddle in front of you? Is he saying his name is "Walter", but with horrible pronunciation? There's a lot of ambiguity involved. In order to disambiguate, we need to look at the speech signal itself, the utterance's context, the learner's past language performance, and details about the learner's mother language as it relates to English, etc., etc... Only then can we hope to guess what the learner is actually trying to say. And then, of course, once we've made a good guess at the learner's speech intentions, what do we do about it? How do we correct him? How do we balance the consideration of inherent qualities of learner motivation, language errors, learning objectives, and possibly low-confidence speech recognition, as we generate good pedagogical feedback? This is NLP (primarily statistical) with a bit of pedagogy theory and linguistic (SLA and phonology) theory sprinkled in.
|
| 19 Nov 04 | Chin-Yew Lin |
After TIDES, What's Left? - Finding Basic Elements
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: As DARPA's TIDES (Translingual Information Detection, Extraction, and Summarization) program coming to an end, I will give a summary of what we have learned from TIDES in summarization and a brief overview of our current effort in developing automatic evaluation methods that go beyond surface n-gram matching. Topics to be covered: (1) Summary of DUCs 2001 - 2004 (2) Automatic Evaluations in Summarization and MT (3) Basic Elements - New Efforts in Summarization at ISI |
| 15 Nov 04 | Thiago Pardo |
Unsupervised learning of verb argument structures
Time: 3:00 pm - 4:30 pm (note the strange date!) Location: 8th floor multipurpose room (#849) -- NOT the conference room Abstract: In this talk, I'll present the investigation I'm carrying out in ISI lately under Daniel Marcu's supervision. Following the noisy-channel framework, we propose a statistical model for learning the argument structures of verbs automatically. We show that we are able to learn both lexicalized and generalized structures and achieve good results, relying only on basic NLP tools like a POS tagger and named-entity recognizer. We also present a comparison of the structures we learn with the predicted ones in PropBank.
|
| 12 Nov 04 | Dragomir Radev |
Words, links, and patterns: novel representations for Web-scale text mining
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Textual data is everywhere, in email and scientific papers, in online newspapers and e-commerce sites. The Web contains more than 200 terabytes of text not even counting the contents of dynamic textual databases. This enormous source of knowledge is seriously underexploited. Textual documents on the Web are very hard to model computationally: they are mostly unstructured, time-dependent, collectively authored, multilingual, and of uneven importance. Traditional grammar-based techniques don't scale up to address such problems. Novel representations and analytical tools are needed. I will discuss several current projects at Michigan related to text mining from a variety of genres. Depending on the amount of time, I will talk about (a) lexical centrality for multidocument summarization, (b) syntax-based sentence alignment, (c) graph-based classification,(d) lexical models of Web growth, and (e) mining protein interactions from scientific papers. As it turns out, the right representations, when complemented with traditional NLP and IR techniques, turn many of these into instances of better studied problems in areas such as social networks, statistical mechanics, sequence analysis, and computational phylogenetics.
About the Speaker: Dragomir R. Radev is Assistant Professor of Information, Electrical Engineering and Computer Science, and Linguistics at the University of Michigan, Ann Arbor. He leads the CLAIR (Computational Lingusitics And Information Retrieval) group which currently includes 12 undergraduate and graduate students. Dragomir holds a Ph.D. in Computer Science from Columbia University. Before joining Michigan, he was a Research Staff Member at IBM's TJ Watson Research Center in Hawthorne, NY. He is the author of more than 45 papers on information retrieval, text summarization, graph models of the Web, question answering, machine translation, text generation, and information extraction. Dr. Radev's current research on probabilistic and link-based methods for exploiting very large textual repositories, representing and acquiring knowledge of genome regulation, and semantic entity and relation extraction from Web-scale text document collections is supported by NSF and NIH. Dragomir serves on the HLT-NAACL advisory committee, was recently reelected as treasurer of NAACL, is a member of the editorial boards of JAIR and Information Retrieval, and is a four-time finalist at the ACM international programming finals (as contestant in 1993 and as coach in 1995-1997). Dragomir received a graduate teaching award at Columbia and recently, the U. of Michigan award for Outstanding Research Mentorship (UROP).
|
| 05 Nov 04 | Mary Wood (Manchester) |
A Human-Computer Collaborative Approach to Computer Aided Assessment
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The ABC (Assess by Computer) system has been developed and used in the School of Computer Science at the University of Manchester for formative and (principally) summative assessment at undergraduate and postgraduate level. We believe that fully automatic marking of constructed answers - especially free text answers - is not a sensible aim. Instead - drawing on parallels in the history of machine translation - we take a "human-computer collaborative" approach, in which the system does what it can to support the efficiency and consistency of the human marker, who keeps the final judgement. Our current work focuses on what are generally referred to as "short text answers" as contrasted to "essays". However we prefer to contrast "factual" with "discursive" answers, and speculate that the former may be amenable to simple statistical techniques, while the latter require more sophisticated natural language analysis. I will show some examples of real exam data and the techniques we are using and developing to handle them.
|
| 22 Oct 04 | Jerry Hobbs |
Like Now: Two Explorations in Deep Lexical Semantics
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: As part of an effort to encode the commonsense knowledge we need in natural language understanding, I have been looking at several very common words and their uses in diverse corpora, and asking what we have to know to understand this word in this context. In this talk, I will describe the investigations of the uses of two words -- the adverb "now" and the preposition "like". One might think that "now" simply expresses a temporal property of an event. But in fact in almost every instance, it is used to point up a contrast -- "This is true now. Something else was true then." It is thus more of a relation than a property. I will describe several categories of such relations. Another question of interest about "now" is "How long a period is the word "now" describing in its various uses?": "I'm typing an abstract now" vs. "We travel by automobile now." I suggest some categories of knowledge that need to be encoded to answer this question. When we successfully understand "A is like B", we have figured out some property that A and B have in common. How can we find that property computationally? In the data I looked at, in 80% of the instances, the property is explicit in the nearby text, and I will talk about how we can identify it. For the remainder I examine the knowledge we would need in order to infer the common property.
|
| 24 Sep 04 | Hal Daume III |
Domain Adaptation in Maximum Extropy Models
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I will present some preliminary results on the problem of domain adaptation in maximum entropy models, specifically in the case when there is a large amount of "out of domain" data, and only a very small amount of "in domain" data. The model and algorithms I present are based on the technique of conditional Expectation Maximization (CEM) and allow for relatively fast optimization of these models. Preliminary results on some tasks are quite promising.
|
| 17 Sep 04 | Various |
About Syntax Fest 2004 (Part II)
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This summer we held a three-month workshop on syntax-driven machine translation, in which we learned syntactic transformations automatically from Chinese/English translated corpora and applied them to translate new text. We'll give a progress report!
|
| 10 Sep 04 | Various |
About Syntax Fest 2004 (Part I)
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This summer we held a three-month workshop on syntax-driven machine translation, in which we learned syntactic transformations automatically from Chinese/English translated corpora and applied them to translate new text. We'll give a progress report!
|
| 16 Aug 04 | Patrick Pantel & Tim Chklovski |
VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations
Time: 2:00 pm - 3:30 pm (note the strange time) Location: 11 Large Abstract: Broad-coverage repositories of semantic relations between verbs could benefit many NLP tasks. We present a semi-automatic method for extracting fine-grained semantic relations between verbs. We detect similarity, strength, antonymy, enablement, and temporal happens-before relations between pairs of strongly associated verbs using lexico-syntactic patterns over the Web. On a set of 29,165 strongly associated verb pairs, our extraction algorithm yielded 65.5% accuracy. We provide the resource, called VerbOcean, for download at http://semantics.isi.edu/ocean/. We will also discuss current work on disambiguating the verbs in the network as well as refining the semantic relations using path analysis.
|
| 13 Aug 04 | Deepak Ravichandran |
Randomized algorithms and its application to NLP
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The last decade has seen a plethora of papers in NLP devoted to Machine Learning algorithms. However, most of these papers have devoted their effort exclusively to improving the system performance on the accuracy axis. Most of the sophisticated NLP algorithms are extremely slow and do not scale up easily when applied to large amounts of data. I will talk about the importance of randomized algorithms and their potential in speeding up some NLP algorithms. This talk will be a survey of some recent advances in Theoretical Computer Science/Math seen with an NLP point-of-view. I am not going to present any results. But I am hoping that this talk will clarify my thinking process, get feedback from people and help me colloborate with others.
|
| 09 Aug 04 | Justin Busch, Hai Huang, Jens Stephan & Chen-kang Yang |
CL Student Presentations
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Justin Busch: Weight and Semantic Class Issues in Japanese Noun Phrase Ordering Many current designs for automatic parsers learn probabilities for the relative frequencies of parts-of-speech and syntactic rules, and this has proven to be generally reliable. In spite of the ubiquity of probabilistic techniques for parsing, however, little attention has been given to the linguistic significance of the probabilistic data and what it might say about human performance. Hawkins proposes a general theory of grammaticalization based on the minimization of syntactic domains. Given that a sentence of any language will contain at least one noun phrase, one verb, and possibly additional noun phrases and prepositional phrases, "minimize domains" suggests that these phrases will order themselves according to whichever pattern requires the least effort to recognize the higher syntactic structure of the sentence. These effects are directly measurable through corpus statistics, and can be interpreted as potential heuristics for probabilistic parsers. In this study, we examine Japanese data from the Kyoto Treebank and test Hawkins' predictions for noun phrase ordering by noun phrase weight as well as by generic semantic types. The discussion will focus primarily on how accurately Hawkins' predictions are reflected in the corpus statistics, and will conclude with observations about how they might be applied to the decision mechanisms of probabilistic parsers. -------------------------------------------------------------------------- Hai Huang: TBA -------------------------------------------------------------------------- Jens Stephan: Evaluation and Visualization of a Dialogue System Evaluations have become a necessary standard to almost any type of research. However, there are many areas where there is no common agreement on how to evaluate, which is the case for complex problem of evaluating dialogue systems. The evaluation of the multi party multi modal dialogue system MRE(1) provides a good example of what questions are important for such an evaluation, how to actually do the evaluation and finally how to how make special problems of the system visible to use the evaluation results to improve the systems performance. After a brief introduction of the MRE domain and architecture, I will break the task town to a set of general evaluation questions. From there I will explain what kinds of metrics and visualizations are suited to answer those questions and what kind of data is needed, as well as how that data was obtained. Along the road, examples of actual system problems and performances will be presented. The topics of data formatting and visualization will receive some special attention by introducing the MRE Evaluation Toolkit as well as the corpus it operates on. -------------------------------------------------------------------------- Chen-kang Yang: Using the Omega Ontology to Determine Selectional Restrictions for Word Sense Disambiguation Word sense disambiguation is fundamental for language processing. Though purely statistical methods are effective for this task, they neglect the syntactic and semantic aspects. In this study, we adopt a hybrid approach by applying an unsupervised machine learning method to learn verbs selectional restrictions on their subjects/objects. The system then uses these learned selectional restrictions for word sense disambiguation of the subjects/objects. Instead of words, the training data contains ontological taxonomy hierarchies that are retrieved from the Omega ontology. Unlike other similar systems, we are able to automatically find the best match among classes from different levels of the ontology. This provides us more flexibility and is closer to human instinct. Our system performs better than other similar systems, though it still needs cooperating methods for better results. |
| 06 Aug 04 | Hae-Chang Rim |
Information Retrieval using Word Senses: Root Sense Tagging Approach
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Information retrieval using word senses is emerging as a good research challenge on semantic information retrieval. In this presentation, I am going to propose a new method using word senses in information retrieval: root sense tagging method. This method assigns coarse-grained word senses defined in WordNet to query terms and document terms by unsupervised way using co-occurrence information constructed automatically. The sense tagger is crude, but performs consistent disambiguation by considering only the single most informative word as evidence to disambiguate the target word. We also allow multiple-sense assignment to alleviate the problem caused by incorrect disambiguation. Experimental results on a large-scale TREC collection show that the proposed approach to improve retrieval effectiveness is successful, while most of the previous work failed to improve performances even on small text collection. The proposed method also shows promising results when is combined with pseudo relevance feedback and state-of-the-art retrieval function such as BM25.
|
| 16 Jul 04 | Hal Daume III and Radu Soricut |
Practice Talks for ACL (+workshops)
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: TBA
|
| 09 Jul 04 | Kevin Knight |
Survey of Trees and Grammars
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I'll give a survey of trees and grammars, at least the parts that seem most relevant to ongoing work at ISI. This will be a theory talk. I'll start with context-free grammars, which were developed in the 1950s, and cover other tree-generating systems. I'll also talk about tree-transforming systems. |
| 02 Jul 04 | Hal Daume III |
A Phrase-Based HMM Approach to Document/Abstract Alignment
Time: 1:30 pm - 3:00 pm Location: 11 Large Abstract: I will present work that extends the standard hidden Markov model to a version that can emit multiple symbols in a single time step. Using this model, we are able to automatically create phrase-to-phrase mappings in an alignment process. I've applied this model to the task of creating alignments between documents and their human-written abstracts, yielding an overall alignment F-score of 0.548, a significant improvement on the best results to date of 0.363. These results are published in an EMNLP paper this year, but the talk will be an extended version of the talk I will give there (namely, I will discuss the mechanics of the extended HMM in more detail in this seminar).
|
| 25 Jun 04 | Dan Gildea |
Syntactic Supervision and Tree-Based Alignment
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Tree-based probability models of translation have been proposed to take advantage of parse trees on one, both, or neither sides of a parallel corpus. I will present comparative results for these three approaches for the task of word alignment on Chinese-English and French-English data, as well as some analysis of what is going on behind the numbers.
|
| 21 Jun 04 | Emil Ettelaie |
Speech-to-Speech Translation: A Phrase Classification Approach
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: This talk will be about automatic speech-to-speech translation. In our system, a doctor speaks one language, the patient speaks another language, and the machine translates their utterances from one language to the other. The talk will be followed by a demo of our system. One approach we have been successful with is phrase classification, i.e., classifying a noisy speech-recognized utterance into one of many meaning categories. Phrase classification is computationally cheap and can provide high quality translations for in domain utterances almost instantaneously. Speed is important for speech translation, where processing delay is a great concern. In this talk, different aspects of building a classification-based speech translator are discussed. Following an overview of automatic speech-to-speech translation and its challenges, a comparison of different classification methods is presented and data collection techniques for that application are introduced.
|
| 17 Jun 04 | Marcello Federico |
Statistical Machine Translation at ITC-irst
Time: 3:00 pm - 4:30 pm Location: 4th Floor Abstract: My presentation will overview recent activities on Chinese-English SMT carried out at ITC-irst (Trento, Italy). After an overview of the complete architecture of our system, I will focus on progress made in Chinese word-segmentation, phrase-based modeling and decoding, log-linear modeling and minimum error training, and language model adaptation. Experimental results will be provided in terms of Bleu and Nist scores on two translation tasks: basic traveling expressions and news reports, respectively adopted by the C-STAR consortium and for the 2002 and 2003 NIST MT evaluation campaigns. Bio: Marcello Federico has been a permanent researcher at ITC-irst since 1991. During 1998-2003, he led the "Multilingual natural speech technologies" (MUNST) research line at ITC-irst. Since 2004, he is head of the "Cross-language information processing" (Hermes) research line. His interests include automatic speech recognition, statistical language modeling, information retrieval, and machine translation.
|
| 24 May 04 | Philipp Koehn |
Challenges in Statistical Machine Translation
Time: 4:00 pm - 5:00 pm Location: 11 Large Abstract: In the last years a standard model in statistical machine translation has emerged, which is based on the translation of sequences of words (so-called "phrases") at a time. I will describe this model, how to train and decode with it, but the focus of this talk will be how to address the challenges to advance and move beyond the model: my thesis work on noun phrase translation, making use of syntax, and better modeling, such as discriminative training. Bio: Philipp Koehn is the author of papers on natural language processing, machine translation, and machine learning. He received his PhD from the University of Southern California in 2003 (advisor: Kevin Knight), and is currently employed as a postdoc at the Massachusetts Institute of Technology, working with Michael Collins. He has worked at AT&T Laboratories on text-to-speech systems, and at WhizBang! Labs on text categorization.
|
| 21 May 04 | Tom Murray and Rahul Bhagat |
Statistical Learning for Dialogue System and A Community of Words
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Natural Language Understanding: A fast and accurate Statistical Learning Approach for Dialogue Systems Natural Language Understanding (NLU) is an essential module of a good dialogue system. To achieve satisfactory performance levels, real time dialogue systems need the NLU module to be both fast and accurate. Finite State Model (FSM) based systems are fast and accurate but lack robustness and flexibility. The Statistical Learning Model (SLM) based systems are robust and flexible but lack accuracy and are at most times slow. In this talk, I am going to talk about an SLM based NLU approach for dialogue utterances that is both accurate and fast. The system has high accuracy and produces frames in real time. A Community of Words: Understanding Social Relationships from E-mail A corpus of e-mail messages presents a number of challenges for NLP techniques, with its nearly unconstrained structure and vocabulary, mistyped words and ungrammatical sentences, and extensive contextual information that is never explicitly stated. Yet, the intrinsically social nature of such communication provides an opportunity to study not just a bag of words, but also the relationships, competencies, and activities behind them. This talk presents work with Eduard Hovy as part of the MKIDS project.
|
| 30 Apr 04 | Liang Zhou |
Automating the Building of Summarization Systems
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Summarization requires one to identify the internal structure of information and to bring that to the surface both operationally and organizationally. How does one put this theory to practice and build real summarization systems? How do the systems built based on this idea perform?
|
| 28 Apr 04 | Dragos Muntanu, Radu Soricut and Hal Daume III |
Practice Talks for HLT/NAACL
Time: 3:00 pm - 5:00 pm Location: 11 Large Abstract: TBA
|
| 23 Apr 04 | Hal Daume III |
A Tree-Position Kernel for Document Compression
Time: 3:00 pm - 4:00 pm Location: 10 Large Abstract: I'll describe our entry into the DUC 2004 automatic document summarization competition. We competed only in the single document, headline generation task. Our system is based on a novel kernel dubbed the tree position kernel, combined with two other well-known kernels. Our system performs well on white-box evaluations, but does very poorly in the overall DUC evaluation. C'est la vie. Slides: 04-tree-position-kernel.ps.bz2 04-tree-position-kernel.pdf |
| 16 Apr 04 | Rada Mihalcea (UNT) |
Graph-based Ranking Algorithms for Language Processing
Time: 10:30 am - 12:00 pm Location: 11 Large Abstract: Although we live in a predominantly statistical world, there are still many language processing applications that long for accurate representations of text meaning. Even applications that found partial solutions in statistical modeling, including information retrieval, machine translation, or automatic summarization, are likely to get a significant boost from deeper text understanding. In this talk, I will present an innovative method for automatic extraction of conceptual graphs as a means to represent text meaning. The method relies on a novel adaptation of graph-based ranking algorithms - traditionally (and successfully) used in citation analysis, Web page ranking, and social networks. I will show how such algorithms can be adapted to semantic networks, resulting in an efficient unsupervised method for resolving the semantic ambiguity of all words in open text, and identifying relations between entities in the text. I will also outline a number of applications that are enabled by this representation, including keyphrase extraction, domain classification, and extractive summarization. BIO: Rada Mihalcea is an Assistant Professor of Computer Science at University of North Texas. Her research interests are in lexical semantics, minimally supervised natural language learning, and multilingual natural language processing. She is currently involved in a number of research projects, including word sense disambiguation, shallow semantic parsing, (non-traditional) methods for building annotated corpora with volunteer contributions over the Web, word alignment for language pairs with scarce resources, and graph-based ranking algorithms for language processing. Her research is supported by NSF and the state of Texas. |
| 13 Apr 04 | Jill Burstein (ETS) |
Automated Essay Evaluation: From NLP research through deployment as a business
Time: 3:00 pm - 4:30 pm Location: 4 Large Abstract: Automated essay scoring was initially motivated by its potential cost savings for large-scale writing assessments. However, as automated essay scoring became more widely available and accepted, teachers and assessment experts realized that the potential of the technology could go way beyond just essay scoring. Over the past five years or so, there has been rapid development, and commercial deployment of automated essay evaluation for both large-scale assessment and classroom instruction. A number of factors contribute to an essay score, including varying sentence structure, grammatical correctness, appropriate word choice, errors in spelling and punctuation, use of transitional words/phrases, and organization and development. Instructional software capabilities exist that provide essay scores and evaluations of student essay writing in all of these domains. The foundation of automated essay evaluation software is rooted in NLP research. This talk will walk through the development of CriterionSM, e-rater, and Critique writing analysis tools, automated essay evaluation software developed at Educational Testing Service - from NLP research through deployment as a business. (Preview of an HLT/NAACL-2004 Invited Speaker Presentation) Jill Burstein Educational Testing Service Princeton, NJ
|
| 09 Apr 04 | Eduard Hovy |
Three (and a half?) Trends: The Future of NLP
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: An interesting (disturbing?) new trend is beginning to manifest itself in NLP, one that is focused on performance and hence very attractive in the context of inter-system competitive evaluations such as TREC and DUC, but one that does not provide much insight about language or NLP methods to the researcher interested in these topics. This addition of a new paradigm to NLP has implications for all of us.
|
| 02 Apr 04 | Stephan Vogel |
The CMU Statistical Machine Translation System
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: The presentation will give an overview of the SMT activities at the Language Technologies Institute, Carnegie Mellon University, in large vocabulary text translation tasks, esp. the Chinese-English and Arabic-English, as well as in limited domain speech-to-speech translation tasks. The CMU SMT system is, like most modern statistical MT systems, based on phrase translation. Several approaches have been developed to extract the phrase pairs from parallel corpora and current research investigates different scoring approaches for these translation pairs. Details of the decoder, esp. on hypothesis recombination, pruning, and efficient n-best list generation will be given. Recently, the SMT system has been extended to use partial translations generated from example based and grammar based translation system, thereby performing multi-engine machine translation. Bio: Stephan Vogel is a researcher at the Language Technologies Institute, Carnegie Mellon University, where he heads the statistical machine translation team. He received a Diploma in Physics from Philips University Marburg, Germany, and a Masters of Philosophy from the University of Cambridge, England. After working for a number of years on the history of science, he turned to computer science, especially natural language processing. Before coming to CMU, he worked for several years at the Technical Univerity of Aachen on statistical machine translation, and also in the Interactive Systems Lab at the University of Karlsruhe.
|
| 26 Mar 04 | Shlomo Argamon |
On Writing, Our Selves: Explorations in Stylistic Text Categorization
Time: 1:30 pm - 3:00 pm Location: 11 Large Abstract: This talk will survey results of several recent projects we have been undertaking in automated text categorization based upon the style, rather than the topic, of the documents. I will describe a general text-categorization framework using machine learning along with general principles for choosing stylistically relevant sets of features for learning effective classification models. Applications of these methods include determining author gender and text genre in published books and articles, authorship attribution of email messages, and analysis of language use in different scientific fields. In many cases, the models that are learned also give some insight into the respective styles being distinguished, which I will also discuss. Shlomo Argamon is an associate professor at the Illinois Institute of Technology Chicago.
|
| 25 Mar 04 | Jon Patrick (U. of Sydney) |
ScamSeek: Capturing Financial Scams at the Coalface by Language Technology
Time: 10:30 am - 12:00 pm Location: 11 Large Abstract: The Scamseek project aims to build a surveillance tool for identifying financial scams on the Internet by performing document classification of Internet pages. There are three principle types of documents of concern: those that give financial advice by unregistered advisors, unlawful investment schemes, and share ramping. The first phase of the project has been completed and a working system, known as ScamAlert installed at the Australian Securities and Investment Commission (ASIC). The independent audit of the performance of the system proved satisfactory with a result for precision of .75, recall .43, and F=. 54, along with identification of 4 scams misclassified by the client. Significant improvement in recall is foreshadowed in the 2nd phase of the project. The results are satisfying in the context of the structure of the data where the density of scam documents is about 1.8% of the total corpus. The good performance of the operational system is ascribed to the combination of using a strong linguistic model of language (Systemic Functional Linguistics) to define the scam documents in parallel with a rich statistical analysis of the structure of non-scam documents and scam look-alikes. A large amount of the experimental program has concentrated on understanding and exploiting the interaction between the linguistically described aspects of the documents and the statistical properties. Each type of data has been used to inform and modify the usage of the other. The operational aspects of the project have proven to be as challenging as the research objectives. The project has a budget of $2.2M over 15 months. It has been managed so as to create a balance in resources between the needs of both the research objectives and the engineering objectives. Software development has concentrated on three aspects. Firstly, to produce an environment for the strong directive management of computational linguistics experiments, secondly, in the aid of the linguists to create tools to support their manual analysis, and thirdly the best practice of software engineering principles to ensure a clean automated rollout of the production system for ASIC. The contributing partners in the Scamseek project are The Capital Markets Co-operative Research Centre (CMCRC), ASIC, the University of Sydney and Macquarie University. |
| 12 Mar 04 | Deepak Ravichandran |
About My Thesis Proposal
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: TBA
Slides: TP.pdf |
| 20 Feb 04 | Hal Daume III |
Some Results in Automatic Evaluation for Summarization and MT
Time: 3:00 pm - 4:00 pm Location: 4 Large Abstract: I will be presenting some recent results of mine regarding the possibility of automatic evaluation in summarization. I will discuss both my own findings, as well of those of people here and at Columbia, and attempt to explain in a principled fashion why there are disparate opinions on the plausibility of performing automatic evaluation in this task. I will discuss my (perhaps pessimistic) views on the plausibility of doing any sort of evaluation of summarization, automatic or otherwise. The results and experimental setups developed in connection with summarization will be extended to the machine translation. I will review possible reasons why metrics such a bleu have experienced significantly more success in machine translation than in summarization. I will also connect the evaluation criterea developed in the context of summarization to machine translation, and discuss the automation of these methods. In short: I'll talk about why I've been doing so much data elicitaiton recently. This will be a highly informal seminar and participation is highly encouraged.
Slides: sumeval.ps |
| 06 Feb 04 | Mark Hopkins |
What's in a Translation Rule?
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We propose a theory that gives formal semantics to word-level alignments defined over parallel corpora. We use our theory to introduce a linear algorithm that can be used to derive from word-aligned, parallel corpora the minimal set of syntactically motivated transformation rules that explain human translation data. (joint work with Michel Galley, Kevin Knight, and Daniel Marcu)
|
| 30 Jan 04 | Paul Kingsbury (Penn) |
PropBank: the next stage of Treebank and Inducing a Chronology of the Pali Canon Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: PropBank: the next stage of Treebank Natural-language engineers the world over are coming to a consensus that a degree of semantic knowledge is a necessary addition to purely structural representations of language. This talk describes the Propbank project at Penn, which provides a complete shallow semantic parse of the Treebank II corpus. Inducing a Chronology of the Pali Canon: Works such as Kroch (1989), Taylor (1994) and Han (2000) have demonstrated that syntactic change can be described mathematically as the competition between innovating and archaic formations. This paper demonstrates how this same mathematical description can be turned around to predict the date of a historical text. The Middle Indic period showed dramatic change in the morphological system, such as the collapse of the past-tense verbal system. Whereas Sanskrit had three competing formations, each with multiple possible morphological realizations, Pali (a Middle Indo-Aryan language) had only a single formation, based mostly on the sigmatic aorist although many archaic nonsigmatic aorists are also attested. The proportions of the archaic and innovative forms can be easily calculated for each text in the Pali Canon and these proportions used to assign an approximate date for each text. The accuracy of the method can be assessed qualitatively by comparing the derived chronology to chronologies based on various non-linguistic criteria, or quantitatively by comparing the derived chronology to a known dating scheme. For the latter it is necessary to turn to a different dataset, such as that describing the rise of do-support in Early Modern English, as described in Ellegard (1953) and Kroch (1989). Bio: Paul Kingsbury graduated summa cum laude in linguistics from Ohio State University in 1993 with a thesis on "Some sources for L-words in Sanskrit". He subsequently entered the University of Pennsylvania to study historical linguistics and Sanskrit, but (like most historical students) was diverted to computational issues. He joined the Propbank project in 2000 and soon thereafter engineered a major rethinking of the methods and goals of the project, in order to make the annotation linguistically meaningful. He completed his doctorate in 2002 with a thesis entitled 'The Chronology of the Pali Canon: the case of the aorist'.
|
| 16 Jan 04 | John Prager (IBM) |
Using Constraints to Improve Question-Answering Accuracy
Time: 2:00 pm - 3:00 pm Location: 11 Large Abstract: Leading Question-Answering systems employ a variety of means to boost the accuracy of their answers. Such methods include redundancy (getting the same answer from multiple documents/sources), deeper parsing of questions and texts (hence improving the accuracy of confidence measures), inferencing (proving the answer from information in texts plus background knowledge) and sanity-checking (verifying that answers are consistent with known facts). To our knowledge, however, no QA system deliberately asks additional questions in order to derive constraints on the answers to the original questions. We present in this talk the method of QA-by-Dossier-with-Constraints (QDC). This is an extension of the simpler method of QA-by-Dossier, in which definitional questions ("Who/what is X") are addressed by asking a set of questions about anticipated properties of X. In QDC, the collection of Dossier candidate answers, along with possibly other answers to questions asked expressly for this purpose, are subjected to satisfying a set of naturally-arising constraints. For example, for a "Who is X" question, the system will ask about birth, accomplishment and death dates, which, if they exist, must occur in that order, and also obey other constraints such as lifespan. Temporal, spatial and kinship relationships seem to be particularly amenable to this treatment, but it would seem that almost any "factoid" question can benefit from QDC. We will discuss the setting-up and application of constraint networks, and talk about how (and whether) to develop the constraint sets automatically. We will demonstrate several applications of QDC, and present one evaluation in which the F-measure for a set of questions improved with QDC from .39 to .69. |
| 19 Dec 03 | Robert Krovetz (Ask Jeeves) |
More than One Sense Per Discourse
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Previous research has indicated that when a polysemous word appears two or more times in a discourse, it is extremely likely that they will all share the same sense (Gale et al. 92). However, those results were based on a coarse-grained distinction between senses (e.g, {\em sentence} in the sense of a `prison sentence' vs. a `grammatical sentence'). I conducted an analysis of multiple senses within two sense-tagged corpora, Semcor and DSO. These corpora used WordNet for their sense inventory. I found significantly more occurrences of multiple-senses per discourse than reported in (Gale et al. 92) (33\% instead of 4\%). I also found classes of ambiguous words in which as many as 45\% of the senses in the class co-occur within a document. I will discuss the implications of these results for the task of word-sense tagging and for the way in which senses should be represented. |
| 25 Nov 03 | Hang Li (MSR Beijing) |
Using Bilingual Data to Mine and Rank Translations
Time: 10:30 pm - 12:00 pm Location: 11th Floor Large Abstract: In this talk, I will introduce some of the technologies which we have developed in the project on an English reading assistant system called English Reading Wizard. The technologies include a method for mining translations from web (unparallel corpora), a method for word translation disambiguation based on bootstrapping, which is called Bilingual Bootstrapping, and a general method of bootstrapping, which is called Collaborative Bootstrapping. First, I will introduce the main features of English Reading Wizard. Next, I will introduce each of the methods. The translation mining method is based on a naïve Bayesian ensemble and the EM algorithm. Bilingual Bootstrapping uses the asymmetric translation relationship between words in the two languages in translation and can construct reliable classifiers for word translation disambiguation. Collaborative Bootstrapping contains the co-training algorithm as its special case, and it uses the strategy of uncertainty reduction in training of the two classifiers. Bio: Hang Li is a researcher at the Natural Language Computing Group of Microsoft Research in Beijing, China. He is also adjunct professor of Xian Jiaotong University. Hang Li obtained a B.S. in Electrical Engineering from Kyoto University (Japan) in 1988 and a M.S. in Computer Science from Kyoto University in 1990. He earned his Ph.D. in Computer Science from the University of Tokyo in 1998. >From 1990 to 2001, Hang Li worked at the Research Laboratories of NEC Corporation in Kawasaki, Japan. He joined Microsoft Research in 2001. His research interest includes statistical learning, natural language processing, data mining, and information retrieval. Hang Li's web site: http://research.microsoft.com/users/hangli/
|
| 17 Nov 03 | Dr. Kato and Dr. Fukomoto (NTCIR) |
An Overview of the QA Challenge + NTCIR -- The Way Ahead
Time: 10:30 am - 12:00 pm Location: 4th Floor Abstract: An Overview of Question Answering Challenge Jun'ichi Fukumoto and Tsuneaki Kato In this talk, we will present an overview of Question Answering Challenge(QAC), which is the question answering task of the NTCIR Workshop. QAC-1 (the first evaluation of QAC) was carried out at NTCIR Workshop 3 in October 2002, and QAC-2 will be at NTCIR Workshop 4 in December 2003. In the QAC, systems to be evaluated are expected to return exact answers consisting of a noun or noun compound denoting, for example, the names of persons, organizations, or various artifacts or numerical expressions such as money, size, or date. Those basically range over the Named Entity (NE) elements of MUC and IREX but is not limited to them. QAC consists of three kinds of subtasks: Task 1, where the systems are allowed to return ranked five possible answers; Task 2, where the systems are required to return a complete list of answers; and Task 3, the systems are required to answer series of questions, that have anaphora and zero-anaphora. We will present the results of QAC-1, and vision and prospect of QAC-2. NTCIR -- the Way Ahead Noriko Kando Dr. Noriko Kando is the leader of NTCIR(Test Collections and Evaluation of IR, Text Summarization, Q&A, etc) project, and an associate professor of National Institute of Informatics (NII). She got her Ph. D in 1995 from Keio University. Her research interest includes evaluation of information retrieval systems, technologies to "Make Information Usable for Users", cross-lingual information retrieval, and analysis of text structure, genre, citation & link She is a member of editorial boards of International Journal on Information Processing and Management, ACM-Transaction on Asian Language Information Processing, etc. Jun'ichi Fukumoto and Tsuneaki Kato are task organizers of QAC. Dr. Jun'ichi Fukumoto is an associate professor of Ritsumeikan University. He got his Ph. D in 1999 from University of Manchester Institute of Science and Technology. His research interest includes Q&A, automatic summarization, and dialogue processing. Dr. Tsuneaki Kato is an associate professor of the University of Tokyo. He got his Dr. of Engineering in 1995 from Tokyo Institute of Technology. His research interests includes multimodal dialogue processing, multimodal presentation generation and domain independent question and answering. He is a member of editorial committee of transaction on information and systems of The Institute of Electronics, Information and Communication Engineers.
|
| 27 Oct 03 | Christopher Manning (Stanford) |
Natural Language Parsing: Graphs, the A* Algorithm, and Modularity
Time: 10:00 am - 11:00 am Location: 11 Large Abstract: Probabilistic parsing methods have in recent years transformed our ability to robustly find correct parses for open domain sentences. Much of this work has been within a common architecture of heuristic search for good pares in lexicalized probabilistic context-free grammars, with many layers of back-off to avoid problems of sparse data. In this talk, I will outline some different ideas that we have been pursuing. I will connect stochastic parsing with finding shortest paths in hypergraphs, and show how this approach naturally provides a chart parser for arbitrary probabilistic context-free grammars (finding shortest paths in a hypergraph is easy; the central problem of parsing is that the hypergraph has to be constructed on the fly). From this viewpoint, a natural approach is to use the A* algorithm to cut down the work in finding the best parse. On unlexicalized grammars, this can reduce the parsing work done dramatically, by at least 97%. This approach is competitive with methods standardly used in statistical parsers, while ensuring optimality, unlike most heuristic approaches to best-first parsing. Finally, I will present a novel modular generative model in which semantic (lexical dependency) and syntactic structures are scored separately. This factored model is conceptually simple, linguistically interesting, admits exact inferenence with an extremely effective A* algorithm, and provides straightforward opportunities for separately improving the component models. In particular, I will mention some of the work we have done focusing on the PCFG component to produce a very high accuracy unlexicalized grammar. This is joint work with Dan Klein. About the Speaker: Christopher Manning is an Assistant Professor of Computer Science and Linguistics at Stanford University. He received his Ph.D. from Stanford University in 1995, and served on the faculty of the Computational Linguistics Program at Carnegie Mellon University (1994-1996) and the University of Sydney Linguistics Department (1996-1999) before returning to Stanford. His research interests include probabilistic models of language, natural language parsing, constraint-based linguistic theories, syntactic typology, information extraction and text mining, and computational lexicography. He is the author of three books, including Foundations of Statistical Natural Language Processing (MIT Press, 1999, with Hinrich Schuetze). Chris' schedule is available in Postscript or PDF format. |
| 17 Oct 03 | Hovy, Marcu, Knight, Byrd, Narayanan, Traum, Gordon |
Introduction to CL Research
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The annual Computational Linguistics Open House will be held at USC's Information Sciences Institute from 3:00-4:30pm in the 11th floor Conference Room. Researchers from ISI, including Eduard Hovy, Daniel Marcu, and Kevin Knight will present overviews of their latest research. We will also hear about the research activities of Dani Byrd of the Linguistics Department, Shri Narayanan's group in EE, and David Traum and Andrew Gordon of USC's Institute for Creative Technologies.
|
| 10 Oct 03 | Philipp Koehn |
Advances in Statistical MT: Phrases, Noun Phrases and Beyond
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: (This is a practice run for I talk I will give a few times over the next weeks when interviewing for job positions.) I will review the state of the art in statistical machine translation (SMT), present my dissertation work, and sketch out the research challenges of syntactically structured statistical machine translation. The currently best methods in SMT build on the translation of phrases (any sequences of words) instead of single words. Phrase translation pairs are automatically learned from parallel corpora. While SMT systems generate translation output that often conveys a lot of the meaning of the original text, it is frequently ungrammatical and incoherent. The research challenge at this point is to introduce syntactic knowledge to the state of the art in order to improve translation quality. My approach breaks up the translation process along linguistic lines. I will present my thesis work on noun phrase translation and ideas about clause structure.
|
| 03 Oct 03 | Anton Leuski |
A Year in Paradise
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: I would like to talk about some of the things I did during the last year. I will discuss and demonstrate CuSTaRD, a cross-lingual information retrieval, organization, summarization, and visualization system that was built for the Surprise Language exercise. I will focus in more details on iNeATS, the interactive multi-document summarization part of CuSTaRD. The other project I plan to present is eArchivarius, a system for accessing collections of electronic mail.
|
| 02 Oct 03 | Ana-Maria Popescu |
TBA
Time: 4:00 pm - 5:00 pm Location: 11 Large Abstract: |
| 15 Sep 03 | Beata Klebanov |
Analyzing Sentences into Facts: Simple is Beautiful
Time: 2:30 pm - 4:00 pm Location: 11 Large Abstract: I present my summer project - writing rule-based software for simplifying texts. Task definition and motivations will be discussed, as well as human and automatic evaluation, the latter using a question answering system. This is joint work with Daniel Marcu and Kevin Knight.
Slides: klebanov_facts.ppt |
| 12 Sep 03 | Lara Taylor |
Discourse Coherence for Ordering Information
Time: 2:30 pm - 4:00 pm Location: 11 Large Abstract: In this talk, I look at how the notion of discourse coherence can be modeled computationally. I begin with the following idea: if you take a text and shuffle its sentences into a random order, that text will no longer make sense. In other words, the text will be "incoherent". Our task is to learn how to reassemble a shuffled text into an order that humans would consider to be coherent. I discuss practical and theoretical motivations for the task, evaluations of our model, increases in performance achieved over the summer, and directions for future research. This work was done in collaboration with Kevin Knight, Daniel Marcu, Jonathan Graehl and Nick Mote.
Slides: taylor_ordering.ppt |
| 05 Sep 03 | Nishit Rathod and Anish Nair |
Deciphering Hindi Scripts
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: A major hurdle in building automated information retrieval systems for Hindi text is the lack of an uniform encoding for text representation. Standards do exist, but noone seems interested. Every web content publisher seems to have their encoding system, making information extraction a nightmare. We explore an unsupervised approach to convert any given "unknown" encoding to UTF-8, by treating it as a decipherment problem. We also study how a little amount of supervision can improve decoding accuracy.
|
| 03 Sep 03 | Alex Fraser and Franz Och |
JHU MT Workshop
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We will present the results of the 2003 Johns Hopkins University Summer Workshop on "Syntax for Statistical Machine Translation". We will describe a large effort to extend a high-performing phrase-based MT system as baseline by adding new features representing syntactic knowledge that deal with specific problems of the underlying baseline. We investigate a broad range of possible feature functions, from very simple binary features to sophisticated tree-to-tree translation models. Simple feature functions test if a certain constituent occurs in the source and the target language parse tree. More sophisticated features will be derived from an alignment model where whole sub-trees in source and target can be aligned node by node. We present results on the Chinese-English large data track of the recent TIDES MT evaluations. This is joint work with the other workshop team members: Daniel Gildea, Anoop Sarkar, Sanjeev Khudanpur, Kenji Yamada, Libin Shen, Shankar Kumar, David Smith, Viran Jain, Katherine Eng, Jin Zhen and Dragomir Radev. See http://www.clsp.jhu.edu/ws03/groups/translate/ for more.
Slides: fraser_mt.pdf.bz2 |
| 29 Aug 03 | Stefan Riezler |
Deepening Representations
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 27 Aug 03 | Michel Galley and Mark Hopkins |
Syntax for Statistical MT
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 22 Aug 03 | Satoshi Sekine |
Information Extraction, IR and QA
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 15 Aug 03 | Beata Klebanov |
On Her Masters Research
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 01 Aug 03 | Shou-de Lin |
Toward deciphering the 2-dimensional ancient Luwian script by discovering its writing order
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 29 Jul 03 | Michael Brasser |
A Model of Word Movement for Machine Translation
Time: 3:00 pm - 4:00 pm Location: 11 Small Abstract: |
| 25 Jul 03 | Jonathan Graehl and Kevin Knight |
Super-Carmel for Trees
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 18 Jul 03 | Doug Oard |
A Maryland Yankee in King Eduard's Court: Some Remarks on a Year in Paradise
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 27 Jun 03 | Michael Fleischman |
Offline Strategies for Online Question Answering: Answering Questions Before They Are Asked and Maximum Entropy Models for FrameNet Classification
Time: 3:00 pm - 4:00 pm Location: 10 Large Abstract: |
| 12 Jun 03 | Dina Demner-Fushman |
Measuring the Effect of Dictionary Coverage on Cross-Language Retrieval
Time: 11:00 am - 12:00 pm Location: 11 Large Abstract: Bilingual term lists have proven to be a useful basis for dictionary-based Cross-Language Information Retrieval (CLIR), but there is ample anecdotal evidence that differences in vocabulary coverage can have a substantial impact on retrieval effectiveness. This issue has recently been explored using ablation studies in which progressively smaller term lists were synthesized using sampling techniques. The ablation techniques used in those studies have not, however, been validated using real terms lists. In this talk I will report the results of what we believe is the first large coverage study use naturally occurring term lists. Thirty-five bilingual terms lists were obtained from a variety of sources, each with English as one of the two paired languages. From these, we created 35 English-to-English term lists by taking each term that was present in the English side of the list as its own translation. When used with an English information retreval test collection, this allowed us to measure the reduction in retrieval effectivenss that could be attributed to deficiencies in the coverage of English terms. Eight types of untranslatable terms were identified in a collection of news stories, of which named entitles were found to have the greatest impact on retrieval effectiveness. Differences in named entity coverage were found to produce large differences in retrieval effectiveness for term lists of similar sizes. Controlling for named entity effects yielded a clear relationship between retrieval effectiveness and the size of the translatable English vocabulary. The functional dependence that we observed is consistent with one previously applied ablation technique and inconsistent with another. Our results indicate that the outcome of a widely cited landmark study of query expansion effects for CLIR was likely affected by a flawed ablation model. We conclude our talk with a suggestion for further work on that topic, and a simple prescription for avoiding such problems in the future.
|
| 23 May 03 | Liang Zhou |
A Web-Trained Extraction Summarization System and Headline Summarization at ISI
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: 1) A serious bottleneck in the development of trainable text summarization systems is the shortage of training data. Constructing such data is a very tedious task, especially because there are in general many different correct ways to summarize a text. Fortunately we can utilize the Internet as a source of suitable training data. In this paper, we present a summarization system that uses the web as the source of training data. The procedure involves structuring the articles downloaded from various websites, building adequate corpora of (summary, text) and (extract, text) pairs, training on positive and negative data, and automatically learning to perform the task of extraction-based summarization systems. 2) Headlines are useful for users who only need information on the main topics of a story. We present a headline summarization system that is built at ISI for this purpose and is a top performer for DUC2003's task 1, generating very short summaries (10 words or less).
|
| 20 May 03 | Michel Galley |
Discourse Segmentation of Multi-Party Conversation
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 16 May 03 | Chin-Yew Lin |
Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Following the recent adoption by the machine translation community of automatic evaluation using the BLEU/NIST scoring process, we conduct an in-depth study of a similar idea for evaluating summaries. The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results.
|
| 09 May 03 | Doug Oard |
Coping with Surprise: The Case of Cebuano
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: For ten days in March, nine research teams worked together to build Cebuano language resources and systems for a "dry run" the TIDES Suprise Language experiment. Cebuano is spoken widely in the southern Phillipines, but there had previously been little work on computational linguistics for that language. As we prepare for the actual Suprise Language experiment this June, we will use this talk to look back on what worked, what didn't, and what lessons there are to be learned from our experience in March. Come prepared to share the excitement, offer your ideas, and understand why we have tried to ask Ed to cancel all vacations during the month of June (just kidding...).
|
| 02 May 03 | Hal Daumé III |
Acquiring Paraphrase Templates from Document/Abstract Pairs
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We present an approach to automatically extracting paraphrase templates from document/abstract pairs. This methodology relies on word-based alignments created by off-the-shelf software. Our paraphrases are evaluated by human evaluators for precision and automatically for applicability. We find that 77% of the extracted paraphrases are judged to be always correct and that the generalized templates of 60% are judged to be applicable most of the time and 87% are judged to be applicable sometimes.
|
| 25 Apr 03 | Quamrul Tipu |
Statistical MT with Bilingual Morphology
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Traditional statistical MT systems mostly work on the word- andphrase-level. For different language pairs, the performance of such systems vary from some 15% to 35%. These systems suffer from problems such as sparse data, with huge vocabulary sizes leading to less reliable probability estimates. In our current research, we aim to come up with a better MT system by looking inside the words. Almost in every language, a root (stem) can have many different forms (inflectional, derivational, etc.). If we can identify the roots, the size of the vocabulary will quite small, and we can have better probability estimates, reducing the sparse data problem and potentially leading to higher accuracy. We are trying to come up with a model that induces morphology automatically from a bilingual corpus and achieves this improvement.
|
| 04 Apr 03 | Donghui Feng |
Natural Language Understanding in MRE
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: In this talk, I will present my current work on language understanding in the project, Mission Rehearsal Exercise(MRE). One of the challenges in a dialogure system is to provide a robust understanding/parsing compoment. We applied both Finte State Model and Statistical Learning Model for the parsing of separate sentences of dialogue utterances. Their performances are evaluated and compared with a new blind set. And we hope to incorporate them to make a better solution in this specific application.
|
| 21 Mar 03 | Gareth Jones |
An Investigation of the Application of Broad Coverage Automatic Pronoun Resolution in Information Retrieval
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Term weighting methods have been shown to give significant increases in information retrieval performance. Term weights are typically calculated using frequency counts across the whole retrieval collection, frequency of each term within individual documents and compensation for varying document length. The presence of pronomial references in documents effectively reduces the within document term frequency of associated words with a consequent effect on term weights and information retrieval behaviour. This presentation will describe an experimental investigation into the impact on information retrieval performance of broad coverage automatic pronoun resolution. Results using a standard information retieval test collection indicate that calculating term weights using a pronoun resolved version of the document test collection can improve both fixed cutoff and average retrieval precision.
|
| 14 Mar 03 | Kareem Darwish |
Improving the Efficiency and Effectiveness of Structured Query Methods
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: One of the key challenges in retrieval is what to do when a query term needs to be replaced with more than one term. This problem arises in applications such as cross language information retrieval and thesaurus expansion. One solution is to use structured query methods, which treat all the possible replacements as if they were one query term by computing a joint document frequency and a joint term frequency. This presentation will review prior work on structured query techniques and then introduce three new variants that aim to improve computational efficiency and to leverage estimates of replacement probabilities to improve retrieval effectiveness. The methods have now been tested in cross-language retrieval and OCR-degraded text retrieval applications in which replacement probability estimates could be estimated. In both applications, the new structured query methods showed statistically significant improvements in retrieval effectiveness over previously known structured query methods.
|
| 07 Mar 03 | Scott Klemmer |
Books with Voices: Paper Transcripts as a Tangible Interface to Oral Histories
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Our contextual inquiry into the practices of oral historians unearthed a curious incongruity. While oral historians consider interview recordings a central historical artifact, these recordings sit unused after a written transcript is produced. We hypothesized that this is largely because books are more usable than recordings. Therefore, we created Books with Voices: bar-code augmented paper transcripts enabling fast, random access to digital video interviews on a PDA. We present quantitative results of an evaluation of this tangible interface with 13 participants. They found this lightweight, structured access to original recordings to offer substantial benefits with minimal overhead. Oral historians found a level of emotion in the video not available in the printed transcript. The video also helped readers clarify the text and observe nonverbal cues. |
| 28 Feb 03 | Radu Soricut |
Sentence Level Discourse Parsing using Syntactic and Lexical Information
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We introduce two probabilistic models that can be used to identify elementary discourse units and build sentence-level discourse parse trees. The models use syntactic and lexical features. A discourse parsing algorithm that implements these models derives discourse parse trees with an error reduction of 18.8\% over a state-of-the-art decision-based discourse parser. A set of empirical evaluations shows that our discourse parsing model is sophisticated enough to yield discourse trees at an accuracy level that matches near-human levels of performance.
|
| 21 Feb 03 | Nate Chambers |
Statistical Language Generation in a Dialogue System
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: The large corpora of written text that is available to the language community has largely been utilized for language understanding; it has somewhat been ignored in the context of language generation. Recent developments in stochastic generation have allowed such systems to shift the burden from hand crafted databases (lexicons, grammars, ontologies) to the knowledge implicitly found in written text. However, when building a dialogue system, generation is largely interactive, very different from the written structure of most corpora. In this talk, I will discuss my recent work at applying a stochastic generator, HALogen, and its newswire language model to a dialogue system, TRIPS. I'll describe the difficulties in mapping the TRIPS semantic form into HALogen's representation, the critical differences between newswire and dialogue, and the possibility of using HALogen and a large newswire model as a domain independent generator.
|
| 07 Feb 03 | Jeongwon Cha |
Automatic Pattern Learning for Information Extraction using Web Data
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: I will give a status report work on information extraction during last 10 months. The motivation of this work is to learn extraction patterns automatically using seed template and web search engine. My approach is to generate linguistics patterns and surface patterns and combine them to compenstate for the respective weaknesses of two patterns. On the DUC01-test-disasters (67 documents), DUC01-training-disasters (54 documents) I got a 0.34/0.26 f-measure respectively. In this talk, I will give a status report on ReAD project (with Dr. Chin-Yew Lin).
|
| 31 Jan 03 | Philipp Koehn |
Noun Phrase Translation
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: I will give a status report on my current thesis work on noun phrase translation. The motivation of this work is to break up the machine translation problem into smaller, more manageable units. The treatment of noun phrase translation as a subtask of machine translation is both linguistically and empirically motivated. My approach is to generate a n-best list of candidate translations with a statistical machine translation system and rerank the candidates with additional features. For about 90% of all noun phrases we can find an acceptable translation in the 100-best list, while an acceptable translation comes out on the very top for only about 60% of the noun phrases. I will discuss a variety of linguistic and empirical features that (may) help to move the acceptable translations higher in the list. I will also present results modeling issues such as phrase based translation and compound splitting. This talk is also intended as a fishing expedition for feature suggestions by the audience.
|
| 24 Jan 03 | Doug Oard & Anton Leuski |
Access to Archival Collections of Electronic Mail
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Since its inception more than 30 years ago, electronic mail (email) has developed into a powerful communication medium with applications that extend well beyond simple asynchronous message exchange between individuals. Automated tools to support the use of email in individual, organizational and social contexts have received increasing attention in recent years. Among the tasks that are now supported are filtering (e.g., spam detection), aggregation (e.g., mailing list digests), workflow management (e.g., help desk routing), and reuse (e.g., retrospective search). We are interested in how today's email will be used in the future -- some will certainly be preserved (indeed, some MUST be preserved!), and those records will serve as powerful evidence of how we lived our lives and organized our societies. The challenges of managing many types of electronic record collections are receiving increasing attention, but we are not aware of any work yet on supporting access to electronic mail archives. That will be the focus of this talk. We will introduce the Open Archival Information Systems (OAIS) model, and then focus on two key processes: ingestion and access. Our focus in ingestion is on support for review and redaction, which we believe will be key enablers to acquisition and near-term access. For access, we will address both browsing based on provenance (original order) and user-guided reorganization based on search and visualization. Along the way, we will identify potentially productive opportunities to apply natural language processing technologies such as topic segmentation, link detection, and summarization. We will then describe two test collections, and demonstrate a system that we have developed to explore user-guided reorganization through visualization for one of those collections. We will conclude the talk by sketching out a research agenda. At that point, we will expect suggestions and comments from the audience. Knowing this audience, it is unlikely that we will need to wait that long :-).
|
