BEGIN:VCALENDAR
CALSTYLE:GREGORIAN
PRODID:-//NL//Seminar Calendar//EN
VERSION:2.0
X-WR-CALNAME:NL
BEGIN:VEVENT
DESCRIPTION:
DTEND:20030801T160000
DTSTART:20030801T150000
LOCATION:11 Large
SUMMARY:Toward deciphering the 2-dimensional ancient Luwian script by discovering its writing order [Shou-de Lin]
UID:20030801T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The presentation will give an overview of the SMT activities at the
 Language Technologies Institute, Carnegie Mellon University, in large
 vocabulary text translation tasks, esp. the Chinese-English and
 Arabic-English, as well as in limited domain speech-to-speech translation
 tasks.  The CMU SMT system is, like most modern statistical MT systems,
 based on phrase translation.  Several approaches have been developed to
 extract the phrase pairs from parallel corpora and current research
 investigates different scoring approaches for these translation pairs.
 Details of the decoder, esp. on hypothesis recombination, pruning, and
 efficient n-best list generation will be given.  Recently, the SMT system
 has been extended to use partial translations generated from example based
 and grammar based translation system, thereby performing multi-engine
 machine translation.
 
 Bio:
 
 Stephan Vogel is a researcher at the Language Technologies Institute,
 Carnegie Mellon University, where he heads the statistical machine
 translation team.  He received a Diploma in Physics from Philips
 University Marburg, Germany, and a Masters of Philosophy from the
 University of Cambridge, England.  After working for a number of years on
 the history of science, he turned to computer science, especially natural
 language processing.  Before coming to CMU, he worked for several years at
 the Technical Univerity of Aachen on statistical machine translation, and
 also in the Interactive Systems Lab at the University of Karlsruhe.
 

DTEND:20040402T160000
DTSTART:20040402T150000
LOCATION:11 Large
SUMMARY:The CMU Statistical Machine Translation System [Stephan Vogel]
UID:20040402T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I will present work that extends the standard hidden Markov model to a
 version that can emit multiple symbols in a single time step.  Using this
 model, we are able to automatically create phrase-to-phrase mappings in an
 alignment process.  I've applied this model to the task of creating
 alignments between documents and their human-written abstracts, yielding
 an overall alignment F-score of 0.548, a significant improvement on the
 best results to date of 0.363.  These results are published in an EMNLP
 paper this year, but the talk will be an extended version of the talk I
 will give there (namely, I will discuss the mechanics of the extended HMM
 in more detail in this seminar).
 

DTEND:20040702T150000
DTSTART:20040702T133000
LOCATION:11 Large
SUMMARY:A Phrase-Based HMM Approach to Document/Abstract Alignment [Hal Daume III]
UID:20040702T133000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: We present an approach to automatically extracting paraphrase templates
 from document/abstract pairs. This methodology relies on word-based
 alignments created by off-the-shelf software. Our paraphrases are
 evaluated by human evaluators for precision and automatically for
 applicability. We find that 77% of the extracted paraphrases are judged
 to be always correct and that the generalized templates of 60% are
 judged to be applicable most of the time and 87% are judged to be
 applicable sometimes.
 

DTEND:20030502T160000
DTSTART:20030502T150000
LOCATION:11 Large
SUMMARY:Acquiring Paraphrase Templates from Document/Abstract Pairs [Hal Daum&eacute; III]
UID:20030502T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND:20031002T170000
DTSTART:20031002T160000
LOCATION:11 Large
SUMMARY:TBA [Ana-Maria Popescu]
UID:20031002T160000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Automatic word alignment plays a critical role in statistical machine
 translation. Unfortunately the relationship between alignment quality and
 statistical machine translation performance has not been well understood.
 In the recent literature the alignment task has frequently been decoupled
 from the translation task, and assumptions have been made about measuring
 alignment quality for machine translation which, it turns out, are not
 justified. In particular, none of the tens of papers published over the
 last five years has shown that significant decreases in Alignment Error
 Rate (AER) result in significant increases in translation quality. I will
 explain this state of affairs and present steps towards measuring
 alignment quality in a way which is predictive of statistical machine
 translation quality.
 
 I will also provide a brief overview of some of my other work on training
 and search for word alignment.
 

DTEND:20060203T163000
DTSTART:20060203T150000
LOCATION:11 Large
SUMMARY:Measuring Word Alignment Quality for Statistical Machine Translation [Alex Fraser]
UID:20060203T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: (note: this is a very tentative title -- comments welcome!)
 
 We present a novel extension of syntax-directed translation for
 statistical MT. Formally speaking, our model is based on tree-to- string
 transducers that recursively convert a parse-tree in the source-language
 into a string in the target-language. These transduction rules have
 multi-level trees on the source-side, giving this system more
 transformational power due to the extended domain of locality. We also
 present efficient algorithms for decoding based on dynamic programming.
 Initial experiments on English-to-Chinese translation show promising
 results in both speed and the translation quality.
 
 Joint work with Kevin Knight and Aravind Joshi.
 
 Bio:
 
 Liang Huang is a 3rd-year PhD student from the University of Pennsylvania.
 He is mainly interested in algorithms and formalisms for parsing and
 syntax-based machine translation. His recent work has been on k-best
 parsing algorithms (with David Chiang) and synchronous binarization for MT
 (with Hao Zhang, Dan Gildea, and Kevin Knight).

DTEND:20060303T163000
DTSTART:20060303T150000
LOCATION:11th Floor (Large)
SUMMARY:Syntax-Directed Translation with Extended Domain of Locality [Liang Huang (Penn)]
UID:20060303T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I would like to talk about some of the things I did during the last 
 year. I will discuss and demonstrate CuSTaRD, a cross-lingual 
 information retrieval, organization, summarization, and visualization 
 system that was built for the Surprise Language exercise. I will focus 
 in more details on iNeATS, the interactive multi-document summarization 
 part of CuSTaRD. The other project I plan to present is eArchivarius, a 
 system for accessing collections of electronic mail.
 

DTEND:20031003T160000
DTSTART:20031003T150000
LOCATION:11 Large
SUMMARY:A Year in Paradise [Anton Leuski]
UID:20031003T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: We will present the results of the 2003 Johns Hopkins University
 Summer Workshop on "Syntax for Statistical Machine Translation".
 
 We will describe a large effort to extend a high-performing
 phrase-based MT system as baseline by adding new features representing
 syntactic knowledge that deal with specific problems of the underlying
 baseline. We investigate a broad range of possible feature functions,
 from very simple binary features to sophisticated tree-to-tree
 translation models. Simple feature functions test if a certain
 constituent occurs in the source and the target language parse
 tree. More sophisticated features will be derived from an alignment
 model where whole sub-trees in source and target can be aligned node
 by node. We present results on the Chinese-English large data track of
 the recent TIDES MT evaluations.
 
 This is joint work with the other workshop team members: Daniel
 Gildea, Anoop Sarkar, Sanjeev Khudanpur, Kenji Yamada, Libin Shen,
 Shankar Kumar, David Smith, Viran Jain, Katherine Eng, Jin Zhen and
 Dragomir Radev.
 
 See <a
 href="http://www.clsp.jhu.edu/ws03/groups/translate/">http://www.clsp.jhu.edu/ws03/groups/translate/</a>
 for more.
 

DTEND:20030903T160000
DTSTART:20030903T150000
LOCATION:11 Large
SUMMARY:JHU MT Workshop [Alex Fraser and Franz Och]
UID:20030903T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: In this talk, I will present my current work on language understanding
 in the project, Mission Rehearsal Exercise(MRE). One of the challenges
 in a dialogure system is to provide a robust understanding/parsing
 compoment. We applied both Finte State Model and Statistical Learning
 Model for the parsing of separate sentences of dialogue utterances.
 Their performances are evaluated and compared with a new blind set.
 And we hope to incorporate them to make a better solution in this
 specific application.
 

DTEND:20030404T160000
DTSTART:20030404T150000
LOCATION:11 Large
SUMMARY:Natural Language Understanding in MRE [Donghui Feng]
UID:20030404T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Test collections for information retrieval tasks have traditionally
 assumed that what we are searching for are documents (e.g., Web pages,
 news stories, or academic documents).  Most information that is generated
 is, however, not in originally generated as part of a document, but rather
 as what we might refer to as "conversational media" (e.g., email, speech,
 or instant messaging).  In this talk, I'll describe the creation of two
 test collections for conversational media, an email collection being
 created in the TREC Enterprise Search track and a spoken word test
 collection for the the Cross-Language Evaluation Forum (CLEF).  I'll spend
 most of the talk describing the details of the CLEF test collection,
 illustrating the issues with some of the results that we have obtained
 from our experiments with that collection.  I'll conclude with a few
 remarks about the implications of what we are learning for DARPA's new
 GALE program.  This is joint work with Charles University, the IBM TJ
 Watson Research Center, the Johns Hopkins University, the Survivors of the
 Shoah Visual History Foundation, and the University of West Bohemia.
 
 
 About the speaker:
 
 Douglas Oard is an Associate Professor at the University of Maryland,
 College Park, with a joint appointment in the College of Information
 Studies and the Institute for Advanced Computer Studies.  He holds a Ph.D.
 in Electrical Engineering from the University of Maryland, and his
 research interests center around the use of emerging technologies to
 support information seeking by end users.  In 2002 and 2003, Doug spent a
 year in paradise here at USC-ISI.  His recent work has focused on
 interactive techniques for cross-language information retrieval and on
 searching conversational text and speech.  Additional information is
 available at http://www.glue.umd.edu/~oard/.

DTEND:20050805T163000
DTSTART:20050805T150000
LOCATION:11 Large
SUMMARY:The CLEF Cross-Language Speech Retrieval Test Collection [Doug Oard (Maryland)]
UID:20050805T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: (This talk occurs in the morning on the same day as the Bayesian tutorial.)
 
 The goal of our research is to support cooperative work performed by
 stakeholders sitting around a table. To support such cooperation, various
 table-based systems with a shared electronic display on the tabletop have
 been developed. These systems, however, suffer the common problem of not
 recognizing shared information such as text and images equally because the
 orientation of their view angle is not favorable. To solve this problem,
 we propose the Lumisight Table. This is a system capable of displaying
 personalized information to each required direction on one horizontal
 screen simultaneously by multiplexing them and of capturing stakeholders'
 gestures to manipulate the information.
 
 About the Speaker:
 
 Mitsunori Matsushita is a research scientist of NTT Communication Science
 Labs., Nippon Telegraph and Telephone Corporation (NTT). He received B.E.,
 M.E., and Dr.E. degrees from Osaka University, in 1993, 1995 and 2003
 respectively. In 1995, he joined NTT, and has been engaged in researches
 on natural language understanding, information visualization, and
 interaction design.
 

DTEND:20050622T240000
DTSTART:20050622T110000
LOCATION:11 Large
SUMMARY:Lumisight Table: A Face-to-face Collaboration Support System That Optimizes Direction of Projected Information to Each Stakeholder [Mitsunori Matsushita]
UID:20050622T110000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I present our approach to identify an argument structure defined as a
 simple hierarchical structure of claim and reasons.  The claim is also
 classified into "in favor of" or "against" the topic. The experiment is
 performed on the comments from the general public sent to government
 officials in response to proposed regulations.
 
 

DTEND:20060505T163000
DTSTART:20060505T150000
LOCATION:11 Large
SUMMARY:Recognizing Argument Structures in Texts [Namhee Kwon]
UID:20060505T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The ABC (Assess by Computer) system has been developed and used in the
 School of Computer Science at the University of Manchester for formative
 and (principally) summative assessment at undergraduate and postgraduate
 level. We believe that fully automatic marking of constructed answers -
 especially free text answers - is not a sensible aim. Instead - drawing on
 parallels in the history of machine translation - we take a
 "human-computer collaborative" approach, in which the system does what it
 can to support the efficiency and consistency of the human marker, who
 keeps the final judgement.
 
 Our current work focuses on what are generally referred to as "short text
 answers" as contrasted to "essays". However we prefer to contrast
 "factual" with "discursive" answers, and speculate that the former may be
 amenable to simple statistical techniques, while the latter require more
 sophisticated natural language analysis. I will show some examples of real
 exam data and the techniques we are using and developing to handle them.
 

DTEND:20041105T163000
DTSTART:20041105T150000
LOCATION:11 Large
SUMMARY:A Human-Computer Collaborative Approach to Computer Aided Assessment [Mary Wood (Manchester)]
UID:20041105T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: A major hurdle in building automated information retrieval systems for
 Hindi text is the lack of an uniform encoding for text representation.
 Standards do exist, but noone seems interested. Every web content
 publisher seems to have their encoding system, making information
 extraction a nightmare. We explore an unsupervised approach to
 convert any given "unknown" encoding to UTF-8, by treating it as a
 decipherment problem. We also study how a little amount of supervision
 can improve decoding accuracy.
 

DTEND:20030905T160000
DTSTART:20030905T150000
LOCATION:11 Large
SUMMARY:Deciphering Hindi Scripts [Nishit Rathod and Anish Nair]
UID:20030905T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Information retrieval using word senses is emerging as a good research
 challenge on semantic information retrieval. In this presentation, I am
 going to propose a new method using word senses in information retrieval:
 root sense tagging method. This method assigns coarse-grained word senses
 defined in WordNet to query terms and document terms by unsupervised way
 using co-occurrence information constructed automatically. The sense
 tagger is crude, but performs consistent disambiguation by considering
 only the single most informative word as evidence to disambiguate the
 target word. We also allow multiple-sense assignment to alleviate the
 problem caused by incorrect disambiguation.
 
 Experimental results on a large-scale TREC collection show that the
 proposed approach to improve retrieval effectiveness is successful, while
 most of the previous work failed to improve performances even on small
 text collection. The proposed method also shows promising results when is
 combined with pseudo relevance feedback and state-of-the-art retrieval
 function such as BM25.
 

DTEND:20040806T163000
DTSTART:20040806T150000
LOCATION:11 Large
SUMMARY:Information Retrieval using Word Senses: Root Sense Tagging Approach [Hae-Chang Rim]
UID:20040806T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: We propose a theory that gives formal semantics to word-level
 alignments defined over parallel corpora. We use our theory to
 introduce a linear algorithm that can be used to derive from
 word-aligned, parallel corpora the minimal set of syntactically
 motivated transformation rules that explain human translation data.
 
 (joint work with Michel Galley, Kevin Knight, and Daniel Marcu)
 

DTEND:20040206T160000
DTSTART:20040206T150000
LOCATION:11 Large
SUMMARY:What's in a Translation Rule? [Mark Hopkins]
UID:20040206T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Automatic Natural Language applications often require the processing of
 structured data. Traditional machine learning approaches attempt to
 represent structured syntactic/semantic objects by means of flat feature
 representations, i.e. attribute-value vectors. This raises two problems:
 
 1. There is no well defined theoretical motivation for such feature model.
 Structural properties may not fit in any flat feature representation.
 
 2. To define effective flat features, a deep knowledge about the
 linguistic phenomenon is required.
 
 Kernel methods for Natural Language Processing aim to solve both the above
 problems as kernel functions can be used to define similarities between
 linguistic objects without explicitly defining the target feature space.  
 In this way, a linguistic phenomenon can be modeled at a more abstract
 level where the modeling is easier. Such property is extremely useful when
 the representation of linguistic phenomena is still not well understood.
 For example, the feature design of semantic role labeling appear to be
 quite complex since several and non-definitive feature sets have been
 proposed.
 
 As a viable alternative to manual feature design, kernel methods propose
 two steps: (1) they generate all substructures of the target
 syntactic/semantic structures and (2) they let the learning algorithm
 (e.g. Support Vector Machines) to select the most relevant substructures.
 In this talk, we (1) introduce the PropBank and FrameNet predicate
 argument structures, (2) present the standard approaches to the automatic
 labeling of semantic roles and (3) show advanced semantic role labeling
 models based on kernel methods.
 
 About the speaker:
 
 Alessandro Moschitti is a researcher at the Computer Science Department of
 the University of Rome ^ÓTor Vergata^Ô. In 1998 he took his master degree
 in Computer Science at the University of Rome ^ÓLa Sapienza^Ô. In 2003 he
 finished his PhD in Computer Science at ^ÓTor Vergata^Ô University.  
 Between 2002 and 2004 he worked as an associate researcher in the
 University of Texas at Dallas. His research interests concern machine
 learning approaches for Natural Language Processing and Information
 Retrieval. His deep expertise relates to automated text categorization and
 semantic role labeling.  Recently, he has devised new kernels which enable
 Support Vector and other kernel-based machines to carry out advanced
 semantic processing.
 
 

DTEND:20050706T153000
DTSTART:20050706T140000
LOCATION:11 Large
SUMMARY:Kernel Methods for Semantic Role Labeling [Alessandro Moschitti (Rome)]
UID:20050706T140000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I will give a status report work on information extraction during last
 10 months. The motivation of this work is to learn extraction
 patterns automatically using seed template and web search engine. My
 approach is to generate linguistics patterns and surface patterns and
 combine them to compenstate for the respective weaknesses of two
 patterns. On the DUC01-test-disasters (67 documents),
 DUC01-training-disasters (54 documents) I got a 0.34/0.26 f-measure
 respectively. In this talk, I will give a status report on ReAD
 project (with Dr. Chin-Yew Lin).
 

DTEND:20030207T160000
DTSTART:20030207T150000
LOCATION:11 Large
SUMMARY:Automatic Pattern Learning for Information Extraction using Web Data [Jeongwon Cha]
UID:20030207T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Text-to-text applications -- Machine Translation, Summarization, Question
 Answering -- do not usually involve generic Natural Language Generation
 (NLG) systems in their generation components, but rather use
 application-specific algorithms. The main reason for this state of affairs
 is that virtually all the formalisms used by current generic NLG systems
 require information that cannot be reliably extracted from unrestricted
 text.
 
 This thesis proposal is about meeting the demand for natural language
 generation in the context of text-to-text applications. I introduce a new
 representation formalism (WIDL-expressions), propose generation algorithms
 that operate on representations specific to this formalism, and discuss a
 generic sentence realization framework for text-to-text applications. The
 generation mechanism is based on algorithms for intersecting
 WIDL-expressions with probabilistic language models. I present both
 theoretical and empirical results concerning the correctness and
 efficiency of these algorithms. I also discuss the practical aspects
 arising from implementing this generation mechanism.
 
 In a concrete application of the proposed generation mechanisms, I present
 an end-to-end Machine Translation application. I also discuss another
 possible application for Automated Summarization, namely automated
 headline generation.
 

DTEND:20050707T163000
DTSTART:20050707T150000
LOCATION:11 Small
SUMMARY:Natural Language Generation for Text-to-Text Applications Using an Information-Slim Representation [Radu Soricut]
UID:20050707T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Our contextual inquiry into the practices of oral
 historians unearthed
 a curious incongruity. While oral historians consider interview
 recordings a central historical artifact, these recordings
 sit unused
 after a written transcript is produced. We hypothesized
 that this is
 largely because books are more usable than recordings.
 Therefore, we
 created Books with Voices: bar-code augmented paper transcripts
 enabling fast, random access to digital video interviews on
 a PDA. We
 present quantitative results of an evaluation of this tangible
 interface with 13 participants. They found this lightweight,
 structured access to original recordings to offer
 substantial benefits
 with minimal overhead. Oral historians found a level of
 emotion in the
 video not available in the printed transcript. The video
 also helped
 readers clarify the text and observe nonverbal cues.
 
 <a
 href="http://guir.berkeley.edu/oral-history/">http://guir.berkeley.edu/oral-history/
 

DTEND:20030307T160000
DTSTART:20030307T150000
LOCATION:11 Large
SUMMARY:Books with Voices: Paper Transcripts as a Tangible Interface to Oral Histories [Scott Klemmer]
UID:20030307T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBA
 

DTEND:20050408T163000
DTSTART:20050408T150000
LOCATION:11 Large
SUMMARY:Search Engines for HLT Applications [Jamie Callan (CMU)]
UID:20050408T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The Inversion Transduction Grammar (ITG) of \cite{DekaiCL} generates a
 synchronous parse tree for a given pair of sentences in two languages. By
 allowing inversion of the order of children at any level of the
 synchronous parse tree, ITG can do recursive, systematic word reordering.
 We made a version of ITG where the nonterminals are lexicalized by word
 pairs and the inversions are dependent on the so-lexicalized nonterminals.  
 We found out that after lexicalization, the Alignment Error Rate (AER)
 against gold standard is reduced for short sentences. ITG parsing
 complexity is high polynomial. We proposed a pruning techique that
 utilizes IBM Model 1 to estimate the inside and outside probability of a
 bitext cell. Taking a step further, we applied the A* parsing having been
 used for monolingual parsing to ITG.  I will talk about the heuristic
 estimates we used for A* parsing for Viterbi alignment selection and
 decoding.
 

DTEND:20050608T163000
DTSTART:20050608T150000
LOCATION:4th floor
SUMMARY:Lexicalization and A* Searching for Inversion Transduction Grammar [Hao Zhang (Rochester)]
UID:20050608T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: An interesting (disturbing?) new trend is beginning to manifest itself in
 NLP, one that is focused on performance and hence very attractive in the
 context of inter-system competitive evaluations such as TREC and DUC, but
 one that does not provide much insight about language or NLP methods to
 the researcher interested in these topics.  This addition of a new
 paradigm to NLP has implications for all of us.
 

DTEND:20040409T163000
DTSTART:20040409T150000
LOCATION:11 Large
SUMMARY:Three (and a half?) Trends: The Future of NLP [Eduard Hovy]
UID:20040409T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Justin Busch:
 Weight and Semantic Class Issues in Japanese Noun Phrase Ordering
 
 Many current designs for automatic parsers learn probabilities for the
 relative frequencies of parts-of-speech and syntactic rules, and this has
 proven to be generally reliable. In spite of the ubiquity of probabilistic
 techniques for parsing, however, little attention has been given to the
 linguistic significance of the probabilistic data and what it might say
 about human performance.
 
 Hawkins proposes a general theory of grammaticalization based on the
 minimization of syntactic domains. Given that a sentence of any language
 will contain at least one noun phrase, one verb, and possibly additional
 noun phrases and prepositional phrases, "minimize domains" suggests that
 these phrases will order themselves according to whichever pattern
 requires the least effort to recognize the higher syntactic structure of
 the sentence. These effects are directly measurable through corpus
 statistics, and can be interpreted as potential heuristics for
 probabilistic parsers.  In this study, we examine Japanese data from the
 Kyoto Treebank and test Hawkins' predictions for noun phrase ordering by
 noun phrase weight as well as by generic semantic types. The discussion
 will focus primarily on how accurately Hawkins' predictions are reflected
 in the corpus statistics, and will conclude with observations about how
 they might be applied to the decision mechanisms of probabilistic parsers.
 
 --------------------------------------------------------------------------
 
 Hai Huang:
 TBA
 
 --------------------------------------------------------------------------
 
 Jens Stephan:
 Evaluation and Visualization of a Dialogue System
 
 Evaluations have become a necessary standard to almost any type of
 research. However, there are many areas where there is no common agreement
 on how to evaluate, which is the case for complex problem of evaluating
 dialogue systems. The evaluation of the multi party multi modal dialogue
 system MRE(1) provides a good example of what questions are important for
 such an evaluation, how to actually do the evaluation and finally how to
 how make special problems of the system visible to use the evaluation
 results to improve the systems performance.
 
 After a brief introduction of the MRE domain and architecture, I will
 break the task town to a set of general evaluation questions. From there I
 will explain what kinds of metrics and visualizations are suited to answer
 those questions and what kind of data is needed, as well as how that data
 was obtained. Along the road, examples of actual system problems and
 performances will be presented. The topics of data formatting and
 visualization will receive some special attention by introducing the MRE
 Evaluation Toolkit as well as the corpus it operates on.
 
 --------------------------------------------------------------------------
 
 Chen-kang Yang:
 Using the Omega Ontology to Determine Selectional Restrictions for Word Sense Disambiguation
 
 Word sense disambiguation is fundamental for language processing. Though
 purely statistical methods are effective for this task, they neglect the
 syntactic and semantic aspects. In this study, we adopt a hybrid approach
 by applying an unsupervised machine learning method to learn verbs
 selectional restrictions on their subjects/objects. The system then uses
 these learned selectional restrictions for word sense disambiguation of
 the subjects/objects. Instead of words, the training data contains
 ontological taxonomy hierarchies that are retrieved from the Omega
 ontology. Unlike other similar systems, we are able to automatically find
 the best match among classes from different levels of the ontology. This
 provides us more flexibility and is closer to human instinct. Our system
 performs better than other similar systems, though it still needs
 cooperating methods for better results.

DTEND:20040809T163000
DTSTART:20040809T150000
LOCATION:11 Large
SUMMARY:CL Student Presentations [Justin Busch, Hai Huang, Jens Stephan & Chen-kang Yang]
UID:20040809T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I'll give a survey of trees and grammars, at least the parts that seem
 most relevant to ongoing work at ISI.  This will be a theory talk.  I'll
 start with context-free grammars, which were developed in the 1950s, and
 cover other tree-generating systems.  I'll also talk about
 tree-transforming systems.

DTEND:20040709T163000
DTSTART:20040709T150000
LOCATION:11 Large
SUMMARY:Survey of Trees and Grammars [Kevin Knight]
UID:20040709T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: For ten days in March, nine research teams worked together to build
 Cebuano language resources and systems for a "dry run" the TIDES Suprise
 Language experiment. Cebuano is spoken widely in the southern
 Phillipines, but there had previously been little work on computational
 linguistics for that language. As we prepare for the actual Suprise
 Language experiment this June, we will use this talk to look back on what
 worked, what didn't, and what lessons there are to be learned from our
 experience in March. Come prepared to share the excitement, offer your
 ideas, and understand why we have tried to ask Ed to cancel all vacations
 during the month of June (just kidding...).
 

DTEND:20030509T160000
DTSTART:20030509T150000
LOCATION:11 Large
SUMMARY:Coping with Surprise: The Case of Cebuano [Doug Oard]
UID:20030509T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: ISI's Tactical Language Project is a system designed to teach Americans
 how to speak Arabic through a video game environment. We've taken a FPS
 engine (Unreal 2003) and re-did the graphics so it looks like you're in a
 typical Lebanese village. We took away the guns, added speech recognition,
 and set the player in the middle of it all. The theory is that if you
 learn well in a classroom, you'll perform well in a classroom, but if you
 learn well in a pseudo-naturalistic environment, you'll perform better in
 real life.
 
 In a pedagogical context, speech recognition is a hard thing we're trying
 to recover signal from noisy language-learner speech--with all of its
 mispronunciations, disfluencies, and grammatical errors . Language
 understanding is hopeless unless you have a good approximation of what
 kinds of mistakes learners make, and you can build a system to anticipate
 them.
 
 Suppose an English language learner says "Water". Is he asking you for
 water? Is he telling you there's a puddle in front of you? Is he saying
 his name is "Walter", but with horrible pronunciation? There's a lot of
 ambiguity involved. In order to disambiguate, we need to look at the
 speech signal itself, the utterance's context, the learner's past language
 performance, and details about the learner's mother language as it relates
 to English, etc., etc... Only then can we hope to guess what the learner
 is actually trying to say.
 
 And then, of course, once we've made a good guess at the learner's speech
 intentions, what do we do about it? How do we correct him? How do we
 balance the consideration of inherent qualities of learner motivation,
 language errors, learning objectives, and possibly low-confidence speech
 recognition, as we generate good pedagogical feedback?
 
 This is NLP (primarily statistical) with a bit of pedagogy theory and
 linguistic (SLA and phonology) theory sprinkled in.
 

DTEND:20041210T163000
DTSTART:20041210T150000
LOCATION:11 Large
SUMMARY:Developing a Language Model for Second Language Learner Speech [Nick Mote]
UID:20041210T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The Arabic language exhibits diglossia, i.e., the coexistence of two forms
 of language, a variety with standard orthography and sociopolitical clout
 which is not natively spoken by anyone (Modern Standard Arabic, MSA) and
 varieties that are primarily spoken and lack writing standards (Arabic
 dialects). There are important resources currently available for MSA with
 much on-going NLP work; for example, there is an Arabic Treebank and
 several syntactic parsers for MSA.  However, Arabic dialect resources and
 NLP research are still at an infancy stage. I will present work done at
 the Johns Hopkins CLSP Summer Workshop on parsing of Arabic dialects, in
 particular, Levantine Arabic.  We have experimented with three approaches
 to leveraging MSA resources to create a parser for Levantine Arabic, as
 well as methods for induction of MSA-Levantine translation lexicons and a
 Levantine part-of-speech tagger. Using these methods we obtain error
 reductions of up to 15% compared with applying an MSA parser directly to
 Levantine text.
 
 Rambow et al. Parsing Arabic Dialects: Final Report. Johns Hopkins
 University Center for Language and Speech Processing Workshop 2005.  
 http://www.clsp.jhu.edu/ws2005/groups/arabic/documents/finalreport.pdf
 
 Chiang et al. Parsing Arabic Dialects. To appear in Proc. EACL 2006.
 
 This is joint work with O. Rambow, M. Diab, N. Habash, R. Hwa, K. Sima'an,
 V.  Lacey, R. Levy, C. Nichols and S. Shareef.

DTEND:20060210T160000
DTSTART:20060210T150000
LOCATION:11 Large
SUMMARY:Parsing Arabic Dialects [David Chiang]
UID:20060210T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: We discuss the relevance of k-best parsing to recent applications in
 natural language parsing, and develop algorithms that substantially
 improve on previously-used algorithms with respect to efficiency,
 scalability, and accuracy. We demonstrate these algorithms in experiments
 on Bikel's implementation of Collins' lexicalized PCFG model, and on a
 synchronous CFG based decoder for statistical machine translation. We show
 in particular how the improved output of our algorithms has the potential
 to improve results from parse reranking systems and other applications.
 
 In this talk, I will demonstrate the convergence of several popular
 parsing formalisms (weighted deduction, shared forest, semiring) under the
 powerful hypergraph formalism. If time permits, I will also show how
 generic Dynamic Programming can be formalised as hypergraph searching.
 
 Joint work with David Chiang (University of Maryland)
 
 
 

DTEND:20050610T163000
DTSTART:20050610T150000
LOCATION:11 Large
SUMMARY:Better k-best Parsing, Hypergraphs and Dynamic Programming [Liang Huang (Penn)]
UID:20050610T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: We revisit the idea of history-based parsing, and present a history-based
 parsing framework that strives to be simple, general, and flexible.  We
 also provide a decoder for this probability model that is linear-space,
 optimal, and anytime.  A parser based on this framework, when evaluated on
 Section 23 of the Penn Treebank, compares favorably with other
 state-of-the-art approaches, in terms of both accuracy and speed.
 
 

DTEND:20060310T163000
DTSTART:20060310T150000
LOCATION:10th Floor
SUMMARY:Exploring the Potential of Intractable Parsers [Mark Hopkins]
UID:20060310T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: (This is a practice run for I talk I will give a few times over the next
 weeks when interviewing for job positions.)
 
 I will review the state of the art in statistical machine translation
 (SMT), present my dissertation work, and sketch out the research
 challenges of syntactically structured statistical machine translation.
 
 The currently best methods in SMT build on the translation of phrases (any
 sequences of words) instead of single words. Phrase translation pairs are
 automatically learned from parallel corpora. While SMT systems generate
 translation output that often conveys a lot of the meaning of the original
 text, it is frequently ungrammatical and incoherent.
 
 The research challenge at this point is to introduce syntactic knowledge
 to the state of the art in order to improve translation quality. My
 approach breaks up the translation process along linguistic lines. I will
 present my thesis work on noun phrase translation and ideas about clause
 structure.
 

DTEND:20031010T160000
DTSTART:20031010T150000
LOCATION:11 Large
SUMMARY:Advances in Statistical MT: Phrases, Noun Phrases and Beyond [Philipp Koehn]
UID:20031010T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: This summer we held a three-month workshop on syntax-driven machine
 translation, in which we learned syntactic transformations automatically
 from Chinese/English translated corpora and applied them to translate new
 text.  We'll give a progress report!
 

DTEND:20040910T163000
DTSTART:20040910T150000
LOCATION:11 Large
SUMMARY:About Syntax Fest 2004 (Part I) [Various]
UID:20040910T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Bilingual term lists have proven to be a useful basis for
 dictionary-based Cross-Language Information Retrieval (CLIR), but
 there is ample anecdotal evidence that differences in vocabulary
 coverage can have a substantial impact on retrieval effectiveness.
 This issue has recently been explored using ablation studies in which
 progressively smaller term lists were synthesized using sampling
 techniques. The ablation techniques used in those studies have not,
 however, been validated using real terms lists. In this talk I will
 report the results of what we believe is the first large coverage
 study use naturally occurring term lists. Thirty-five bilingual terms
 lists were obtained from a variety of sources, each with English as
 one of the two paired languages. From these, we created 35
 English-to-English term lists by taking each term that was present in
 the English side of the list as its own translation. When used with
 an English information retreval test collection, this allowed us to
 measure the reduction in retrieval effectivenss that could be
 attributed to deficiencies in the coverage of English terms. Eight
 types of untranslatable terms were identified in a collection of news
 stories, of which named entitles were found to have the greatest
 impact on retrieval effectiveness. Differences in named entity
 coverage were found to produce large differences in retrieval
 effectiveness for term lists of similar sizes. Controlling for named
 entity effects yielded a clear relationship between retrieval
 effectiveness and the size of the translatable English vocabulary.
 The functional dependence that we observed is consistent with one
 previously applied ablation technique and inconsistent with another.
 Our results indicate that the outcome of a widely cited landmark study
 of query expansion effects for CLIR was likely affected by a flawed
 ablation model. We conclude our talk with a suggestion for further
 work on that topic, and a simple prescription for avoiding such
 problems in the future.
 

DTEND:20030612T240000
DTSTART:20030612T110000
LOCATION:11 Large
SUMMARY:Measuring the Effect of Dictionary Coverage on Cross-Language Retrieval [Dina Demner-Fushman]
UID:20030612T110000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBA
 

DTEND:20040312T163000
DTSTART:20040312T150000
LOCATION:11 Large
SUMMARY:About My Thesis Proposal [Deepak Ravichandran]
UID:20040312T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: This is two practice talks.
 
 -----------------------------------------------------------------------------
 FIRST TALK:
 
 The traditional approach to diagnosing learner speech errors in Computer
 Aided Language Learning is to create a linguistic profile of the
 learner/user. We, however, propose that work must also be done to model
 the linguistic profile of a typcial native listener.
 
 Not all errors in second langage learner speech are created equal.
 Different errors sound more "severe" or "harsh" to native speaker ears and
 should therefore be treated with more emphasis in pedagogical interaction.
 
 The Tactical Language Training System (TLTS) is a speech-enabled
 virtual-reality based computer learning environment designed to teach
 Arabic spoken communication to American English speakers. This talk
 addresses the ways the TLTS contextualizes non-native speech errors, and
 how this contextualization fits in the corrective exchanges between a
 non-native learner and a pedagogical agent built to model a native
 listener.
 
 The pedagogical system used in TLTS includes:
 
   * Automatic Speech Recognition (ASR) models which are built on a
     combination of both annnotated and unannotated non-native speech with
     native speech data.
 
   * A stochastic generative model for errors in learner speech that
     creates mispronunciation grammars for the ASR
 
   * Reweighting of system-perceived mispronunciation severity based on
     aggregate native speaker judgements of quality pronunciation and
     intelligiblity.
 
   * Contextualization of feedback based on lexical and phonetic
     inventories of the native and non-native languages.
 
 
 -----------------------------------------------------------------------------
 SECOND TALK:
 
 We present a novel feature-enriched approach that learns to detect the
 conversation focus of threaded discussions by combining NLP analysis and
 IR techniques. Using the graph-based algorithm HITS, we integrate
 different features such as lexical similarity, poster trustworthiness, and
 speech act analysis of human conversations with featureoriented link
 generation functions. It is the first quantitative study to analyze human
 conversation focus in the context of online discussions that takes into
 account heterogeneous sources of evidence. Experimental results using a
 threaded discussion corpus from an undergraduate class show that it
 achieves significant performance improvements compared with the baseline
 system.
 

DTEND:20060512T163000
DTSTART:20060512T150000
LOCATION:11 Large
SUMMARY:Pedagogical Contextualization of Language Learner Speech Errors AND Learning to Detect Conversation Focus of Threaded Discussions [Nick Mote and Donghui Feng]
UID:20060512T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Textual data is everywhere, in email and scientific papers, in online
 newspapers and e-commerce sites. The Web contains more than 200 terabytes
 of text not even counting the contents of dynamic textual databases. This
 enormous source of knowledge is seriously underexploited. Textual
 documents on the Web are very hard to model computationally: they are
 mostly unstructured, time-dependent, collectively authored, multilingual,
 and of uneven importance.  Traditional grammar-based techniques don't
 scale up to address such problems. Novel representations and analytical
 tools are needed.
 
 I will discuss several current projects at Michigan related to text mining
 from a variety of genres. Depending on the amount of time, I will talk
 about (a) lexical centrality for multidocument summarization, (b)
 syntax-based sentence alignment, (c) graph-based classification,(d)
 lexical models of Web growth, and (e) mining protein interactions from
 scientific papers. As it turns out, the right representations, when
 complemented with traditional NLP and IR techniques, turn many of these
 into instances of better studied problems in areas such as social
 networks, statistical mechanics, sequence analysis, and computational
 phylogenetics.
 
 
 
 About the Speaker:
 
 Dragomir R. Radev is Assistant Professor of Information, Electrical
 Engineering and Computer Science, and Linguistics at the University of
 Michigan, Ann Arbor.  He leads the CLAIR (Computational Lingusitics
 And Information Retrieval) group which currently includes 12
 undergraduate and graduate students.  Dragomir holds a Ph.D. in
 Computer Science from Columbia University.  Before joining Michigan,
 he was a Research Staff Member at IBM's TJ Watson Research Center in
 Hawthorne, NY.  He is the author of more than 45 papers on information
 retrieval, text summarization, graph models of the Web, question
 answering, machine translation, text generation, and information
 extraction.  Dr. Radev's current research on probabilistic and
 link-based methods for exploiting very large textual repositories,
 representing and acquiring knowledge of genome regulation, and
 semantic entity and relation extraction from Web-scale text document
 collections is supported by NSF and NIH.  Dragomir serves on the
 HLT-NAACL advisory committee, was recently reelected as treasurer of
 NAACL, is a member of the editorial boards of JAIR and Information
 Retrieval, and is a four-time finalist at the ACM international
 programming finals (as contestant in 1993 and as coach in
 1995-1997). Dragomir received a graduate teaching award at Columbia
 and recently, the U. of Michigan award for Outstanding Research
 Mentorship (UROP).
 

DTEND:20041112T163000
DTSTART:20041112T150000
LOCATION:11 Large
SUMMARY:Words, links, and patterns: novel representations for Web-scale text mining [Dragomir Radev]
UID:20041112T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: In this talk, I look at how the notion of discourse coherence can be
 modeled computationally. I begin with the following idea: if you take
 a text and shuffle its sentences into a random order, that text will
 no longer make sense. In other words, the text will be "incoherent".
 Our task is to learn how to reassemble a shuffled text into an order
 that humans would consider to be coherent.
 
 I discuss practical and theoretical motivations for the task,
 evaluations of our model, increases in performance achieved over the
 summer, and directions for future research.
 
 This work was done in collaboration with Kevin Knight, Daniel Marcu,
 Jonathan Graehl and Nick Mote.
 

DTEND:20030912T160000
DTSTART:20030912T143000
LOCATION:11 Large
SUMMARY:Discourse Coherence for Ordering Information [Lara Taylor]
UID:20030912T143000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Automated essay scoring was initially motivated by its potential cost
 savings for large-scale writing assessments.  However, as automated essay
 scoring became more widely available and accepted, teachers and assessment
 experts realized that the potential of the technology could go way beyond
 just essay scoring.  Over the past five years or so, there has been rapid
 development, and commercial deployment of automated essay evaluation for
 both large-scale assessment and classroom instruction.  A number of
 factors contribute to an essay score, including varying sentence
 structure, grammatical correctness, appropriate word choice, errors in
 spelling and punctuation, use of transitional words/phrases, and
 organization and development. Instructional software capabilities exist
 that provide essay scores and evaluations of student essay writing in all
 of these domains.  The foundation of automated essay evaluation software
 is rooted in NLP research.  This talk will walk through the development of
 CriterionSM, e-rater, and Critique writing analysis tools, automated essay
 evaluation software developed at Educational Testing Service - from NLP
 research through deployment as a business.
 
 (Preview of an HLT/NAACL-2004 Invited Speaker Presentation)
 
 Jill Burstein
 Educational Testing Service
 Princeton, NJ
 

DTEND:20040413T163000
DTSTART:20040413T150000
LOCATION:4 Large
SUMMARY:Automated Essay Evaluation: From NLP research through deployment as a business [Jill Burstein (ETS)]
UID:20040413T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The last decade has seen a plethora of papers in NLP devoted to Machine
 Learning algorithms. However, most of these papers have devoted their
 effort exclusively to improving the system performance on the accuracy
 axis. Most of the sophisticated NLP algorithms are extremely slow and do
 not scale up easily when applied to large amounts of data.
 
 I will talk about the importance of randomized algorithms and their
 potential in speeding up some NLP algorithms. This talk will be a survey
 of some recent advances in Theoretical Computer Science/Math seen with an
 NLP point-of-view. I am not going to present any results. But I am hoping
 that this talk will clarify my thinking process, get feedback from people
 and help me colloborate with others.
 

DTEND:20040813T163000
DTSTART:20040813T150000
LOCATION:11 Large
SUMMARY:Randomized algorithms and its application to NLP [Deepak Ravichandran]
UID:20040813T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I'm going to talk about what I've been working on recently.  My thesis
 proposal is something having to do with the interaction of search,
 learning and features in supervised natural language problems.  I will be
 focusing on the task of coreference, since it is a well-studied problem,
 yet nevertheless not really solved and quite difficult.  It is also a 
 great pedagogical example for why we should care about something *other* 
 than standard Markov random fields for structured prediction, since, for 
 the coreference problem (and pretty much every other "real" natural 
 language problem) inference in such models is intractable.
 
 The contents of this talk will be roughly 40% from a paper I have at ICML
 this year on efficient, accurate supervised learning techniques for
 structured prediction (and why I feel inclined to make the very
 controversial statement that supervised learning for NLP problems is
 solved); it will be roughly 40% about an application of this technique to
 the coreference resolution problem and an exploration of the feature space
 for solving this problem (submitted to HLT); and it will be roughly 20%
 about looking forward to what I want to accomplish in the remainder of my
 thesis, not covered by the first 80%.

DTEND:20050613T240000
DTSTART:20050613T103000
LOCATION:11 Small
SUMMARY:Search, Learning and Features (my thesis proposal proposal) [Hal Daume III]
UID:20050613T103000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I will describe some recent work on "natural logics", logics for languages
 that are more similar to human languages than traditional first order
 predicate logic, giving particular attention to questions about what the
 syntax encodes about semantic relations among sentences. On everyone's
 view, some but not all entailments are syntactically encoded (in a sense
 that I will define precisely), but, beyond this starting point,
 controversy starts almost immediately. Considering some particular
 examples, I will sketch methods for addressing some of the basic
 questions.
 
 

DTEND:20050513T163000
DTSTART:20050513T150000
LOCATION:11 Large
SUMMARY:Natural Logic [Ed Stabler (UCLA)]
UID:20050513T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Although a considerable number of generic Natural Language Generation
 (NLG) systems has been produced over the years, none of them is usually
 employed in end-to-end, text-to-text applications such as Machine
 Translation, Summarization, Question Answering, etc. In this talk, we
 identify the likely reasons for this state of affairs, and propose
 WIDL-expressions as a flexible formalism that facilitates the integration
 of a generic NLG engine within end-to-end language processing
 applications.
  
 WIDL-expressions represent compactly probability distributions over finite
 sets of candidate realizations, and have optimal algorithms for text
 realization via interpolation with language model probability
 distributions. We show the effectiveness of our WIDL-based NLG engine for
 both sentence realization and document realization tasks. By employing
 language models that capture sentence-level properties, we perform Machine
 Translation and Headline Generation at state-of-the-art levels or better.
 By employing language models that capture document-level properties such
 as text coherence, we synthesize output for Multi-document Summarization
 that displays both high content selection performance and increased
 coherence.
 
 

DTEND:20060414T163000
DTSTART:20060414T150000
LOCATION:11 Large
SUMMARY:Natural Language Generation for Text-to-Text Applications using an Information-Slim Representation [Radu Soricut]
UID:20060414T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: One of the key challenges in retrieval is what to do when a query term
 needs to be replaced with more than one term. This problem arises in
 applications such as cross language information retrieval and
 thesaurus expansion. One solution is to use structured query methods,
 which treat all the possible replacements as if they were one query
 term by computing a joint document frequency and a joint term
 frequency. This presentation will review prior work on structured
 query techniques and then introduce three new variants that aim to
 improve computational efficiency and to leverage estimates of
 replacement probabilities to improve retrieval effectiveness. The
 methods have now been tested in cross-language retrieval and
 OCR-degraded text retrieval applications in which replacement
 probability estimates could be estimated. In both applications, the
 new structured query methods showed statistically significant
 improvements in retrieval effectiveness over previously known
 structured query methods.
 

DTEND:20030314T160000
DTSTART:20030314T150000
LOCATION:11 Large
SUMMARY:Improving the Efficiency and Effectiveness of Structured Query Methods [Kareem Darwish]
UID:20030314T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND:20030815T160000
DTSTART:20030815T150000
LOCATION:11 Large
SUMMARY:On Her Masters Research [Beata Klebanov]
UID:20030815T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: (Yarowsky et al., 2001) present an algorithm for bootstrapping a POS
 tagger for an arbitrary target language, using an existing POS tagger for
 a source language and a parallel corpus in the source and target
 languages.  The source text is annotated with the POS tagger; the parallel
 corpus is word-aligned; the POS tags are "projected" from source to target
 language; and finally smoothing is performed before training a POS tagger
 for the target language on the projected annotations.
 
 I will talk about my work (jointly with my advisor, Steve Abney, at U. of
 Michigan) in which we extend this algorithm by projecting from multiple
 source languages onto a target language, then combining the outputs to
 compute a consensus POS tagger.  Our hypothesis is that systematic
 transfer errors from different source-target pairs can be reduced by using
 multiple source languages.  I will present experimental results for three
 different source languages (English, German, and Spanish), and two
 different target languages (French and Czech).  Our results indicate that
 using multiple source languages improves performance.

DTEND:20050715T163000
DTSTART:20050715T150000
LOCATION:11 Large
SUMMARY:Inducing POS Taggers by Projecting from Multiple Source Languages [Victoria Li Fossum (Michigan)]
UID:20050715T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: In this talk, I'll present the investigation I'm carrying out in ISI
 lately under Daniel Marcu's supervision.  Following the noisy-channel
 framework, we propose a statistical model for learning the argument
 structures of verbs automatically.  We show that we are able to learn both
 lexicalized and generalized structures and achieve good results, relying
 only on basic NLP tools like a POS tagger and named-entity recognizer. We
 also present a comparison of the structures we learn with the predicted
 ones in PropBank.
 

DTEND:20041115T163000
DTSTART:20041115T150000
LOCATION:8th floor multipurpose room (#849) -- NOT the conference room
SUMMARY:Unsupervised learning of verb argument structures [Thiago Pardo]
UID:20041115T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I present my summer project  - writing rule-based software for
 simplifying texts. Task definition and motivations will be
 discussed, as well as human and automatic evaluation, the
 latter using a question answering system.
 
 This is joint work with Daniel Marcu and Kevin Knight.
 

DTEND:20030915T160000
DTSTART:20030915T143000
LOCATION:11 Large
SUMMARY:Analyzing Sentences into Facts: Simple is Beautiful [Beata Klebanov]
UID:20030915T143000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Although we live in a predominantly statistical world, there are still
 many language processing applications that long for accurate
 representations of text meaning. Even applications that found partial
 solutions in statistical modeling, including information retrieval,
 machine translation, or automatic summarization, are likely to get a
 significant boost from deeper text understanding.
 
 In this talk, I will present an innovative method for automatic extraction
 of conceptual graphs as a means to represent text meaning.  The method
 relies on a novel adaptation of graph-based ranking algorithms -
 traditionally (and successfully) used in citation analysis, Web page
 ranking, and social networks. I will show how such algorithms can be
 adapted to semantic networks, resulting in an efficient unsupervised
 method for resolving the semantic ambiguity of all words in open text, and
 identifying relations between entities in the text. I will also outline a
 number of applications that are enabled by this representation, including
 keyphrase extraction, domain classification, and extractive summarization.
 
 BIO: Rada Mihalcea is an Assistant Professor of Computer Science at
 University of North Texas. Her research interests are in lexical
 semantics, minimally supervised natural language learning, and
 multilingual natural language processing. She is currently involved in a
 number of research projects, including word sense disambiguation, shallow
 semantic parsing, (non-traditional) methods for building annotated corpora
 with volunteer contributions over the Web, word alignment for language
 pairs with scarce resources, and graph-based ranking algorithms for
 language processing. Her research is supported by NSF and the state of
 Texas.

DTEND:20040416T240000
DTSTART:20040416T103000
LOCATION:11 Large
SUMMARY:Graph-based Ranking Algorithms for Language Processing [Rada Mihalcea (UNT)]
UID:20040416T103000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Broad-coverage repositories of semantic relations between verbs could
 benefit many NLP tasks. We present a semi-automatic method for extracting
 fine-grained semantic relations between verbs. We detect similarity,
 strength, antonymy, enablement, and temporal happens-before relations
 between pairs of strongly associated verbs using lexico-syntactic patterns
 over the Web. On a set of 29,165 strongly associated verb pairs, our
 extraction algorithm yielded 65.5% accuracy. We provide the resource,
 called VerbOcean, for download at http://semantics.isi.edu/ocean/. We will
 also discuss current work on disambiguating the verbs in the network as
 well as refining the semantic relations using path analysis.
 
 

DTEND:20040816T153000
DTSTART:20040816T140000
LOCATION:11 Large
SUMMARY:VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations [Patrick Pantel & Tim Chklovski]
UID:20040816T140000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Ranked lists of output trees from syntactic statistical NLP applications
 frequently contain multiple repeated entries. This redundancy leads to
 misrepresentation of tree weight and reduced information for debugging and
 tuning purposes. It is chiefly due to nondeterminism in the weighted
 automata that produce the results. I will introduce an algorithm that
 determinizes such automata while preserving proper weights, returning the
 sum of the weight of all multiply derived trees. I will also report
 results of the application of the algorithm to machine translation and
 Data Oriented Parsing.
 
 

DTEND:20051216T163000
DTSTART:20051216T150000
LOCATION:11 Large
SUMMARY:A Better N-Best List - Practical Determinization of Weighted Finite Tree Automata [Jonathan May]
UID:20051216T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Leading Question-Answering systems employ a variety of means to boost the
 accuracy of their answers.  Such methods include redundancy (getting the
 same answer from multiple documents/sources), deeper parsing of questions
 and texts (hence improving the accuracy of confidence measures),
 inferencing (proving the answer from information in texts plus background
 knowledge) and sanity-checking (verifying that answers are consistent with
 known facts).  To our knowledge, however, no QA system deliberately asks
 additional questions in order to derive constraints on the answers to the
 original questions.
 
 We present in this talk the method of QA-by-Dossier-with-Constraints (QDC).
 This is an extension of the simpler method of QA-by-Dossier, in which
 definitional questions ("Who/what is X") are addressed by asking a set of
 questions about anticipated properties of X.  In QDC, the collection of
 Dossier candidate answers, along with possibly other answers to questions
 asked expressly for this purpose, are subjected to satisfying a set of
 naturally-arising constraints.  For example, for a "Who is X" question, the
 system will ask about birth, accomplishment and death dates, which, if they
 exist, must occur in that order, and also obey other constraints such as
 lifespan.  Temporal, spatial and kinship relationships seem to be
 particularly amenable to this treatment, but it would seem that almost any
 "factoid" question can benefit from QDC.  We will discuss the setting-up
 and application of constraint networks, and talk about how (and whether) to
 develop the constraint sets automatically.  We will demonstrate several
 applications of QDC, and present one evaluation in which the F-measure for
 a set of questions improved with QDC from .39 to .69.

DTEND:20040116T150000
DTSTART:20040116T140000
LOCATION:11 Large
SUMMARY:Using Constraints to Improve Question-Answering Accuracy [John Prager (IBM)]
UID:20040116T140000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBA
 

DTEND:20040716T163000
DTSTART:20040716T150000
LOCATION:11 Large
SUMMARY:Practice Talks for ACL (+workshops) [Hal Daume III and Radu Soricut]
UID:20040716T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Following the recent adoption by the machine translation community of
 automatic evaluation using the BLEU/NIST scoring process, we conduct an
 in-depth study of a similar idea for evaluating summaries. The results
 show that automatic evaluation using unigram co-occurrences between
 summary pairs correlates surprising well with human evaluations, based
 on various statistical metrics; while direct application of the BLEU
 evaluation procedure does not always give good results.
 

DTEND:20030516T160000
DTSTART:20030516T150000
LOCATION:11 Large
SUMMARY:Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics [Chin-Yew Lin]
UID:20030516T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: This talk will address the problem of assessing the correctness of MT
 output on the word level. I will give an overview on word confidence
 measures for SMT.  Different variants of word posterior probabilities that
 can be directly used as confidence measure will be presented. Their
 connection with the Bayes decision rule and the underlying error measure
 will be shown. Experimental comparison of different word confidence
 measures will be presented on a translation task consisting of technical
 manuals.
 
 Additionally, I will show how word confidence measures can be applied in
 an interactive SMT system. This system predicts translations, taking parts
 of the sentence into account that have already been accepted or typed by
 the user. Through the use of confidence measures, the performance of the
 prediction engine can be improved.
 
 
 About the Speaker:
 
 Nicola Ueffing is a graduate research assistant at the group for "Human
 Language Technology and Pattern Recognition" (Lehrstuhl fuer Informatik
 VI) at RWTH Aachen University. She received her diploma in mathematics
 from RWTH Aachen University in 2000. Her research topic is statistical
 machine translation, focusing on confidence measures for SMT. In 2003, she
 was a member of the team working on "Confidence Estimation for SMT" at the
 CLSP workshop at JHU.
 

DTEND:20041217T163000
DTSTART:20041217T150000
LOCATION:11 Large
SUMMARY:Word-Level Confidence Measures for SMT [Nicola Ueffing]
UID:20041217T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: My presentation will overview recent activities on Chinese-English SMT
 carried out at ITC-irst (Trento, Italy).  After an overview of the
 complete architecture of our system, I will focus on progress made in
 Chinese word-segmentation, phrase-based modeling and decoding, log-linear
 modeling and minimum error training, and language model adaptation.
 Experimental results will be provided in terms of Bleu and Nist scores on
 two translation tasks:  basic traveling expressions and news reports,
 respectively adopted by the C-STAR consortium and for the 2002 and 2003
 NIST MT evaluation campaigns.
 
 Bio:
 
 Marcello Federico has been a permanent researcher at ITC-irst since 1991.  
 During 1998-2003, he led the "Multilingual natural speech technologies"
 (MUNST)  research line at ITC-irst.  Since 2004, he is head of the
 "Cross-language information processing" (Hermes) research line. His
 interests include automatic speech recognition, statistical language
 modeling, information retrieval, and machine translation.
 
 

DTEND:20040617T163000
DTSTART:20040617T150000
LOCATION:4th Floor
SUMMARY:Statistical Machine Translation at ITC-irst [Marcello Federico]
UID:20040617T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: As a discipline of biology, the field of neuroscience suffers greatly from
 information overload, non-standardization and complexity. In the absence
 of a mathematical theoretical structure for the subject, scientists use
 their own ad-hoc methods of collating and synthesizing information from
 both the primary literature and their own data. In order to eventually
 formalize and accelerate the development of theoretical approaches in the
 subject, we are combining an Electronic Laboratory Notebook (ELN) with
 asset management of the primary research literature to construct a
 knowledge engineering framework based around the organizational unit of a
 neuroscience laboratory. This project, called ˇNeuroScholar˘
 (http://www.neuroscholar.org/) is open-source, and is being tested and
 used in the laboratories of Prof. Larry Swanson and Prof. Alan Watts at
 USC. In each laboratory, the system will operate on top of a ˇlaboratory
 corpus˘ of knowledge resources (data files, full-text pdf files , etc.)
 that summarizes the relevant knowledge for that laboratory. Not only will
 this collection provide a valuable resource for the members of the
 laboratory, it provides a platform for natural language processing and
 knowledge engineering to answer formally-defined research questions. The
 Society for Neuroscience˘s annual meeting attracts over 30,000 attendees,
 who collectively form potential user-base of this software.
 
 I will talk about the ideas underlying the project, the current
 implementation of NeuroScholar, developments from collaboration with the
 natural language group at ISI and possible collaborations for the future.
 
 

DTEND:20050617T240000
DTSTART:20050617T103000
LOCATION:11 Large
SUMMARY:The neuroscience laboratory as a knowledge factory: challenges, approaches and tools [Gully Burns]
UID:20050617T103000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: In the 1990s, researchers applied their new developments in transducer
 theory using widely available easy-to-use toolkits for string transducers,
 and made well-known advances in parsing, machine translation, and other
 areas. Rapid prototyping via software such as the AT&T toolkit and carmel
 was useful for proofs of concept and in many cases led to unforseen
 developments in novel areas. In the current nlp research environment tree
 based strategies and new models have shown promising results in advancing
 the state of the art, and recent developments in weighted tree automata
 theory are enriching the bedrock created 40 years ago, but as of yet there
 is no toolkit available with the necessary capabilities to turn promise
 into solution.
 
 Tiburon is the first probablistic tree transducer toolkit. Similar in form
 and function to the string-based toolkits of yesteryear, it is designed to
 be easy to use, with simple but expressive definitions of tree automata
 and a concise set of vital operations that can be used to construct many
 useful tree-based nlp projects. Although a work in progress, Tiburon is
 already a usable tool with active users between the ages of 6 and 41. I
 will describe the current status of the system, demonstrate ease of use
 and potential power, and discuss the challenges ahead.

DTEND:20060317T163000
DTSTART:20060317T150000
LOCATION:4th Floor
SUMMARY:Tiburon: A Finite State Tree Automata Toolkit [Jon May]
UID:20060317T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: An Overview of Question Answering Challenge
 Jun'ichi Fukumoto and Tsuneaki Kato
 
 In this talk, we will present an overview of Question Answering
 Challenge(QAC), which is the question answering task of the NTCIR
 Workshop.  QAC-1 (the first evaluation of QAC) was carried out
 at NTCIR Workshop 3 in October 2002, and QAC-2 will be at
 NTCIR Workshop 4 in December 2003.  In the QAC, systems to be
 evaluated are expected to return exact answers consisting of a noun
 or noun compound denoting, for example, the names of persons,
 organizations, or various artifacts or numerical expressions such
 as money, size, or date.  Those basically range over the Named
 Entity (NE) elements of MUC and IREX but is not limited to them.
   QAC consists of three kinds of subtasks: Task 1, where the systems
 are allowed to return ranked five possible answers; Task 2, where
 the systems are required to return a complete list of answers; and
 Task 3, the systems are required to answer series of questions, that
 have anaphora and zero-anaphora.  We will present the results of
 QAC-1, and vision and prospect of QAC-2.
 
 NTCIR -- the Way Ahead
 Noriko Kando
 
 Dr. Noriko Kando is the leader of NTCIR(Test Collections and Evaluation
 of IR, Text Summarization, Q&A, etc) project, and an associate professor
 of National Institute of Informatics (NII).  She got her Ph. D in 1995
 from Keio University.  Her research interest includes evaluation of
 information retrieval systems, technologies to "Make Information Usable
 for Users", cross-lingual information retrieval, and analysis of text
 structure, genre, citation & link  She is a member of editorial boards of
 International Journal on Information Processing and Management,
 ACM-Transaction on Asian Language Information Processing, etc.
 
 Jun'ichi Fukumoto and Tsuneaki Kato are task organizers of QAC.
   Dr. Jun'ichi Fukumoto is an associate professor of Ritsumeikan
 University.  He got his Ph. D in 1999 from University of Manchester
 Institute of Science and Technology.  His research interest includes
 Q&A, automatic summarization, and dialogue processing.
 Dr. Tsuneaki Kato is an associate professor of the University of Tokyo.
 He got his Dr. of Engineering in 1995 from Tokyo Institute of
 Technology.  His research interests includes multimodal dialogue
 processing, multimodal presentation generation and domain independent
 question and answering.  He is a member of editorial committee of
 transaction on information and systems of The Institute of Electronics,
 Information and Communication Engineers.
 

DTEND:20031117T240000
DTSTART:20031117T103000
LOCATION:4th Floor
SUMMARY:An Overview of the QA Challenge + NTCIR -- The Way Ahead [Dr. Kato and Dr. Fukomoto (NTCIR)]
UID:20031117T103000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The annual Computational Linguistics Open House will be held at USC's Information
 Sciences Institute from 3:00-4:30pm in the 11th floor Conference Room. Researchers from
 ISI, including Eduard Hovy, Daniel Marcu, and Kevin Knight will present overviews of
 their latest research.  We will also hear about the research activities of Dani Byrd of
 the Linguistics Department, Shri Narayanan's group in EE, and David Traum and Andrew
 Gordon of USC's Institute for Creative Technologies.
 

DTEND:20031017T163000
DTSTART:20031017T150000
LOCATION:11 Large
SUMMARY:Introduction to CL Research [Hovy, Marcu, Knight, Byrd, Narayanan, Traum, Gordon]
UID:20031017T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: This summer we held a three-month workshop on syntax-driven machine 
 translation, in which we learned syntactic transformations automatically
 from Chinese/English translated corpora and applied them to translate new
 text.  We'll give a progress report!
 
 

DTEND:20040917T163000
DTSTART:20040917T150000
LOCATION:11 Large
SUMMARY:About Syntax Fest 2004 (Part II) [Various]
UID:20040917T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBA
 

DTEND:20050218T163000
DTSTART:20050218T150000
LOCATION:11 Large
SUMMARY:TBA [Inderjeet Mani (Georgetown)]
UID:20050218T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND:20030718T160000
DTSTART:20030718T150000
LOCATION:11 Large
SUMMARY:A Maryland Yankee in King Eduard's Court: Some Remarks on a Year in Paradise [Doug Oard]
UID:20030718T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: This talk is the second in three tutorial lectures on ontologies.  It
 first shows some details of various Upper Ontologies-ResearchCYC, SUMO,
 DOLCE, and the Penman Upper Model.  It then discusses the problem of
 creating content for the 'Middle Model' zone of ontologies, and outlines a
 methodology for moving from words to word senses to concepts.  It
 concludes by describing ISI's Omega ontology and showing how Omega has
 been used in annotation projects to support semantic labeling of texts.
 
 Please bring a pen or pencil and some paper; there is a small exercise!
 

DTEND:20050318T163000
DTSTART:20050318T150000
LOCATION:11 Large
SUMMARY:Methodologies of ontology content construction [Ed Hovy]
UID:20050318T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:     Previous research has indicated that when a polysemous word appears two
     or more times in a discourse, it is extremely likely that they will all
     share the same sense (Gale et al. 92). However, those results were
     based on a coarse-grained distinction between senses (e.g, {\em
     sentence} in the sense of a `prison sentence' vs. a `grammatical
     sentence'). I conducted an analysis of multiple senses within two
     sense-tagged corpora, Semcor and DSO. These corpora used WordNet for
     their sense inventory. I found significantly more occurrences of
     multiple-senses per discourse than reported in (Gale et al. 92) (33\%
     instead of 4\%). I also found classes of ambiguous words in which as
     many as 45\% of the senses in the class co-occur within a document. I
     will discuss the implications of these results for the task of 
     word-sense tagging and for the way in which senses should be  
     represented.

DTEND:20031219T163000
DTSTART:20031219T150000
LOCATION:11 Large
SUMMARY:More than One Sense Per Discourse [Robert Krovetz (Ask Jeeves)]
UID:20031219T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: In the past decade, researchers have explored many approaches to
 automatically extract large collections of knowledge from text. In this
 talk, we present Espresso, a weakly-supervised, general-purpose, and
 broad-coverage algorithm for harvesting binary semantic relations. The
 main contributions are: i) a method for exploiting generic patterns by
 filtering incorrect instances using the Web; and ii) a principled measure
 of pattern and instance reliability enabling the filtering algorithm. We
 present an empirical comparison of Espresso with various state of the art
 systems, on different size and genre corpora, on extracting various
 general and specific relations. Experimental results show that our
 exploitation of generic patterns substantially increases system recall
 with small effect on overall precision.
 

DTEND:20060519T163000
DTSTART:20060519T150000
LOCATION:11 Large
SUMMARY:Espresso: Making Use of Generic Patterns for Mining Relations from Small and Large Corpora [Patrick Pantel]
UID:20060519T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: As DARPA's TIDES (Translingual Information Detection, Extraction, and
 Summarization) program coming to an end, I will give a summary of what we
 have learned from TIDES in summarization and a brief overview of our
 current effort in developing automatic evaluation methods that go beyond
 surface n-gram matching. Topics to be covered:
 
 (1) Summary of DUCs 2001 - 2004
 (2) Automatic Evaluations in Summarization and MT
 (3) Basic Elements - New Efforts in Summarization at ISI

DTEND:20041119T163000
DTSTART:20041119T150000
LOCATION:11 Large
SUMMARY:After TIDES, What's Left? - Finding Basic Elements [Chin-Yew Lin]
UID:20041119T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I will be presenting some recent results of mine regarding the possibility
 of automatic evaluation in summarization.  I will discuss both my own 
 findings, as well of those of people here and at Columbia, and attempt to 
 explain in a principled fashion why there are disparate opinions on the 
 plausibility of performing automatic evaluation in this task.  I will
 discuss my (perhaps pessimistic) views on the plausibility of doing any
 sort of evaluation of summarization, automatic or otherwise.
 
 The results and experimental setups developed in connection with 
 summarization will be extended to the machine translation.  I will review 
 possible reasons why metrics such a bleu have experienced significantly 
 more success in machine translation than in summarization.  I will also 
 connect the evaluation criterea developed in the context of summarization 
 to machine translation, and discuss the automation of these methods.
 
 In short: I'll talk about why I've been doing so much data elicitaiton 
 recently.
 
 This will be a highly informal seminar and participation is highly
 encouraged.
 

DTEND:20040220T160000
DTSTART:20040220T150000
LOCATION:4 Large
SUMMARY:Some Results in Automatic Evaluation for Summarization and MT [Hal Daume III]
UID:20040220T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Narratology analyzes the discursive structure of narratives as finalized
 products of human invention, such as novels, short-stories, or
 fairy-tales. Those narratives are rendered in a given surface form;
 Narratology focuses on narratives in natural language. Narratologists
 assume that each narrative surface representation is associated with a
 neutral, abstract event sequence, the "Story" (histoire, sjuzhet). The
 abstractness of Story is illustrated by the fact that the same Story can
 be realized in different surface texts. By discursive structure or
 "Discourse" (discours, fabula), narralogists mean the relation between an
 abstract Story and its concrete expression in a sequential text. For
 example, if the chronological order of the Story is not respected in its
 textual recount, we are dealing with the Discourse parameter of order.
 Other Discourse parameters include the frequency with which Story events
 are evoked, the point of view from which they are narrated (perceived,
 evaluated,...), or framed narratives with several narrative levels.
 
 The Story Generator Algorithms project at the University of Hamburg
 evaluated several existing Story Generators with respect to their
 discursive abilities. It became obvious that most Story Generators
 concentrate on creating a coherent and chronological abstract Story,
 which is directly mapped onto natural language. This results in a
 predominance of 1:1 relations between Story and surface, and in most
 cases corresponds to a default or zero instantiation of Discourse
 parameters. As a consequence, Story Generator outputs tend to be very
 explicit and straightforward, and are likely to be perceived as uniform
 and boring.
 
 Narratological expert knowledge might be useful to future enhanced Story
 Generators and to Natural Language Generation systems dealing with
 narrative. One of the aims of Computational Narratology is to model that
 expert knowledge. Ideally, narratological knowledge will be integrated
 into a Narratological Structurer, as a processing component of an
 advanced system that creates narratives. In such a system, the
 Narratological Structurer will be the interface between a Story Generator
 and subsequent Natural Language Generation modules. The talk also
 presents examples of the knowledge that is being modelled.
 
 
 About the Speaker:
 
 Birte Lönneker graduated from the University of Hamburg, Germany, with a
 degree in French with Finno-Ugristics (Finnish) and Business
 Administration. Since then, her main fields of publication are Cognitive
 Linguistics and electronic resources for Natural Language Processing,
 with special focus on frames and metaphors, as well as electronic
 dictionaries, corpora, and recently part-of-speech tagging. Her PhD on
 Concept Frames and Relations, also published as a book in 2003, was
 co-supervised at the Institute for Romance Languages and at the
 Department of Informatics in Hamburg. For her Slovenian-German online
 dictionary, Birte Lönneker was twice awarded the EURALEX Laurence Urdang
 Award. From 2002 to 2004, she received various research grants for
 Slovenia, where she was working in the Corpus Laboratory of the Institute
 of Slovenian Language.
 
 Since 2004, Birte Lönneker carries out research on Story Generator
 Algorithms within the Narratology Research Group Hamburg. She is also a
 board member of the German Cognitive Linguistics Association.
 

DTEND:20050620T113000
DTSTART:20050620T100000
LOCATION:11 Small
SUMMARY:Between Story Generation and Natural Language Generation [Birte Loenneker (Hamburg)]
UID:20050620T100000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND:20030520T160000
DTSTART:20030520T150000
LOCATION:11 Large
SUMMARY:Discourse Segmentation of Multi-Party Conversation [Michel Galley]
UID:20030520T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: In this talk, we introduce a methodology for analyzing judgment opinions.
 We define a judgment opinion as consisting of a valence, a holder, and a
 topic. We decompose the task of opinion analysis into four parts: 1)
 recognizing the opinion; 2) identifying the valence; 3)  identifying the
 holder; and 4) identifying the topic. We evaluate our methodology using
 both intrinsic and extrinsic measures.

DTEND:20060421T163000
DTSTART:20060421T150000
LOCATION:11 Large
SUMMARY:Identifying and Analyzing Judgment Opinions [Soo-Min Kim]
UID:20060421T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The large corpora of written text that is available to the language
 community has largely been utilized for language understanding; it has
 somewhat been ignored in the context of language generation. Recent
 developments in stochastic generation have allowed such systems to shift
 the burden from hand crafted databases (lexicons, grammars, ontologies) to
 the knowledge implicitly found in written text. However, when building a
 dialogue system, generation is largely interactive, very different from
 the written structure of most corpora.
 
 In this talk, I will discuss my recent work at applying a stochastic
 generator, HALogen, and its newswire language model to a dialogue system,
 TRIPS. I'll describe the difficulties in mapping the TRIPS semantic form
 into HALogen's representation, the critical differences between newswire
 and dialogue, and the possibility of using HALogen and a large newswire
 model as a domain independent generator. 
 

DTEND:20030221T160000
DTSTART:20030221T150000
LOCATION:11 Large
SUMMARY:Statistical Language Generation in a Dialogue System [Nate Chambers]
UID:20030221T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: This talk will be about automatic speech-to-speech translation.  In our
 system, a doctor speaks one language, the patient speaks another language,
 and the machine translates their utterances from one language to the
 other.  The talk will be followed by a demo of our system.
 
 One approach we have been successful with is phrase classification, i.e.,
 classifying a noisy speech-recognized utterance into one of many meaning
 categories.  Phrase classification is computationally cheap and can
 provide high quality translations for in domain utterances almost
 instantaneously. Speed is important for speech translation, where
 processing delay is a great concern.
 
 In this talk, different aspects of building a classification-based speech
 translator are discussed. Following an overview of automatic
 speech-to-speech translation and its challenges, a comparison of different
 classification methods is presented and data collection techniques for
 that application are introduced.
 
 

DTEND:20040621T160000
DTSTART:20040621T150000
LOCATION:11 Large
SUMMARY:Speech-to-Speech Translation: A Phrase Classification Approach [Emil Ettelaie]
UID:20040621T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Term weighting methods have been shown to give significant increases
 in information retrieval performance. Term weights are typically
 calculated using frequency counts across the whole retrieval
 collection, frequency of each term within individual documents and
 compensation for varying document length. The presence of pronomial
 references in documents effectively reduces the within document term
 frequency of associated words with a consequent effect on term weights
 and information retrieval behaviour. This presentation will describe
 an experimental investigation into the impact on information retrieval
 performance of broad coverage automatic pronoun resolution. Results
 using a standard information retieval test collection indicate that
 calculating term weights using a pronoun resolved version of the
 document test collection can improve both fixed cutoff and average
 retrieval precision.
 

DTEND:20030321T160000
DTSTART:20030321T150000
LOCATION:11 Large
SUMMARY:An Investigation of the Application of Broad Coverage Automatic Pronoun Resolution in Information Retrieval [Gareth Jones]
UID:20030321T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: <b>Natural Language Understanding: A fast and accurate Statistical Learning Approach for Dialogue Systems</b>
 
 Natural Language Understanding (NLU) is an essential module of a good
 dialogue system. To achieve satisfactory performance levels, real time
 dialogue systems need the NLU module to be both fast and accurate. Finite
 State Model (FSM) based systems are fast and accurate but lack robustness
 and flexibility. The Statistical Learning Model (SLM) based systems are
 robust and flexible but lack accuracy and are at most times slow.
 
 In this talk, I am going to talk about an SLM based NLU approach for
 dialogue utterances that is both accurate and fast. The system has high
 accuracy and produces frames in real time.
 
 <b>A Community of Words: Understanding Social Relationships from E-mail</b>
 
 A corpus of e-mail messages presents a number of challenges for NLP
 techniques, with its nearly unconstrained structure and vocabulary,
 mistyped words and ungrammatical sentences, and extensive contextual
 information that is never explicitly stated. Yet, the intrinsically social
 nature of such communication provides an opportunity to study not just a
 bag of words, but also the relationships, competencies, and activities
 behind them.
 
 This talk presents work with Eduard Hovy as part of the MKIDS project.
 

DTEND:20040521T163000
DTSTART:20040521T150000
LOCATION:11 Large
SUMMARY:Statistical Learning for Dialogue System <b>and</b> A Community of Words [Tom Murray and Rahul Bhagat]
UID:20040521T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I am going to be talking about stuff that I have been working over the
 past 6-9 months. This includes randomized algorithms and its application
 to 2 NLP problems: noun clustering and noun-pair clustering. I will also
 be commenting on my experience of working with very very large amounts of
 real Natural Language text (This includes processing and working with data
 available from the web. This corpus is not the standard newspaper text
 that we are so used to in the NLP community.) This talk will also cover a
 large part of my thesis work.

DTEND:20050422T163000
DTSTART:20050422T150000
LOCATION:11 Large
SUMMARY:Working with Large Corpus, High speed clustering and its applications [Deepak Ravichandran]
UID:20050422T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND:20030822T160000
DTSTART:20030822T150000
LOCATION:11 Large
SUMMARY:Information Extraction, IR and QA [Satoshi Sekine]
UID:20030822T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: EM has proved to be a great and useful technique for unsupervised learning
 problems in natural language.  Unfortunately, it cannot solve every
 problem out there, either because the E-step is intractable, the M-step is
 intractable or both.  Typically our community resorts to a Viterbi
 approximation in this case, which really isn't very justified and can
 easily diverge from our expectations (no pun intended). Moreover, EM --
 like all maximum likelihood methods -- suffers from a need for ad-hoc and
 undesirable smoothing.  All of these problems -- intractable E- or
 M-steps, the Viterbi approximation, and the annoyance of smoothing -- are
 solved by using Bayesian methods. Moreover, from a theoretic point of
 view, the Bayesian paradigm is much more foundationally well justified
 than the frequentist use of estimators (such as the maximum likelihood
 estimator), at some cost in computation (though not as much as you might
 believe).
 
 In this tutorial, I will discuss Bayesian methods as they can be used in
 natural language processing.  The first half will be background (some of
 which you probably won't have seen, some of which you probably will have
 seen, but which will probably be presented in a different way that you're
 used to) including graphical models, EM, priors and pro- (and con-)
 Bayesian arguments.  The second half of the tutorial will focus on solving
 complex inference problems, essentially building on what we've seen from
 EM.  I'll cover MAP (*not* Bayesian -- if you can't tell me why, then you
 should come to the tutorial!), summing, Monte Carlo, MCMC, Laplace,
 variational and expectation propagation.  Time permitting, I will briefly
 discuss Bayesian discriminative models (basically what a Bayesian uses
 instead of SVMs), non-parametric (infinite) models and Bayesian decision
 theory, all of which make use of the inference techniques we will have
 already covered.
 
 This tutorial is intended to be largely self contained, though I will
 expect that you know what probabilities are, what distributions are and
 the standard manipulations of conditional/joint distributions. Familiarity
 with EM would be helpful, but I'll cover this topic in some depth since it
 will be important for understanding the rest of the tutorial.  I hope --
 though this never really seems to come to fruition -- that this will be a
 semi-interactive talk and I will attempt to adjust according to what
 people are interested in and what is putting people to sleep.
 
 (see http://www.isi.edu/~hdaume/bayesnlp/ for more information)
 

DTEND:20050622T163000
DTSTART:20050622T130000
LOCATION:11 Large
SUMMARY:Beyond EM: Bayesian Techniques for NLP Researchers [Hal Daume III]
UID:20050622T130000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: This is a practice tutorial for one I am giving at HLT/NAACL one week
 later.  Comments/feedback are very welcome.
 
 ----------------------------------------------------------------------
 
 Expectation Maximization (EM) has proved to be a great and useful
 technique for unsupervised learning problems in speech and language
 processing.  Unfortunately, its range of applications is limited either by
 intractable E- or M-steps, or by its reliance on the maximum likelihood
 estimator.  The natural language processing community typically resorts to
 ad-hoc approximation methods to get (some reduced form of) EM to apply to
 NLP tasks.  However, many of the problems that plague EM can be solved
 with Bayesian methods, which are theoretically more well justified.  In
 this tutorial, I discuss Bayesian methods as they can be used in natural
 language processing.  The two primary foci of this tutorial are specifying
 prior distributions and performing the necessary computations to perform
 inference in Bayesian models.  I focus on unsupervised techniques (for
 which EM is the obvious choice), but discuss supervised and discriminative
 techniques at the conclusion with pointers to relevant literature.
 
 Depending on one's inference technique of choice, the math required to
 build Bayesian learning models can be difficult.  Compounding this problem
 is the fact that current written tutorials on Bayesian techniques tend to
 focus on continuous-valued problems, a poor match for the high-dimension
 discrete world of text.  This combination makes the cost of entrance to
 the Bayesian learning literature often too high.  The goal of this
 tutorial is to provide sufficient motivation, intuition and vocabulary
 mapping so that one can easily understand recent papers in Bayesian
 learning that are published at conferences like NIPS, and increasingly at
 ACL.  In addition to the standard tutorial materials (slides), this
 tutorial is accompanied by a technical report that spells out all the
 mathematic derivations in great detail, for those who wish to start
 research projects in this fields.
 
 This tutorial should be accessible to anyone with a basic understanding of
 statistics.  I use a query-focused summarization task as a motivating
 running example for the tutorial, which should be of interest to
 researchers in natural language processing and in information retrieval.  
 Additionally, though the tutorial does not focus on speech problems, those
 attendees interested in graphical modeling techniques for automatic speech
 recognition might also find the tutorial of interest.

DTEND:20060524T240000
DTSTART:20060524T090000
LOCATION:4th Floor
SUMMARY:Beyond EM: Bayesian Techniques for Human Language Technology Researchers [Hal Daume III]
UID:20060524T090000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: As part of an effort to encode the commonsense knowledge we need in
 natural language understanding, I have been looking at several very common
 words and their uses in diverse corpora, and asking what we have to know
 to understand this word in this context.  In this talk, I will describe
 the investigations of the uses of two words -- the adverb "now" and the
 preposition "like".
 
 One might think that "now" simply expresses a temporal property of an
 event.  But in fact in almost every instance, it is used to point up a
 contrast -- "This is true now.  Something else was true then."  It is thus
 more of a relation than a property.  I will describe several categories of
 such relations.  Another question of interest about "now" is "How long a
 period is the word "now" describing in its various uses?": "I'm typing an
 abstract now" vs. "We travel by automobile now."  I suggest some
 categories of knowledge that need to be encoded to answer this question.
 
 When we successfully understand "A is like B", we have figured out some
 property that A and B have in common.  How can we find that property
 computationally?  In the data I looked at, in 80% of the instances, the
 property is explicit in the nearby text, and I will talk about how we can
 identify it.  For the remainder I examine the knowledge we would need in
 order to infer the common property.
 

DTEND:20041022T163000
DTSTART:20041022T150000
LOCATION:11 Large
SUMMARY:Like Now:  Two Explorations in Deep Lexical Semantics [Jerry Hobbs]
UID:20041022T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I'll describe our entry into the DUC 2004 automatic document summarization
 competition.  We competed only in the single document, headline generation
 task.  Our system is based on a novel kernel dubbed the tree position
 kernel, combined with two other well-known kernels.  Our system performs
 well on white-box evaluations, but does very poorly in the overall DUC
 evaluation.  C'est la vie.

DTEND:20040423T160000
DTSTART:20040423T150000
LOCATION:10 Large
SUMMARY:A Tree-Position Kernel for Document Compression [Hal Daume III]
UID:20040423T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Natural language interfaces designed for agents that interact with users
 in shared environments (e.g. training simulators, videogames) must
 incorporate knowledge about the users' context in order to address the
 many ambiguities of situated language use. We introduce a model of
 situated language acquisition that operates in two phases.  First,
 intentional context is represented and inferred from user actions using
 probabilistic context free grammars.  Then, utterances are mapped onto
 this representation in a noisy channel framework.  The acquisition model
 is trained on unconstrained speech collected from subjects playing an
 interactive game, and tested using an understanding task.  Discussion of
 results focuses both on the implications for theoretical models of
 cognition, as well as, for natural language applications in shared
 environments.
 

DTEND:20050623T240000
DTSTART:20050623T103000
LOCATION:11 Small
SUMMARY:Intentional Context in Situated Language Learning [Michael Fleischman (MIT)]
UID:20050623T103000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: 1) A serious bottleneck in the development of trainable text summarization
 systems is the shortage of training data. Constructing such data is a very
 tedious task, especially because there are in general many different
 correct ways to summarize a text. Fortunately we can utilize the Internet
 as a source of suitable training data. In this paper, we present a
 summarization system that uses the web as the source of training data. The
 procedure involves structuring the articles downloaded from various
 websites, building adequate corpora of (summary, text) and (extract,
 text) pairs, training on positive and negative data, and automatically
 learning to perform the task of extraction-based summarization systems.
 
 2) Headlines are useful for users who only need information on the main
 topics of a story. We present a headline summarization system that is
 built at ISI for this purpose and is a top performer for DUC2003's task 1,
 generating very short summaries (10 words or less). 
 

DTEND:20030523T160000
DTSTART:20030523T150000
LOCATION:11 Large
SUMMARY:A Web-Trained Extraction Summarization System and Headline Summarization at ISI [Liang Zhou]
UID:20030523T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: 3:30pm  Mark Hopkins (UCLA)
 Tree Sequence Automata: A Unifying Framework for Tree Relation Formalisms
 
 There exist a wide variety of competing formalisms for representing a
 language of ordered tree pairs.  These include (bottom-up and top-down)  
 tree transducers, synchronous tree-substitution grammars (STSGs),
 synchronous tree-adjoining grammars (STAGs), and inversion transduction
 grammars (ITGs).  Since these formalisms have all developed independently
 of one another, it is difficult to compare their respective
 representational power.  This work seeks to make this task simpler by
 viewing these formalisms as instances of a general unifying formalism,
 which we call tree sequence automata (TSA).  By casting these different
 formalisms in a single framework, we can compare them directly by studying
 the specific subclass of TSA that they fall into.
 
 4:00pm  Jason Riesa (Johns Hopkins)
 A case study in building a cost-effective speech-to-speech machine translation system with sparse resources: English - Iraqi Arabic
 
 The Arabic spoken dialect of Iraq is a language deprived of the vast
 resources that researchers enjoy when working with its written
 counterpart, Modern Standard Arabic (MSA). The Iraqi Arabic lexicon and
 grammar are also sufficiently distinct so that the use of existing tools
 or corpora for MSA yield little or no positive effect on machine
 translation output quality.  One can see that building a machine
 translation system normally dependent on a large parallel corpus is a
 particularly difficult task when given just a 37,000 line translated
 parallel text based on transcribed speech. This talk will explore the
 constraints involved in working with this type of data, how we endeavored
 to mitigate such problems as a non-standard orthography and a highly
 inflected grammar, and propose a cost- effective way for dealing with such
 projects in the future.
 
 4:30pm  Preslav Nakov (UC Berkeley)
 Multilingual Word Alignment
 
 Recently there has been a growing number of available multilingual
 parallel texts. One such source is the European Union, which publishes its
 official documents in the official languages of all member states
 (sometimes also in the languages of the candidates). Another source are
 the United Nations. These corpora are a great source of training data for
 machine translation between new language pairs. But they also offer the
 opportunity to obtain better pairwise word alignments by looking at
 multiple languages in parallel. In this talk I will present my research as
 a summer intern at ISI on getting better French (Fr) to English (En) word
 alignments using an additional language (Xx). First, I will introduce two
 heuristics which start with pairwise alignments between Fr-Xx, En-Xx and
 Fr-En and then combine them probabilistically (in a linear model) or
 graph-theoretically (by looking at in- and out-degrees for each word).  
 Then I will present two Model1 inspired alignment models: (a) from "Fr and
 Xx" to En; and (b) from Fr to "En and Xx".

DTEND:20050824T170000
DTSTART:20050824T153000
LOCATION:11 Large
SUMMARY:Summer Student Presentations [Hopkins, Reisa, and Nakov]
UID:20050824T153000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I present an algorithm, Searn (for "search-learn") that is designed to
 solve structured prediction problem: problems whose goal is to learn to
 predict complex objects such as parts-of-speech, parse trees,
 translations, etc...  Searn functions by "breaking apart" structured
 prediction problems into classification problems in the process of search.  
 I analyze Searn in the framework of learning reductions and show that good
 performance on the underlying classification problems implies good search
 performance.  Moreover, Searn is computationally efficient in a superset
 of the settings where previous algorithms are efficient and is not limited
 by conditional independence assumptions (as in CRFs).  This excessively
 simple and general algorithm turns out to have excellent state-of-the-art
 performance.
 
 This is joint work with John Langford (TTI-C) and Daniel Marcu; and, to a
 lesser extent, with Drew Bagnell (CMU) and Bianca Zadrozny (IBM TJ
 Watson).

DTEND:20060224T163000
DTSTART:20060224T150000
LOCATION:11 Large
SUMMARY:Search-based Structured Prediction [Hal Daume III]
UID:20060224T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Since its inception more than 30 years ago, electronic mail (email)
 has developed into a powerful communication medium with applications
 that extend well beyond simple asynchronous message exchange between
 individuals. Automated tools to support the use of email in
 individual, organizational and social contexts have received
 increasing attention in recent years. Among the tasks that are now
 supported are filtering (e.g., spam detection), aggregation (e.g.,
 mailing list digests), workflow management (e.g., help desk routing),
 and reuse (e.g., retrospective search). We are interested in how
 today's email will be used in the future -- some will certainly be
 preserved (indeed, some MUST be preserved!), and those records will
 serve as powerful evidence of how we lived our lives and organized our
 societies. The challenges of managing many types of electronic record
 collections are receiving increasing attention, but we are not aware
 of any work yet on supporting access to electronic mail archives.
 That will be the focus of this talk.
 
 We will introduce the Open Archival Information Systems (OAIS) model,
 and then focus on two key processes: ingestion and access. Our focus
 in ingestion is on support for review and redaction, which we believe
 will be key enablers to acquisition and near-term access. For access,
 we will address both browsing based on provenance (original order) and
 user-guided reorganization based on search and visualization. Along
 the way, we will identify potentially productive opportunities to
 apply natural language processing technologies such as topic
 segmentation, link detection, and summarization. We will then
 describe two test collections, and demonstrate a system that we have
 developed to explore user-guided reorganization through visualization
 for one of those collections. We will conclude the talk by sketching
 out a research agenda. At that point, we will expect suggestions and
 comments from the audience. Knowing this audience, it is unlikely
 that we will need to wait that long :-).
 

DTEND:20030124T160000
DTSTART:20030124T150000
LOCATION:11 Large
SUMMARY:Access to Archival Collections of Electronic Mail [Doug Oard &amp; Anton Leuski]
UID:20030124T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Parallel texts -- texts that are translations of each other -- are an
 important resource in many cross-lingual NLP applications, such as lexical
 acquisition, cross-language IR, and annotation projection. However, their
 importance is paramount for Statistical Machine Translation (SMT), as they
 provide the training data from which all the translation knowledge is
 learned. The state of the art in SMT is advanced enough that, given
 sufficient parallel data (i.e. a few million words) for any language pair
 in a given domain, a generic SMT system trained on it will achieve a
 reasonable translation performance in that domain. The main reason why SMT
 systems exist only for a handful of languages is that, for most language
 pairs, parallel training data is simply not available.
 
 One way to alleviate this lack of parallel data is to exploit a much
 richer and more diverse resource: comparable corpora, texts which are not
 strictly parallel but related. The prototypical example of comparable
 texts are two news articles in different languages which report on the
 same event. I will present methods for automatic extraction of parallel
 data from such corpora. I will show how to detect parallel data at various
 levels of granularity: parallel documents, parallel sentences, and even
 parallel sub-sentence fragments. The parallel corpora obtained using these
 methods help improve translation performance for both resource-scarce
 language pairs (such as Romanian-English) and resource-rich ones (such as
 Arabic-English).
 

DTEND:20060324T163000
DTSTART:20060324T150000
LOCATION:11 Large
SUMMARY:Automatic creation of parallel corpora [Dragos Munteanu]
UID:20060324T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: In the last years a standard model in statistical machine
 translation has emerged, which is based on the translation
 of sequences of words (so-called "phrases") at a time.
 I will describe this model, how to train and decode with it,
 but the focus of this talk will be how to address the
 challenges to advance and move beyond the model: my thesis
 work on noun phrase translation, making use of syntax, and
 better modeling, such as discriminative training.
 
 Bio: Philipp Koehn is the author of papers on natural language
 processing, machine translation, and machine learning. He
 received his PhD from the University of Southern California
 in 2003 (advisor: Kevin Knight), and is currently employed as
 a postdoc at the Massachusetts Institute of Technology, working
 with Michael Collins. He has worked at AT&T Laboratories on
 text-to-speech systems, and at WhizBang! Labs on text
 categorization.
 

DTEND:20040524T170000
DTSTART:20040524T160000
LOCATION:11 Large
SUMMARY:Challenges in Statistical Machine Translation [Philipp Koehn]
UID:20040524T160000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I will present some preliminary results on the problem of domain 
 adaptation in maximum entropy models, specifically in the case when there 
 is a large amount of "out of domain" data, and only a very small amount of 
 "in domain" data.  The model and algorithms I present are based on the 
 technique of conditional Expectation Maximization (CEM) and allow for 
 relatively fast optimization of these models.  Preliminary results on some 
 tasks are quite promising.
 
 

DTEND:20040924T163000
DTSTART:20040924T150000
LOCATION:11 Large
SUMMARY:Domain Adaptation in Maximum Extropy Models [Hal Daume III]
UID:20040924T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Traditional statistical MT systems mostly work on the word-
 andphrase-level. For different language pairs, the performance of such
 systems vary from some 15% to 35%. These systems suffer from problems
 such as sparse data, with huge vocabulary sizes leading to less
 reliable probability estimates. In our current research, we aim to
 come up with a better MT system by looking inside the words. Almost in
 every language, a root (stem) can have many different forms
 (inflectional, derivational, etc.). If we can identify the roots, the
 size of the vocabulary will quite small, and we can have better
 probability estimates, reducing the sparse data problem and
 potentially leading to higher accuracy. We are trying to come up with
 a model that induces morphology automatically from a bilingual corpus
 and achieves this improvement.
 

DTEND:20030425T160000
DTSTART:20030425T150000
LOCATION:11 Large
SUMMARY:Statistical MT with Bilingual Morphology [Quamrul Tipu]
UID:20030425T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND:20030725T160000
DTSTART:20030725T150000
LOCATION:11 Large
SUMMARY:Super-Carmel for Trees [Jonathan Graehl and Kevin Knight]
UID:20030725T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Tree-based probability models of translation have been proposed to take
 advantage of parse trees on one, both, or neither sides of a parallel
 corpus.  I will present comparative results for these three approaches for
 the task of word alignment on Chinese-English and French-English data, as
 well as some analysis of what is going on behind the numbers.
 

DTEND:20040625T160000
DTSTART:20040625T150000
LOCATION:11 Large
SUMMARY:Syntactic Supervision and Tree-Based Alignment [Dan Gildea]
UID:20040625T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The Scamseek project aims to build a surveillance tool for identifying
 financial scams on the Internet by performing document classification of
 Internet pages. There are three principle types of documents of concern:
 those that give financial advice by unregistered advisors, unlawful
 investment schemes, and share ramping.
 
 The first phase of the project has been completed and a working system,
 known as ScamAlert installed at the Australian Securities and Investment
 Commission (ASIC). The independent audit of the performance of the system
 proved satisfactory with a result for precision of .75, recall .43, and
 F=. 54, along with identification of 4 scams misclassified by the client.
 Significant improvement in recall is foreshadowed in the 2nd phase of the
 project.  The results are satisfying in the context of the structure of
 the data where the density of scam documents is about 1.8% of the total
 corpus.
 
 The good performance of the operational system is ascribed to the
 combination of using a strong linguistic model of language (Systemic
 Functional Linguistics) to define the scam documents in parallel with a
 rich statistical analysis of the structure of non-scam documents and scam
 look-alikes. A large amount of the experimental program has concentrated
 on understanding and exploiting the interaction between the linguistically
 described aspects of the documents and the statistical properties. Each
 type of data has been used to inform and modify the usage of the other.
 
 The operational aspects of the project have proven to be as challenging as
 the research objectives. The project has a budget of $2.2M over 15 months.
 It has been managed so as to create a balance in resources between the
 needs of both the research objectives and the engineering objectives.
 Software development has concentrated on three aspects. Firstly, to
 produce an environment for the strong directive management of
 computational linguistics experiments, secondly, in the aid of the
 linguists to create tools to support their manual analysis, and thirdly
 the best practice of software engineering principles to ensure a clean
 automated rollout of the production system for ASIC.
 
 The contributing partners in the Scamseek project are The Capital Markets
 Co-operative Research Centre (CMCRC), ASIC, the University of Sydney and
 Macquarie University.

DTEND:20040325T240000
DTSTART:20040325T103000
LOCATION:11 Large
SUMMARY:ScamSeek: Capturing Financial Scams at the Coalface by Language Technology [Jon Patrick (U. of Sydney)]
UID:20040325T103000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Speech is a crucial component in human computer interaction. While
 tremendous progress has been made in automatic speech recognition, speech
 transcription -- which is the output of automatic speech recognition -- is
 far from providing all the information that one could retrieve from
 speech. For example, prominence, pause, rhythm, and rate of speech all
 carry important information in speech and are crucial in speech
 perception. Inclusion of such information can facilitate better machine
 recognition and understanding of speech.
 
 In this talk, we will introduce the research effort and result in speech
 rate, prominence, disfluency and utterance boundary detection. We will
 also show some interesting applications utilizing these features in
 natural language understanding and dialog management.

DTEND:20050325T163000
DTSTART:20050325T150000
LOCATION:11 Large (THIS HAS CHANGED!!!)
SUMMARY:Metalinguistic feature study for spontaneous speech in human computer interaction [Dagen Wang]
UID:20050325T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Parsing and translating natural languages can be viewed as
 structured-prediction problems. We outline the crucial design
 decisions that must be made to build a machine to solve structured
 prediction problems, and explain our particular choices for these two
 large-scale NLP problems.  Our approach uses a purely discriminative
 learning method that scales up well to problems of this size.  Unlike
 currently popular methods, this one does not require a great deal of
 feature engineering a priori, because it performs feature selection
 over a compound feature space as it learns.  Accuracy on constituent
 parsing was at least as good as other comparable methods.  To our
 knowledge, it is the first purely discriminative learning algorithm
 for translation with tree-structured models.  Experiments demonstrate
 the method's versatility, accuracy, and efficiency.
 

DTEND:20060623T163000
DTSTART:20060623T150000
LOCATION:11 Large
SUMMARY:Discriminative Training for Large-Scale NLP [Joseph Turian (NYU)]
UID:20060623T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: In this talk, I will introduce some of the technologies which
 we have developed in the project on an English reading assistant system
 called English Reading Wizard. The technologies include a method for
 mining translations from web (unparallel corpora), a method for word
 translation disambiguation based on bootstrapping, which is called
 Bilingual Bootstrapping, and a general method of bootstrapping, which is
 called Collaborative Bootstrapping. First, I will introduce the main
 features of English Reading Wizard. Next, I will introduce each of the
 methods. The translation mining method is based on a naďve Bayesian
 ensemble and the EM algorithm. Bilingual Bootstrapping uses the
 asymmetric translation relationship between words in the two languages
 in translation and can construct reliable classifiers for word
 translation disambiguation. Collaborative Bootstrapping contains the
 co-training algorithm as its special case, and it uses the strategy of
 uncertainty reduction in training of the two classifiers.
 
 Bio:
 
 Hang Li is a researcher at the Natural Language Computing Group
 of Microsoft Research in Beijing, China. He is also adjunct professor of
 Xian Jiaotong University. Hang Li obtained a B.S. in Electrical
 Engineering from Kyoto University (Japan) in 1988 and a M.S. in Computer
 Science from Kyoto University in 1990. He earned his Ph.D. in Computer
 Science from the University of Tokyo in 1998. >From 1990 to 2001, Hang
 Li worked at the Research Laboratories of NEC Corporation in Kawasaki,
 Japan. He joined Microsoft Research in 2001.  His research interest
 includes statistical learning, natural language processing, data mining,
 and information retrieval. Hang Li's web site:
   http://research.microsoft.com/users/hangli/
 

DTEND:20031125T240000
DTSTART:20031125T223000
LOCATION:11th Floor Large
SUMMARY:Using Bilingual Data to Mine and Rank Translations [Hang Li (MSR Beijing)]
UID:20031125T223000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: 3:00pm  Victoria Fossum (Michigan)
 Exploring the Continuum between Phrase-based and Syntax-based Machine Translation
 
 State-of-the-art statistical machine translation systems use lexical
 phrases as the basic unit of translation.  Phrase-based systems can
 capture those aspects of translation that are sensitive to local context.  
 Syntax-based systems, on the other hand, make use of linguistically
 motivated syntactic structure, can capture long-distance dependencies and
 reorderings, and offer greater generalization in translation rules.  
 However, their performance lags that of phrase-based systems.
 
 Hierarchical phrase-based translation, introduced by [Chiang 05], provides
 an elegant framework for exploring the continuum between phrase-based and
 syntax-based translation.  This system combines the "formal machinery" of
 syntax-based systems without any "linguistic commitment" to a particular
 syntactic structure [Chiang 05].
 
 I will present results from my re-implementation of Chiang's hierarchical
 phrase-based system, and (if time permits) compare those results with the
 following systems on Chinese-English translation: ISI's phrase-based
 system, and ISI's syntax-based system.  Between now and December 2005, I
 plan to incrementally explore the space between phrase-based and
 syntax-based systems by augmenting these hierarchical phrase-based rules
 with richer syntactic annotation.
 
 
 3:30pm  Liang Huang (Penn) and Hao Zhang (Rochester)
 Efficient Integration of n-gram Language Models with Syntax-based Decoding
 
 We first give an overview of the ISI syntax-based MT system which is based
 on tree-to-string (xRs) translation rules. The biggest problem at this
 stage is the inefficiency of the integration of n-gram models.  Without
 n-gram models, the xRs translation rules can be easily binarized with
 respect to the foreign language to ensure cubic-time decoding. With n-gram
 models, however, binarization without considering both languages will lead
 to exponential complexity.
 
 Inspired by Inversion Transduction Grammar (ITG) (Wu, 97), we will focus
 on the so-called ITG binarizable rules which count for over 99% of the
 whole rule set. A simple linear-time algorithm will be presented to do the
 binarization. Decoding with ITG-like rules is of low polynomial complexity
 in both time and space. We will discuss experimental results on both
 efficiency and accuracy of decoding with the new binarization.  If time
 permits, we will also present the "hook trick" (inspired by (Eisner and
 Satta, 99)) to even further reduce the polynomial complexity of the
 decoding process.

DTEND:20050826T163000
DTSTART:20050826T150000
LOCATION:11 Large
SUMMARY:Summer Student Presentations [Fossum, Huang and Zhang]
UID:20050826T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Many research efforts are addressing the problem of enabling automatic
 summarization of opinions and assessments stated on the web in product
 reviews, discussion forums, and blogs. One key difficulty is that relevant
 assessments scattered throughout web pages are obscured by variations in
 natural language. In this paper, we focus on a novel aspect of enabling
 aggregations of assessments of degree to which a given property holds for
 a given entity (for instance, how touristy is Boston). We present
 GrainPile, a user interface for extracting from the web, aggregating and
 quantifying degree assessments of unconstrained topics. The interface
 provides a variety of functions: a) identification of dimensions of
 comparison (properties) relevant to a particular entity or set of
 entities, b) comparisons of like entities on user-specified properties
 (for example, which university is more prestigious, Yale or Cornell), c)
 tracing the derived opinions back to their sources (so that the reasons
 for the opinions can be found). A central contribution in GrainPile is the
 evaluated demonstration of feasibility of mapping the recognized
 expressions (such as fairly, very, extremely, and so on) to a common scale
 of numerical values and aggregating across all the extracted assessments
 to derive an overall assessment of degree. GrainPile&#8217;s novel
 assessment and aggregation of degree expressions is shown to strongly
 outperform an interpretation-free, co-occurrence based method.
 
 Full paper:
 
 http://www.isi.edu/~timc/papers/IUI06-grainpile-chkl.pdf
 
 

DTEND:20060126T140000
DTSTART:20060126T130000
LOCATION:4th floor
SUMMARY:GrainPile: Deriving Quantitative Overviews of Free Text Assessments on the Web [Tim Chklovski]
UID:20060126T130000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: This talk will survey results of several recent projects we have been
 undertaking in automated text categorization based upon the style,
 rather than the topic, of the documents.  I will describe a general
 text-categorization framework using machine learning along with general
 principles for choosing stylistically relevant sets of features for
 learning effective classification models.  Applications of these methods
 include determining author gender and text genre in published books and
 articles, authorship attribution of email messages, and analysis of
 language use in different scientific fields.  In many cases, the models
 that are learned also give some insight into the respective styles being
 distinguished, which I will also discuss.
 
 Shlomo Argamon is an associate professor at the Illinois Institute of
 Technology Chicago.
 

DTEND:20040326T150000
DTSTART:20040326T133000
LOCATION:11 Large
SUMMARY:On Writing, Our Selves: Explorations in Stylistic Text Categorization [Shlomo Argamon]
UID:20040326T133000@NL
URL:http://www.isi.edu/natural-language/nl-seminar/
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: These are two practice talks for our upcoming thesis defenses.  The titles 
 and abstracts are:
 
 --------------------------------------------------------------------------
 
 NATURAL LANGUAGE GENERATION FOR TEXT-TO-TEXT APPLICATIONS USING AN INFORMATION-SLIM REPRESENTATION
 
 Radu Soricut
 
 In this talk, I describe a new natural language generation paradigm, based
 on direct transformation of textual information into well-formed textual
 output.  I support this language generation paradigm with theoretical
 contributions in the field of formal languages, new algorithms, empirical
 results, and software implementations. At the core of this work is a novel
 representation formalism for probability distributions over finite
 languages. Due to its convenient representation and computational
 properties, this formalism supports a wide range of language generation
 needs, from sentence realization to text planning.
 
 Based on this formalism, I describe, implement, and analyze theoretically
 a family of algorithms that perform language generation using direct
 transformations of text. These algorithms use stochastic models of
 language to drive the generation process. I perform extensive empirical
 evaluations using my implementatio