|
|
|
(MT in a growing field of language technologies)
The Sheraton Bucks County Hotel
Langhorne, Pennsylvania
October 28-31, 1998
TUTORIALS
This tutorial will present various approaches to the problem of Machine Translation taking us from early methodological approaches, through the "classical" architectures of 1970s and 1980s MT systems to the latest ideas, ending with a consideration of some outstanding topics for MT researchers. On the way we will also consider how various external factors (use and users) affect MT system design.
The tutorial will be divided into six topics, as follows:
1. Historical perspective.
2. 2nd generation: Transfer vs interlingua, rule-based systems.
3. Making life easier I: sublanguage and controlled language systems.
4. Making life easier II: Tools for translators.
5. New paradigms: EBMT and statistical MT.
6. Hard problems don't go away.
1. Historical perspective.
Just to get into the mood, a whistle-stop tour of the main historical
landmarks in the last 50 years. We will also introduce some
terminology, and remind ourselves of some of the problems that
computers face when translating natural languages.
2. 2nd generation: Transfer vs interlingua, rule-based systems.
As MT
system design became more sophisticated, it reflected advances in both
computer science and formal linguistics. This is reflected in the much
vaunted "2nd generation" system design, with modular programming
styles, linguistically sophisticated stratificational data structures
and translation schemas, and linguistic rule-writing formalisms. A
central feature was the "Transfer vs Interlingua" debate, and ideas
about incorporating real-world knowledge into MT systems.
3. Making life easier I: sublanguage and controlled language systems.
As the difficulty of achieving fully automatic high-quality translation
of unrestricted texts has come to be realised, it has been accepted
that we must choose between low-quality (but still usable) output,
human-assisted MT systems, or restricted input. In this segment, we
will concentrate on the latter option, clarifying the difference
between the two variants, and briefly discussing some of the
sociological issues involved.
4. Making life easier II: Tools for translators.
Continuing the theme of the previous segment, we will look at some
of the methods of Human-Aided MT, especially the more recent and
corpus-based tools such as Translation Memory, and other tools which
take advantage of recent reserach in corpus linguistics.
5. New paradigms: EBMT and statistical MT.
In the early 1990s, an apparently new paradigm of MT system design
emerged, generating heated debate. Now that the dust has settled,
we can review more calmly the so-called "empirical vs rationalist"
debate, and note that the former methods come in many flavours, often
integrated in hybrid approaches.
6. Hard problems don't go away.
We end our survey of methodological approaches with a look at the more
difficult problems still facing MT researchers. Among these are old
favourites like "structure-preserving translation as first choice", and
the incorporation of common-sense knowledge; but other problems,
especially due to the widespread use of MT by novice users on the
Internet, for all sorts of purposes, including proper-name recognition,
robust handling of error-full and idiosyncratic text (especially in
e-mails), aspects of dialogue translation (again in e-mails, as well as
speech translation) and so on. Thinking briefly about the particular
problems of speech translation, we will consider "metadialogue"
transaltion errors, language-switching, and interpretation of
metalinguistic features (e.g. pragmatic effects of deliberate use of
pauses and apparent repairs in spoken language).
After lying dormant for over two decades, automated text summarization has experienced a tremendous resurgence of interest in the past few years. This tutorial reviews the state of the art in automatic summarization, with particular emphasis on multilinguality, to the extent this has been addressed to date. The tutorial begins by outlining the major types of summary, then describes the typical decomposition of summarization into three stages, and explains in detail the major approaches to each stage. Next, discussion turns to the difficult issue of evaluation--measuring how good a summary is. Finally, we will outline the major open problems and research challenges that remain to be solved.
Eduard Hovy directs the Natural Language Group at the Information Sciences Institute of the University of Southern California. He currently leads a project that is building the summarization system SUMMARIST. Other related work he has been or is currently involved with include concept clustering, large ontology construction, machine translation, and text generation.
Daniel Marcu's recently completed PhD at University of Toronto builds theoretical and algorithmic foundations of rhetorical parsing and discourse-based summarization. He is one of the members of the organizing committee of the AAAI-98 Spring Symposium on Intelligent Text Summarization.
Computer Assisted Language Learning (CALL) Programs come in all shapes and sizes these days. They are on the Web and they're on your software store shelf. In this tutorial, we will work through a variety of CALL programs and see what they've got to offer to you as the MT developer, whether you are in the beginning or advanced phases of system development.
The tutorial will be divided into 4 topics:
1. Identifying and Locating CALL Programs: Where Do you Find Them?
2. What is Out There?: Developing a Taxonomy of CALL Programs
3. Reviewing the Reviews on CALL: Evaluating the Evaluations and Making the Most out of the Them
4. The Value of CALL To MT Development: What Can the MT Developer Get out of CALL that Cannot be Obtained Elsewhere?
1. Identifying and Locating CALL Programs: Where Do you Find Them?
For starters, you have to know what the sources are which will lead you to the knowledge of the very existence of CALL programs and where they can be obtained. Next, you have to develop a strategy so that your information stays fresh so that you have access to the latest software releases.
In this tutorial, you will be provided with references to the print and electronic sources of information on CALL and learn about which professional organizations, societies and universities have made CALL their "calling." You will learn how to develop your resources on CALL and implement a constantly evolving strategy which will enable you to keep informed of the latest software releases and developments in CALL.
2. What is Out There?: Developing a Taxonomy of CALL Programs
In this portion of the tutorial, a taxonomy of the CALL programs that are available as off-the-shelf and as Internet-based products will be presented. In this portion of the session, you will also learn how to build and maintain your own taxonomy for the languages you are interested in.
3. Reviewing the Reviews on CALL: Evaluating the Evaluations and Making the Most out of Them
Review articles can be an excellent source of information on CALL programs. In this tutorial, first off, we will learn where to find CALL review articles.
We will examine exemplary review articles and learn about what kinds of things typically appear in review articles. Since all CALL review articles are not created equal--some are purely marketing pieces and are not critical reviews--we will learn how to separate the two and derive value from each of them.
We will learn how to make the information in review articles work for us in a couple of key ways.
Review articles are useful in alerting us to the very existence of software programs. In addition, review articles can be useful in helping us decide whether to obtain the software. In these ways, reviews can be a powerful "pre-purchase" tool.
Reviews can also be used "post-purchase." That is, once we've got the software on our machines, reviews can be used in a variety of creative ways. They can help us zero in on particular aspects of a CALL program. They can help us find things in the software that might not be readily or easily apparent through the program itself--we all know that it takes time to discover all the features of a program and interfaces and manuals are not always particularly good about pointing these things out. Reviews can also be used to help shed light on particular aspects or elements of the software which might be difficult to see otherwise.
This portion of the tutorial will be highly interactive. You will learn the strategies for sifting the gems out of review articles and using them effectively as "pre" and "post purchase" tools.
4. The Value of CALL To MT Development: What Can You as an MT developer Get out of CALL that You Cannot Obtain Elsewhere?
One of the major motivations for CALL is to fill in the niche need to teach material that is not available in any other source, such as textbooks, dictionaries, grammars or other linguistic resources. Many CALL developers are instructors who create CALL programs out of a desire to have a vehicle for teaching aspects of the second language in unique ways. For example, a teacher of Japanese as a second language may have a unique ways of explaining grammatical structures--perhaps those which are most difficult for students to grasp--and the teacher finds that CALL is the best way to deliver the instruction. The teacher may find CALL to be the best way for teaching the instruction for a variety of reasons. For instance, it might not make sense to publish the material in print if it is just one or two lessons that the teacher would like to develop.
CALL also can meet the need for delivering instruction in a particular manner which is not possible in any other way. For instance, the CALL developer creates a program making use of online dictionaries to enable the student to look up lexical items at a much faster rate than they can with print dictionaries. Thus, the mode of instruction is key.
The MT developer looking at what CALL programs teach and how they teach can discover a treasure chest of linguistic insights. The MT developer in the beginning phases as well as the MT developer in advanced phases of development can find all kinds of goodies in CALL programs that s/he will not be able to find anywhere else. A CALL author may have a particularly wonderful explanation of Spanish clitics, for example, which may inspire you to rework that algorithm you(IU(Bve been working on that hasn(IU(Bt ever worked quite right. Another program may have electronic dictionaries or data sources that both the beginning and the advanced MT developer would like to get their hands on.
In this tutorial, we will explore the range of goodies CALL offers and the various techniques for sifting them out of programs.
The idea of representing a source text in a language-neutral format and then generating the target text off the latter is simple and well known. The devil, as always, is in the details.
Ontological semantics is a computational-linguistic theory devoted to the issues of deriving, representing and manipulating meanings of concrete natural language texts. While the theory can serve as the basis for many information technology applications, in the area of machine translation it is buttressed by a detailed and tested development methodology developed in the framework of the Mikrokosmos R&D project.
This tutorial will address the following topics:
1. Introduction to the concerns, assumptions, content and justification of ontological semantics. Comparison of ontological semantics with other semantic theories.
2. The body of the theory:
- text meaning representation as the dynamic knowledge repository;
- ontology as the repository of static language-neutral meanings;
- the lexicon as the repository of static language-dependent meaning;
- the semantic analyzer as the main dynamic knowledge source for deriving text meaning representations;
- the text planner as the main dynamic knowledge source for manipulating text meaning representations.
3. Methodological issues:
- "effectiveness" of the theory: a theoretical determination of a constructive methodology;
- issues in massive knowledge acquisition;
- microtheories as vehicles for decription of language phenomena;
- issues related to computational architectures for application systems.
4. Status of theory and methodology development:
- what parts of the theory have already been practicaly tested in applications;
- what parts of the theory have already been developed;
- what parts of the theory are under development;
- what parts of the theory await development (look here if you need a dissertation topic)
5. Applications of ontological semantics otuside machine translation.
Contrary to fears and beliefs of five years ago, the WWW will not be an English-only resource. Information is readily accessible in growing numbers of languages. Cross Language Information Retrieval (CLIR) supports the view that foreign language documents are sources of information and not just noise to be eliminated.
This tutorial will present CLIR techniques and recent experiments attacking the problems raised trying to access documents written in one language by a query expressed in another.
The tutorial will be divided into four parts, as follows:
1. The Science of Information Retrieval.
2. Why Cross Language Information Retrieval is not Information Retrieval and is not Machine Translation.
3. Linguistic Techniques for Cross Language Information Retrieval.
4. Experiments and Results.
1. The Science of Information Retrieval.
Information
Retrieval has been around for forty years. It has developed a
methodology of research that it uses for judging one retrieval system
as better than another. We will see where Information Retrieval is
coming from, looking at Information Retrieval testbeds and techniques
and surveying experiments from Cleverdon in the 1960's to the Text
Retrieval Conferences of the 1990's.
2. Why Cross Language Information Retrieval is not Information
Retrieval and is not Machine Translation.
Though sharing considerable
overlap with both Information Retrieval and Machine Translation,
Cross Language Information Retrieval poses, we will see, three
specific problems: finding translation equivalence candidates for
open-domain queries, filtering out some translation candidates, and
weighting the remaining candidates.
3. Linguistic Techniques for Cross Language Information
Retrieval.
In this part of the tutorial, we will cover
automatic language identification and term normalization, using
heuristics as well as currently available natural language processing
techniques.
4. Experiments and Results.
This final section will cover a number of recent experiments attacking the
problems raised in section 2. We will see what has been learned so far
and what remains to be done.
With the introduction of commercial text dictation systems (such as Dragon Naturally speaking) and text translation systems (e.g. IBM Personal Translator), two important technologies in natural language processing have become available to the general public. However, speech translation cannot be reduced to just speech recognition and text translation. Utterances that occur in unrehearsed, spoken dialogs are very different from written text. For such spontaneous input, speech recognition systems have a higher error rate and conventional text translation systems often fail due to ungrammaticalities, missing punctuation, and recognition errors. To make full use of all information present in the speech signal, a more integrated approach to speech translation is required. The main section of the tutorial will cover common approaches and problems in speech recognition and speech translation. Some algorithms used in speech translation systems will be explained in more detail to provide a better understanding of the problems and possibilities. A description of past, present, and future speech translation systems with video and/or live demonstrations will round off the tutorial.
Evaluation has always been a fundamental part of the MT discourse. Yet many participants in the field will claim that there is no generally agreed upon method for evaluation. In part, this sense springs from the realization that there are different purposes for MT, different interests of the participants in the process, radically different theoretical approaches, and, of course, different languages. Also, MT evaluation has some unique difficulties over evaluations of other language systems; in particular, there is never a single "right" translation, and therefore never a solid ground truth against which MT output may be compared. This tutorial faces all of these issues, discussing the different evaluation needs of different MT stakeholders, the problems of using subjective judgments for evaluation, and a variety of classic and new approaches to evaluation.
Maintained by Katya Shuldiner