The MT research group is developing GAZELLE (formerly Japangloss), a system that translates Japanese, Arabic, and Spanish texts into English. The system operates over unrestricted newspaper text. The goal is to improve on existing system's accuracy and to make it easier and faster to devlop MT systems for new language pairs. The R&D work includes parsing and semantic analysis of the various languages, English sentence planning and generation, Interlingua/ontology construction, and lexicon creation for Japanese, Arabic, Spanish, and English. This project started in 1994 and is funded by the Department of Defense.
Designing and developing a new system called SUMMARIST to perform automatex text summarization robustly, using a combination of statistical and linguistics-based techniques, and avoiding brittle techniques such as parsing and semantic analysis as far as possible. This project was started in late 1996 and is funded for the initial threee years by the Department of Defense under the TIPSTER program.
The C*ST*RD system is being built to address the increasingly common problem of information overload. This system is designed to incorporate web access, an information retrieval engine, the MT system GAZELLE, and the text summarization system SUMMARIST, to perform multilingual information retrieval, document clustering and sorting, and user-friendly interfaces. This project started in 1997 and is funded by DARPA.
Working on the construction and assembly of large concept taxonomies for use in Natural Language Processing and other applications, to be used as Intelingua lexicons for Machine Translation and as concept fusion resources for text summarization. Constructing the large ontology SENSUS (in collaboration with other projects at USC/ISI), and helping to build a large ontology to serve as "standard" by merging the upper regions of the Pangloss, CYC, EDR, and other concept ontologies (in collaboration with researchers at IBM Santa Theresa, CYC, Stanford University, EDR Tokyo, and various others). This project is currently unfunded.
Extending and testing theories of discourse structure, and developing text planning, sentence planning, and sentence generation algorithms to structure coherent multisentential texts for various application domains. Earlier, this work included the development of Penman, one of the largest English-language sentence generators built. Current work includes the development of:- NITROGEN, a robust statistics-based sentence generator that works even when given underspecified input,
Construction of Large Lexicons for Various Languages
Recent work included the development and use of Finite State technology for a semi-automated process of creating large lexicons out of machine-readable dicitonaries on CDs and other online resources. The system, called ELECTROLEX, has been used to construct a lexiconof 142,000 Japanese items by merging the contents of 11 CD-based dictionaries and online lexicons and to extract over 40,000 Arabic words from a CD dictionary.
Research done in the early 1990's focused on developing theoretical foundations and algorithms for some central problems in multimedia human-computer interaction, including the management of multilodal interactions under discourse principles and the automatic coordination and allocation of information across media. This work is in collaboraiton with Dr. Yigal Arens of USC/ISI.