The creation and use of large (50,000 or more) concept taxonomies and Ontologies for natural language processing and other applications by combining online resources such as dictionaries and thesauri, statistical methods over text, and traditional human knowledge acquisition interfaces. In particular, creating and organizing a 70,000-item concept taxonomy for use in the Pangloss Machine Translation systems, the Penman sentence generation system, and eventually other systems as appropriate.
This research addresses the need for acquiring large semantic and lexical knowledge resources, both for Penman-specific work and to support the sharing of knowledge across Pangloss modules at other sites. Since Pangloss uses an Interlingua, which by definition is language-neutral, an obvious candidate for shared knowledge is the definitional framework of the Interlingua symbols. This is the point of least representational difference (lexical, syntactic, etc.) between parsers, analyzers, and generator.
The Pangloss Ontology is a taxonomy of approximately 50,000 symbols that represent the semantic meanings conveyed in translations. The Ontology is being constructed at USC/ISI by extracting knowledge from a variety of sources. It is represented in Loom, FrameKit, and Prolog, and is distributed with appropriate access routines to the other Pangloss sites.
The topmost levels of the Ontology, which we call the Ontology Base (OB), consist of approx. 400 terms. The OB contains nodes that represent generalized distinctions required for the processing of the parsers, analyzers, and generator. While the idiosyncratic processing requirements of each lexeme are stored either in a lexicon (for morphological and syntactic information) or in the Ontology body (for semantic information), general semantic and syntactic patterns are captured as nodes in the OB. The OB is a merge of the Penman Upper Model (based on Systemic-Functional Linguistics), the top-level ONTOS ontology (a semantic network; see [Nirenburg and Defrise 92]), and, for nouns, the LDOCE semantic categories. It maintains the distinctions present in the Upper Model so that all subordinated Ontology terms can be properly generated in English; it maintains the LDOCE categories so that ULTRA can make the necessary distinctions when parsing nouns; and it maintains the ONTOS distinctions so that semantic analysis can proceed properly. The function of the Ontology Base and its relation with the Interlingua are described in [Hovy and Nirenburg 92].
The primary source for the Ontology body is the semantic database WordNet [Miller 85]. To construct the main body of the Ontology, work was performed to automatically connect WordNet concepts and English lexical items by discovering pairs of corresponding senses [Knight 93].
In addition to housing the symbols to represent semantic meaning, the Ontology contains pointers from each symbol to appropriate lexical items in various languages. The Penman English lexicon currently contains about 50,000 spelling forms (corresponding to approx. 90,000 words); the Japangloss Japanese lexicon contains over 120,000 words.
Knowledge-based machine translation (KBMT) systems have achieved excellent results in constrained domains, but have not yet scaled up to newspaper text. The reason is that knowledge resources (lexicons, grammar rules, world models) must be painstakingly handcrafted from scratch. One of the hypotheses being tested in the PANGLOSS machine translation project is whether or not these resources can be semi-automatically acquired on a very large scale. This paper focuses on the construction of a large ontology (or knowledge base, or world model) for supporting KBMT. It contains representations for some 70,000 commonly encountered objects, processes, qualities, and relations. The ontology was constructed by merging various online dictionaries, semantic networks, and bilingual resources, through semi-automatic methods. Some of these methods (e.g., conceptual matching of semantic taxonomies) are broadly applicable to problems of importing/exporting knowledge from one KB to another. Other methods (e.g., bilingual matching) allow a knowledge engineer to build up an index to a KB in a second language, such as Spanish or Japanese.
Given the increasing need for sharing knowledge across projects or for reusing existing knowledge resources built for other purposes, the development of polytheoretic knowledge resources that support processing by modules based on different theories and possibly using different programming languages is becoming more acute. This paper describes one such resource, the semantic concept ontology used in the Pangloss Machine Translation system.
We address the problem of constructing in a principled way an ontology of terms to be used in an interlingua for machine translation. Given our belief that the a true language-neutral ontology of terms can only be approached asymptotically, the construction method outlined involves a stepwise folding in of one language at a time. This is effected in three steps: first building for each language a taxonomy of the linguistic generalizations required to analyze and generate that language, then organizing the domain entities in terms of that taxonomy, and finally merging the result with the existing interlingua ontology in a well-defined way. This methodology is based not on intuitive grounds about what is and is not `true' about the world, which is a question of language-independence, but instead on practical concerns, namely what information the analysis and generation programs require in order to perform their tasks, a question of language-neutrality. After each merging is complete, the resulting taxonomy contains, declaratively and explicitly represented, those distinctions required to control the analysis and generation of the linguistic phenomena. The paper is based on current work of the Pangloss MT project.