The Natural Language Group

Projects

Current Projects

CLEAR: Cross-Lingual Event and Argument Retrieval

The automatic extraction of events from text has empowered tasks as varied as the prediction of political stability forecasting or the automatic creation of in-depth biomedical information resources. However, most training data for event extraction models is available only in English. Under IARPA’s BETTER program, ISI researchers are developing an innovative end-to-end, cross-lingual system which will provide personalized, multi-lingual semantic extraction and retrieval from text in foreign languages, using only English training data. Our collaborator on this effort is UMass Amherst. This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via 2019-19051600007. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

CORAL: Combined Representations for Adept Learning

Modern applications of machine learning (ML) constitute transformative technologies, whose adoption is spreading rapidly and impacting myriad sectors of the economy. However, ML’s revolutionary potential is limited by its cost—both the cost of creating corpora of labeled training data and the difficulty of adapting successful models to perform well in related domains. The goals of CORAL are to create revolutionary machine learning (ML) technology that will achieve state-of-the-art (SOTA) performance on a broad range of tasks, but require a factor of 106 less training data than the SOTA to automatically create a new capability for a new domain and task, and require 106 less training data to automatically adapt to a related domain. We additionally will develop and test information-theoretic techniques that characterize the limits of ML and adaptation. Our collaborators are the California Institute of Technology, Carnegie Mellon University, Columbia University, the NVIDIA Corporation, and the University of Illinois, Urbana-Champaign. This research is based upon work supported by DARPA's LwLL program, Contract XXXX.  The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

CWIC: Communicating Intelligently with Computers

In this project, we work on human/robot communication, and on creative language generation.This work is supported by Contract W911NF-15-1-0543 with the US Defense Advanced Research Projects Agency (DARPA).

 

Discerning Group Biases in Online Communities via Linguistic Analysis

Cultural models are shared understandings of the world by particular cultures, which help individuals interpret the world around them. These models help provide people with internal mental heuristics for what is right, what is just, what is edible, among others. Understanding these models helps us to better understand the nature of the cultures which adopt them, as they provide rules for everyday life. The underlying mechanisms for these models are currently collected through ethnographic analysis undertaken by social scientists using specialized tools like interviews with key informants, pile sorts, and other techniques to better extract the underlying workings of a culture. While social scientists can use specialized tools like interviews with key informants, pile sorts, this “thick data” about the cultures being studied. It is cumbersome and time-consuming, and often requires months to years of fieldwork from social scientists to understand these models and can not easily scale across different types of stereotypes. Fortunately,  these shared understandings are redundancies that can be exploited by machine learning techniques to quickly understand the underlying principles of a culture.  In the past decade, enormous quantities of textual data,  including product reviews,  discussion forums,  and digital contents,  have been generated by online communities.  This provides an opportunity and challenge for computer scientists and social scientists to study the underlying group biases present in the online content at scale. The primary scientific goal of the project is to develop a computational framework for rapid data-driven ethnographic analysis of online communities using statistical machine learning and computational linguistic tools.  This work is carried out with funding from the DARPA UGB program, HR0011890019.

ELICIT:  A System for Extracting and Organizing Causal Information

Optimal decision making requires ready access to disparate sources of structured (e.g., databases) and unstructured (e.g., natural language) information.  ELICIT researchers are developing a framework that integrates concepts of causality, factual knowledge, and meta-reasoning into a model-driven knowledge graph representation that allows decision makers to access relevant knowledge.  Teammates include Rensselaer Polytechnic Institute, CMU, and Lockheed Martin ATL. Sponsored by the Defense Advanced Research Projects Agency (DARPA) Causal Exploration program (FA8650-17-C-7715)

ELISA: Exploiting Language Information for Situational Awareness

Today's automatic parsers, translators, extractors, and dictionaries cover a tiny fraction of the world's languages. Can we use general knowledge of how language works to extend the reach of natural language tools?  In this project, we develop technology for rapidly constructing information extraction (IE), machine translation (MT), and topic and sentiment processing capabilities for new languages.  Our collaborators are ICSI, Brno University of Technology, University of Pennsylvania, University of Notre Dame, Rensselaer Polytechnic Institute, and Next Century, Inc.  This work is carried out with funding from DARPA (HR0011-15-C-0115).

EvidxExtraction (Evidence Extraction Systems for the Molecular Interaction Literature)

Biomedical databases describe the claims made by scientists in detail, but rarely provide descriptions of any supporting evidence that a consulting scientist could use to understand why the claims are being made. Currently, the process of curating evidence into databases is manual, time-consuming and expensive; thus, the evidence is recorded in papers but not generally captured in database systems. Although experimental evidence is complex, it conforms to certain principles of experimental design. Exploiting these principles has permitted us to devise a preliminary, robust, general-purpose representation for experimental evidence. A major goal of our project is that we will develop methods to extract this evidence from scientific papers automatically (1) by using natural language processing to read information from the text used by scientists to describe their results and (2) by using image processing on a specific subtype of figure that is common in molecular biology papers. We will develop these tools and package them so that they may then also be used for evidence pertaining to other areas of research in biomedicine. Funding statement: This work is carried out with funding from an NIH R01 (LM012592).
 

LESTAT: Learning Event Schema Temporally and Transmodally

Understanding a complex event requires organizing its typical sequence of actions and the pattern of duration and participation in these actions. By organizing events in this manner, downstream analytics (or analysts) are able to identify the early warning signs of a process and recognize important but missing information. While handcrafted models provide this capability for well-understood processes, a system that learns schemas from data offers the opportunity to extend event-focused analytics to new domains. We are developing the LESTAT system, which semi-automatically discovers schemas for complex events (CE). These automatically discovered schemas will be general, composable, and specializable to support domain-specific contexts. LESTAT’s schemas will be the basis for systems developed elsewhere to track and anticipate the activities of entities of interest based on detecting individually weak event signals in multi-modal, multilingual input. Our collaborators are Arizona State University and the University of Central Florida. This research is based upon work supported by DARPA's KAIROS program, Contract FA8750-19-2-0500.  The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

MICS: Machine Intelligence from Common Sense

Our goal is to automatically learn a common sense repository (CSR) of knowledge that can be applied to diverse problems. In this project we learn common sense from textual and visual sources, as well as from rich pre-existing common sense knowledge bases (e.g. ConceptNet, YAGO) and observations of relations between objects and concepts as they appear in visual and linguistic sources (i.e. images, videos, documents). The resulting CSR will be a dense, low-dimensional space that efficiently and accurately encapsulates hundreds of millions of observations. To enable more robust reasoning and question answering over this space than the state of the art, we will develop novel inference techniques for both transitive and multi-hop reasoning over the space (e.g. bats have wings, wings enable flying, thus, bats can fly). We apply our system to shared tasks and via an interactive demonstration system which supports direct questioning (e.g. Can the Grand Canyon fly to New York?). Our collaborators are Columbia University, UCLA, and UMass-Amherst. This research is based upon work supported by DARPA's MCS program, Contract N660011924032.  The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

SARAL: Summarization and Domain-Adaptive Retrieval Across Languages

How can a monolingual English speaker access and understand text and speech material in low-resource foreign languages?  In this project, we develop cross-lingual retrieval and summarization techniques that will work for any language in the world, given minimal resources to work with.  Our collaborators are University of Notre Dame, Rensselaer Polytechnic Institute, Massachusetts Institute of Technology, Northeastern University, University of Massachusetts, and Idiap. This work is supported by the IARPA MATERIAL program.  (Acknowledgement: This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via  AFRL Contract FA8650-17-C-9116.  The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.)

Previous Projects

AMR: Abstract Meaning Representation

The AMR Bank is a set of English sentences paired with simple, readable semantic representations.  We hope that it will spur new research in natural language understanding, generation, and translation.  Please visit the AMR page for details. Thanks to NSF (IIS-0908532) for funding the initial design of AMR, and to DARPA MRP (FA-8750-09-C-0179) for supporting a group to construct consensus annotations and the AMR Editor. The initial AMR bank was built under DARPA DEFT FA-8750-13-2-0045 (PI: Stephanie Strassel; co-PIs: Kevin Knight, Daniel Marcu, and Martha Palmer) and DARPA BOLT HR0011-12-C-0014 (PI: Kevin Knight).

DECODE: Deciphering Historical Manuscripts

In collaboration with colleagues at Uppsala University (Sweden), we are collecting enciphered manuscripts from the European modern era (1600-1800) and developing software to automatically decipher them. This work is supported by the Swedish Research Council and a gift from Google, Inc.

 

DIG: Domain-Specific Insight Graphs

DIG is a domain-specific indexing, search and analysis system. The DIG system harnesses state-of-the-art open source software combined with an open architecture and flexible set of APIs to facilitate the integration of a variety of extraction and analysis tools.  Please visit the DIG page for more details.  This research is supported in part by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL) under contract number FA8750-14-C-0240, and in part by the National Science Foundation under Grant No. 1117913.

 

L2K2R2: Learning to Know to Read

Scientists are overwhelmed with scientific literature. If we can build machines to read scientific papers and understand them, we can help science move faster.  This work is sponsored by DARPA Big Mechanism (W911NF-14-1-0364).

Groups: