Who did what, and to whom? When and where did they do it?
Answers to these simple questions can help you understand an event, which is why this is the type of information event extraction is on the hunt for.
Event extraction is a natural language processing (NLP) technique that allows for names, dates, events, relationships and more to be automatically pulled from large volumes of text.
The process takes natural language – typically things written by humans (think: newspaper articles) – and turns it into structured data that can be searched or put in a database. There, event information can be processed, organized, easily retrieved, and analyzed.
Valuable insights can be gained from vast amounts of text. Event extraction is used to predict political instability; track the path of disease outbreaks; identify social unrest; follow natural disasters; and much more.
But it comes with its own set of challenges. A team of AI researchers at the Information Sciences Institute (ISI), a research institute of USC Viterbi School of Engineering, has developed a method to combat one of those challenges: how to extract event information from foreign language text.
NLP from Afrikaans to Yiddish
Anti-inflation protests in Vietnam; a cholera outbreak in Malawi; statements by German chancellor Angela Merkel about Ukraine. These are examples of events across the globe that would each be written about in a number of different languages.
If you were interested in one of these events, the ability to search a variety of news outlets, documents, and text from countries – and languages – around the world would provide greater context. And in many cases, it’s about more than just context. Global events can have impacts that cross boundaries and languages, which means the language of interest for an event may shift from one day to the next (e.g., when a disease emerges in an unexpected location).
Elizabeth Boschee, Director of ISI’s Boston Office and Associate Director of ISI’s AI Division, along with her team have developed a system that extracts events from foreign language texts with only English training data.
Why is that important? “When an English-speaking user is interested in adapting the system to a new type of event, all they need to do is provide some examples of that new event in their own language—and the system does the rest,” said Boschee.
She explained that in previous work, extracting events in a foreign language would require one of two approaches. “You could manually create examples in each language of interest, but this is a time-consuming process that has to be re-done for each language.”
A second approach is to “first automatically translate all of the documents of interest into English and then run an English event extraction system, but this both reduces accuracy and is usually prohibitively time-consuming and expensive when a user wants to search over millions rather than thousands of documents.”
The ISI team’s system takes advantage of massively multilingual underlying language models. Boschee explained, “this allows you to train models using English data and deploy them on any language that’s represented in the language model.”
The result? The system makes global events available on-demand in 100 languages ranging from Afrikaans to Yiddish, and returns results in English.
Boschee said, “The idea is that you take that technology and make it accessible. To some extent, a lot of these research technologies are pretty far from being usable; for example, they take hours to run. But we’ve transformed ours into something that could be run on the fly.”
ISI-CLEAR is the resulting technology. It’s an event extraction system and accompanying user interface for event visualization and search.
How is it different from a simple Google search?
Boschee gave an example: “When you have a lot of data in foreign languages, it’s hard to search ‘protests against Putin in Belarus this month’ and get accurate results. With keyword searches like Google, you can look for the word ‘protest,’ and you can look for the word ‘Belarus,’ but how do you get protests against Putin in Belarus? How do you distinguish that from general protests in Belarus? So, with a keyword-based search, you’ll get a lot of non-relevant results.”
By doing event extraction directly on the foreign language data sources, the team can extract events such as “protests against Putin in Belarus this month” from globally sourced text, then translate it into English for the English-speaking user. This significantly broadens the reach of the technology and the data sources available and provides far more accurate search results.
The team will present their paper Massively Multi-Lingual Event Understanding: Extraction, Visualization, and Search as a demonstration at the 61st Annual Meeting of the Association for Computational Linguistics (ACL’23), which is taking place in Toronto, Canada from July 9 to July 14, 2023.
Published on July 11th, 2023
Last updated on July 11th, 2023