Great research happens on all sides of town. On Wednesday April 20, 2022, the University of Southern California’s vice president of research, Dr. Ishwar K. Puri, took a break from the main campus and visited the Information Sciences Institute (ISI), on the westside of Los Angeles, in the heart of Silicon Beach. This internationally known scientist and engineer specialized in fire safety, nanotechnology and 3D cell printing with an aerospace and mechanical engineering background got to meet five Ph.D students and their advisors, all working on various artificial intelligence, networking, and cybersecurity topics:
Modeling “Newsworthiness” for Lead-Generation Across Corpora, by Alexander Spangher
If this research had a less academic headline, it would be “Finding Untold Stories: How Models for Newsworthiness Can Surface Stories that Journalists Tell Everyday”. Spangher shows that “newsworthiness” can be modeled and that the layout of the print newspaper contains a good proxy signal to predict it. He also shows that models of newsworthiness based on news article text and newspaper page layout can transfer well into domains where the text is legal (i.e. court cases and laws) and verbal summaries (i.e. city council meeting minutes). Finally, he proposes that these insights be used to help journalists find stories. “Today, journalists spend hours reading through court cases, attending city council meetings, and reading new laws, explained Spangher. Many of these don’t end up being interesting, but when a journalist finds the few that are interesting and writes about them, these stories become extremely valuable for local and national readers, helping them understand and participate in democracy. However, today, newsrooms are also facing catastrophic revenue declines, and fewer journalists are able to take the time to do this. We attempt to address this problem by using machine learning to surface potentially interesting court cases, city council meeting minutes and state-level laws. We do this by training models on the text of news articles — we ask the question: would this text appear on the front page of a major newspaper (i.e. it is very newsworthy) or not (i.e. it is less newsworthy).” He hopes that this work will lead to the creation of recommendation tools for journalists that will surface leads of possible interest.
Alexander Spangher, Emilio Ferrara, Nanyun Peng, Jonathan May Information Sciences Institute, University of Southern California
Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation, by Mozhdeh Gheini
Gheini is working on improving an existing practice in training machine learning models. “The currently dominating paradigm in training machine learning models is the pretraining/fine-tuning paradigm (also known as transfer learning),” she explained. As a concrete example for translation, if you want to train a model that translates from Uzbek to English, it is not that simple, and the main challenge is often the lack of data. There is not enough existing parallel data in the form of (Uzbek sentence, equivalent English sentence) pairs. If you train a model from scratch, it will perform less than ideal. “However, luckily, there are cases where we have more data, for instance, French to English. It’s been shown that it’s indeed beneficial to first train a system on French–English data (pertaining), and then take that model and update its parameters using your limited Uzbek–English data (fine-tuning) instead of solely relying on Uzbek data to train a system from scratch,” said Gheini. She adds: “Now what we address is that such models often have millions to billions of parameters. So if you change everything, then you need to have and store a separate version of each and every single parameter for both French–English, and Uzbek–English. Let’s say tomorrow, you want to also add an Urdu–English model. The database will just keep growing. In this research we ask: what if instead of modifying all parameters, we could get a new model by only updating certain modules. Then instead of storing a whole new set of parameters for Uzbek–English, we can just store those modules, which will be a lot less than all the parameters, for each new language pair. We showed that a certain module in the state-of-the-art translation models called cross-attention is enough to change (fine-tune) for competitive performance. This way we significantly reduce the number of new parameters that need to be stored.” In short: when asked if it is really necessary to change all the parameters to create the new model, this work proves that “no, change what you minimally need, and no more.”
Mozhdeh Gheini, Xiang Ren, Jonathan May Information Sciences Institute, University of Southern California
Auditing for Discrimination in Algorithms Delivering Job Ads, by Basileal Imana
Imana studied how Facebook decides to show job advertisements to men and women, and found that Facebook prevents women from seeing some of them. This occurs even when an employer sets up their job ads parameters to reach qualified men and women equally on the social media platform. LinkedIn, on the other hand, was a lot less biased, with no evidence of discrimination. This work demonstrated that some platforms need to re-examine how their ad delivery algorithms shape access to opportunities such as employment, and highlighted the need for more transparency. The main goal with this work is to bridge the gap between prior auditing methods and what the law says about discrimination, by adding a new control for job qualification as a potential confounding factor in ad delivery. Those results are a new tool for other researchers and policymakers to audit for discrimination and assess the legal liability of ad platforms.
Basileal Imana, Aleksandra Korolova and John Heidemann University of Southern California USC/Information Sciences Institute
Harm-DoS: Hash Algorithm Replacement for Mitigating Denial-of-Service Vulnerabilities in Binary Executables, by Nicolaas Weideman
Weideman is working on automated vulnerability detection in programs. In this context, a vulnerability is a flaw in a program that allows an attacker to exploit it for malicious purposes. Those flaws are often found in hash tables, a functionality used in all areas of computing for their efficient lookup performance. Unfortunately, they are often implemented with algorithms that make them vulnerable to denial-of-service attacks. This work offers a solution: Harm-DoS, an automated approach capable of detecting and mitigating such vulnerabilities in the binary code of real-world programs. Those flaws can be found everywhere: some programming languages, such as PHP, had built-in hash tables with this vulnerability, before it was discovered and fixed. Because the hash table is built into the programming language, any program written in this language was affected. This USC system has already discovered such a vulnerability in a component of Reddit. The team reported this vulnerability to the security team of the social platform, and proposed this solution, which was adopted by Reddit.
Nicolaas Weideman, Haoda Wang, Tyler Kann, Spencer Zahabizadeh, Wei-Cheng Wu, Rajat Tandon, Jelena Mirkovic, Christophe Hauser University of Southern California – Information Sciences Institute
A Graph-based Approach for Inferring Semantic Descriptions of Wikipedia Tables, by Binh Vu
This research’s goal is to develop methods to automatically integrate structured data sources of different schemas into knowledge graphs (KGs), such as Wikidata, a knowledge graph database hosted by the Wikipedia Foundation. The issue they are facing is that the data pulled from different sources may be in different formats (e.g., spreadsheets, json, csv). In addition, data publishers may also use different/vague/abbreviated vocabularies/terms to describe their data attributes (e.g., name/writer/creator). Wikipedia has millions of high-quality tables covering a wide range of domains, and they may be obvious for humans to interpret, but it is difficult for machines because of all those different formats and inconsistencies in attributes. With this work, Vu wants to understand how the data is stored and automatically transform the data into a common format that is easily usable: with the concepts and relationships being decrypted, the data is now accessible in a machine-readable format. This can also potentially provide useful data to data-driven applications or intelligent systems (e.g., question-answering).
Binh Vu, Craig A. Knoblock, Pedro Szekely, Minh Pham, and Jay Pujara Information Sciences Institute, University of Southern California
“How are you going to change the world?”
During the presentations, Puri asked many questions, challenged the students on how they were planning to change the world, and was particularly intrigued by ISI’s work on translation and the use of French for machine learning. A language he studied a long time ago “with mixed results” he confessed jokingly.
“I enjoyed learning about ISI and getting to know the terrific research being done by students. ISI is a world class scientific research enterprise that should be both protected and enhanced by USC,” said Puri after the presentations. He also pointed the “impressive world class leadership in computing and AI research” in a tweet after the visit.
Dr. Terry Benzel, director of Networking and Cybersecurity research at ISI, indicated the importance of the broad range of research at ISI. “We were very pleased to show the student’s work in multiple disciplines. The work on cybersecurity and privacy demonstrated how our work is addressing these important areas. Not only is the research and the graduate study tackling these hard problems but the work is already undergoing technology transfer as is the case with the Harm-DoS: Hashproject that has already been adopted by Reddit.”
Dr. Yolanda Gil, director for major strategic AI and data science initiatives at ISI, recalled that “Dr. Puri’s interest in the student’s research was very palpable. During the student presentations, he asked a lot of questions that connected their research to other work at different schools at USC and to challenging problems that we hear in the news. The students were very engaged, and came away with a deeper understanding of why their research really matters.”
Later that day, Puri met ISI’s senior leadership to discuss the institute’s AI strategy. “We are thrilled that the vice president of research for USC spent five hours with us learning about the work of some of our Ph.D students, meeting with the division directors, touring our facilities, and providing some great ideas on continuing to grow the institute,” said Dr. Craig Knoblock, Michael Keston executive director of ISI.
Published on April 29th, 2022
Last updated on May 5th, 2022