Uncovering Hidden Authors with AI

by Stephanie Lee

Illustration of hands typing on a keyboard

A cutting-edge AI research program to unmask anonymous authors is underway at USC Viterbi’s Information Sciences Institute (ISI). Supported by a recent grant from the Intelligence Advanced Research Projects Activity (IARPA), it could have sweeping implications for online communication, spanning from misinformation to privacy rights. 

The project, dubbed SADIRI (Stylometric Authorship Discernment & Interpretation for Realistic Inputs), aims to advance the field of authorship attribution—the process of identifying the individual behind anonymous written works—using language AI. 

“There has always been a reason to know who wrote things that are anonymous,” Elizabeth Boschee, associate director of ISI’s AI Division and principal investigator(PI) of the SADIRI project, said. “But with the volume of information in today’s world, it’s really a new frontier.”

Recent high-profile cases have already demonstrated the power and potential of machine learning techniques to unmask who is behind harmful and anonymous online content. For instance, in 2022, AI analyses of writing styles helped computer scientists reveal the identities of the true authors behind the far-right QAnon conspiracy leader, named Q, ending years of speculation. 

The SADIRI program aims to further enhance these kinds of authorship attribution techniques, expanding the capability to include multiple languages and differing lengths of text.

The technology works by using text data to generate an “authorial fingerprint” that represents an author’s unique writing style rather than content. Many language forensics tools focus on identifying the latter, including common words, phrases or topics, making them ineffective if an author shifts subjects. But SADIRI aims to overcome this by developing an AI system that analyzes subtle patterns of stylistic elements that can only be picked up by a machine looking at hundreds of thousands of examples; for example, the use of commas, bullet points or passive voice. This enables the technology to recognize an author’s stylistic fingerprint across any type of content or subject.

In its first phase, the program has already achieved promising results, accurately identifying the authors of 350-word English-language text documents, roughly the length of a short news article. But the team has bigger goals for the future, including expanding capabilities to multiple languages, such as Spanish, Russian and Chinese. They also strive to maintain accuracy with much shorter text samples—potentially even tweets.

If successful, the technology could open up powerful new applications for cyber forensics to intelligence gathering, while also addressing ethical concerns around civil liberties and privacy rights.

“One of the most concerning issues of the modern world is whether true anonymity can exist anymore,” Boschee said. While anonymity can facilitate harmful behavior, it also plays a vital role in safeguarding individuals who speak out against oppressive governments, criminal groups and so on.

To address this, SADIRI has a complementary focus on developing authorship anonymization methods—sophisticated algorithms to help writers mask their identities from machine detection. The technology will play a similar role to a voice distorter, but for text, according to Boschee.

Though authorship anonymization may sound counterintuitive to the goals of attribution, simultaneous progress must exist between the two capabilities in order to maintain a delicate balance between two essential objectives in the digital world: accountability and privacy protection.

“As one system gets better, then the other system has to get better in order to protect against it,” Boschee said. “Hopefully it becomes a virtuous cycle, where all the technology is getting better all the time.”

The SADIRI program is a collaboration between ISI, which is part of USC Viterbi school of Engineering, University of Michigan, University of Maryland, and University of Birmingham. It is slated to complete its first research phase in June 2024.

Published on April 15th, 2024

Last updated on May 16th, 2024

Want to write about this story?