ISI News
When “Probably” Isn’t Enough: USC Researchers Find AI and Humans Speak Different Languages of Uncertainty
A doctor finishes a long day and turns to a familiar shortcut. Instead of drafting an email from scratch, she feeds a few notes into an AI assistant. One line stands out: Serious side effects are unlikely with this vaccine. She means it to reassure. In her mind, “unlikely” translates to something like a 1 or 2 percent chance.
The AI rewrites the message into a polished email for a patient: While serious side effects are infrequent, patients should be aware of a meaningful risk.
The doctor glances at it, makes no changes and hits send.
A patient opens the email later that evening. The tone feels different. Not alarming exactly but not comforting either. He hesitates. Maybe he will wait. Maybe skip the shot for now.
No facts have changed. Only the language has.
That quiet shift is at the center of new research from USC Viterbi’s Mayank Kejriwal, who has spent years studying how artificial intelligence communicates with humans. “If AI systems and humans interpret words like “likely,” “probably” and “unlikely” differently, that can lead to miscommunication in high-stakes situations,” said Kejriwal, research assistant professor in the Daniel J. Epstein Department of Industrial & Systems Engineering and research lead at the USC Information Sciences Institute at the School of Advanced Computing.
A common language that isn’t shared
The paper, “An Evaluation of Estimative Uncertainty in Large Language Models,” published in “npj Complexity,” was co-authored by Kejriwal, ISI-ISE Ph.D. student Zhisheng Tang, whom Kejriwal advises, and Ke Shen, Kejriwal’s former Ph.D. student who now works at Amazon.
Those seemingly simple words carry enormous weight. They shape decisions in medicine, finance, government and everyday life. The study reveals that when it comes to these terms, humans and AI systems are frequently not on the same page.
In conversation, people rely on what linguists call words of estimative probability. These are the phrases that fill in the gray areas when we do not have exact numbers. Someone says a storm is likely, a treatment is unlikely to fail, a deal is probably going through. Most listeners translate those words into rough mental percentages without even thinking about it.
AI models do something similar, but not in the same way. They are trained on vast amounts of text and learn patterns in how words appear together. They can produce language that sounds natural, even nuanced. But beneath that fluency is a statistical system that does not share human intuition.
“The model is not thinking in words the way we are,” said Kejriwal, a recent recipient of a 2026 ARIS Award. “It’s ultimately producing numbers, and those numbers don’t always map cleanly to the words we use.”
To measure the gap, the research team set up a large experiment. They started with 17 common probability phrases, including “almost certain,” “probable,” “we doubt” and “little chance.” These terms have been studied for decades, including by intelligence agencies trying to standardize how analysts express uncertainty.
They compared human interpretations with those generated by several AI systems, including GPT-3.5, GPT-4, LLaMA models and ERNIE-4.0. Each system was asked to assign numerical probabilities to those words across dozens of scenarios, from short sentences to longer passages and even translations into Chinese.
Where the meanings drift
The results revealed a pattern that is easy to miss but hard to ignore.
For most of the phrases, the AI systems did not match human expectations. In 13 out of 17 cases, the differences were statistically significant. The mismatch was most pronounced for middle-ground terms. Words like “likely” and “probable” were often interpreted by models as stronger signals than humans intended. Where a person might hear “likely” as roughly two-thirds probable, GPT-4 frequently pushed closer to four-fifths.
That tilt suggests the systems tend to express greater confidence when using imprecise language. At the same time, they flatten distinctions. Humans often hear meaningful differences between “likely,” “very likely” and “probably.” The models frequently compress those into overlapping ranges, reducing nuance even as they sound precise.
At the extremes, the gap narrowed. Phrases like “almost certain” or “almost no chance” aligned closely between humans and machines.
The study also found that subtle changes in wording can shift how confidently AI speaks. When researchers replaced neutral phrasing with gendered pronouns, such as changing “they” to “he” or “she,” the models’ answers became noticeably less varied. For several probability terms, especially those expressing moderate confidence, the models stopped adjusting their estimates across different scenarios and instead returned nearly identical values each time.
Researchers believe the cause lies in training data: sentences with gendered subjects tend to be more formulaic, pushing models toward fixed outputs regardless of which pronoun was used. The mere presence of gender — not its direction — made AI responses more rigid and less context-sensitive. The same sentence can read as more confident simply because a pronoun changed.
Why it matters now
The implications stretch far beyond a single email.
In legal settings, a client’s decision to settle or fight a case may hinge on whether AI-drafted language describes an outcome as “unlikely” or merely “improbable.” In government, the interpretation of uncertainty can shape policy decisions. Imagine an AI-assisted briefing that describes a geopolitical event as “likely” to escalate. If policymakers interpret that as a 60 percent chance while the system is operating closer to 80 percent, the resulting decisions about sanctions, troop movements or diplomacy could shift in ways that neither side fully understands.
In finance, similar gaps can carry real costs. An AI-generated market summary that calls a downturn “unlikely” might still reflect a materially higher probability than investors assume. That difference can influence trading behavior, risk tolerance and ultimately market outcomes.
Even in everyday life, these differences can quietly influence choices.
One of the more surprising findings is that more advanced AI does not necessarily handle this better. In several cases, GPT-3.5 aligned more closely with human interpretations than GPT-4.
The researchers describe the gap as a breakdown in communication rather than a simple technical flaw, one that becomes harder to ignore as AI is increasingly used to translate complex information into plain language.
Kejriwal believes the solution is not to abandon AI, but to use it more carefully. Systems could ask for clarification when vague terms are used or provide numerical ranges alongside words like “likely” or “unlikely.”
“The right thing for the system to do,” he said, “is to ask what you mean, or at least signal how it is interpreting the word, so you can catch the difference before it matters.”
Published on April 22nd, 2026
Last updated on April 22nd, 2026