Skip to content

New study finds AI-generated empathy has its limits

Conversational agents (CAs) like Alexa and Siri are designed to answer questions, offer suggestions, and even show empathy. However, new research finds that, compared to humans, they do poorly at interpreting and exploring a user’s experience.

CAs run on large language models (LLMs) that ingest massive amounts of human-produced data and can therefore be prone to the same biases as the humans the information comes from.

Researchers from Cornell University, Olin College, and Stanford University tested this theory by prompting CAs to show empathy while conversing with or around 65 different human identities.

The team found that ACs make value judgments about certain identities, such as gay and Muslim, and can encourage identities related to harmful ideologies, including Nazism.

“I think automated empathy could have a tremendous impact and enormous potential to achieve positive things, for example, in education or in the healthcare sector,” said lead author Andrea Cuadra, now a postdoctoral researcher at Stanford.

“It’s extremely unlikely that (automated empathy) won’t happen,” he said, “so it’s important that as it happens, we have critical insights so we can be more intentional about mitigating potential harms.”

Cuadra will present “The illusion of empathy? Notes on displays of emotion in human-computer interaction” at CHI ’24, the Association for Computing Machinery’s conference on Human Factors in Computing Systems, May 11-18 in Honolulu. Co-authors of the research at Cornell University included Nicola Dell, associate professor, Deborah Estrin, professor of computer science, and Malte Jung, associate professor of information sciences.

The researchers found that, overall, LLMs received high ratings on emotional reactions, but received low ratings on interpretations and explorations. In other words, LLMs can answer a query based on their training, but they cannot go deeper.

Dell, Estrin and Jung said they were inspired to think about this work since Cuadra was studying older adults’ use of older-generation ACs.

“She witnessed intriguing uses of technology for transactional purposes, such as frailty health assessments, as well as for open-ended reminiscence experiences,” Estrin said. “Along the way, she saw clear examples of the tension between compelling and disruptive ’empathy.'”

Funding for this research came from the National Science Foundation; a PhD fellowship from the Cornell Tech Digital Living Initiative; a Stanford PRISM Baker Postdoctoral Fellowship; and the Stanford Institute for Human-Centered Artificial Intelligence.