Skip to content

AI headsets allow the user to hear a single person in a crowd, looking at them just once

Noise-canceling headphones have become very good at creating an auditory blank slate. But allowing certain sounds from the user’s environment to pass through deletion still poses a challenge for researchers. The latest edition of Apple’s AirPods Pro, for example, automatically adjusts sound levels for users, detecting when they are in conversation, for example, but the user has little control over who to listen to or when this happens.

A team at the University of Washington has developed an artificial intelligence system that allows a user wearing headphones to look at a person speaking for three to five seconds to “enroll” them. The system, called “Target Speech Hearing”, cancels out all other sounds in the environment and reproduces only the speaker’s voice recorded in real time, even when the listener moves in noisy places and is no longer looking at the speaker.

The team presented their findings May 14 in Honolulu at the ACM CHI Conference on Human Factors in Computing Systems. Code for the proof-of-concept device is available for others to develop. The system is not commercially available.

“We now tend to think of AI as web-based chatbots that answer questions,” said lead author Shyam Gollakota, a UW professor in the Paul G. Allen School of Computer Science and Engineering. “But in this project, we developed AI to modify the auditory perception of anyone wearing headphones, depending on their preferences. With our devices you can now clearly hear a single speaker even if you are in a noisy environment with many other people talking.”

To use the system, a person wearing standard headphones equipped with microphones presses a button while pointing their head toward someone speaking. Sound waves from that speaker’s voice should reach the microphones on both sides of the headphones simultaneously; There is a margin of error of 16 degrees. The headphones send that signal to an onboard computer, where the machine’s machine learning software learns the vocal patterns of the intended speaker. The system picks up that speaker’s voice and continues to play it to the listener, even when the couple moves. The system’s ability to focus on the recorded voice improves as the speaker continues speaking, providing the system with more training data.

The team tested their system on 21 subjects, who rated the clarity of the recorded speaker’s voice almost twice as high as the unfiltered audio on average.

This work builds on the team’s previous “semantic hearing” research, which allowed users to select specific classes of sounds (such as birds or voices) they wanted to hear and cancel out other sounds in the environment.

Currently, the TSH system can enroll only one speaker at a time, and can only enroll a speaker when there is no other strong voice coming from the same direction as the target speaker’s voice. If a user is not satisfied with the sound quality, he can perform another recording on the speaker to improve clarity.

The team is working to expand the system to headphones and earphones in the future.

Other co-authors on the paper were Bandhav Veluri, Malek Itani, and Tuochao Chen, UW doctoral students in the Allen School, and Takuya Yoshioka, research director at AssemblyAI. This research was funded by a Moore Inventor Fellow Award, a Thomas J. Cabel Professorship, and a CoMotion Innovation Gap Fund from the University of Washington.