Imagine a world where you have control over what you hear. Where you can tune out the unwanted noises and focus on the sounds that matter to you. Where you can enjoy the tranquility of nature and listen to the birds chirping in a park without hearing the chatter from other hikers. Similarly, it would be great to block out the constant traffic noise on a busy street while still being able to hear important sounds like emergency sirens and car honks.
This is the vision of a team led by researchers at the University of Washington. Working with Microsoft, the team has developed deep-learning algorithms that let users pick which sounds filter through their headphones in real-time. For instance, a user might erase car horns when working indoors but not when walking along busy streets.
The system called “semantic hearing” enables them to focus on or ignore specific sounds from real-world environments in real-time while preserving the spatial cues. It functions when the headphones stream captured audio to a connected smartphone, which then eliminates all environmental sounds.
Users can then select the sounds they want to listen to from 20 different sound classes, such as sirens, baby cries, speech, vacuum cleaners, and bird chirps, either through a smartphone app or voice commands. The headphones only play the selected sounds, effectively canceling out all other noise.
“Understanding what a bird sounds like and extracting it from all other sounds in an environment requires real-time intelligence that today’s noise-canceling headphones haven’t achieved,” said senior author Shyam Gollakota, a UW professor in the Paul G. Allen School of Computer Science & Engineering. “The challenge is that the sounds headphone wearers hear need to sync with their visual senses. You can’t hear someone’s voice two seconds after they talk to you. This means the neural algorithms must process sounds in under a hundredth of a second.”
Due to the limited time available for processing sounds, the semantic hearing system must process sounds on a device, such as a connected smartphone, rather than on more robust cloud servers. Furthermore, since sounds from various directions arrive at different times in people’s ears, the system must preserve these delays and other spatial cues to enable people to perceive sounds in their environment.
The prototype developed by the team was tested in different environments, including parks, streets, and offices, and was able to extract sirens, bird chirps, alarms, and other target sounds while eliminating all other background noise. When 22 participants rated the audio output of the system for the target sound, they reported an overall improvement in quality compared to the original recording.
Results show that the system can operate with 20 sound classes and that the transformer-based network has a runtime of 6.56 ms on a connected smartphone.
However, in some cases, the system had difficulty distinguishing between sounds that share similar characteristics, such as vocal music and human speech. The researchers suggest that training the models on more real-world data may improve this outcome.