Tuesday, March 5, 2024

Surrey develops energy-efficient text-to-audio AI system

Text-to-Audio (TTA) system has recently gained attention for its ability to synthesize general audio based on test descriptions. However, previous studies in TTA have limited generation quality with high computation.

In a new study, researchers at the University of Surrey have proposed a new energy-efficient text-to-audio AI system called AudioLDM that allows users to submit a text prompt, which is then used to generate a corresponding audio clip.

The system can process prompts and deliver clips using less computational power than current AI systems without compromising sound quality or the user’s ability to manipulate clips.

Surrey’s open-sourced text-to-audio model is built in a semi-supervised way with a method called Contrastive Language-Audio Pretraining (CLAP). The pre-trained CLAP models enabled researchers to train the AudioLDM system on massive amounts of diverse audio data without text labeling, significantly improving model capacity.

“What makes AudioLDM special is not just that it can create sound clips from text prompts, but that it can create new sounds based on the same text without requiring retraining,” said Wenwu Wang, Professor in Signal Processing and Machine Learning at the University of Surrey. “This saves time and resources since it doesn’t require additional training. As generative AI becomes part and parcel of our daily lives, it’s important that we start thinking about the energy required to power up the computers that run these technologies. AudioLDM is a step in the right direction.”

The research team allowed the general public to try out AudioLDM by visiting its Hugging Face space. Their code is also open-sourced on Github with 1000+ stars. The user community has created a variety of music clips using AudioLDM in different genres.

Sound designers could use such a system in a variety of applications, such as filmmaking, game design, digital art, virtual reality, metaverse, and a digital assist for the visually impaired.

“Generative AI has the potential to transform every sector, including music and sound creation,” Haohe Liu, project lead from the University of Surrey. “With AudioLDM, we show that anyone can create high-quality and unique samples in seconds with very little computing power. While there are some legitimate concerns about the technology, there is no doubt that AI will open doors for many within these creative industries and inspire an explosion of new ideas.”

Journal reference:

  1. Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D. Plumbley. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. DOI: 10.48550/arXiv.2301.12503