Amazon’s AI could help Alexa better recognize speakers’ emotion

Amazon Dot Speaker
Amazon Dot Speaker

The tone of voice may be the best way to figure out what someone is feeling or to know about his/her emotions. The emotions have a lot of applications like it can detect the early signs of dementia or heart attack. It also has the potential to make the conversational AI system more engaging and responsive.

The concept of AI that classifies the emotions is not new. But traditional perspectives are supervised, meaning they access the training data labeled as the emotional state of the speaker. Recently, scientists at Amazon took a totally different approach.

For making the smart assistant Alexa better understand the human with whom it interacts, Amazon developed a more effective way to teach Alexa to scan people’s voices for signs of emotions. That means, the smart assistants will soon be able to understand humanity’s wants and needs better than ever before. And then, of course, use it to sell to things.

Instead of sourcing the fully translated “emotion” corpus to teach a system, they fed an adversarial autoencoder, a publicly available data set that contains 10,000 utterances from 10 different speakers.

The architecture of our adversarial autoencoder. The latent representation has two components (emotion classes and style), whose outputs feed into two adversarial discriminators.
The architecture of our adversarial autoencoder. The latent representation has two components (emotion classes and style), whose outputs feed into two adversarial discriminators.

And the result was really positive when compared to the conventional algorithm, the new self-teaching Alexa AI was able to judge valence, or emotional value, in people’s voices with 4% more accuracy.

Viktor Rozgic, the paper co-author explained, ‘adversarial autoencoders are two-part models comprising an encoder, which learns to produce a compact (or latent) representation of input speech encoding all properties of the training example, and a decoder, which reconstructs the input from the compact representation.’

The Amazon team’s emotion representation consists of three network nodes, one for each of three emotional measures- valence, activation or whether the speaker’s emotion is positive or negative, and dominance or whether the speaker feels in control of the situation.

The training is conducted in three phases. The first of which involves individually training the encoder and decoder using data without labels. In the second phase, they used adversarial training to tune the encoder. And in the third one, the encoder is tuned to ensure that the latent emotion representation predicts the emotional labels of the training data.

In the test which involves sentence-level feature vectors hand-engineered to capture relevant information about a speech signal, the network was 3% more accurate than a conventionally trained network in assessing valence, according to the blog post.

Moreover, they say that when the network was supplied a sequence of representations for the acoustic characteristics of 20-millisecond frames or audio snippets, the improvement was 4% better than the original.

This new type of algorithm indicates that future projects can be really emotionally-intelligent – even smart assistants. The results are described in the paper which is to be presented at the International Conference on Acoustics, Speech, and Signal Processing.