INTERSPEECH: Where speech recognition and synthesis converge

As the start of this year’s Interspeesch is approaching, “Generative AI” has become a guard word in both the machine learning community and the popular press, where it generally refers to models that synthesize text or images.

TTS) Models (TTS-to-Tale), which is an important research area at Interspeech, has in some sense always been “generative”. But as Jasha Droppo, a senior main scientist in Alexa AI organization, explains, has also been reshaped by the new generative-IA paradigm.

Jasha Droppo, a senior main scientist in Alexa AI organization.

The first Neural TTS models were trained at a “point-to-point” mode, says Droppo, whose own Interspeech paper is by speech synthesis.

“Let’s say you estimate spectrograms – and a spectrogram is basically a picture where every pixel, every little element of the image, is how much energy is in the signal at that time and the frequency,” Droppo Exlains. “We want to estimate a time slice of the spectrogram, for example, and have energy content over the frequency of the particular time slice. And the best we could do at the time was to look at the distance between it and the speeches that we wanted the model to create.

“But in text-to-speech data there are many valid ways to express the text. You can change the pace; you can change the stress; you could insert breaks in different places. So this concept that there is a single point estimate that it is the correct answer was just defective.”

Generative AI offers an alternative to point-to-point training. Large language models (LLMS), for example, calculate probability distributions over sequences of words; At the time of generation, the simple samples choose from these distributions.

“The progress of generative modeling to text-to-speech has this trait that they do a single answer,” says Droppo. “You are estimated to be the likelihood of being correct over all sorts of answers.”

The first of these generative approaches to TTS, says Droppo, was to normalize streams that pass data through a series of invertible transformations (current) to approximate a prior distribution (normalization). Next came diffusion modeling, which increases the noise to data samples and trains a model to denoise the results until it can eventually generate data from random input.

Spectrum quantity

Most recently says Droppo, a new approach known as Spectrum quantity Has generated excitement among TTS scientists.

“If we were to have an acoustic tokenizer-it will say something that takes one, for example, a 100-Milliscond segment of the spectrogram and transforms it into an integer-if we have the right component like that, we say this continuous problem, says this imaging problem spectrogram and transformed into a unit forecast problem,” says Droppo. “The model doesn’t care where integers came from. It just knows that there is a sequence and there is a certain structure at a high level.”

In this regard, explains Droppo, is a spectrum quantity model a lot like one Cause Llm, trained in the task of predicting the next word in a series of words.

“It’s all a causal LLM also sees,” says Droppo. “It doesn’t see the text; it sees textokenes. Spectrum quantity allows the model to look at speech in exactly the same way as the model looks at text. And now we can take all the code and modeling and insight we have used to scale great language and bring this modeling.

Unified Speech

However, Droppo’s work is not limited to everyone; The majority of the papers he is coauthored at Amazon are on automatic speech recognition (ASR) and related techniques for the treatment of acoustic input signals. The breadth of his work gives him a more holistic view of speech as a subject of research.

“In my experience as a human being, I cannot separate the process of generating speech and understandable speech,” says Droppo. “It seems very united to me. And I think if I had to build the perfect machine, it wouldn’t really distinguish between trying to understand what I’m talking about and trying to understand what the other party in the conversation is talking about.

More specifically, Droppo says, “The problems of taking speech recognition from end to end and making TTS -ending similar aspects, such as being able to deal with words that are not well represented in the data. An ASR system will struggle to transcribe a word it never has, and a TTS system will struggle to pronounce correctly a word it has never encouraged before.”

As an example, Alexa AI researchers have used audio data generated by TTS models to train ASR models. But, says Droppo, this is just the tip of the iceberg. “At Amazon,” he says, “It has been my mission to bring text to speech and speak to text closer together.”

Leave a Comment Cancel reply