Pronunciation Detection for Alexa’s new English learning experience

In January 2023, Alexa launched in language learning experience in Spain, helping Spanish speakers learn beginning level English. The experience was developed in collaboration with Vaughan, the leading English-language provider in Spain, and the beloved to provide an immersive English learning program with a special focus on evaluation of pronunciation.

We are now expanding this offer to Mexico and the Spanish -speaking population in the United States, adding more languages in the future. The language learning experience includes structured lessons on vocabulary, grammar, expression and pronunciation with practice exercises and quizzes. To try it, set your device language to Spanish and tell Alexa “Quiero Aprender Inglés.”

Mini-less content page: lessons covering vocabulary, grammar, expression and pronunciation.

The highlight of this Alexa skill is its pronunciation function that provides accurate feedback every time a customer pronounces a word or phrase. At this year’s International Conference on Acoustics, Speech and Signal Treatment (ICASSP), we presented paper that describes our advanced approach to erroneous pronunciation.

Pronunciation Correction: Blue highlighting correctly indicates opinion. Red highlighting incorrect pronociation indicates. For incorrectly pronounced sentences/words, Alexa will give instructions on how to pronounce them.

Our method uses a new phonetic recurring-neural-network-transducer (RNN-T) model that predicts phonemes, the smallest speech units, from the student’s pronunciation. The model can therefore provide fine -grained pronunciation evaluation at the word, syllable or phonema level. For example, if a student pronounces the word “rabbit” as “rabid”, the model emits the five-phonememes sequence R AE B IH D. It can then detect the incorrectly pronounced phonemes (ih d) and syllable (-bit) using Levenshtein alignment to compare the phoneme sequence with the phoneme sequence To compare the phonemes sequence with the phoneme-sequence rate reference “R AE B AVE T”.

Related content

In a top-3% paper on ICASSP, Amazon scientists adapt graph-based label formation to improve the speech recognition on the underrepresented comments.

The paper highlights two knowledge holes that have not treated in previous opinion modeling work. The first is the ability to disambigue similar -sounding phonemes from different languages (eg the rolled “r” sounds in Spanish vs. “R” sound in English). We tackled this challenge by designing a multilingual pronunciation Lexicon and building a massive code mixing phonetic data set for training.

The second knowledge is the ability to learn unique wrong pronunciation patterns from language students. We achieve this by utilizing the authentication of the RNN-T model, which means the dependence on its output of input and output that preceded them. This context awareness means that the model can catch frequent incorrect pronunciation patterns from training data. Our pronunciation model has achieved advanced achievement in both phonema prediction accuracy and incorrect pronunciation of detection.

L2 -Data rises

One of the most important technical challenges in building a phonetic-recognition model for non-native (L2) speakers is that there are very limited data sets for erroneous release of the diagnosis. In our InterSpech 2022 paper “L2-Gen: A Neural Phoneme Paraphrasing approach to L2-Talesynthesis for erroneous pronunciation of the diagnosis”, we suggested to bridge this hole by using data gain. Specifically, we built a FonemparFraser that can generate realistic L2 phonemes for speakers from a particular local – e.g. Phonemes representing a native Spanish speaker speaking in English.

A trial interaction with Alexa Live translation.

Related content

Parallel speech recognizers, language -id and translation models that are addressed to conversation -tal are among the changes that make live translation possible.

As is common with correction tasks with grammatical errors, we used to sequence-to-sequel model, but tilt the task direction, training the model to putPronance words rather than correct incorrect statements. In order to enrich and diversify the generated L2 phonememes sequences to further enrich and diversify the generated L2 phonemes sequence Diversified beam search with one Preference loss It is partical to human -like wrong statements.

For each input telephoneOr speech fragment, the model produces several candidate phonemes as output, and sequences of phonemes are models like a tree, with options spread with each new phone. Typically, the upper turning phonemes sequences are extracted from the tree through Beam searchThere only the branches of the tree pursuing the wood with the utmost probability. In our paper, however, we offer to radiate search method that prioritizes unusual phonemes or phoneme candidates that differ from most of the OOLs at the same depth in the tree.

From established sources in the language learning literature, we also construct lists of ordinary incorrect statements at the phonema level, representative as peers of phonemes, a standard phonem in the language and it is its non-standard. We construct at Tabs -fun that bold model education prioritizes output that uses the non -standard variants on our list.

In experiments, we saw accuracy of up to 5% in erroneous pronunciation of incorrect pronunciation over a baseline model trained without increased data.

Balancing of false rejection and false acceptance

An important consideration in the design of a pronunciation model for a language learning experience is to balance the relationship between false rejection and false acceptance. A false rejection occurs when the pronunciation model detects an incorrect pronunciation, but the customer was actually correct or consists of it is a bit easily highlighted pronunciation. A false acceptance arises when a customer pronounces a word and the model fails to register it.

Multitask Network.16x9.png

Related content

Methods of learning noisy data, using phonetic embedders to improve the device resolution, and quantity-noticeable training are a few of the highlights.

Our system has two design features intended to balance these two measurements. To reduce false acceptance, we first combine our default -term lexicons for English and Spanish to a single lexicon with multiple phonemes similar to each word. Then we use the lexicon for automatic unannounced speech samples that fall into three categories: native Spanish, native English and code -changed Spanish and English. The training of the model on this data set allows it to distinguish very subtle difference between phonemes.

To reduce false refusal, we use a multireference -outlook lexicon, where each word is associated with multiple reference prains. For example, the word “data” can be pronounced as EITH “DAG-TAH” or “DAH-TAH” and the system accepts standard as correct.

In ongoing work, we are investigating more approaches to further improvement of our pronunciation evaluation function. One of these is to build a multilingual model that can be used for pronunciation evaluation for many languages. We also extend the model to diagnosis several properties of incorrect pronunciation, such as tone and lexical stress.

Leave a Comment