Automatic-Tale Recognition Models (ASR) that transcribe spoken utterances is a key component of voice assistants. They are increasingly implemented on devices on the edge of the Internet where they enable quick resorts (as they do not need cloud treatment) and continued service, even during connection breaks.
But ASR models need regular update as new words and names come into the public conversation. If all local collection data remains on device requires updating a global model Federated Learningwhere devices calculate updates locally and transmit only Gradients – Now added to model weights – to the cloud.
A key question in federal learning is how to comment on the locally stored data so they can be used to update the local model. At this year’s International Conference on Acoustics, Speech and Signal Treatment (ICASSP), my colleagues and I presented an answer to that question. Part of our answer is to use self-under-supervision or use a version of a model to label data for another version along with data gain. The second part is to use noisy, weak supervisory signals based on implicit customer feedback-as the re-form of a request-and-natural-Langa-understandable semantics that are determined across multiple twists in a session with the conversation agent.
Transcription |
Play Halo by Beyonce in the main speaker |
ASR hypothesis |
Play hello of out over hand -speaker |
NLU Semantics |
Playsong, Artist: Beyonce, Song: Halo, Unit: Head Speaker |
Semantic costs |
2/3 |
Table: Examples of weak supervision available for an utterance. Here, semantic costs (fraction of slots are wrong) are used as a feedback signal.
To test our approval, we simulated a Federated -Learning (FL) setup, where the hunger of devices updates their local models using data they do not share. These updates are gathered and combined with updates from cloud servers that repeat training with historical data to take regression on the ASR model. These innovations allow for 10% relative improvement of word error speed (WER) on new use cases with minimal degradation of other test sets in the absence of strong supervision signals, such as Earth-Sand Sandside Contributions.
Noisy students
Semi-monitored learning often uses a large, powerful teacher model to label training data for a smaller, more efficient student’s model. In edge units that often have calculation, communication and memory restrictions, larger teacher models may not be practical.
Instead, we consider the so-called noisy student or iterative-pseudo-labeling paradigm, where the local ASR model acts as a teacher model for itself. Once the model has labeled the locally stored sound, we throw the examples out where the label confidence is too high (Asy will not teach the model anything new) or too low (probably wrong). Once we have had a pool of strong, pseudo-labeled examples, we increase the examples by adding elements such as noise and background speech with the love to improve the trained model robustness.
We rotate weak monitoring to prevent faultfoot backs where the model is trained to predict erroneous self-marks. Users typically interact with conversation agents across multiple turns in a session, and later interactions may indicate where a request has been properly handled. Cancellation or repetition of a request indicates user disabilities, and users may also be asked for explicit feedback signals. These types of interactions add an additional source of soil sandy with the self-marks.
In particular, we use reinforcement learning to update the local models. In reinforcement learning, a model repeatedly interacts with its environment and tries to learn a policy that maximizes a certain reward function.
We simulate rewards using synthetic scores based on (1) implicit feedback and (2) semantics introduced by an on-the-unit natural-langue-understandable (NLU) model. We can convert the discharged semantics from the NLU model to a feedback score by calculating a semantic cost metration (eg fraction of named units labeled by the NLU model, which is also shown in the ASR hypothesis).
To utilize this noisy feedback, we update the model using a combination of the self -learning loss and an increased loss of reinforcement. Since feedback scores cannot be used directly to update the ASR model, we used feature that maximizes the probability of predicting hypotheses with high reward results.
In our experience, we used data on 3,000 training rounds across 400 devices using self -marks and weak supervision to calculate gradients or model updates. A Sky Orchrator combines these updates with updates generated on 40 pseudo-devices on cloud servers that calculate model updates using historically transcribed data.
We see improving more than 10% We test sets with new data – ie. utterances where words or phrases are five times more popular in the current period than before. Cloud-Pseudo-Devices performs replay training that prevents catastrophic forget or degradation of older data when models are updated.