The training of a machine learning model can be considered as exploring a landcape that maps settings for the model parameters against the average error rate. The goal of training is to find the bottom of the lowest basin in Landcape or the parameter settings that provide the lowest error rate or “loss” value.
A critical hyperparameter during training is the learning speed that determines the effect of the learning from a given batch of training data can have on a model’s parameter settings. It is common to vary the learning speed throughout training: For example, we can use a high learning frequency at first to quickly explore the entire landscape, but slowly the learning speed over time to ensure that we do not skip a global minimum.
Varying learning speed is known as Learning frequency planningAnd it helps to achieve stable convergence and maximum accuracy. Nevertheless, the optimal schedules often create about pain-and-failure experimentation. As the models become more complex, manual setting is increasingly unclear, and human -designed schedules do not respond to the intrigues of the loss landscape, model parameters and data sets.
At Amazon, we develop algorithms that can learn to plan by utilizing data from previous experiment. In a number of recent papers we described in phases of our research:
- Nutring Stability Guarantees for a Simplified Problem (Non-Negative-Matrix Factorization) and uses them to develop a learning planning;
- Expansion of this approach to deep neural networks; and
- Distilling the results for effective heuristic planning.
Analysis of stochaastic non-negative matrix factorization
In the first paper “Effective learning speed plans for stochastic non-nongative matrix factorization via reinforcement learning”, which we presented on ICLR 2023, we analyze stochastic non-negative-matrix factorization (NMF), a well-studied unattended technically. NMF involves the breakdown of a non-negative matrix in two low-ranked non-negative factoratrixes.
Because of its popularity and mathematical simplicity, NMF served as an appealing test bed before we tackle more complex models. Interestingly, our way of posing this well -developed matrix degradation problem as a learning problem related to the popular parameters -efficient fine -tuning methods (PEFT) used today for more effective compression and training of large language models.
In our first article, we considered an optimization scheme for NMF, which uses stochastic gradient displacement – the standard machine learning algorithm – to minimize the difference between the original matrix and the matrix reconstituted from factor mates. To measure distance, we used the Frobenius norm, which is the square root of the sum of the squares of the individual differences for all matrix posts.
Inscription of Styisy Gradies – That is, noisy estimates of slopes in the loss landscape – we established an upper limit for learning rats that guarantee stabilityBut convergence to a minimum locally under repeated training poker.
This provided valuable insight. First, it quantified the exact HOWNING speed check between convergence speed and potential deviation. Secondly, it showed that stability can be secured through proper learning initialization and mowingOr, to what extent, any model parameter can be changed during model updates.
With a convergence guarantee in hand, we moved our focus to learn what schedules can work well for specific problem. Reinforcement-Learning (RL) search for and generating sequences of decisions that should lead to a better end state. This can be used directly for learning speed plans that maximize convergence speed while respecting stability limits.
Empirically discovered the automated schedules that our RL agent discovered constantly better than the popular heurmistics – such as steps, which systematically lowers the learning speed of successive era – on NMF tasks. This provides a promising proof-of-concept for meta-learned planning in simplified domains where stability can analytically ensure.
To tackle deep-neural network optimization
Given what we had learned about using RL for generating NMF time plans, we next tried to expand the adaptive planning paradigm for deep neuel networks. Unfortunately, getting theoretical guarantee is far more difficult for complex not -neural training goals. Without insurance on stability, the optimization landscape becomes even more treacherous.
Nevertheless, in another 2023 ICLR paper, “Learned learning speed plans for deep neural network training using reinforcement learning,” we assumed that data-driven planning could still improve on hand-set learning speeds and schedules. We used the reinforcing-learning frames we had developed for NMF to generate schedules for computer vision and natural-Langa treatment tasks.
The automated schedules successfully reduced training time and improved generalization compared to standard heuristics such as Kosinus Annealing. This demonstrated the empirical viability of our approval in the absence of stability guarantees. By learning online from data adapted the planner’s shades of tabscape and gradient courts.
But with the help of RL to find optimal schedules for this problem, there is still an exise – and it becomes more expensive as model and data sizes rise. So our next step was to distill our approach to a simple and usable algorithm.
Greedyylr Scheduler
At this year’s conference on pattern recognition and machine learning (PRML), we won the award for best print for an easily learned planning called Greedyylr, that learning speed based on recent impokes in the training loss. In comparisons with popular planning and optimizer combinations, Greedylr performed equivalent or better more than 90% of the time. It also enabled faster convergence than techniques such as stochastic line search that adjusts the learning speed by solving optimization problems during exercise.
In each training poke, Greedylr adapts the learning speed based on changes in the loss of validation. Its core logic is simple: Increase learning speed if the loss improves and diminished if the loss worsens. But Greedyylr uses additional techniques to do this greedy heuristic work well in practice:
- Its patience parameter prevents overreaction on noisy loss fluctuations.
- A smoothing window calculates validation loss for rolling average for more robust comparison.
- Throughs take unnecessary updates when the loss changes are insignificant.
- Cooldown and warm -up stages continuously increasing or reducing the learning speed, even if the loss trend is turning.
- Configurable upper and lower boundaries in the learning steering field enable it to take advantage of human intuition without sacrificing the ability to explore the methods of pronunciation.
In general, these improvement producers Greedylr intelligently listed for trends in the loss rather than responding impulsively. The algorithm sets the learning speed adaptively during exercise to speed up the convergence without compromising stability.
In our experience, we found that Greedyylr is able to produce different, dynamic schedules, as shown in the figures below. Below are also standard plans such as linear, constant and cosine decay that are popular today:
Greedyylr achieved faster convergence, especially for large models, makes it a prompt plan for general purpose. It also worked better than more advanced methods, such as hypergradient descent, which can consider a first order version of Greedylr. While hypergradient descent is trying to achieve faster convergence by using gradient displacement to learn a learning speed per day. Parameter or parameter group, Greedylr is just ue global, reactive learning speed. These are especially interests that you need a billion learning rates for a billion parameter model in hypergradient descent against a single learning speed of greed.
Conclusion and future prospects
Together, these contributions demonstrate the potential of learned optimizers to accelerate deep learning. By automatically adapting to exercise dynamics, they can find more optimal solutions than human designed algorithms that connect the rules of thumb. The ease of use and consistent gains from Greedyylr make it a complete, general planning ready for broad adoption. We plan to continue to improve the effectiveness of our learning -based methods to further improve the productivity of dearing practitioners.