Large language models more effective training

Large language models (LLMs) review several stages of training on mixed data sets with different distributions that include prior, setting instruction and reinforcement learning from human feedback. Finding the optimal mix of data distributions across data sets is crucial to building accurate models, but it typically requires training and evaluation of the model numbers times on a very large set of combinations.

At the last conference on empirical methods in natural language treatment (EMNLP), My Colleugues and I Propéd a training framework that reduces the calculation costs of using mixed data distributions to train LLMs or other neural-network-based models by up to 91%. At the same time, the method actually improves the quality of the resulting models.

While the default approach to optimizing data distributions involves weighting the different data sets used to train a single model, we train a separate model on the ECH Data set and then emphasis on Models To produce a compound model.

This unconventional approach won a special award for “effective modeling, training and inference” at EmnLP and has the potential to make great training much more efficient and accessible.

Distribution -edited models

Traditional training methods (eg Setting Instruction) Choose the optimal blend of training data through a method called Grid Search, an exhaustive search method that simply compares results for a wide rage of different weight values. This requests not only in terms of time and resources, but also in terms of flexibility: When the model is trained, it cannot be changed without incurring similar costs.

To tackle these restrictions, we offer fine -tuning a prior model on data distributions that correspond to different tasks and then subtract the parameter values ​​for the original model from them for the fine -tuned models. We call the differences in parameter values The distribution of vectorsAnd we produce a composite model by adding a weighted sum of distribution vectors to the parameters of the original model.

Related content

Theoretical analysis provides insight into the optimization process during model training and reveals that for some optimizations, the Gaussian attention core can work better than Softmax.

We call the resulting model a Distribution-edited model (them) To highlight leverage of weight vectorithmetics for model editing. The weights are based on confusion Of each fine -tuned model or probability that its parameter values ​​can be predicted from them in the original model.

This approach related to two key observations: (1) Training of the model Separaly on each dataset allows better modeling of the underlying properties of each data set as there is no interference with other DAT -Distributions during the training process; and (2) confusion can be calculated in a single forward pass on validation data, which is much more effective than grid search. The first point helps to improve model quality and the second point helps to make training much more efficient.

More detailed here are the steps in the procedure:

  1. Individual distribution training:: The original model is trained in individual data through standard education treatment. Control points or snapshots of the model mode after training on a particular data set are stored for subsequent steps.
  2. Vector Computation Distribution:: Distribution vectors are calculated by drawing the mocked model’s parameters from them from the fine -tuned models. These vectors capture the unique features of each dataset.
  3. Optimization of fusion coefficients:: The optimal coefficient of combining the data distribution vectors is found based on confusion on the validation set using a single forward passport per. Combination.
  4. Merring of distribution vectors: Linear combination of the distribution vectors with custom weights creates a unified model that effectively captures the common distribution of different data sets.
  5. Resulting properties (flexibility and scalibility): Those enbles step -by -step updates when new data sets are introduced without requiring full retining. This makes it ideal for dynamic and large training scenarios.

With distribution -edited models (DEMS), a low set model is fine -tuned on data that distributes to different tasks (ΘD1 – θDN). Then the parameter values ​​for the original model (Θ) Pulled from them from the fine -tuned models and produce a set of Vector’s distribution (ΔθD1 – ΔθDN). Them is a compound (ΘD.) Produced by adding a weighted sum of distribution vectors (Σ) To the parameters of the original model.

Evaluation and future work

When evaluating our approach, we focused on educating LLMs of increasing size, from 3 billion parameters up to 13 billion parameters during instructional voting internship. Our study showed that they reduce education costs by up to 91%, while achieving up to 16.1% improvement quality compared to traditional data mixing strategies, highlighting them’s potential to democratize access to advanced training techniques and provide benefits for organizations to organizations neural models in scale development. In addition, DEM’s flexibility ensures that researchers and practitioners can quickly adapt to new data requirements without compromising the benefit.

Detoxification.png

Related content

Award-controlled fine-tuning can produce LLMs that comply with the policy while achieving competitive results on general benchmarks.

The main takeaways from the study can be summarized as follows:

  • Superior performance: They have been validated on popular benchmarks such as MMLU, BBH and help, where it achieved up to 16.1% improvement of data mixture on individual tasks.
  • Different domain efficiency: Experiment on Dataset such as Mathqa, Super-Natural Instructions (SNI) and Chain-of-Tanker (COT) demonstrates DEM’s ability to excel across different domains.
  • Scalabibility: They show that improvise performance in different model sizes – 3B, 7B and 13B – giving strong evidence of the scale of this approval.

The effectiveness of them emphasizes the importance of innovation in making machine learning more efficient and accessible. As the machine learning community continues to scale models and data sets, frames like those will be estaistial for mainaining efficiency without sacrificing performance. Future research can investigate the effectiveness of the framework of other training scenarios and its expansion to other model architectures, such as codes-decoder frames or mixing mixing experience.

Leave a Comment