A better training method for reinforcing learning with human feedback

Reinforcement learning with human feedback (RLHF) is the default method of adjustment Large language models (LLMs) with human preferences – such as the preferences of non -toxic language and invoiced accurate resorts. Recently, one of the most popular RLHF methods has been direct preference optimization, where LLM chooses between two output options, one of which has been labeled as preferred by human annotators.

With direct preference optimization (DPO), however, and with other similar direct adjustment algorithms-llms risk the risk of learning false correlations from the data. In toxicity data sets, for example, it is common for the serious, thought -provoking reactions to be longer than the offensive reactions. Thus, under RLHF, an LLM could learn how to preferred housing rather than shorter, which may not generally be preferred.

Related content

Using large language models to generate training data and updating models through both fine tuning and reinforcement learning improves the success rate for code generation by 39%.

At this year’s International Conference on Learning Representations (ICLR), we presented the method of limiting such false correlations as we call will be, for self -construction and adaptation. First, after the first round of RLHF on human-annoted data, we use LLM itself to generate additional training examples. Then we use LLM’s output —side to assess the strength of preference for training of peers, only to keep them where the preferred responsible is Stongly Preferred.

To evaluate for approach, we compare a model that is trained using is with three baseline models on the ovenbenchmark dataset. For each test input, we compare our model output with it for each of the basic lines and we used an off-the-shelf LLM to choose the better answer. The Sera-Trained Model’s winnings in these pair of comparisons are higher than all three baselines everywhere, sometimes by up to 20% to 40%.

Direct preference optimization

Reinforcement learning is a sample-and-error method where an agent interacts with the world, and depending on the actions it takes, receives greater or leser rewards. Over time the agent is trying to learn a Politics It maximizes its cumulative reward.

In classic reinforcement learning, the interaction with the world can be literal: For example, a robot can receive a great reward for successfully navigating to a prescribed rent and a negative reward for encountering. In RLHF, however, the reward depends on how well an LLM’s output is in line with a paradigm case specifically of a human.

Figure Image Neurips Blog 11.30.png

Related content

When you optimize for a new solution in deep reinforcement learning, it helps if the optimizer dugs against the previous solution.

With traditional RLHF, the rewards of a separate model are also calculated that is also trained in human-annoted data. But this is a time -consuming approval that does not scale well. With DPO there is no need for another model: LLM receives the reward if it chooses the human favorite output and not if it does not.

The disadvantage of the DPO is that it treats all the training peakers straight: the reward is the same where the preferred output is preferred strongly or only mildly preferred. This increases the chances of the model learning false correlations.

For example, if choosing highly toxic resorts is incredible with a greater penalty than choosing mildly toxic resorts, the model could infer that toxicity – and was not responsible length – was the recipient’s function in training. DPO iron out of these differences; Will be reintroduced them.

Will be

MED will be that we are going to perform conventional DPO using a data set with human-annote example peers. After this first review through the data, LLM has read something about the types of output that Human Pree.

We then use the updated model to generate a new set of training examples. For each generated response pair, we assign each response a preferences score based on the updated model’s likelihood of generating this answer. We then store only the peirs where the preferred response results are on average higher than the non-preferred response.

With will be (self -planting and adjustment) the updated model generates a new response pair (a winner or YW.And loser, gold YL.) for each sample input (X). Each response gets a preferences score based on the updated model’s likelihood of generating it. Peirs where the scoring of the preferred responsible is significantly high the one for the non-preliminary responsible (green) Held; The others (Red) Discarded.

Using the same metric, we filter the next data in the original, human-annoted data set. Then we combine filtered samples from the original data set with filtered samples from our new, generated data sets and perform DPO again. This process is repeated with the generated samples that make up a larger and larger fraction of the data set until model performance converges.

The intuition here is that if a dataset is designed to reproduce some contrast, but it also contains fake correlations, then Intentional Contrast-between, e.g. ONEEndorsed contrast – between, e.g. Long and short resorts.

Results of Collaboration Simulation of Collaboration Navigation

Related content

With a new method you can tackle better agents with the differences between simulated training environments and implementation in the real world.

This INST held for the four benchmark data sets we used to evaluate our method, and we think it is plausible for other false correlations. But there may be cases where it does not hold, so in applications of the will -method the model’s convergency behavior must be monitored.

While we used DPO in our experiences, we also demonstrate in our paper how we generalize our method to other direct adaptation algorithms. There is fine, there is a certain risk that when we use model gehed data to train a model, we could get into a feedback loop where the model over-develop then the aspect of the initial data set. As a consequence in each review through the data, the model’s reward is not only based on the current iteration of previous iterations also to ensure continuity in the characteristic features of the training data.

Recognitions: Sravan Bodapati

Leave a Comment