Large language models (LLMs) have demonstrated impressive benefits across different tasks, but as it has been clear in several cases, they have the risk of producing inappropriate, unsafe or partial output. When generating resorts, a successful trained LLM must comply with a set of policies specified by its creator; For example, the developer may limit LLM from generating toxic resorts. We return to this as attribute control as it regulates an attribute for LLM output.
In a paper we presented on Emnlp 2024, we offer a new method of training an LLM to comply with a set of restrictions while retaining its performance. We first define a successful educated LLM as one that can satisfy the following restrictions: (1) Control allocation – LLM output complies with a policy, defined by the Creator in most cases; (2) Utility protection – LLM maintains performance comparable to the original LLM on utility benchmarks; and (3) Efficiency training -The cost of fine tuning with attribute control is typically fine tuning.
Our work is inspired by the classic idea of ​​restriction -driven learning and rear Legislativewhere the model output is forced to comply with a particular distribution. Specifically, we train an auxiliary model to control a specific output tribute – in this case toxicity. During fine tuning, the auxiliary model estimates the closest distribution that, given the current state of LLM, fulfilled the restrictions and it punishes the gap between this estimate and LLM’s current distribution.
The natural way of doing this is to iteratively push LLM closer to the feasible generation region, making the progressive estimation more accurate. However, this approach is sequential and it causes significance in driving time. In our paper we also present a parallelized algorithm that updates base LLM and Regularizer at the same time, based on their status in the last iteration. Empirically, parallelization achieves the same level of performance as sequential fine -tuning, and the time complexity is the same as for typical, unregularized fine -tuning.
We also explore adaptive reguularization (ie the use of a domain -specific reguularizer on related parts of the training data) to improve performance and prevent catastrophic forgetting.
The tool is preserved
In Experiment, we fine-tuned Llama-7B and Falcon-7B models on the blend corpus including the toxic (data containing toxic resorts) and wikitext (generally corpus) in equal proportions. With the adaptive regularizer, our approach generally retained better performance better than standard methods for reinforcing learning (RL) and filtration while meeting toxicity control standards.
Benchmark Performance of Llama-7B and Falcon-7B with toxicity control
Model | Toxic (lower is better) | Mmlu (5-shot) (higher is better) | Com. Reasoning (0-shot) (higher is better) | |
Llama-7B | Baseline | 23 | 35.1 | 75.6 |
Filtration | 21.9 | 34.6 | 75.1 | |
RL | 15.2 | 33.6 | 73.2 | |
NADO decoding | 15.2 | 31.1 | 71.4 | |
Bear without adaptive | 15.2 | 30.4 | 71.9 | |
W/ adaptive bear | 14.2 | 33.9 | 73.6 | |
Falcon-7b | Baseline | 14 | 27.2 | 76.1 |
Filtration | 13.6 | 26.4 | 74.9 | |
RL | 9.8 | 25.4 | 74.4 | |
NADO decoding | 7.3 | 23.6 | 72.5 | |
Bear without adaptive | 7.1 | 23.1 | 71.8 | |
W/ adaptive bear | 7.3 | 26.1 | 74.5 |
Generation quality is preserved
Sequences produced by our model were original in terms of quality, from those produced by the base model when the OPT-30B ACTD as a judge. This shows that our method retains the quality of the generation. Our model also exceeded models trained using filtration and RL approaches.
Wind frequency against baseline
Wind frequency | Basis | Filter | RL | Bear |
Basis | N / a | 44.3 | 45.1 | 51.4 |
Filtration | 55.7 | N / a | 53.4 | 61.6 |
RL | 54.9 | 46.6 | N / a | 61.3 |
Bear | 48.6 | 38.4 | 38.7 | N / a |
Toxicity classification and generation
One of the most interesting aspects of our method is that it allows LLM to learn from toxic content. In Experiment, we fine-tuned Llama-7B models on toxicity classification task using the Jigsaw data set with toxic content. With standard surveilled fine -tuning, the model’s performance on the classification task improved, but the increased exposure to toxic content made it more likely that it generates toxic content itself. With our method, on the other hand, improves the performance of the classification task Reduced Generation toxicity.
Jigsaw performance using Llama-7B model with toxicity control
Model | API tox. | Classify ROC |
Baseline | 0.315 | 0.910 |
SFT (LLM -TAB) | 0.344 | 0.966 |
Bear (LLM -tab) | 0.288 | 0.959 |
SFT (classification) | 0.314 | 0.972 |
Recognitions: I would like to acknowledge our trainee, Tao Meng (UCLA), who led work on this paper, and our co-authors, Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Aram Galstyan, Richard Zemel, Kai-Wei Chang and Rahul Gupta, for their contributions .