Differential Privacy to Deep Learning in GPT -Scale

Deep learning models are data-driven and this data may contain sensitive information that requires privacy protection. Differential Privacy (DP) is a formal framework for eradicating the privacy of individuals in data sets so that opponents of users cannot learn a given data tray was or was not used to train a machine learning model. Use of DP in deep learning typically means to limit the contribution that each training sample makes to the model’s parameter adjustments, an approach called mowing per day. Sample.

However, per-test-gradient clipping makes deep learning much more time-consuming than it would otherwise be, which prevents the development of large DP models to the occurrence at the level of GPT language models with billions of parameters.

In 2022, in workshops at International Conference on Machine Learning (ICML) and the conference on neural information processing systems (Neurips), we presented two papers that promote DP for deep learning. In the first paper, “Automatic clipping: Differentialily Private Deep Learning made easier and stronger,” we described an automatic method that improves the effectiveness of setting the gradient clipping process with an order of magnitude (says 5-10 times).

Typically, gradient clipping involves an expensive ablation examination to choose a clip threshold over which a Data Røve’s contribution to the model’s parameter adjustments is cut off or cut. Our approach uses normalization instead, eliminating the setting of the cliff threshold.

Related content

Technology that mixes public and private training data can meet criteria for differential private-private, while cutting error increases by 60%-70%.

In the second paper, “Differentially private bias-term only fine-tuning of foundation models” (DP-BitFit), which won the best paper price at the Neurips workshop on reliable and socially responsible machine learning (TSRML), we introduced Bitfit, a parameter fine-tuning during DP learning.

Generally, a neural network has two types of parameters: the weights that make up more than 99% of the parameters and catch most of the information from the training data, and the bias that switches (displaces) model output. We show that private fine-tuning of the bias conditions alone is enough to achieve high accident during DP restrictions, make DP learning 2 to 30 times faster, reduce memory use by 50% to 88% and only curled 1/1000 communication costs in the distributed environment.

Together, these two techniques have done fine tuning a DP-GPT-2 as effective as fine tuning a standard GPT-2 in a parameter effect way. We have made both methods publicly available to encourage researchers to experience and take advantage of faster DP Deep Learning.

Automatic mowing

The included dyblæring process has Tuvable Hyperparameter called the learning speed, which determines the extent to which the model weights can be changed during updates. The per. Sample-gradient clip threshold is the same, but it imposed on a limit for has per. Trial base. The existing approach to DP training requires an ablation examination to simultaneously set the cliff threshold and learning speed. As such, if K (Say, five, in practice) Different cliff thresholds evaluated, this makes the model’s hyperparameter -tuning stage K Times more expensive.

Two sample-ablation studies considering different learning rates and cliff thresholds per Gradient. Left: GPT-2’s blue score on the E2E data set, trained with DP-Adamw. Right: Accuracy of the Resnet18 classification on the image dataset, trained with DP-SCD. The different teats of results illustrate the need to set both hyperparameters simultaneously.

Calibrated noise addition.gif

Related content

Calibration of noise addition to word essence in the embedding space improves the utility of privacy -protected text.

To solve this problem, we automatically introduced mowing using gradient normalization Intead of PR. Testing gradient clipping. This (1) eliminates the clip threshold, (2) enlarges the small gradients not cut and (3) optimizes the proof. Equipped with our automatic clipping, DP-Stochastic-Gradient Discharge Optimization algorithm (DP-SGD) has the same asymptotic convergency speed as standard (non-DP) SGD, even in the non-convex optimization setting where the deep learning optimization lines.

Our experiences across multiple computer vision and language tasks show that automatic mowing can achieve advanced DP accuracy, with PR. PR.

Performance of GPT-2 on the E2E data set measured by Bleu and Rouge scores under DP and non-DP settings (higher is better). We compare full fine tuning with automatic cutting to advanced fine -tuning methods such as Lora. Additional performance measures are included in the full paper. The best two GPT-2 models for each row are marked in bold.

DP-BitFit

The first advantage of differentiated private bias-term fine-tuning (dp-bitfit) is that it is agnostic model; We can apply it to any model by simply freezing all weights during fine tuning and only updating the bias conditions. In sharp contrast, prior alternatives such as low rank adapped (LORA) and adapters apply exclusively to transformers and involve extra setting of adapption rows.

The second advantage of DP-BitFit is its parameter efficiency. In a study that spans a number of foundation models, we found that Bias -Tarms made up only wood 0.1% of model parameters. This means that DP-Bitfit provides great efficiency improvisers in terms of exercise time, memory imprints and communication costs in the distributed learning setting.

Efficiency of DP-BitFit parameter. The last two columns include the total number of parameters and the percentage of the training parameters. Note that DP-BitFit only optimizes approx. 0.1% of the total parameters.

The third advantage of DP-BitFit is its calculation advantage over other parameters effect methods, such as DP-Lora. Even if both are approaching fine-tuned about the same number of parameters, DP-Bitfit still enjoys a major advantage in memory savings as it does not have to save and access exensive activation details when calculating the bias degrees; It is inaccessible when calculating weight gradients. We verify this carefully through the chain rules for the rear propagation, where DP-Bitfit has a much simpler computer graph because activation cleaners are not used.

The same calculation graph over propagation of the back (black) With changes with three different DP procedures (Red). Becaus DP-Bitfit (Bottom right) Changes only model supply it requires far less calculation overhead than prior approaches (Left: Ghostclip; top right: Opacus) And therefore there must be a simple calculation graph.

Empirically, we have observed a significant boost in effectively when switching from dp full tuning to dp-bitfit, while we are still kef-to-art accuracy on large foundation models such as GPT-2-WORM, reset 152, Roberta-Large and Vision Transformers. For example, we compare DP-BitFit with DP full fine tuning and observe a four to ten times speedup and one two to ten times memory saving on GPT-2.

Maximum flow and batch size in different tuition methods. Left: E2E data set with GPT2-Lille/Medium/Large. Right: 50,000 images of 512×512 pixels with reset 50/101/152. The speed and memory saving offered by DP-BitFit is significant, especially on large models.

Recognitions: We would like to acknowledge our Coauthors on the papers for their contribution: Sheng Zha and George Karypis. We thank Huzefa Rangwala for reviewing this post.

Leave a Comment