Federated Learning is a process where distributed devices, each with its own store with local collection data, can contribute to Global Machine Learning Model without transmitting the data itself. Keeping local data reduces Federated Learning both network traffic and protects data protection.
Continuous learning is the process of constantly updating a model when new data becomes available. The key is to avoid “catastrophically forget”, in which model updates based on new data overwrite existing settings, degrading benefits on the old data.
In a paper we presented at this year’s conference on empirical methods in natural-language processing (EMNLP), we combine these two techniques with a new method of performing continuous federal learning that improves its predecessors.
One way to protect against catastrophic forget is that each device must keep samples of the data it has already seen. When new data comes in, they are merged with the old data and the model is traded on the common data set.
The essence of our method is a procedure for choosing data samples for storage. We present the procedure in two varieties: uncoordinated, where each unit chooses its own samples locally; and coordinated, where sample selection is coordinated across units of a central server.
After experience, we compared to test selection methods for three predecessors. The relative performance of the method depended on how many previous samples a device could store. At 50 and 100 samples, both versions of our method significantly surpassed their predecessors, but the uncoordinated method offered slightly better performance than the coordinated.
At 20 samples, our methods again enjoyed a significant advantage over the benchmarks, but the coordinated version was the top performance. At 10 and fewer samples, other methods began to overtake the bear.
Gradient -based test selection
For any given data, the graph of a machine learning model’s loss function against the settings for its parameters can be invented as a landscape, with tops representing high fault output and expressing low -failure output. Given the model’s current parameter settings-a special point on the landscape target with the machine learning algorithm is to choose a direction leading downhill towards lower error exits. The negative of downhill is known as a Gradient.
A common way of choosing samples for retention is to maximize the diversity of gradients, which ensures a simultaneous diversity in the types of information contained in the samples. Since a gradient is simply a direction in a multidimenal space, choosing samples is whose gradients sum to zero maximizes diversity: All graduations point in different directions.
The problem of optimizing gradient diversity can be formulated as activating each gradient a coefficient on 1 gold 0 Such that the sum over all gradients is as close to zero as possible. The sum of the coefficients, in turn, must be equal to the memory budget available for storing samples. If we have space on our device to N Samples we want N Coefficients that need to be 1 And the rest should be 0.
This is, however, an NP-Complet problem as it requires systematically to try different combinations of N Gradits. We offer relaxation of this requirement so while the sum of the coefficient is still NCan the coefficients themselves be fractional. This is a calculation traceable problem as it only requires successive improvements to a first guess. Fist, we choose N Tests with the highest coefficients.
In our experience, this uncoordinated approach was the best priesting method of making continuous federal learning with a N At 50 or higher: Each device simply optimized gradient diversity locally. Presumably, with enugh bites on Apple, local sampling gives good enough coverage of imports to the model as a whole.
Year N Of 20, however, more careful sample selection requires, and this is where our coordinated method worked best.
Coordinated method
The coordinated method switches between Summation gradients locally and globally. First, each device finds a local optimization whose sum is as close to zero as possible. There it is the overall gradients for all its local samples and their calculated coefficients for a central server. Aggregation gradients instead of sending them individually protect against potential attacks that try to, inverted, engineer locally stored data from their gradients.
Usually, the local choice of coefficient does not give a sum of exactly zero. The central server is considering the existing, non -nul -amounts from all devices and calculating the minimal change of all that will give a global sum of zero. There are the changed amounts left to the units, as new non -zero goals for optimization.
These processes can be repeated as many times as necessary, but in our experience we found that only one iteration was generally Enchieve a global sum that was very close to zero. After the final iteration, each unit selects the data robbers equal to N Large coefficients in its sum.
As basic lines for our experience, we used three previous sampling strategies. One was a naive uniform sampling method that simply tries from all the data currently on the device and the other two used weighted sampling to try to ensure a better balance between previously seen and newly acquired data.
On N = 10, the random sampling approaches we count with our approach, and by N = 5, they surpassed it. But in practice, distributed devices are often able to store more than five or ten samples. And our paper provides a guide to optimizing the sample selection strategy for the device capacity.