ICLR: Why does Deep Learning work and what are its limits?

At this year’s International Conference on Learning Representations (ICLR), René Vidal, professor of radiology and electrical engineering at the University of Pennsylvania and an Amazon Ward, was a senior area meat that oversees a team of reviewers filled with evaluation of paper submission. And the topic of paper on which his team focused on, says Vidal, was the theory of deep learning.

René Vidal, Rachleff University professor at the University of Pennsylvania, with joint agreements in the School of Medicine’s Department of Radiology and Department of Electrical and Systems Engineering, A Penn Integrats Knowledge University and an Amazon Scholar.

“While representation learning and deep learning have been incredibly successful and has produced spectacular results for many application domains, deep networks remain black boxes,” Vidal explains. “How to design deep networks remains an art; there is a lot of test and mistake on every single data set. So largely the area of ​​math is for deep learning to have phrases, math proofs that guarantee the performance of deep networks.

“You can ask questions like ‘why is the case that dead networks are generalized from one data set to another?’ “Can you get a sentence that tells you the classification error on a new data set versus the classification error on the training data set?”

“There are questions that square for optimization. These days you minimize a loss fun over sometimes billions of parameters. And because the optimization problems are so wide and you have so many training examples of calculation reasons, are you limited to very simple optimodes. These not -Convekes problems? Can you understand what you converge to?

Double descent

Especially, says Vidal, has two topics in the theory of deep learning drawn increased cape recently. The first is the so -called phenomenon of double occupation. The conventional wisdom in AI used to claim that the size of a neural network had to be carefully tailored to both the problem it added and the amount of available training data. If the network was too small, it couldn’t learn complex patterns in the data; But if it got too big, the individual could be the right answer for all data in his training set – a partial irregular case of overfitting – and it would not generalize to new inputs.

Related content

The surprising dynamics related to learning that is common to artificial and biological system.

As a consequence of a given problem and a given set of training data as the size of a neural network grows, its error rate falls on the previously unseen data from the test set. At one point, however, the error frequency begins to rise again as the network begins to overtake the data.

In the last few years, however, a number of papers have reported the surprising result that as the network continues to grow, the error rate goes back again. This is the double distance phenomenon and no one is sure why it happens.

“The error drops when the size of the model grows, and then backed up when it overwhelms,” Vidal explains. “And it comes to a highlight at the so -called interpolation limit, which is exactly when under underneath trainingCan you achieve zero errors because the network is large enough for it to remember. But from then of Testing Error goes down again. There have been many papers trying to explain why this is happening. “

The neural tangent core

Another interested tendency in the theory of dead network, says Vidal, involving new forms of analysis based on Neural Tangent Kern.

Tree_Kernels.png._CB456832901_.PNG

Related content

Machine learning systems often work on “features” that are extracted from input data. For example, in a natural-linguistic understanding system, the functions may include words parts of the speech, as assessed by an automatic syntactic parser, or as a bed is in the active or passive voice.

“In that past case, the year 2000-the way we learned was by using so-called core methods,” Vidal explains. “Core methods are based on attaching your data and embedding them with a fixed embedding in a very high-dimensional space where everything linearly. We can use classic linear learning techniques in the embedding space, but the embedding room was firm.

“You can think of deep learning as a learning that embeds the input data to a high-dimensional space. In fact, it is precisely representation learning. The neural attack-tragal regime-a type of initialization, a type of neural network, a type of exercise-as you can approach the learning dynamics of a deep network using kernels.

“That regime is very unannounced – Network with infinite breadth or initializations that you change the weights too much during training. In this highly despised and specialized setting, things are easier and we can understand them. Unrealistic assumptions and acknowledge that the problem is difficult: You want weights to change drings on training, because if they don’t, you don’t learn much.”

DP.CV.JPEG

Related content

Technology that mixes public and private training data can meet criteria for differential private-private, while cutting error increases by 60%-70%.

In fact, Vidal has been engaged in this topic even in a paper that accepts this year’s conference on artificial intelligence and statistics (Aistats), whose co -author is his old research team from Johns Hopkins University.

“The three assumptions we try to get rid of are one, can we get phrases for networks with a limited width as opposed to infinite width?” Says Vidal. “Number two is, we can get phrases for gradient departure-like methods that have a final step-size? Becuse Mayry Previous sentences assumed a truly teeny little step-size-like, infinitely small. And the third ins we relax, this is inspired on the original much more general.”

The limits of representation learning

When ICLR was founded, in 2013, it was a coming for researchers to explore alternatives to machine learning methods, such as core methods that represented data in fixed, predetermined ways. But now Deep Learning – which uses learned representations – has taken over the area of ​​machine learning, and the difference between ICLR and the other major Machor learning conferences has shrunk.

However, as a person who spent 20 years as a professor of biomedical technique at Hopkins, Vidal has a great awareness of the limitations of representation learning. For some applications, he says, domain knowledge is still important.

Optimal_NEURAL_NETWORK_MOVIE.GIF._CB452651292_.GIF

Related content

The first step in training a neural network to solve a problem is usually the choice of an architecture: A special of the number of computer nodes in the network and the connections between them. Architectural decisions are generally based on historically prior, intuition and planty of trials and errors.

“It happens in domains where data or labels may not be abuant,” he explains. “This is the case, for example, in the medical domain, where there may be 100 patients in a study, or maybe you can meet the data on a website where everyone can comment on them.

“Just to give you a concrete example, I had a project where we needed to produce a blood test, and we had to classify white blood cells in different kinds. No one is ever taking videos of mills and you don’t want a pathologist who annotates each and objecting that we do in computer vision.

“So everything we could get was the actual results of the blood test: What are the concentrations? And you may have a million cells in class 1, class two and class three, and you just have these very weak labels. But the domain experts said we can perform cellopling by adding centrifugation and I don’t know what and then we get cells of a type in this example.

“If you do things with 100% people who are all data scientists and machine learning people, they tend to believe that everything you need is a larger network and more data. But I think, as at Amazon, where you have to think behind the customer, you need to solve real problem solving not always more data and more comments.”

Leave a Comment