Do big language models really need all these layers?

Large language models (LLMS) have been around for a while, but have really caught the attention of the public this year with the emergence of chatgpt. LLMs are typically lined on massive amounts of data; Recent variants are further tuned to follow the instructions and incorporate human feedback using reinforcement learning.

A fascinating ability that dissertations demonstrates is in connection with learning, where a model can learn to perform a task just by following a few (or sometimes even zero) good examples give together with a new input. After this learning paradigm, larger LLMs also develop more capable of pleasing a wide range of tasks than less when the ament of prior data was determined.

In a paper we present at this year’s meeting in Association for Computational Linguistics (ACL), we examine the importance of model scale for learning in context from the perspective of architectural interpretation. We specifically ask the question Are all LLM components really needed to perform learning in context?

LLM building block

Modern LLMs use the transformer architecture that depends on an attentive attention mechanism: the model learns to predict that before the tokens in the sequence should expect to when predicting the current token.

Related content

Amazon’s Yang Liu, General Chairman of this year’s meeting in Association for Computational Linguistics, on the way of LLMS.

Specifically, LLMS Multihead uses attention, which means they are more attention to mechanisms or heads in parallel. Opt-66B has 64 layers with 72 precautions in each layer. Output from Multihead attention through a separate forward network (FFN) on each layer.

Our first method of an analysis Opt-66b was to assign a score to each attention head and FFN that indicates how important they were for a given task. Based on these scores we have the plum model.

We found that important attention heads are primarily clusted in the model’s intermediate layers, and important FFNs are primary in later layers. The ability to perform zero/few-shot in-context learning on 14 different natural-language-proceations (NLP) Data sets/tasks remained almost intact when up to 70% (~ 15.7B parameters in Opt-66b) of the attention heads are removed.

A heating card that represents the attention of the attention heads’ overall important scores for five-shot in-short-learning across 14 NLP tasks, at each layer of the Opt-66B model.

The heads of attention, which are important (and unimportant) for learning in context, also see overlap across tasks and shots. This indicates that a common task-embarrassing subgroup of the head of attention is responsible for learning in context. We also found that up to 20% of the FFNs (~ 8.5B parameters) can be withheld with minimal decline in zero/few-shot in content learning.

Our second analysis technique was to quantify the capacity of all attention heads in the Opt-66B to perform a few task primitive an old operation associated with learning in context. Primitive these are Prefix matching and Copy: Search explicitly for a prior occurrence of the current token in context and copying the token that is successful suffix).

Prefix matching and copying.

Leader who specializes in these two operations that we first discovered by Machine Learning Research Company Anthropic and called induction heads. We found that a small set of heads in Opt-66b has non-trivial scores for both primitive. We also found that these heads overlap (in different degrees) with the heads that are important for specific tasks identified in the past. This indicates that induction heads are capable of more sophisticated behavior associated with context learning, such as latent concept matching, but are not the only heads with such capabilities.

Related content

Generative AI raises new challenges in defining, measuring and mitigating concerns about justice, toxicity and intellectual property. But the work has started with the solution.

Our overcurrent observation that only a core core of attention heads and FFNs seem to be important to learning in context indicates that Opt-66b and probably other prominent LLMs are unused. This also enhances recent research that EffectCy keeps the Lord of prior prior data set when scaling models, suggesting that the amenity of prior data seen must be scaled handed with the models themselves to attack optimal performance. It would be interesting to see newer variants of LLMs released since the publication of our study, such as those set to follow the instructions, ticket prices in such analyzes.

Leave a Comment Cancel reply