From Llama 3 70B to 120B: How to Self-Augment an LLM?
Experiments with Llama 3 8B: Removing, duplicating, and reordering layers
Meta only released two versions of Llama 3: 8B and 70B. Another version of Llama 3, called Llama 3 120B, has also appeared and has drawn a lot of discussions on social networks for its surprising performance in specific tasks such as creative writing.
This “Llama 3 120B” is not a product of Meta but a version made by Maxime Labonne with Mergekit. In a previous article, I presented Mergekit and showed how to use it to create powerful LLMs by simply merging several LLMs into one:
However, Llama 3 120B is not a merge of several LLMs. It has been made from a single LLM, Llama 3 70B, by simply duplicating 60 of its 80 layers. Yet, despite its simplicity, this process seems to have worked relatively well on Llama 3 70B.
In this article, I explain how to reproduce Llama 3 120B. I show how to use the same process for Llama 3 8B and evaluate the resulting models. Going further than just duplication, I also show how to reorder and remove layers, e.g., to reduce the size of an LLM with mergekit. We will see that while most of these operations damage the model’s performance, it is possible to recover some of it through fine-tuning.
The following notebook implements layer duplication, reordering, and deletion for Llama 3 models (but technically applicable to any transformer model):