Meta released the long-awaited Llama 3 405B, an extremely large model of 405B parameters.
According to public benchmarks, the model performs on par with GPT-4. This is the first time that an open model has reached this level of performance. This is impressive but not unexpected. As mentioned by Meta in their paper, LLM scaling laws could predict this level of performance for a model with 405B parameters using the same training pipeline as Llama 3 8B and 70B.
Llama 3 405B has a very standard Llama architecture with more attention heads and more layers. The quality of the training data also seemed to have played a major role in Llama 3 405B's performance. Note: Meta’s data pipeline looks very similar to what Hugging Face did to create FineWeb-Edu but at a much larger scale.
our performance gains are primarily driven by improvements in data quality and diversity as well as by increased training scale. (Llama 3 paper, Section 3.2)
Llama 3 405B is already very good but fine-tuning it on your data would certainly make it better for your applications.
What would be the cost of such a fine-tuning?
In this article, I answer this question. I estimate the memory consumption for full fine-tuning Llama 3 405B, and using parameter-efficient methods such as LoRA and QLoRA. We will see that while fully fine-tuning the model could cost millions, LoRA and QLoRA fine-tuning can be much more affordable if you can find a provider with enough 80 GB GPUs.
I used this notebook to estimate memory consumption for inference and fine-tuning:
What about inference?
Running Llama 3 405B for inference tasks is much more affordable than fine-tuning since we don’t have to deal with all the model’s activations and optimizer states.
The model occupies 810 GB of memory. You need 9 GPUs with 80 GB of memory to load it, or less if you offload some parts of the model to the CPU memory. The FP8 version, also released by Meta, would only require half this memory for similar results. A 4-bit version of the model would only require between 202 GB and 250 GB of memory. However, note that this is only the memory required to load the model. Add 50% more memory for the inference itself, without batch decoding and for short sequences of tokens. 5 H100 GPUs would be enough for the 4-bit version, or you could use a framework optimized for inference on CPUs such as llama.cpp. I estimate that it would “only” require 400 GB of CPU RAM, which is much more affordable but slower than GPUs.
Fine-tuning Llama 3 405B: Is It Even Possible?
We know how to fine-tune Llama 3.
While the 8B and 70B versions are large, they are relatively easy and affordable to fine-tune, especially with parameter-efficient fine-tuning methods such as LoRA.