Fine-tune the Token Embeddings and the Language Modeling Head of Llama 3

If you have enough GPU RAM

May 27, 2024

∙ Paid

We can easily adapt a pre-trained large language model (LLM) to new tasks thanks to low-rank adaptation (LoRA). LoRA freezes the entire model and adds a small amount of trainable parameters on top of it. By only training these new parameters instead of the entire model, LoRA, and its quantized variant QLoRA, save a lot of GPU memory and make it possible to fine-tune LLMs on consumer hardware.

QLoRA: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie

May 30, 2023

Read full story

LoRA is usually only applied to attention and MLP modules. The token embeddings and the language modeling head remain the same after LoRA fine-tuning. This is often not ideal since the token embeddings learned during pre-training are very general without any domain or task specialization. Some of them may have even been left untrained. This is the case for instance for some special tokens of Llama 3 8B.

Ideally, we should retrain the token embeddings and the language modeling head to better adapt the model to a new task or domain.

But what is the cost of this retraining? Is it really worth it? When should we do it?

In this article, I investigate the impact of retraining the token embeddings and language modeling head of Llama 3 during (Q)LoRA fine-tuning. We will see that, due to the large vocabulary of Llama 3, it is indeed very costly in GPU memory but still feasible on consumer hardware. More importantly, we will see that retraining the token embeddings and language modeling head of Llama 3 can significantly improve the fine-tuning.

The notebook demonstrating the impact of this retraining is available here:

Get the notebook (#73)

The Kaitchup – AI on a Budget

Fine-tune the Token Embeddings and the Language Modeling Head of Llama 3

If you have enough GPU RAM

QLoRA: Fine-Tune a Large Language Model on Your GPU

This post is for paid subscribers