Fine-tuning LLMs with 32-bit, 8-bit, and…

Oct 7

Finding the right trade-off between memory-efficiency, accuracy, and speed

6 Comments

3 hrs agoLiked by Benjamin Marie

Thank you. Great explanation. I am now curious which other tricks can be used to fit llama 3.1 8b on a 24GB for full fine tuning. (I saw that Torchtune allows it. )

Expand full comment

Reply (1)

Benjamin Marie

2 hrs agoAuthor

For full fine-tuning an 8b model with 24 GB, you need:

- gradient checkpointing

- FlashAttention

- bfloat16

- paged adamw 8bit

- batch size of 1 (or 2)

- short sequence length (less than 1024, maybe 512 or 256)

So yes it's possible but it won't perform well for tasks processing long sequences.

Expand full comment

Xinyu Wei

Oct 10Liked by Benjamin Marie

Great article. I have a question: Do the data types used by the optimizer need to match the data types used for training the model? For example, if I train the model using BF16 or FP32, do these need to be the same as the data type used by the optimizer? If not, theoretically, in any fine-tuning scenario, would using a quantized (8-bit) optimizer be the best choice?

Expand full comment

Reply (1)