6 Comments
3 hrs agoLiked by Benjamin Marie

Thank you. Great explanation. I am now curious which other tricks can be used to fit llama 3.1 8b on a 24GB for full fine tuning. (I saw that Torchtune allows it. )

Expand full comment
author

For full fine-tuning an 8b model with 24 GB, you need:

- gradient checkpointing

- FlashAttention

- bfloat16

- paged adamw 8bit

- batch size of 1 (or 2)

- short sequence length (less than 1024, maybe 512 or 256)

So yes it's possible but it won't perform well for tasks processing long sequences.

Expand full comment
Oct 10Liked by Benjamin Marie

Great article. I have a question: Do the data types used by the optimizer need to match the data types used for training the model? For example, if I train the model using BF16 or FP32, do these need to be the same as the data type used by the optimizer? If not, theoretically, in any fine-tuning scenario, would using a quantized (8-bit) optimizer be the best choice?

Expand full comment
author

The data type don't need to be the same. They are independent.

Indeed, AdamW 8-bit is probably our best option.

Expand full comment
Oct 11Liked by Benjamin Marie

When doing pre-training, rather than SFT, is the conclusion that "AdamW 8-bit is probably our best option" valid?

Expand full comment
author

Good question! I don't do pretraining often enough to be 100% sure but I believe 8bit AdamW might also work well for pretraining.

Expand full comment