Thank you. Great explanation. I am now curious which other tricks can be used to fit llama 3.1 8b on a 24GB for full fine tuning. (I saw that Torchtune allows it. )
Great article. I have a question: Do the data types used by the optimizer need to match the data types used for training the model? For example, if I train the model using BF16 or FP32, do these need to be the same as the data type used by the optimizer? If not, theoretically, in any fine-tuning scenario, would using a quantized (8-bit) optimizer be the best choice?
Thank you. Great explanation. I am now curious which other tricks can be used to fit llama 3.1 8b on a 24GB for full fine tuning. (I saw that Torchtune allows it. )
For full fine-tuning an 8b model with 24 GB, you need:
- gradient checkpointing
- FlashAttention
- bfloat16
- paged adamw 8bit
- batch size of 1 (or 2)
- short sequence length (less than 1024, maybe 512 or 256)
So yes it's possible but it won't perform well for tasks processing long sequences.
Great article. I have a question: Do the data types used by the optimizer need to match the data types used for training the model? For example, if I train the model using BF16 or FP32, do these need to be the same as the data type used by the optimizer? If not, theoretically, in any fine-tuning scenario, would using a quantized (8-bit) optimizer be the best choice?
The data type don't need to be the same. They are independent.
Indeed, AdamW 8-bit is probably our best option.
When doing pre-training, rather than SFT, is the conclusion that "AdamW 8-bit is probably our best option" valid?
Good question! I don't do pretraining often enough to be 100% sure but I believe 8bit AdamW might also work well for pretraining.