Hi Benjamin, I suppose there is an error here ‘If you train TinyLlama for one epoch on openassistant-guanaco which contains 9,846 steps, with a batch size of 8, it yields 1,231 training steps’. It is 9,846 examples and not 9,846 steps, isn’t it?
Yes, you might be able to set a greater eval batch size but the difference won't be large. Gradient checkpointing already saves a lot of memory to a point that the maximum training batch size supported by the GPU is very close to the one for evaluation.
Hi Benjamin, I suppose there is an error here ‘If you train TinyLlama for one epoch on openassistant-guanaco which contains 9,846 steps, with a batch size of 8, it yields 1,231 training steps’. It is 9,846 examples and not 9,846 steps, isn’t it?
Indeed! I made the correction. Thank you.
What about learning rate scheduler after the warmup? Shall we use rate decay (cosine, linear) or keep the rate constant?
I recommend linear decay. This is done when lr_scheduler_type is set to "linear". For most uses cases, it will work as well as cosine, at least.
Good point!
Yes, you might be able to set a greater eval batch size but the difference won't be large. Gradient checkpointing already saves a lot of memory to a point that the maximum training batch size supported by the GPU is very close to the one for evaluation.