5 Comments
Aug 3Liked by Benjamin Marie

Hi Benjamin, I suppose there is an error here ‘If you train TinyLlama for one epoch on openassistant-guanaco which contains 9,846 steps, with a batch size of 8, it yields 1,231 training steps’. It is 9,846 examples and not 9,846 steps, isn’t it?

Expand full comment
author

Indeed! I made the correction. Thank you.

Expand full comment
Apr 1Liked by Benjamin Marie

What about learning rate scheduler after the warmup? Shall we use rate decay (cosine, linear) or keep the rate constant?

Expand full comment
author

I recommend linear decay. This is done when lr_scheduler_type is set to "linear". For most uses cases, it will work as well as cosine, at least.

Expand full comment
deletedApr 8·edited Apr 8Liked by Benjamin Marie
Comment deleted
Expand full comment
author

Good point!

Yes, you might be able to set a greater eval batch size but the difference won't be large. Gradient checkpointing already saves a lot of memory to a point that the maximum training batch size supported by the GPU is very close to the one for evaluation.

Expand full comment