A Guide on Hyperparameters and Training…

Apr 1

Batch size, optimizers, learning rate schedulers, bfloat16, ...

5 Comments

Aug 3Liked by Benjamin Marie

Hi Benjamin, I suppose there is an error here ‘If you train TinyLlama for one epoch on openassistant-guanaco which contains 9,846 steps, with a batch size of 8, it yields 1,231 training steps’. It is 9,846 examples and not 9,846 steps, isn’t it?

Expand full comment

Reply (1)

Benjamin Marie

Aug 3Author

Indeed! I made the correction. Thank you.

Expand full comment

Alex Grishin

Apr 1Liked by Benjamin Marie

What about learning rate scheduler after the warmup? Shall we use rate decay (cosine, linear) or keep the rate constant?

Expand full comment

Reply (1)

Benjamin Marie

Apr 1Author

I recommend linear decay. This is done when lr_scheduler_type is set to "linear". For most uses cases, it will work as well as cosine, at least.

Expand full comment

deletedApr 8·edited Apr 8Liked by Benjamin Marie

Comment deleted

Expand full comment

Benjamin Marie

Apr 8Author

Good point!

Yes, you might be able to set a greater eval batch size but the difference won't be large. Gradient checkpointing already saves a lot of memory to a point that the maximum training batch size supported by the GPU is very close to the one for evaluation.

Expand full comment

The Kaitchup – AI on a Budget

A Guide on Hyperparameters and Training…