Fixing Faulty Gradient Accumulation…

Oct 21

Years of suboptimal model training?

2 Comments

22 hrs agoLiked by Benjamin Marie

Wow. I remember when minibatch/batch normalization/gradient accumulation was offered as a performance improvement to lessen the number of weight updates in backpropagation. Carried forward because that's how it's always been done.

Now we await differences in model performance after the Transformers change.

Expand full comment

Remixa

Oct 21Liked by Benjamin Marie

Glad to see it fixed!

Expand full comment

The Kaitchup – AI on a Budget

Fixing Faulty Gradient Accumulation…