The Weekly Kaitchup #62

Benjamin Marie

Oct 11

Gradient Accumulation - Contextual Document Embeddings - Aria

Read →

7 Comments

Remixa

Oct 11·edited Oct 11Liked by Benjamin Marie

On the issue of gradient accumulation, I learned this from a friend late this spring:

For example, with `bs=1` and `gradient_accumulation_steps=2`, there are two sequences of `attention_mask`:

- seq1: [1,1,1,1,1,1,1,1,0,0]

- seq2: [1,1,0,0,0,0,0,0,0,0]

In theory, when calculating the loss for backpropagation, seq1 should have 80% weight and seq2 should have 20% weight, but the implementation of HF is straightforward (seq1_loss + seq2_loss)/2, which means that two sequences with different actual lengths are treated as the same weight.

I don't know if this has anything to do with it, but the bias may be mitigated when `gradient_accumulation_steps` is large, so the effect is not so bad.

Expand full comment

Reply (3)

Benjamin Marie

Oct 11Author

Interesting!

This is not so difficult to verify. I could create a dataset in which all the sequences have the same length and use it for training.

Expand full comment

Reply (2)

Remixa

Oct 11Liked by Benjamin Marie

I talked to him about this again, and he said that the HF implementation (total_loss/num_seqs), in addition to being equal regardless of the number of padding, also lost information about the sequence length(or the number of tokens), So he made some corrections himself (total_loss/num_seqs * max_seq_len), which tries to use `max_seq_len` to introduce the "number of tokens" information, so that the training results are closer to pure large bs. In theory, it would be more accurate to count the number of tokens for each sequence rather than `max_seq_len`, but it doesn't seem easy to record or pass data across gradient accumulations.

So indeed if you experiment with a dataset of inherently equal length, you should expect to get the same results as a purely large bs.

Expand full comment

Remixa

Oct 11Liked by Benjamin Marie

As always looking forward to your sharing!

Expand full comment

Trelis Research

Trelis Research Updates

Oct 14Liked by Benjamin Marie

Wow hf accumulates loss? I had assumed it accumulated the gradients.

This point seems, very salient then…

Expand full comment

Reply (1)

Benjamin Marie

Oct 14Author

Not sure... I'm waiting for the professionals to get their hands on it. Daniel, the author of Unsloth, could reproduce the issue. He is working on it, but he is also very busy.

Expand full comment

Remixa

Oct 11

But I didn't confirm it later, I just mentioned it here.

Expand full comment

The Kaitchup – AI on a Budget

The Weekly Kaitchup #62