Insights from the Falcons

Benjamin Marie

Dec 18, 2023

How to pre-train large language models

Read →

5 Comments

Ronan McGovern

The Blip

Dec 19, 2023Liked by Benjamin Marie

I think the issue with multi-query is more about quality than parallelization, because you’re basically using the same K & V values for everything. With grouped query attention, you take an approach that is between multi query and multi head.

Expand full comment

Reply (2)

Benjamin Marie

Dec 19, 2023·edited Dec 19, 2023Author

Yes, it's what was mainly pointed out by previous work. But in their paper, they show that the quality with multiquery is actually almost comparable with Vanilla transformer (Table 9). If I understood well, this is why they show the "parallelization" angle to motivate GQA.

Expand full comment