5 Comments

I think the issue with multi-query is more about quality than parallelization, because you’re basically using the same K & V values for everything. With grouped query attention, you take an approach that is between multi query and multi head.

Expand full comment
author
Dec 19, 2023·edited Dec 19, 2023Author

Yes, it's what was mainly pointed out by previous work. But in their paper, they show that the quality with multiquery is actually almost comparable with Vanilla transformer (Table 9). If I understood well, this is why they show the "parallelization" angle to motivate GQA.

Expand full comment

I think they use the parallel approach, so that would be the right hand diagram on the figure

Expand full comment
author

Thanks for pointing out this. I misread the figure. It's corrected.

Expand full comment

Great piece btw, thanks

Expand full comment