I think the issue with multi-query is more about quality than parallelization, because you’re basically using the same K & V values for everything. With grouped query attention, you take an approach that is between multi query and multi head.
Yes, it's what was mainly pointed out by previous work. But in their paper, they show that the quality with multiquery is actually almost comparable with Vanilla transformer (Table 9). If I understood well, this is why they show the "parallelization" angle to motivate GQA.
I think the issue with multi-query is more about quality than parallelization, because you’re basically using the same K & V values for everything. With grouped query attention, you take an approach that is between multi query and multi head.
Yes, it's what was mainly pointed out by previous work. But in their paper, they show that the quality with multiquery is actually almost comparable with Vanilla transformer (Table 9). If I understood well, this is why they show the "parallelization" angle to motivate GQA.
I think they use the parallel approach, so that would be the right hand diagram on the figure
Thanks for pointing out this. I misread the figure. It's corrected.
Great piece btw, thanks