5 Comments

Seems experiment has not been supported by Optimum-Benchmark. Do you have updated version of code, such as test in vLLM, RoCm, thank you.

Expand full comment
author

Yes, there was a big update of Optimum benchmark. It completely changed. I don't have a new version yet of this article but I'm working on it.

Note however that Optimum benchmark doesn't benchmark vLLM and I don't know whether this is compatible with RoCm.

Expand full comment

The flash atten2 in vLLM locks up a lot of GPU memory based on the maximum sequence during inference, which is a big problem, and I haven't found a good way to do it other than limiting the sequence size. This means that even though phi-3 has 128k, if you use vLLM inference and actually use 128K arguments, it will cause OOM. Do you have some suggestions?

Expand full comment
author

128k is an extremely long sequences. There is not much to do I think. FlashAttention2 is actually reducing the memory consumption a lot for such a long sequence.

You may be able to reduce the memory consumption with --gpu-memory-utilization xx , e.g., xx=0.20. Or use --enforce-eager but it will slow down inference.

Expand full comment

Thank you so much

Expand full comment