Marlin: Nearly Ideal Inference Speed for 4-bit Models with vLLM (1k+ tokens/sec)

Up to 4x times faster inference

Mar 28, 2024

∙ Paid

Large language models (LLMs) are often too large to be directly used on consumer hardware. To reduce their size, various techniques have been proposed to quantize LLMs and lower their memory consumption. While recent algorithms for 4-bit quantization are often released along with their own optimized CUDA kernels, the inference throughput of quantized LLMs remains far from optimal.

Inference with 4-bit models, for instance using the INT4 data type, involves INT4xFP16 operations which are slow even with modern GPUs, hence the need for optimized CUDA kernels.

IST proposes Mixed Auto-Regressive Linear kernel (Marlin), an extremely optimized INT4xFP16 matmul kernel that can deliver close to ideal (4x) inference speed.

In this article, I explain how Marlin achieves this speedup. Then, we will see how to convert existing GPTQ models into the Marlin format. I use Mistral 7B for demonstration and check the inference speed with vLLM.

The notebook showing how to convert LLMs into Marlin, and use them with vLLM, is available here:

Get the notebook (#56)

The Kaitchup – AI on a Budget

Marlin: Nearly Ideal Inference Speed for 4-bit Models with vLLM (1k+ tokens/sec)

Up to 4x times faster inference

This post is for paid subscribers