Falcon Mamba, Jamba, RWKV... Can You Use Them on Your Computer?
A close look at quantization and parameter-efficient fine-tuning (LoRA/QLoRA) for SSMs, RWKV, and hybrid models
State-of-the-art large language models (LLMs) rely heavily on the Transformer architecture. Larger Transformer models tend to learn more effectively and benefit from efficient training due to the parallel computation enabled by attention mechanisms.
However, Transformers have some limitations, especially during inference. The computational cost of attention grows quadratically with the length of the input sequence. To mitigate this, various techniques like ALiBi and RoPE have been developed to optimize performance for long sequences of tokens.
Given that attention is the primary bottleneck, attention-free neural architectures have emerged as alternatives. Notable among these are the RWKV and state-space models (SSMs), which can process sequences of any length in nearly constant time. While promising, these models tend to slightly underperform compared to Transformers and are more challenging to train.
To address the limitations of pure SSMs, hybrid models combining SSM and Transformer blocks have been introduced. These models are easier to train and, while slightly less efficient in inference, offer a good performance-efficiency balance.
Large-scale models like Jamba 1.5, RWKV-6, and Falcon Mamba—trained on trillions of tokens—are now emerging with impressive performance.
But how well are they supported by popular frameworks, and can we easily quantize and fine-tune them on consumer hardware?
In this article, we’ll explore quantization and fine-tuning for Jamba 1.5, RWKV-6, and Falcon Mamba.
I tested various frameworks for quantization, including bitsandbytes, HQQ, AutoRound, AutoGPTQ, and AutoAWQ. For LoRA/QLoRA fine-tuning, I evaluated Hugging Face Transformers and PEFT.
I implemented all these tests in the following notebook:
You can also use this notebook with other models if you want to check whether they support quantization or LoRA/QLoRA fine-tuning.