Fine-tune a Mixture of Experts on Your Computer

Cheap supervised fine-tuning for MoEs with QLoRA

Jan 25, 2024

∙ Paid

There is a wave of mixture of experts (MoE) and merged models currently surging on the Hugging Face Hub. In a previous article, we reviewed Phixtral, a merge of several Phi-2 LLMs, and made our own MoE.

Maixtchup: Make Your Own Mixture of Experts with Mergekit

Benjamin Marie

Jan 18

Read full story

While these new LLMs are simply merging several LLMs, they often outperform, without any fine-tuning, the individual LLMs that they merged.

In this article, I first discuss the memory requirements for fine-tuning MoE/merged LLMs and then show how to fine-tune them using QLoRA. For demonstration, I use my Maixtchup model but it would work the same for any other MoE models made of Llama 2 or Mistral 7b models.

I have implemented this fine-tuning in the following notebook:

Get the notebook (#40)

You can get Maixtchup from the Hugging Face Hub:

kaitchup/Maixtchup-4x7b

How Much Memory Do We Need to Fine-tune 4xMistral 7B?

Maixtchup is a 24 billion parameter model. Fine-tuning it on consumer hardware is challenging.

One fp16 (16-bit) parameter occupies 2 bytes in memory. 24 billion fp16 parameters occupy 48 GB. Even an expensive A100 48GB GPU wouldn’t be enough to load the entire model. To load it on a consumer GPU (e.g., with 24 GB of VRAM), we could offload 3 of the 4 experts to the CPU RAM but then fine-tuning would become extremely slow.

Run Mixtral-8x7B on Consumer Hardware with Expert Offloading

Benjamin Marie

Jan 8

Read full story

The best alternative is to quantize the model to 4-bit and then fine-tune an adapter over the model. This is the QLoRA method. If you don’t know about QLoRA, we used it several times in the previous articles and you can find a detailed explanation of how it works here:

QLoRa: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie

May 30, 2023

Read full story

With 4-bit quantization, one parameter only occupies 0.5 bytes. It divides by 4 the size of the model in memory. We only need 14 GB (not 12 GB since not all the parameters are quantized) to load a quantized Maixtchup.

So we need a GPU with 14 GB, but we still don’t know how much memory the fine-tuning itself will require. It depends on several hyperparameters, mainly the training batch size and the maximum sequence length of the training examples.

Since we will fine-tune the model on long training examples (see next section), I set the maximum sequence length to 1,024. I will also activate FlashAttention 2 to reduce the computational cost and memory consumption of training over long sequences of tokens.

Use FlashAttention-2 for Faster Fine-tuning and Inference

Benjamin Marie

November 16, 2023

Read full story

It remains to know the memory cost given the batch size. To know this, I tried to run fine-tuning with different batch sizes.

If you target a 24 GB GPU (RTX 3090/4080/4090), the maximum batch size would be 3 but for training efficiency, it is recommended to use a value that can be divided by 2. So I set 2.

If you target a 16 GB GPU (RTX 4060 Ti/4070 Ti Super), even setting a batch size of 1 would still consume too much memory. You will need to change the maximum sequence length to 512 and then a batch size of 2 would work.

Distilled Supervised Fine-tuning

This supervised fine-tuning (SFT) aims to make a chat model using Maixtchup as a base model. We want this chat model to be able to answer instructions. We need an instruction dataset for training: instructions paired with a correct answer.

The Kaitchup – AI on a Budget

Fine-tune a Mixture of Experts on Your Computer

Cheap supervised fine-tuning for MoEs with QLoRA

Maixtchup: Make Your Own Mixture of Experts with Mergekit

How Much Memory Do We Need to Fine-tune 4xMistral 7B?

Run Mixtral-8x7B on Consumer Hardware with Expert Offloading

QLoRa: Fine-Tune a Large Language Model on Your GPU

Use FlashAttention-2 for Faster Fine-tuning and Inference

Distilled Supervised Fine-tuning

This post is for paid subscribers