The Weekly Kaitchup #31

Zephyr 7B Gemma - PEFT for AWQ/AQLM - Genstruct 7B

Benjamin Marie

Mar 08, 2024

Hi Everyone,

In this edition of The Weekly Kaitchup:

A New Zephyr 7B Based on Gemma
PEFT: LoRA with AWQ and AQLM
Genstruct 7B: Generate an Instruction Dataset from Raw Text

The Kaitchup has now 2,391 subscribers. Thanks a lot for your support!

If you are a free subscriber, consider upgrading to paid to access all the notebooks and articles. There is a 7-day trial that you can cancel anytime.

A New Zephyr 7B Based on Gemma

Hugging Face has fine-tuned and aligned Google’s Gemma 7B, following a recipe similar to the one they used to train their original Zephyr.

Google's Gemma: Fine-tuning, Quantization, and Inference on Your Computer

Benjamin Marie

Feb 26

Read full story

A Cheap Zephyr 7B Beta: Distilled DPO on Consumer Hardware

Benjamin Marie

November 9, 2023

Read full story

According to the standard public benchmarks, this new Zephyr is not as good as the original one based on Mistral 7B. However, it performs better on the MT Bench.

Nonetheless, this release is informative as many have reported difficulties in fine-tuning Gemma 7B. This recipe published by Hugging Face seems to work relatively well and can be a good starting point.

This recipe is available here:

GitHub: huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-gemma

The model is on the Hugging Face Hub:

HuggingFaceH4/zephyr-7b-gemma-v0.1

Note also that several bugs have been identified in the Gemma models released on the Hugging Face Hub. They are currently correcting them. I assume the Gemma models will be easier to fine-tune (with more stable learning curves) in the coming weeks. Meanwhile, if you want to fine-tune Gemma, I recommend using unsloth which has already implemented many corrections.

unsloth: Faster and Memory-Efficient QLoRA Fine-tuning

Benjamin Marie

December 28, 2023

Read full story

PEFT: LoRA with AWQ and AQLM

The last update of PEFT (0.9.0) supports fine-tuning LoRA adapters on top AWQ and AQLM quantized models. Note: GPTQ models were already supported.

To make it work, in principle, you only have to load the model without passing any quantization_config. The quantization configuration will be automatically retrieved from the config.json in the model’s directory.

However, be aware that if you fine-tune LoRA on top of a model quantized with these methods, we can’t merge the adapter into the model once fine-tuned.

Next week, I’ll probably write a tutorial about fine-tuning AQLM models and report on how well it performs. If it works, it could be a very cheap way to fine-tune Mixtral-8x7B on consumer hardware.

Run a 7.7x Smaller Mixtral-8x7B on Your GPU with AQLM 2-bit Quantization

Benjamin Marie

Feb 22

Read full story

Genstruct 7B: Generate an Instruction Dataset from Raw Text

Fine-tuning an instruct/chat LLM on a particular domain with specific knowledge can be a very difficult task as we usually lack the data for this fine-tuning.

There are many methods to generate synthetic instruction datasets, but none of them are easy to set up and able to generate complex and long instructions.

With Genstruct 7B, it seems much simpler. It is an LLM, based on Mistral 7B, fine-tuned to generate instruction datasets from raw text. To make it, Nous Research follows an approach similar to Ada-instruct:

Ada-Instruct: Adapting Instruction Generators for Complex Reasoning

Genstruct 7B is available on the HF Hub:

HF Hub: NousResearch/Genstruct-7B

The Salt

In The Salt this week, I published a long review of LongRope. LongRoPE is a new extension of RoPE which is much more robust to model extremely long contexts. I explain how LongRoPe works for extending the context size of LLMs to 2 million tokens.

LongRoPE: Towards Unlimited Context Length for the Transformer

Benjamin Marie

Mar 6

LongRoPE: Towards Unlimited Context Length for the Transformer

Transformer models have a limited context size that can be too small for a wide range of applications, such as summarization, information retrieval, or in-context learning with numerous examples. A transformer model can’t accurately model a context longer than the examples it has seen during training. We must increase the sequence length at training time to get better accuracy for longer sequences. However, this is often impractical due to the cost of training on long examples and the scarcity of long examples for training.

Read full story

Evergreen Kaitchup

In this section of The Weekly Kaitchup, I mention which of the AI notebook(s) I have checked and updated, with a brief description of what I have done.

This week I have updated the notebook implementing the fine-tuning of Mistral 7B on consumer hardware, using TRL and QLoRA.

#22 Fine-tune Mistral 7B on Your Computer with QLoRa and TRL

To sum up, I checked that the notebook was still running properly and added the support of FlashAttention-2, used a longer maximum sequence length, and used bfloat16 instead of float16, among other less significant changes. All these changes significantly accelerate and improve the fine-tuning.