The Weekly Kaitchup #17

Starling - IPO - Intel Extension for Transformers

Benjamin Marie

Dec 01, 2023

Hi Everyone,

In this edition of The Weekly Kaitchup:

Starling: Another RLAIF Model
IPO: A Better Learning Objective to Align LLMs
Intel Extension for Transformers: QLoRA Support and Faster Speculative Decoding

The Kaitchup has now 1,169 subscribers. Thanks a lot for your support!

If you are a free subscriber, consider upgrading to paid to access all the notebooks and articles. There is a 7-day trial that you can cancel anytime.

Starling: Another RLAIF Model

Two weeks ago, I introduced Intel’s NeuralChat: a model similar to Hugging Face’s Zephyr models. These models are trained and aligned using only synthetic data in which the training examples are rated by GPT-4. This is the RLAIF method (“reinforcement learning” with AI feedback).

Zephyr 7B Beta: A Good Teacher Is All You Need

Benjamin Marie, PhD

November 6, 2023

Read full story

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Benjamin Marie, PhD

October 26, 2023

Read full story

Another model joined the trend:

berkeley-nest/Starling-LM-7B-alpha

A 7 billion parameter model trained on 184k prompts and pair of responses rated, again, by GPT-4.

A notable difference from previous RLAIF models is the use of the Advantage-Induced Policy Alignment (APA). Thanks to APA and the use of a larger training dataset, Starling LM 7B achieves better results than Zephyr and NeuralChat.

RLAIF models are getting closer to GPT-4 but remain far from it for reasoning, math, and coding.

IPO: A Better Learning Objective to Align LLMs

As we see with RLAIF models, direct preference optimization (DPO) is now very popular to align LLMs as it is much simpler than the typical reinforcement learning with human feedback (RLHF) (there is no need for a reward model in DPO).

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback

September 21, 2023

Read full story

Nonetheless, DPO is prone to overfitting the training data. In this new work, Google DeepMind demonstrates why this overfitting happens and proposes a new learning objective for LLM alignment with human preference: identity preference optimization (IPO).

A General Theoretical Paradigm to Understand Learning from Human Preferences

IPO is superior to DPO and RLHF according to DeepMind as it makes the KL-regularization more effective.

IPO is already implemented in Hugging Face TRL. You can use it by simply setting “loss_type=’IPO’” in the arguments of the DPOTrainer. I’ll try it with Mistral 7B, following this training recipe but changing the training objective for IPO:

A Cheap Zephyr 7B Beta: Distilled DPO on Consumer Hardware

Benjamin Marie, PhD

November 9, 2023

Read full story

Intel Extension for Transformers: QLoRA Support and Faster Speculative Decoding

Intel is very active in improving the CPU support for Transformers. Within a week, they enabled QLoRA on the CPU and sped up speculative decoding. You can find this extension here:

Intel Extension for Transformers

To use QLoRA on the CPU, you only have to import AutoModelForCausalLM from the Intel extension instead of importing it from Transformers, as follows:

from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    'decapoda-research/llama-7b-hf',
    torch_dtype=torch.bfloat16,
    load_in_4bit=True,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(
    model, use_gradient_checkpointing=True
)
model.gradient_checkpointing_enable()
peft_config = LoraConfig(
    r=8,
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, peft_config)

There is nothing else to change compared to the regular QLoRA. Intel’s extension already supports NF4, INT4, and FP4 quantization.

For speculative decoding, they claim a 3x acceleration.

Speculative decoding uses first a fast and small model to generate a draft output (or sequences of tokens). Then, the actual model used for decoding will check the generated output and correct it only if needed, hence the decoding acceleration.

You can find how to use speculative decoding with Intel’s extension here:

Speculative decoding with Intel’s extension for Transformers

What to Read on Substack

Note: These recommendations are articles that I’ve read, found instructive, and that I think may interest The Kaitchup’s readers.

Deep (Learning) Focus

Easily Train a Specialized LLM: PEFT, LoRA, QLoRA, LLaMA-Adapter, and More

This newsletter is presented by Rebuy, the commerce AI company.Join subscribers from Microsoft, Tesla, Google, Meta, and more that use Deep (Learning) Focus to better understand AI research! If you like the newsletter, feel free to get in touch with me…

a year ago · 28 likes · 7 comments

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers:

Share The Kaitchup – AI on a Budget

Have a nice weekend!

The Kaitchup – AI on a Budget

The Weekly Kaitchup #17

Starling - IPO - Intel Extension for Transformers

Starling: Another RLAIF Model

Zephyr 7B Beta: A Good Teacher Is All You Need

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

IPO: A Better Learning Objective to Align LLMs

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback

A Cheap Zephyr 7B Beta: Distilled DPO on Consumer Hardware

Intel Extension for Transformers: QLoRA Support and Faster Speculative Decoding

What to Read on Substack

Note: These recommendations are articles that I’ve read, found instructive, and that I think may interest The Kaitchup’s readers.

Discussion about this post