Hi Everyone,
In this edition of The Weekly Kaitchup:
Starling: Another RLAIF Model
IPO: A Better Learning Objective to Align LLMs
Intel Extension for Transformers: QLoRA Support and Faster Speculative Decoding
The Kaitchup has now 1,169 subscribers. Thanks a lot for your support!
If you are a free subscriber, consider upgrading to paid to access all the notebooks and articles. There is a 7-day trial that you can cancel anytime.
Starling: Another RLAIF Model
Two weeks ago, I introduced Intel’s NeuralChat: a model similar to Hugging Face’s Zephyr models. These models are trained and aligned using only synthetic data in which the training examples are rated by GPT-4. This is the RLAIF method (“reinforcement learning” with AI feedback).
Another model joined the trend:
A 7 billion parameter model trained on 184k prompts and pair of responses rated, again, by GPT-4.
A notable difference from previous RLAIF models is the use of the Advantage-Induced Policy Alignment (APA). Thanks to APA and the use of a larger training dataset, Starling LM 7B achieves better results than Zephyr and NeuralChat.
RLAIF models are getting closer to GPT-4 but remain far from it for reasoning, math, and coding.
IPO: A Better Learning Objective to Align LLMs
As we see with RLAIF models, direct preference optimization (DPO) is now very popular to align LLMs as it is much simpler than the typical reinforcement learning with human feedback (RLHF) (there is no need for a reward model in DPO).
Nonetheless, DPO is prone to overfitting the training data. In this new work, Google DeepMind demonstrates why this overfitting happens and proposes a new learning objective for LLM alignment with human preference: identity preference optimization (IPO).
A General Theoretical Paradigm to Understand Learning from Human Preferences
IPO is superior to DPO and RLHF according to DeepMind as it makes the KL-regularization more effective.
IPO is already implemented in Hugging Face TRL. You can use it by simply setting “loss_type=’IPO’” in the arguments of the DPOTrainer. I’ll try it with Mistral 7B, following this training recipe but changing the training objective for IPO:
Intel Extension for Transformers: QLoRA Support and Faster Speculative Decoding
Intel is very active in improving the CPU support for Transformers. Within a week, they enabled QLoRA on the CPU and sped up speculative decoding. You can find this extension here:
To use QLoRA on the CPU, you only have to import AutoModelForCausalLM from the Intel extension instead of importing it from Transformers, as follows:
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
'decapoda-research/llama-7b-hf',
torch_dtype=torch.bfloat16,
load_in_4bit=True,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(
model, use_gradient_checkpointing=True
)
model.gradient_checkpointing_enable()
peft_config = LoraConfig(
r=8,
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, peft_config)
There is nothing else to change compared to the regular QLoRA. Intel’s extension already supports NF4, INT4, and FP4 quantization.
For speculative decoding, they claim a 3x acceleration.
Speculative decoding uses first a fast and small model to generate a draft output (or sequences of tokens). Then, the actual model used for decoding will check the generated output and correct it only if needed, hence the decoding acceleration.
You can find how to use speculative decoding with Intel’s extension here:
What to Read on Substack
Note: These recommendations are articles that I’ve read, found instructive, and that I think may interest The Kaitchup’s readers.
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers:
Have a nice weekend!