The Weekly Kaitchup #33

Benjamin Marie

Mar 22

ORPO - EagleX 7B - Quanto

Read →

3 Comments

baconnier loic

Mar 23Liked by Benjamin Marie

There is something i don’t understand

You keep the sft model as reference and in arguilla blog they keep the original base model

What is the best way to ise DPO please ?

reference

Finally, they describe with a few lines of code, how you can configure a DPOTrainer class and run the train. Here is what you will need:

model, the fine-tuned version of your model (the result from SFT);

model_ref, the non-fine-tuned version of the model that's being fine-tuned. Usually it’s the original checkpoint you used before SFT.

training_args, same TrainerArguments class object present in transformers library, containing a list of training parameters such as per_device_train_batch_size, max_steps, gradient_accumulation_steps, learning_rate, evaluation_strategy, output_dir, etc.

beta, temperature parameter for the DPO loss, typically something in the range of 0.1 to 0.5. »

https://argilla.io/blog/mantisnlp-rlhf-part-3/

Expand full comment

Reply (2)

Benjamin Marie

Mar 23Author

Yes, you can use the main model as reference, ie, not fine tuned, and the sft model for initialisation. You can also use the main model for initialisation and the sft model for reference. You can even use the sft model for both initialisation and as reference. In my experiments, it didn't make much difference. You can see it as an hyperparameter.

Expand full comment

Benjamin Marie

Mar 23Author

I just rechecked the dpo paper:

"the deviation from the base reference policy πref, namely the ini-

tial SFT model π

SFT. In practice, the language model policy πθ is also initialized to π

SFT"

To reproduce the official dpo method, we should use the sft model as reference and for initialisation.

Expand full comment

The Kaitchup – AI on a Budget

The Weekly Kaitchup #33