3 Comments
Mar 23Liked by Benjamin Marie

There is something i don’t understand

You keep the sft model as reference and in arguilla blog they keep the original base model

What is the best way to ise DPO please ?

reference

« 

Finally, they describe with a few lines of code, how you can configure a DPOTrainer class and run the train. Here is what you will need:

model, the fine-tuned version of your model (the result from SFT);

model_ref, the non-fine-tuned version of the model that's being fine-tuned. Usually it’s the original checkpoint you used before SFT.

training_args, same TrainerArguments class object present in transformers library, containing a list of training parameters such as per_device_train_batch_size, max_steps, gradient_accumulation_steps, learning_rate, evaluation_strategy, output_dir, etc.

beta, temperature parameter for the DPO loss, typically something in the range of 0.1 to 0.5. »

https://argilla.io/blog/mantisnlp-rlhf-part-3/

Expand full comment
author

Yes, you can use the main model as reference, ie, not fine tuned, and the sft model for initialisation. You can also use the main model for initialisation and the sft model for reference. You can even use the sft model for both initialisation and as reference. In my experiments, it didn't make much difference. You can see it as an hyperparameter.

Expand full comment
author

I just rechecked the dpo paper:

"the deviation from the base reference policy πref, namely the ini-

tial SFT model π

SFT. In practice, the language model policy πθ is also initialized to π

SFT"

To reproduce the official dpo method, we should use the sft model as reference and for initialisation.

Expand full comment