7 Comments
Apr 27Liked by Benjamin Marie

I am looking to run a fine-tuned small language model on an edge device. The edge device is limited, so I am obviously looking to quantize.

To be efficient, I prefer to keep the quantized base model on the hardware and if I need to push updates or adjust the finetune, solely push LoRA Adapters and allow the ‘merge’ or ‘apply’ process take place on the edge.

This eliminates me having to push an entire base model + LoRA Adapter and simplifies the send to only the LoRA Adapter.

I know this is possible, but I want to limit the degradation of performance since applying a naïve adapter to a quantized model has repercussions as you’ve noted.

Expand full comment
author

To the best of my knowledge, there is no straightforward way to do that. We can't merge an adapter into a quantized model. It has to be dequantized first and quantize again after the merge. An edge device might not have enough resources to do it.

A method that could do that without the need to quantize and dequantize is QA-LoRA but the implementation remains in limbo. Using the QA-LoRA code and adapting it to your model might be the way.

https://github.com/yuhuixu1993/qa-lora

Expand full comment

Could you expand on this comment: "Note: If you plan to release your model fine-tuned with LoftQ, you will need to release the model along with your adapter. The model itself is also modified."

In what way is the base model changed? Could you link to the source for this?

Expand full comment
author

Good catch!

The model doesn't change. I should have removed this note.

When I wrote the draft of this article I misunderstood how the script works. I thought it was applying a modified quantization but then realized that the weights saved are not 4bit...

The script simply serializes the base model after it does get_peft_model.

Expand full comment

I dug in some more and actually if you only release your adapter, that's not enough.

The base model is changed during LoftQ: https://github.com/yxli2123/LoftQ/issues/23

I still have some open questions on whether things are handled properly: https://github.com/yxli2123/LoftQ/issues/26

Expand full comment
author

I tried with and without the model saved with LoftQ and I obtained exactly the same results. I can't see any difference also between the original and the model created by PEFT for LoftQ.

Also, the code in PEFT appears to do nothing more than saving the base model.

However, the model should be changed by LoftQ. LoftQ searches for a better quantization. But I'm not sure that this search is implemented in PEFT.

Expand full comment

Yeah I'm pretty confused... plus, the results aren't better.

Expand full comment