The Kaitchup – AI on a Budget

Apr 7·edited Apr 7Author

Hi!

Setting GenerationConfig(padding_side="left"... should work.

Expand full comment

-> https://github.com/huggingface/trl/issues/1217#issuecomment-1889282654

Sai Harish Pathuri

Apr 7·edited Apr 7Liked by Benjamin Marie

Thanks for pinging back, I tried that and it did not work. Apparently there is a bug caused by use_catch=True..

I set use_catch to False in both base model and inference and tried it. Its working now.

I have a new question now, please pardon my ignorance I am new to this field..

I want to finetune this 7b model for text generation related to fitness suggestions..

My input prompt will have around 200 words and output will have 170 words max.

How many train samples will be needed for this task?

Expand full comment

Reply (2)

Apr 7Author

I think one thousand samples would work. But more is better of course.

Expand full comment

Caron

Apr 11

How did you set `use_cache=False` for both base and lora weights?

I tried putting "model.config.use_cache" before and after loading the lora weights:

# load base model weights ...

model.config.use_cache = False

# load lora weights along with base weights

model = PeftModel.from_pretrained(model, "./results/checkpoint-20/")

model.config.use_cache = False

But the same error still persisted.

Expand full comment

Apr 11Author

Are you referring to the warning printed at the beginning of fine tuning? I think we can't get rid of this warning, even if we disable caching.

Expand full comment

Caron

Apr 16Liked by Benjamin Marie

Hi Ben, no I'm not referring to the warning at the beginning of fine-tuning.

I'm seeing this ValueError at inference time whenever I set attn_implementation="flash_attention_2", no matter what value I set for model.config.use_cache:

```

File "/home/tianlu_zhang/.conda/envs/finetune/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 992, in forward

raise ValueError(

```

If I just load the base model without specifying the usage of flash_attention_2, then the inference code works properly:

model = AutoModelForCausalLM.from_pretrained(

model_name, device_map={"": 0},

torch_dtype=torch.bfloat16)

Expand full comment

Apr 16Author

Are you using the most recent version of Transformers? I also noticed some bugs when using FlashAttention with Mistral 7B, last weeks, but I don't have these bugs anymore.

In your case, the error message mentionned that you didn't set the padding side to left. Did you? Your error doesn't seem related to the cache.

Expand full comment

Alex Grishin

Mar 23Liked by Benjamin Marie

Hi Benjamin! I'm a newbie in quantization. Can I ask a very basic and a very general question? When I follow your instructions from this article and load a quantized model with

model = AutoModelForCausalLM.from_pretrained(

model_name, quantization_config=bnb_config

)

the loaded model has an expected size of 3.84GB, but number of model parameters are surprising to me:

Trainable parameters: 262410240

Total parameters: 3752071168

How come quantization reduced the total number of parameters from 7.2B to 3.7B? Shouldn't the total number of model parameters stay the same after quantization?

Expand full comment

https://github.com/huggingface/transformers/issues/25978

Mar 23Author

Great question!

Quantization does not reduce the number of parameters. It's some kind of bug in the parameter counting. More information here:

Expand full comment

Feb 24·edited Feb 24Liked by Benjamin Marie

Hi Benjamin, I noticed you used Guanaco format for fine-tune Mistral . I do see different fine tune tutorials using different format. E.g using [INST] & [/INST]. I wonder if that will somehow change the fine tune performance? Or as long as using the same format for training & inferencing, I will be fine?

And I am assuming if I want to fine tune on a fine-tuned model like Zephyr, in this case, I will have to follow the format they used for fine tune Mistral base model? Thanks.

Expand full comment

Feb 24Author

The format of the prompt is not very important, Choose a format that is clear and easy to use for your application. The only requirement is that it must be same for fine-tuning and inference.

Zephyr is already fine-tuned. Fine-tuning on new data will possibly undo the previous fine-tuning. So if you fine-tune it again with a new prompt format, it will likely forget Zephyr's format.

Expand full comment

Feb 24

Thanks for your clarification

Expand full comment

Nov 14, 2023Liked by Benjamin Marie

hi, great post! i just have a quick question about generation part. I check the notebook and the generated result does not seem to know when to end until it reaches the maximum number of tokens. I also face similar problem in my fine tuning project, do you know how to fix this kind of issue? Many thanks!

Expand full comment

Reply (2)

Nov 14, 2023Author

Yes, you have to set add_eos_token=True when you call AutoTokenizer.from_pretrained. but only for fine-tuning, not for inference.

Expand full comment

Reply (2)

Nov 15, 2023Liked by Benjamin Marie

I think I have fixed it somehow, the problem seems to be max_seq_length was set too long for my trained data. I changed from 1024 to 512 and the generated results of new model knows how to stop. Still, I wonder if having too many padding tokens would affect my training?

Expand full comment

Nov 15, 2023Author

I see. I'm not sure why decreasing the max_seq_lentgh helps.

The reverse would have been understandable: for intance, if your training examples are all more than 512 tokens, then the EOS token might truncated (depending on the implementation, ie, if EOS is added before or after truncation), in that case the model would not see EOS during training. In this situation, increasing to 1024 would help.

Padding tokens are basically ignored during training. Their number has no effect.

Expand full comment

Nov 14, 2023Liked by Benjamin Marie

yeah, i think i did do that, but the generation still does not stop. really not sure what's going on :( here's the code for tokenizer:

tokenizer = AutoTokenizer.from_pretrained(

base_model,

padding_side="right",

add_eos_token=True,

add_bos_token=True,

fast_tokenizer=True,

)

tokenizer.pad_token = tokenizer.unk_token

Expand full comment

Feb 24

Hi kz, did you figure out what was the reason in the end? Run into similar issue....Thanks.

Expand full comment

deletedNov 17, 2023

Comment deleted

Expand full comment

Nov 18, 2023Author

Change to device_map='auto' when you load the model with from_pretrained

Expand full comment

deletedNov 18, 2023

Comment deleted

Expand full comment

deletedNov 18, 2023

Comment deleted

Expand full comment

deletedNov 18, 2023·edited Nov 18, 2023Liked by Benjamin Marie

Comment deleted

Expand full comment

Nov 18, 2023Author

That's interesting. I also usually set device map to auto for multi GPU setting. Mistral implements various optimisations for inference maybe that's why we need to explicitly call accelerate.

Thank you for pointing out this solution.

Expand full comment

Feb 22Liked by Benjamin Marie

Hi, Benjamin, it seems the original question was deleted, could you give a high level summary about what was the issue if you still recall...thanks

Expand full comment