7 Comments
Jul 25Liked by Benjamin Marie

Hello Sir,

Thank you for all the different tutorials. I have a question about the instruction dataset. In my case, I want to fine-tune, for example, LLAMA3 for extracting features from a given real estate ad. My question is:

Is it better to format the dataset using the prompt format of LLAMA3, or is it okay to use a different format, like the Alpaca format? For example:

"""

Below is an instruction that describes a task, paired with an input that provides further context.

Write a response that appropriately completes the request.

### Instruction:

{instruction}

### Input:

{input}

### Response:

"""

Could you provide a note discussing how to prepare the dataset?

Thank you!

Expand full comment
author

It's totally OK to use a different prompt format. What only matters is that the prompt format used during fine-tuning must be the same at inference time.

You can use Alpaca prompt format for fine-tuning Llama 3.

Expand full comment

thank u sir

Expand full comment
May 5Liked by Benjamin Marie

Can the files train.clean.pp.dedup.norm.spm8k.en and train.clean.pp.dedup.norm.spm8k.es be used for machine translation training as they are? Does the content of the files need to be numericalized for NLP training, meaning that each token is mapped to a unique numerical ID?

Expand full comment
author

Yes they can be used like that without further pre processing.

Expand full comment
May 6Liked by Benjamin Marie

I still have a question, I remember when training deep learning models, especially NLP, we usually need to convert the text to ID, why in this blog of yours, the generated English txt can be taken directly to training without converting to ID?

Expand full comment
author

Because the framework does this conversation for you.

Expand full comment