Data Preprocessing for Machine Translation

Benjamin Marie

Feb 25, 2023

Clean, normalize, and tokenize

Read →

7 Comments

Kacimi Imad

Jul 25Liked by Benjamin Marie

Hello Sir,

Thank you for all the different tutorials. I have a question about the instruction dataset. In my case, I want to fine-tune, for example, LLAMA3 for extracting features from a given real estate ad. My question is:

Is it better to format the dataset using the prompt format of LLAMA3, or is it okay to use a different format, like the Alpaca format? For example:

"""

Below is an instruction that describes a task, paired with an input that provides further context.

Write a response that appropriately completes the request.

### Instruction:

{instruction}

### Input:

{input}

### Response:

"""

Could you provide a note discussing how to prepare the dataset?

Thank you!

Expand full comment

Reply (1)

Benjamin Marie

Jul 25Author

It's totally OK to use a different prompt format. What only matters is that the prompt format used during fine-tuning must be the same at inference time.

You can use Alpaca prompt format for fine-tuning Llama 3.

Expand full comment

Reply (1)

Kacimi Imad

Jul 27

thank u sir

Expand full comment

Xinyu Wei

May 5Liked by Benjamin Marie

Can the files train.clean.pp.dedup.norm.spm8k.en and train.clean.pp.dedup.norm.spm8k.es be used for machine translation training as they are? Does the content of the files need to be numericalized for NLP training, meaning that each token is mapped to a unique numerical ID?

Expand full comment

Reply (1)

Benjamin Marie

May 5Author

Yes they can be used like that without further pre processing.

Expand full comment

Reply (1)

Xinyu Wei

May 6Liked by Benjamin Marie

I still have a question, I remember when training deep learning models, especially NLP, we usually need to convert the text to ID, why in this blog of yours, the generated English txt can be taken directly to training without converting to ID?

Expand full comment

Reply (1)

Benjamin Marie

May 6Author

Because the framework does this conversation for you.

Expand full comment

The Kaitchup – AI on a Budget

Data Preprocessing for Machine Translation