Example script for fine-tuning on a new destination language
#4
by
gmallen
- opened
Hi! Can you provide an example script for fine-tuning the model on a new language?
Thanks,
Hi @gmallen ,
To get started, please take a look at this, it's my implementation,(Contains only important config/format I used) not official.
- Set model_id and new_lang_code
- Model loading: This is optimised to via Unsloth, just so that it runs in colab
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = MODEL_ID,
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
) - Apply LORA adaptors. (Standard procedure)
- Data Formatting. (Must wrap our raw text in the exact JSON schema the model expects)
def format_to_google_schema(examples):
texts = []
for source, target in zip(examples['en'], examples['target']):
json_payload = json.dumps([
{
"type": "text",
"source_lang_code": "en",
"target_lang_code": NEW_LANG_CODE,
"text": source
}
], ensure_ascii=False)
full_prompt = f"user\n{json_payload}\nmodel\n{target}"
texts.append(full_prompt)
return {"text": texts}
- Dataset, use your JSONL loading logic
- Training loop, use SFTTrainer or nay other to create trainer you are comfortable with.
trainer.train() - Inference check: FastLanguageModel.for_inference(model)
- That's all, create inputs, json payloads and prompt and save the model.
The training objective is standard casual language modeling, but strict adherence to the JSON format is non-negotiable.
Please refer to Gemma cookbook for more details on required structure.
Please reach out if you need further help.