Example script for fine-tuning on a new destination language

#4
by gmallen - opened

Hi! Can you provide an example script for fine-tuning the model on a new language?
Thanks,

Hi @gmallen ,

To get started, please take a look at this, it's my implementation,(Contains only important config/format I used) not official.

  1. Set model_id and new_lang_code
  2. Model loading: This is optimised to via Unsloth, just so that it runs in colab
    model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL_ID,
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
    )
  3. Apply LORA adaptors. (Standard procedure)
  4. Data Formatting. (Must wrap our raw text in the exact JSON schema the model expects)
    def format_to_google_schema(examples):
    texts = []
    for source, target in zip(examples['en'], examples['target']):
    json_payload = json.dumps([
    {
    "type": "text",
    "source_lang_code": "en",
    "target_lang_code": NEW_LANG_CODE,
    "text": source
    }
    ], ensure_ascii=False)

full_prompt = f"user\n{json_payload}\nmodel\n{target}"
texts.append(full_prompt)
return {"text": texts}

  1. Dataset, use your JSONL loading logic
  2. Training loop, use SFTTrainer or nay other to create trainer you are comfortable with.
    trainer.train()
  3. Inference check: FastLanguageModel.for_inference(model)
  4. That's all, create inputs, json payloads and prompt and save the model.

The training objective is standard casual language modeling, but strict adherence to the JSON format is non-negotiable.
Please refer to Gemma cookbook for more details on required structure.
Please reach out if you need further help.

Sign up or log in to comment