Example script for fine-tuning on a new destination language

#4
by gmallen - opened

Hi! Can you provide an example script for fine-tuning the model on a new language?
Thanks,

Google org

Hi @gmallen ,

To get started, please take a look at this, it's my implementation,(Contains only important config/format I used) not official.

  1. Set model_id and new_lang_code
  2. Model loading: This is optimised to via Unsloth, just so that it runs in colab
    model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL_ID,
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
    )
  3. Apply LORA adaptors. (Standard procedure)
  4. Data Formatting. (Must wrap our raw text in the exact JSON schema the model expects)
    def format_to_google_schema(examples):
    texts = []
    for source, target in zip(examples['en'], examples['target']):
    json_payload = json.dumps([
    {
    "type": "text",
    "source_lang_code": "en",
    "target_lang_code": NEW_LANG_CODE,
    "text": source
    }
    ], ensure_ascii=False)

full_prompt = f"user\n{json_payload}\nmodel\n{target}"
texts.append(full_prompt)
return {"text": texts}

  1. Dataset, use your JSONL loading logic
  2. Training loop, use SFTTrainer or nay other to create trainer you are comfortable with.
    trainer.train()
  3. Inference check: FastLanguageModel.for_inference(model)
  4. That's all, create inputs, json payloads and prompt and save the model.

The training objective is standard casual language modeling, but strict adherence to the JSON format is non-negotiable.
Please refer to Gemma cookbook for more details on required structure.
Please reach out if you need further help.

I tried this model and got these results…

Hi @srikanta-221 , The chat template translates the language code into the plain language name in the Jinja file.
Can you explain to me the rationale behind passing the language instead of the language name directly, please?
Thanks

Sign up or log in to comment