Mayan-mT5: Q'eqchi' Translation Adapter (Phase 1)
This repository contains the LoRA adapter weights (Checkpoint 22000) for bidirectional machine translation between English/Spanish and Q'eqchi'. It is designed to be used in conjunction with the google/mt5-base model.
Status: Phase 1 complete. This model was trained on a synthetic corpus. The accompanying research paper is currently under peer review.
Repository Cross-Links
- Training Code & Generator (GitHub): achulzhanov/mayan-mt5
- Training Dataset (Hugging Face): achulz/mayan-mt5-qeqchi-dataset
Usage and Inference
Because this is a PEFT/LoRA adapter, you must load the base mt5-base model first, then apply these weights.
Required Task Prefix This model was trained using a specific unified task schema that categorizes the source data type. Because Phase 1 relies entirely on synthetic data, you must include the word "synthetic" in your task prefix, even if the text you are passing to it for inference is human-written. If you omit this, the model will not trigger the correct internal representations.
The required format is:
translate synthetic [Source Language] to [Target Language]:
- Valid Source/Target Languages:
English,Spanish,Q'eqchi'
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel
base_model_id = "google/mt5-base"
peft_model_id = "achulz/mayan-mt5-qeqchi-adapter"
# 1. Load Tokenizer and Base Model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForSeq2SeqLM.from_pretrained(base_model_id)
# 2. Load the LoRA Adapter
model = PeftModel.from_pretrained(base_model, peft_model_id)
# 3. Format the Input with the correct Task Prefix
# Note: You MUST include "synthetic" as defined in the Phase 1 unified task schema.
task_prefix = "translate synthetic English to Q'eqchi': "
source_text = "The dog is sleeping in the house."
input_text = task_prefix + source_text
# 4. Tokenize and Generate
inputs = tokenizer(input_text, return_tensors="pt", max_length=128, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=128)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)
Limitations & Bias
This Phase 1 adapter was trained exclusively on a synthetic, rule-based generated dataset. While it demonstrates strong foundational grammar and vocabulary retention, it may exhibit "synthetic artifacts" or unnatural phrasing to native speakers and has weak vocabulary knowledge. It is intended as a baseline for Phase 2 refinement and Phase 3 RLHF, not as a production-ready, standalone translation service.
- Downloads last month
- 23
Model tree for achulz/mayan-mt5-qeqchi-adapter
Base model
google/mt5-base