Mayan-mT5: Q'eqchi' Translation Adapter (Phase 1)

This repository contains the LoRA adapter weights (Checkpoint 22000) for bidirectional machine translation between English/Spanish and Q'eqchi'. It is designed to be used in conjunction with the google/mt5-base model.

Status: Phase 1 complete. This model was trained on a synthetic corpus. The accompanying research paper is currently under peer review.

Repository Cross-Links

Usage and Inference

Because this is a PEFT/LoRA adapter, you must load the base mt5-base model first, then apply these weights.

Required Task Prefix This model was trained using a specific unified task schema that categorizes the source data type. Because Phase 1 relies entirely on synthetic data, you must include the word "synthetic" in your task prefix, even if the text you are passing to it for inference is human-written. If you omit this, the model will not trigger the correct internal representations.

The required format is: translate synthetic [Source Language] to [Target Language]:

  • Valid Source/Target Languages: English, Spanish, Q'eqchi'
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel

base_model_id = "google/mt5-base"
peft_model_id = "achulz/mayan-mt5-qeqchi-adapter"

# 1. Load Tokenizer and Base Model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForSeq2SeqLM.from_pretrained(base_model_id)

# 2. Load the LoRA Adapter
model = PeftModel.from_pretrained(base_model, peft_model_id)

# 3. Format the Input with the correct Task Prefix
# Note: You MUST include "synthetic" as defined in the Phase 1 unified task schema.
task_prefix = "translate synthetic English to Q'eqchi': "
source_text = "The dog is sleeping in the house."
input_text = task_prefix + source_text

# 4. Tokenize and Generate
inputs = tokenizer(input_text, return_tensors="pt", max_length=128, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=128)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(translation)

Limitations & Bias

This Phase 1 adapter was trained exclusively on a synthetic, rule-based generated dataset. While it demonstrates strong foundational grammar and vocabulary retention, it may exhibit "synthetic artifacts" or unnatural phrasing to native speakers and has weak vocabulary knowledge. It is intended as a baseline for Phase 2 refinement and Phase 3 RLHF, not as a production-ready, standalone translation service.

Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for achulz/mayan-mt5-qeqchi-adapter

Base model

google/mt5-base
Adapter
(29)
this model