AigizK/bashkir-russian-parallel-corpora
Viewer • Updated • 1.2M • 96 • 15
| Current Model | Architecture | Focus |
|---|---|---|
| 🔴 Large Model | NLLB-1.3B (QLoRA) | Best Quality (SOTA) |
| 🟡 Medium Model | M2M-100 (418M) | Balanced |
| 🟢 Small (This Model) | MarianMT (Full FT) | Fastest / CPU Friendly |
This is the Lightweight model from Team DevLake. Unlike the larger models, this is trained from scratch (transfer learning from English-Turkish) with a manually expanded vocabulary to support Bashkir Cyrillic characters.
It is designed for environments with limited resources (CPU deployment, mobile, etc.).
You must prepend the special token >>bak<< to the source sentence.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "Voldis/marian-rus-bak"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
# Note the prefix!
text = ">>bak<< Добрый день, друзья!"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
@inproceedings{tyurin-2026-devlake,
title = "{D}ev{L}ake at {L}o{R}es{MT} 2026: The Impact of Pre-training and Model Scale on {R}ussian-{B}ashkir Low-Resource Translation",
author = "Tyurin, Vyacheslav",
booktitle = "Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)",
month = mar,
year = "2026",
address = "Rabat, Morocco",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.loresmt-1.18",
doi = "10.18653/v1/2026.loresmt-1.18",
pages = "209--212",
}
Base model
Helsinki-NLP/opus-mt-en-trk