rubai-corrector-base
Base ByT5 correction checkpoint for building task-specific Rubai correctors.
This is the foundation model of the rubai-corrector line. It is meant to be fine-tuned for a concrete demand:
- transcript display cleanup
- punctuation and comma recovery
- OCR and ASR typo repair
- apostrophe normalization
- mixed Uzbek/Russian cleanup
- domain-specific formatting rules
If you want a ready-to-use ASR display model, use rubai-corrector-transcript-uz. If you want the OCR-specialized old-books model, use rubai-corrector-ocr-books-uz. This package is the base for further adaptation.
Authors
- Sardor Islomov — lead author
- Davron Ibrokhimov
Model Family
| Model | Use Case |
|---|---|
| rubai-corrector-base (this model) | Fine-tuning base for new correction tasks |
| rubai-corrector-transcript-uz | Ready-to-use transcript display normalization |
| rubai-corrector-ocr-books-uz | OCR correction for old Uzbek books |
Both models share the same ByT5 architecture. The transcript model is fine-tuned from this base for ASR display text.
Quick Smoke Test
The model uses the correct: instruction prefix.
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_path = "islomov/rubai-corrector-base"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
text = "men ozim kordim"
inputs = tokenizer([f"correct: {text}"], return_tensors="pt", padding=True)
output_ids = model.generate(**inputs, max_new_tokens=128)
prediction = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(prediction)
Expected output:
Men o'zim ko'rdim
For a local runnable example suite, see test_model.py.
Real Base Examples
These are real outputs from this packaged checkpoint.
Abbreviations
Input: telefon rqami qaysi
Output: Telefon raqami qaysi
Apostrophes
Input: men ozim kordim
Output: Men o'zim ko'rdim
Input: togri yoldan boring
Output: To'g'ri yo'ldan boring
OCR And ASR Noise
Input: rnen universitetda oqiyrnan
Output: Men universitetda o'qiyman
Input: bu juda rnuhirn masala
Output: Bu juda muhim masala
Numbers And Dates
Input: narxi yigirma besh ming so'm
Output: Narxi 25 000 so'm
Input: uchrashuv o'n beshinchi yanvar kuni
Output: Uchrashuv 15-yanvar kuni
Mixed Uzbek And Russian
Input: men segodnya bozorga bordim
Output: Men сегодня bozorga bordim
Input: privet kak делa
Output: Привет как дела
Fine-Tuning
This package includes a standalone fine-tuning script:
It keeps the same core training behavior as the original project line:
- input prefix:
correct: - ByT5 /
T5ForConditionalGeneration - Adafactor optimizer
- linear warmup scheduler
- seq2seq supervised fine-tuning on
input -> outputpairs
Example:
python finetune.py \
--model-path rubai/rubai-corrector-base \
--train-file ./data/train.jsonl \
--eval-file ./data/valid.jsonl \
--output-dir ./outputs/my-domain-corrector \
--learning-rate 5e-5 \
--num-train-epochs 2 \
--per-device-train-batch-size 16 \
--gradient-accumulation-steps 4 \
--max-source-length 512 \
--max-target-length 512 \
--bf16
Input Data Format
Training data is JSONL. Each line must contain:
input: noisy or source textoutput: target corrected text
Example:
{"input":"men ozim kordim","output":"Men o'zim ko'rdim"}
{"input":"narxi yigirma besh ming so'm","output":"Narxi 25 000 so'm"}
{"input":"rnen universitetda oqiyrnan","output":"Men universitetda o'qiyman"}
{"input":"men segodnya bozorga bordim","output":"Men сегодня bozorga bordim"}
A tiny sample file is included here:
You can point finetune.py either to a JSONL file directly or to a directory containing data.jsonl.
How This Base Was Trained
This model starts from google/byt5-small and was built with a 3-stage curriculum on Uzbek text correction data.
Stage 1 — Foundation
The foundation stage used ~1,000,000 synthetic correction pairs generated from Uzbek text with transformations such as:
- apostrophe removal
- comma removal
- lowercasing
- OCR-like character substitutions
h/xswaps- abbreviation-like corruption
Stage 2 — Curated Mix
Stage 2 added ~408,000 curated rows covering:
- general error correction
- text denormalization (numbers, dates, formatting)
- Russian Latin-to-Cyrillic recovery
- focused apostrophe and
h/xrestoration - anti-Cyrillic guardrails (prevent unwanted script switching)
Stage 3 — Polish
Stage 3 used ~32,000 rows for fine-grained behavior tuning:
- comma and punctuation restoration
- exact-copy preservation (teach the model not to over-correct)
- format restoration (numbers, dates, addresses)
- mixed-script guardrails (prevent script bleeding between Uzbek and Russian)
- period hallucination prevention
Training Details
- Architecture:
T5ForConditionalGenerationwith ByT5 tokenizer - Precision: BF16 mixed precision
- Optimizer: Adafactor
- Scheduler: linear warmup + linear decay
- Max sequence length: 512
- Gradient checkpointing: enabled
- Curriculum learning: enabled (length-sorted batches)
Notes
- This base model is for continuation training and task-specific adaptation.
- It can be used directly for inference, but that is not its main role in the model family.
- For Rubai STT postprocessing out of the box, use rubai-corrector-transcript-uz.
- For old-book OCR correction, use rubai-corrector-ocr-books-uz.
Acknowledgements
Special thanks to Davron Ibrokhimov for sponsoring this work and making it possible to keep these models open.
Thank you to the community that supports Uzbek language technology. In particular:
- MetaSell for support and resources
- Kotib for their support and collaboration on Uzbek STT
- Global Move for backing open Uzbek NLP work
Thanks to Arofat, Gulimshaxnoz, and many others who contributed in ways big and small. The list is too long to fit here, but every contribution matters and is appreciated.
Support my works and open-source movement: https://tirikchilik.uz/islomovs
- Downloads last month
- 182