rubai-corrector-base

Base ByT5 correction checkpoint for building task-specific Rubai correctors.

This is the foundation model of the rubai-corrector line. It is meant to be fine-tuned for a concrete demand:

  • transcript display cleanup
  • punctuation and comma recovery
  • OCR and ASR typo repair
  • apostrophe normalization
  • mixed Uzbek/Russian cleanup
  • domain-specific formatting rules

If you want a ready-to-use ASR display model, use rubai-corrector-transcript-uz. If you want the OCR-specialized old-books model, use rubai-corrector-ocr-books-uz. This package is the base for further adaptation.

Authors

Model Family

Model Use Case
rubai-corrector-base (this model) Fine-tuning base for new correction tasks
rubai-corrector-transcript-uz Ready-to-use transcript display normalization
rubai-corrector-ocr-books-uz OCR correction for old Uzbek books

Both models share the same ByT5 architecture. The transcript model is fine-tuned from this base for ASR display text.

Quick Smoke Test

The model uses the correct: instruction prefix.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_path = "islomov/rubai-corrector-base"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

text = "men ozim kordim"
inputs = tokenizer([f"correct: {text}"], return_tensors="pt", padding=True)
output_ids = model.generate(**inputs, max_new_tokens=128)
prediction = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(prediction)

Expected output:

Men o'zim ko'rdim

For a local runnable example suite, see test_model.py.

Real Base Examples

These are real outputs from this packaged checkpoint.

Abbreviations

Input:  telefon rqami qaysi
Output: Telefon raqami qaysi

Apostrophes

Input:  men ozim kordim
Output: Men o'zim ko'rdim

Input:  togri yoldan boring
Output: To'g'ri yo'ldan boring

OCR And ASR Noise

Input:  rnen universitetda oqiyrnan
Output: Men universitetda o'qiyman

Input:  bu juda rnuhirn masala
Output: Bu juda muhim masala

Numbers And Dates

Input:  narxi yigirma besh ming so'm
Output: Narxi 25 000 so'm

Input:  uchrashuv o'n beshinchi yanvar kuni
Output: Uchrashuv 15-yanvar kuni

Mixed Uzbek And Russian

Input:  men segodnya bozorga bordim
Output: Men сегодня bozorga bordim

Input:  privet kak делa
Output: Привет как дела

Fine-Tuning

This package includes a standalone fine-tuning script:

It keeps the same core training behavior as the original project line:

  • input prefix: correct:
  • ByT5 / T5ForConditionalGeneration
  • Adafactor optimizer
  • linear warmup scheduler
  • seq2seq supervised fine-tuning on input -> output pairs

Example:

python finetune.py \
  --model-path rubai/rubai-corrector-base \
  --train-file ./data/train.jsonl \
  --eval-file ./data/valid.jsonl \
  --output-dir ./outputs/my-domain-corrector \
  --learning-rate 5e-5 \
  --num-train-epochs 2 \
  --per-device-train-batch-size 16 \
  --gradient-accumulation-steps 4 \
  --max-source-length 512 \
  --max-target-length 512 \
  --bf16

Input Data Format

Training data is JSONL. Each line must contain:

  • input: noisy or source text
  • output: target corrected text

Example:

{"input":"men ozim kordim","output":"Men o'zim ko'rdim"}
{"input":"narxi yigirma besh ming so'm","output":"Narxi 25 000 so'm"}
{"input":"rnen universitetda oqiyrnan","output":"Men universitetda o'qiyman"}
{"input":"men segodnya bozorga bordim","output":"Men сегодня bozorga bordim"}

A tiny sample file is included here:

You can point finetune.py either to a JSONL file directly or to a directory containing data.jsonl.

How This Base Was Trained

This model starts from google/byt5-small and was built with a 3-stage curriculum on Uzbek text correction data.

Stage 1 — Foundation

The foundation stage used ~1,000,000 synthetic correction pairs generated from Uzbek text with transformations such as:

  • apostrophe removal
  • comma removal
  • lowercasing
  • OCR-like character substitutions
  • h/x swaps
  • abbreviation-like corruption

Stage 2 — Curated Mix

Stage 2 added ~408,000 curated rows covering:

  • general error correction
  • text denormalization (numbers, dates, formatting)
  • Russian Latin-to-Cyrillic recovery
  • focused apostrophe and h/x restoration
  • anti-Cyrillic guardrails (prevent unwanted script switching)

Stage 3 — Polish

Stage 3 used ~32,000 rows for fine-grained behavior tuning:

  • comma and punctuation restoration
  • exact-copy preservation (teach the model not to over-correct)
  • format restoration (numbers, dates, addresses)
  • mixed-script guardrails (prevent script bleeding between Uzbek and Russian)
  • period hallucination prevention

Training Details

  • Architecture: T5ForConditionalGeneration with ByT5 tokenizer
  • Precision: BF16 mixed precision
  • Optimizer: Adafactor
  • Scheduler: linear warmup + linear decay
  • Max sequence length: 512
  • Gradient checkpointing: enabled
  • Curriculum learning: enabled (length-sorted batches)

Notes

  • This base model is for continuation training and task-specific adaptation.
  • It can be used directly for inference, but that is not its main role in the model family.
  • For Rubai STT postprocessing out of the box, use rubai-corrector-transcript-uz.
  • For old-book OCR correction, use rubai-corrector-ocr-books-uz.

Acknowledgements

Special thanks to Davron Ibrokhimov for sponsoring this work and making it possible to keep these models open.

Thank you to the community that supports Uzbek language technology. In particular:

  • MetaSell for support and resources
  • Kotib for their support and collaboration on Uzbek STT
  • Global Move for backing open Uzbek NLP work

Thanks to Arofat, Gulimshaxnoz, and many others who contributed in ways big and small. The list is too long to fit here, but every contribution matters and is appreciated.

Support my works and open-source movement: https://tirikchilik.uz/islomovs

Downloads last month
182
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using islomov/rubai-corrector-base 1