Instructions to use islomov/rubai-corrector-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use islomov/rubai-corrector-base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="islomov/rubai-corrector-base")

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("islomov/rubai-corrector-base")
model = AutoModelForSeq2SeqLM.from_pretrained("islomov/rubai-corrector-base")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use islomov/rubai-corrector-base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "islomov/rubai-corrector-base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "islomov/rubai-corrector-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/islomov/rubai-corrector-base

SGLang

How to use islomov/rubai-corrector-base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "islomov/rubai-corrector-base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "islomov/rubai-corrector-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "islomov/rubai-corrector-base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "islomov/rubai-corrector-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use islomov/rubai-corrector-base with Docker Model Runner:
```
docker model run hf.co/islomov/rubai-corrector-base
```

rubai-corrector-base

Base ByT5 correction checkpoint for building task-specific Rubai correctors.

This is the foundation model of the rubai-corrector line. It is meant to be fine-tuned for a concrete demand:

transcript display cleanup
punctuation and comma recovery
OCR and ASR typo repair
apostrophe normalization
mixed Uzbek/Russian cleanup
domain-specific formatting rules

If you want a ready-to-use ASR display model, use rubai-corrector-transcript-uz. If you want the OCR-specialized old-books model, use rubai-corrector-ocr-books-uz. This package is the base for further adaptation.

Authors

Sardor Islomov — lead author
Davron Ibrokhimov

Model Family

Model	Use Case
rubai-corrector-base (this model)	Fine-tuning base for new correction tasks
rubai-corrector-transcript-uz	Ready-to-use transcript display normalization
rubai-corrector-ocr-books-uz	OCR correction for old Uzbek books

Both models share the same ByT5 architecture. The transcript model is fine-tuned from this base for ASR display text.

Quick Smoke Test

The model uses the correct: instruction prefix.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_path = "islomov/rubai-corrector-base"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

text = "men ozim kordim"
inputs = tokenizer([f"correct: {text}"], return_tensors="pt", padding=True)
output_ids = model.generate(**inputs, max_new_tokens=128)
prediction = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(prediction)

Expected output:

Men o'zim ko'rdim

For a local runnable example suite, see test_model.py.

Real Base Examples

These are real outputs from this packaged checkpoint.

Abbreviations

Input:  telefon rqami qaysi
Output: Telefon raqami qaysi

Apostrophes

Input:  men ozim kordim
Output: Men o'zim ko'rdim

Input:  togri yoldan boring
Output: To'g'ri yo'ldan boring

OCR And ASR Noise

Input:  rnen universitetda oqiyrnan
Output: Men universitetda o'qiyman

Input:  bu juda rnuhirn masala
Output: Bu juda muhim masala

Numbers And Dates

Input:  narxi yigirma besh ming so'm
Output: Narxi 25 000 so'm

Input:  uchrashuv o'n beshinchi yanvar kuni
Output: Uchrashuv 15-yanvar kuni

Mixed Uzbek And Russian

Input:  men segodnya bozorga bordim
Output: Men сегодня bozorga bordim

Input:  privet kak делa
Output: Привет как дела

Fine-Tuning

This package includes a standalone fine-tuning script:

finetune.py

It keeps the same core training behavior as the original project line:

input prefix: correct:
ByT5 / T5ForConditionalGeneration
Adafactor optimizer
linear warmup scheduler
seq2seq supervised fine-tuning on input -> output pairs

Example:

python finetune.py \
  --model-path rubai/rubai-corrector-base \
  --train-file ./data/train.jsonl \
  --eval-file ./data/valid.jsonl \
  --output-dir ./outputs/my-domain-corrector \
  --learning-rate 5e-5 \
  --num-train-epochs 2 \
  --per-device-train-batch-size 16 \
  --gradient-accumulation-steps 4 \
  --max-source-length 512 \
  --max-target-length 512 \
  --bf16

Input Data Format

Training data is JSONL. Each line must contain:

input: noisy or source text
output: target corrected text

Example:

{"input":"men ozim kordim","output":"Men o'zim ko'rdim"}
{"input":"narxi yigirma besh ming so'm","output":"Narxi 25 000 so'm"}
{"input":"rnen universitetda oqiyrnan","output":"Men universitetda o'qiyman"}
{"input":"men segodnya bozorga bordim","output":"Men сегодня bozorga bordim"}

A tiny sample file is included here:

data_format.example.jsonl

You can point finetune.py either to a JSONL file directly or to a directory containing data.jsonl.

How This Base Was Trained

This model starts from google/byt5-small and was built with a 3-stage curriculum on Uzbek text correction data.

Stage 1 — Foundation

The foundation stage used ~1,000,000 synthetic correction pairs generated from Uzbek text with transformations such as:

apostrophe removal
comma removal
lowercasing
OCR-like character substitutions
h/x swaps
abbreviation-like corruption

Stage 2 — Curated Mix

Stage 2 added ~408,000 curated rows covering:

general error correction
text denormalization (numbers, dates, formatting)
Russian Latin-to-Cyrillic recovery
focused apostrophe and h/x restoration
anti-Cyrillic guardrails (prevent unwanted script switching)

Stage 3 — Polish

Stage 3 used ~32,000 rows for fine-grained behavior tuning:

comma and punctuation restoration
exact-copy preservation (teach the model not to over-correct)
format restoration (numbers, dates, addresses)
mixed-script guardrails (prevent script bleeding between Uzbek and Russian)
period hallucination prevention

Training Details

Architecture: T5ForConditionalGeneration with ByT5 tokenizer
Precision: BF16 mixed precision
Optimizer: Adafactor
Scheduler: linear warmup + linear decay
Max sequence length: 512
Gradient checkpointing: enabled
Curriculum learning: enabled (length-sorted batches)

Notes

This base model is for continuation training and task-specific adaptation.
It can be used directly for inference, but that is not its main role in the model family.
For Rubai STT postprocessing out of the box, use rubai-corrector-transcript-uz.
For old-book OCR correction, use rubai-corrector-ocr-books-uz.

Acknowledgements

Special thanks to Davron Ibrokhimov for sponsoring this work and making it possible to keep these models open.

Thank you to the community that supports Uzbek language technology. In particular:

MetaSell for support and resources
Kotib for their support and collaboration on Uzbek STT
Global Move for backing open Uzbek NLP work

Thanks to Arofat, Gulimshaxnoz, and many others who contributed in ways big and small. The list is too long to fit here, but every contribution matters and is appreciated.

Support my works and open-source movement: https://tirikchilik.uz/islomovs

Downloads last month: 8

Safetensors

Model size

0.3B params

Tensor type

F32