| --- |
| language: |
| - uz |
| - ru |
| library_name: transformers |
| pipeline_tag: text-generation |
| tags: |
| - uzbek |
| - russian |
| - asr-postprocessing |
| - transcript-normalization |
| - byt5 |
| --- |
| |
| # rubai-corrector-transcript-uz |
|
|
| Transcript-display normalization model for Uzbek ASR output with mixed Uzbek/Russian support. Built on ByT5 architecture. |
|
|
| This is the transcript-display variant of the **rubai-corrector** model family. For the fine-tuning foundation checkpoint, see [rubai-corrector-base](https://huggingface.co/rubai/rubai-corrector-base). |
|
|
| ## Authors |
|
|
| - **[Sardor Islomov](https://www.linkedin.com/in/islomov-sardor/)** — lead author |
| - [Davron Ibrokhimov](https://www.linkedin.com/in/davron-ibrokhimov-8b62b8287/) |
|
|
| This checkpoint is tuned for: |
| - display-ready punctuation and casing |
| - apostrophe normalization |
| - OCR / ASR typo cleanup |
| - Latin Russian -> Cyrillic Russian recovery |
| - mixed Uzbek/Russian transcript cleanup |
| - selected text-to-number normalization patterns |
|
|
| ## Intended Use |
|
|
| Use this model after ASR to convert noisy transcript text into better display text. |
|
|
| It is best for: |
| - Rubai-style Uzbek ASR postprocessing |
| - Uzbek display text cleanup |
| - mixed Uzbek/Russian lines where Russian appears in Latin transcription |
|
|
| Primary upstream ASR models this normalizer is intended to follow within the same Rubai model family: |
| - [islomov/rubaistt_v2_medium](https://huggingface.co/islomov/rubaistt_v2_medium) |
| - [Kotib/uzbek_stt_v1](https://huggingface.co/Kotib/uzbek_stt_v1) |
|
|
| The model is focused on line-level transcript outputs that look like the text produced by those ASR models. |
|
|
| It is not optimized for: |
| - literal no-edit transcript preservation |
| - noisy Gemini-style mixed-script metadata targets with forced Cyrillic inside Uzbek morphology |
| - aggressive general denormalization beyond the transcript-display objective |
|
|
| ## Model Family |
|
|
| | Model | Use Case | |
| |---|---| |
| | [rubai-corrector-base](https://huggingface.co/rubai/rubai-corrector-base) | Fine-tuning base for new correction tasks | |
| | **rubai-corrector-transcript-uz** (this model) | ASR transcript display normalization, mixed Uzbek/Russian | |
|
|
| Both models share the same ByT5 architecture. This variant is fine-tuned from the base with additional transcript-display objectives. |
|
|
| ## Quick Start |
|
|
| ```python |
| from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
| |
| model_path = "islomov/rubai-corrector-transcript-uz" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_path) |
| model = AutoModelForSeq2SeqLM.from_pretrained(model_path) |
| |
| text = "bugun yaxshi kun. segodnya xoroshiy den." |
| inputs = tokenizer(f"correct: {text}", return_tensors="pt") |
| output_ids = model.generate(**inputs, max_new_tokens=256) |
| prediction = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0] |
| print(prediction) |
| ``` |
|
|
| Expected output: |
|
|
| ```text |
| Bugun yaxshi kun. Сегодня хороший день. |
| ``` |
|
|
| ## Real Example Outputs |
|
|
| The examples below are taken from this exact checkpoint's saved eval/test outputs. |
|
|
| ### Abbreviations / shorthand |
|
|
| ```text |
| Input: tlefon rqami |
| Output: Telefon raqami |
| ``` |
|
|
| ```text |
| Input: telefon rqami qaysi |
| Output: Telefon raqami qaysi |
| ``` |
|
|
| ### Apostrophes |
|
|
| ```text |
| Input: ozbekiston gozal mamlakat bolgan |
| Output: O'zbekiston go'zal mamlakat bo'lgan |
| ``` |
|
|
| ```text |
| Input: men ozim kordim |
| Output: Men o'zim ko'rdim. |
| ``` |
|
|
| ### OCR / ASR noise |
|
|
| ```text |
| Input: 0zbekiston Respub1ikasi |
| Output: O'zbekiston Respublikasi |
| ``` |
|
|
| ```text |
| Input: 5alom dostlar |
| Output: Salom do'stlar |
| ``` |
|
|
| ### Numbers |
|
|
| ```text |
| Input: uchrashuv o'n beshinchi yanvar kuni |
| Output: Uchrashuv 15-yanvar kuni |
| ``` |
|
|
| ```text |
| Input: narxi yigirma besh ming so'm |
| Output: Narxi 25 000 so'm |
| ``` |
|
|
| ### Mixed Uzbek + Russian |
|
|
| ```text |
| Input: bugun yaxshi kun. segodnya xoroshiy den. |
| Output: Bugun yaxshi kun. Сегодня хороший день. |
| ``` |
|
|
| ```text |
| Input: men bozorga bordim. tam ya kupil xleb. |
| Output: Men bozorga bordim. Там я купил хлеб. |
| ``` |
|
|
| ### Russian only |
|
|
| ```text |
| Input: segodnya xoroshaya pogoda |
| Output: Сегодня хорошая погода |
| ``` |
|
|
| ```text |
| Input: privet kak dela |
| Output: Привет как дела |
| ``` |
|
|
| ### Mixed script |
|
|
| ```text |
| Input: privet kak делa |
| Output: Привет как дела |
| ``` |
|
|
| ```text |
| Input: zaklad bersa keyin gaplashamiz |
| Output: Заклад bersa keyin gaplashamiz |
| ``` |
|
|
| ### Display-text cleanup |
|
|
| ```text |
| Input: mustahkamlik sinovida spark boshqa avtomobillarni ortda qoldirdi. |
| Output: Mustahkamlik sinovida Spark boshqa avtomobillarni ortda qoldirdi. |
| ``` |
|
|
| ```text |
| Input: kadrlarda kranning mashina old oynasi ustiga qulaganligini ko'rish mumkin |
| Output: Kadrlarda kranning mashina old oynasi ustiga qulaganligini ko'rish mumkin. |
| ``` |
|
|
| ## Known Tradeoff |
|
|
| This model is more display-oriented than the previous base. |
|
|
| That means: |
| - it is better at final punctuation and finished-sentence formatting |
| - it may add a final period in places where an old reference omitted it |
|
|
| ## Files |
|
|
| - `test_model.py` |
| small runnable example/test script for local use and HF packaging |
|
|
| ## How This Model Was Trained |
|
|
| This model is fine-tuned from [rubai-corrector-base](https://huggingface.co/rubai/rubai-corrector-base) (which itself is built on [google/byt5-small](https://huggingface.co/google/byt5-small)). |
|
|
| The fine-tuning added transcript-display objectives on top of the base correction capabilities: |
| - Uzbek transcript/display pairs from ASR output |
| - Russian recovery pairs from Latin-script ASR output |
| - Punctuation and formatting polish data |
|
|
| The model expects the `correct: ` instruction prefix at inference time. |
|
|
| ## Acknowledgements |
|
|
| Special thanks to [Davron Ibrokhimov](https://www.linkedin.com/in/davron-ibrokhimov-8b62b8287/) for sponsoring this work and making it possible to keep these models open. |
|
|
| Thank you to the community that supports Uzbek language technology. In particular: |
| - [MetaSell](https://metasell.ai/) for support and resources |
| - [Kotib](https://kotib.ai/) for their support and collaboration on Uzbek STT |
| - [Global Move](https://globalmove.uz/) for backing open Uzbek NLP work |
|
|
| Thanks to Arofat, Gulimshaxnoz, and many others who contributed in ways big and small. The list is too long to fit here, but every contribution matters and is appreciated. |
|
|
| Support my works and open-source movement: https://tirikchilik.uz/islomovs |
|
|