File size: 6,355 Bytes
f361c60 1ec0aaa f361c60 4a9b7f6 f361c60 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 | ---
language:
- uz
- ru
library_name: transformers
pipeline_tag: text-generation
tags:
- uzbek
- russian
- asr-postprocessing
- transcript-normalization
- byt5
---
# rubai-corrector-transcript-uz
Transcript-display normalization model for Uzbek ASR output with mixed Uzbek/Russian support. Built on ByT5 architecture.
This is the transcript-display variant of the **rubai-corrector** model family. For the fine-tuning foundation checkpoint, see [rubai-corrector-base](https://huggingface.co/rubai/rubai-corrector-base).
## Authors
- **[Sardor Islomov](https://www.linkedin.com/in/islomov-sardor/)** — lead author
- [Davron Ibrokhimov](https://www.linkedin.com/in/davron-ibrokhimov-8b62b8287/)
This checkpoint is tuned for:
- display-ready punctuation and casing
- apostrophe normalization
- OCR / ASR typo cleanup
- Latin Russian -> Cyrillic Russian recovery
- mixed Uzbek/Russian transcript cleanup
- selected text-to-number normalization patterns
## Intended Use
Use this model after ASR to convert noisy transcript text into better display text.
It is best for:
- Rubai-style Uzbek ASR postprocessing
- Uzbek display text cleanup
- mixed Uzbek/Russian lines where Russian appears in Latin transcription
Primary upstream ASR models this normalizer is intended to follow within the same Rubai model family:
- [islomov/rubaistt_v2_medium](https://huggingface.co/islomov/rubaistt_v2_medium)
- [Kotib/uzbek_stt_v1](https://huggingface.co/Kotib/uzbek_stt_v1)
The model is focused on line-level transcript outputs that look like the text produced by those ASR models.
It is not optimized for:
- literal no-edit transcript preservation
- noisy Gemini-style mixed-script metadata targets with forced Cyrillic inside Uzbek morphology
- aggressive general denormalization beyond the transcript-display objective
## Model Family
| Model | Use Case |
|---|---|
| [rubai-corrector-base](https://huggingface.co/rubai/rubai-corrector-base) | Fine-tuning base for new correction tasks |
| **rubai-corrector-transcript-uz** (this model) | ASR transcript display normalization, mixed Uzbek/Russian |
Both models share the same ByT5 architecture. This variant is fine-tuned from the base with additional transcript-display objectives.
## Quick Start
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_path = "islomov/rubai-corrector-transcript-uz"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
text = "bugun yaxshi kun. segodnya xoroshiy den."
inputs = tokenizer(f"correct: {text}", return_tensors="pt")
output_ids = model.generate(**inputs, max_new_tokens=256)
prediction = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(prediction)
```
Expected output:
```text
Bugun yaxshi kun. Сегодня хороший день.
```
## Real Example Outputs
The examples below are taken from this exact checkpoint's saved eval/test outputs.
### Abbreviations / shorthand
```text
Input: tlefon rqami
Output: Telefon raqami
```
```text
Input: telefon rqami qaysi
Output: Telefon raqami qaysi
```
### Apostrophes
```text
Input: ozbekiston gozal mamlakat bolgan
Output: O'zbekiston go'zal mamlakat bo'lgan
```
```text
Input: men ozim kordim
Output: Men o'zim ko'rdim.
```
### OCR / ASR noise
```text
Input: 0zbekiston Respub1ikasi
Output: O'zbekiston Respublikasi
```
```text
Input: 5alom dostlar
Output: Salom do'stlar
```
### Numbers
```text
Input: uchrashuv o'n beshinchi yanvar kuni
Output: Uchrashuv 15-yanvar kuni
```
```text
Input: narxi yigirma besh ming so'm
Output: Narxi 25 000 so'm
```
### Mixed Uzbek + Russian
```text
Input: bugun yaxshi kun. segodnya xoroshiy den.
Output: Bugun yaxshi kun. Сегодня хороший день.
```
```text
Input: men bozorga bordim. tam ya kupil xleb.
Output: Men bozorga bordim. Там я купил хлеб.
```
### Russian only
```text
Input: segodnya xoroshaya pogoda
Output: Сегодня хорошая погода
```
```text
Input: privet kak dela
Output: Привет как дела
```
### Mixed script
```text
Input: privet kak делa
Output: Привет как дела
```
```text
Input: zaklad bersa keyin gaplashamiz
Output: Заклад bersa keyin gaplashamiz
```
### Display-text cleanup
```text
Input: mustahkamlik sinovida spark boshqa avtomobillarni ortda qoldirdi.
Output: Mustahkamlik sinovida Spark boshqa avtomobillarni ortda qoldirdi.
```
```text
Input: kadrlarda kranning mashina old oynasi ustiga qulaganligini ko'rish mumkin
Output: Kadrlarda kranning mashina old oynasi ustiga qulaganligini ko'rish mumkin.
```
## Known Tradeoff
This model is more display-oriented than the previous base.
That means:
- it is better at final punctuation and finished-sentence formatting
- it may add a final period in places where an old reference omitted it
## Files
- `test_model.py`
small runnable example/test script for local use and HF packaging
## How This Model Was Trained
This model is fine-tuned from [rubai-corrector-base](https://huggingface.co/rubai/rubai-corrector-base) (which itself is built on [google/byt5-small](https://huggingface.co/google/byt5-small)).
The fine-tuning added transcript-display objectives on top of the base correction capabilities:
- Uzbek transcript/display pairs from ASR output
- Russian recovery pairs from Latin-script ASR output
- Punctuation and formatting polish data
The model expects the `correct: ` instruction prefix at inference time.
## Acknowledgements
Special thanks to [Davron Ibrokhimov](https://www.linkedin.com/in/davron-ibrokhimov-8b62b8287/) for sponsoring this work and making it possible to keep these models open.
Thank you to the community that supports Uzbek language technology. In particular:
- [MetaSell](https://metasell.ai/) for support and resources
- [Kotib](https://kotib.ai/) for their support and collaboration on Uzbek STT
- [Global Move](https://globalmove.uz/) for backing open Uzbek NLP work
Thanks to Arofat, Gulimshaxnoz, and many others who contributed in ways big and small. The list is too long to fit here, but every contribution matters and is appreciated.
Support my works and open-source movement: https://tirikchilik.uz/islomovs
|