| | --- |
| | language: |
| | - uz |
| | - ru |
| | license: apache-2.0 |
| | pipeline_tag: automatic-speech-recognition |
| | tags: |
| | - speech-recognition |
| | - whisper |
| | - uzbek |
| | - russian |
| | - stt |
| | --- |
| | |
| | # OmoN-STT |
| |
|
| | OmoN-STT — это модель автоматического распознавания речи (ASR), обученная на базе Whisper. |
| |
|
| | Модель дообучена на собственном датасете аудио и предназначена для распознавания узбекской и русской речи. |
| |
|
| | Base model: islomov/rubaistt_v2_medium |
| |
|
| | --- |
| |
|
| | # Возможности |
| |
|
| | - распознавание узбекской речи |
| | - распознавание русской речи |
| | - поддержка длинных аудио |
| | - высокая точность на разговорной речи |
| |
|
| | --- |
| |
|
| | Обучение |
| |
|
| | Модель обучена на базе: |
| |
|
| | islomov/rubaistt_v2_medium |
| |
|
| | Training parameters: |
| |
|
| | epochs: 3 |
| |
|
| | batch size: 8 |
| |
|
| | gradient accumulation: 4 |
| |
|
| | learning rate: 2e-5 |
| |
|
| | GPU: RTX 3080Ti 16GB |
| |
|
| | training time: ~5 часов |
| |
|
| |
|
| | Архитектура |
| |
|
| | Whisper encoder-decoder transformer. |
| |
|
| | Автор |
| |
|
| | OmoN Mullaboyev |
| |
|
| | # Использование |
| |
|
| | ```python |
| | import torch |
| | import librosa |
| | from transformers import WhisperProcessor, WhisperForConditionalGeneration |
| | |
| | processor = WhisperProcessor.from_pretrained("omullaboyev/OmoN-STT") |
| | model = WhisperForConditionalGeneration.from_pretrained("omullaboyev/OmoN-STT") |
| | |
| | audio, sr = librosa.load("audio.wav", sr=16000) |
| | |
| | inputs = processor(audio, sampling_rate=16000, return_tensors="pt") |
| | |
| | with torch.no_grad(): |
| | predicted_ids = model.generate(inputs["input_features"]) |
| | |
| | text = processor.batch_decode(predicted_ids, skip_special_tokens=True) |
| | |
| | print(text) |
| | |