File size: 2,994 Bytes
c1a8f22 ffb4b19 6903ee1 ffb4b19 6903ee1 ffb4b19 6903ee1 fc0cc57 ffb4b19 a9e617e c1a8f22 6903ee1 c1a8f22 3a7ebcb 6903ee1 c1a8f22 ffb4b19 c1a8f22 ffb4b19 c1a8f22 ffb4b19 6903ee1 ffb4b19 c1a8f22 ffb4b19 c1a8f22 ffb4b19 c1a8f22 ffb4b19 a9e617e c1a8f22 6903ee1 c1a8f22 6903ee1 c1a8f22 6903ee1 c1a8f22 6903ee1 c1a8f22 ffb4b19 c1a8f22 ffb4b19 c1a8f22 6903ee1 c1a8f22 ffb4b19 c1a8f22 ffb4b19 c1a8f22 ffb4b19 6903ee1 ffb4b19 c1a8f22 ffb4b19 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
---
language:
- kk
- ru
license: apache-2.0
tags:
- automatic-speech-recognition
- whisper
- generated_from_trainer
- kazakh
- ksc2
- common-voice
- gemma-27b
datasets:
- mozilla-foundation/common_voice_23_0
- InflexionLab/ISSAI-KSC2-Structured
metrics:
- wer
base_model: openai/whisper-large-v3
model-index:
- name: whisper-large-v3-kazakh-ksc2
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Kazakh Speech Corpus 2 (KSC2)
type: issai/ksc2
metrics:
- name: Wer
type: wer
value: 17.7
---
# Whisper Large V3 Fine-tuned on KSC2 (Sybyrla)
This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3). It is designed to provide robust automatic speech recognition (ASR) for the Kazakh language, achieving a Word Error Rate (WER) of approximately **17.7%**.
To handle real-world acoustic environments in the region, this model was trained on a strategic mix of Kazakh and Russian data.
**Developed by:** Inflexion Lab
**License:** Apache License 2.0
## Model Description
- **Model type:** Transformer-based sequence-to-sequence model (Whisper Large V3)
- **Language(s):** Kazakh (kk), Russian (ru) auxiliary
- **Task:** Automatic Speech Recognition (ASR)
- **Base Model:** `openai/whisper-large-v3`
## Performance
The model was evaluated on the held-out test split of the KSC2 dataset.
| Metric | Score |
|:---:|:---:|
| **WER** | **~17.7%** |
## Training Data & Methodology
The training dataset was curated to address specific challenges in Kazakh ASR, particularly the lack of punctuation in raw datasets and the prevalence of code-switching in daily speech.
### Dataset Composition (80/20 Split)
We utilized a **80% / 20%** data mixing strategy to prevent model degradation and improve stability when encountering non-Kazakh phonemes.
1. **Kazakh Speech Corpus 2 (KSC2) - ~80%**
* **Volume:** ~1,200 hours.
* **Processing:** The original transcripts are in plain lowercase. We utilized **Gemma 27B** to syntactically restructure the text, restoring proper capitalization and punctuation.
* **Sources:** Parliament speeches, TV/Radio broadcasts, podcasts, and crowdsourced recordings.
2. **Common Voice Scripted Speech 23.0 (Russian) - ~20%**
* **Volume:** ~250 hours.
* **Purpose:** Including high-quality Russian speech helps the model distinguish between languages and handle loanwords or code-switching without hallucinating or degrading into gibberish.
## Usage
### Using with Hugging Face `transformers`
You can use this model directly with the Hugging Face `pipeline`.
```python
from transformers import pipeline
# Load the pipeline
pipe = pipeline("automatic-speech-recognition", model="InflexionLab/sybyrla")
# Transcribe an audio file
# The pipeline handles chunking automatically if configured (see batch inference below).
result = pipe("path/to/your/audio.mp3")
print(result["text"]) |