sybyrla / README.md
diaslmb's picture
Update README.md
3a7ebcb verified
---
language:
- kk
- ru
license: apache-2.0
tags:
- automatic-speech-recognition
- whisper
- generated_from_trainer
- kazakh
- ksc2
- common-voice
- gemma-27b
datasets:
- mozilla-foundation/common_voice_23_0
- InflexionLab/ISSAI-KSC2-Structured
metrics:
- wer
base_model: openai/whisper-large-v3
model-index:
- name: whisper-large-v3-kazakh-ksc2
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Kazakh Speech Corpus 2 (KSC2)
type: issai/ksc2
metrics:
- name: Wer
type: wer
value: 17.7
---
# Whisper Large V3 Fine-tuned on KSC2 (Sybyrla)
This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3). It is designed to provide robust automatic speech recognition (ASR) for the Kazakh language, achieving a Word Error Rate (WER) of approximately **17.7%**.
To handle real-world acoustic environments in the region, this model was trained on a strategic mix of Kazakh and Russian data.
**Developed by:** Inflexion Lab
**License:** Apache License 2.0
## Model Description
- **Model type:** Transformer-based sequence-to-sequence model (Whisper Large V3)
- **Language(s):** Kazakh (kk), Russian (ru) auxiliary
- **Task:** Automatic Speech Recognition (ASR)
- **Base Model:** `openai/whisper-large-v3`
## Performance
The model was evaluated on the held-out test split of the KSC2 dataset.
| Metric | Score |
|:---:|:---:|
| **WER** | **~17.7%** |
## Training Data & Methodology
The training dataset was curated to address specific challenges in Kazakh ASR, particularly the lack of punctuation in raw datasets and the prevalence of code-switching in daily speech.
### Dataset Composition (80/20 Split)
We utilized a **80% / 20%** data mixing strategy to prevent model degradation and improve stability when encountering non-Kazakh phonemes.
1. **Kazakh Speech Corpus 2 (KSC2) - ~80%**
* **Volume:** ~1,200 hours.
* **Processing:** The original transcripts are in plain lowercase. We utilized **Gemma 27B** to syntactically restructure the text, restoring proper capitalization and punctuation.
* **Sources:** Parliament speeches, TV/Radio broadcasts, podcasts, and crowdsourced recordings.
2. **Common Voice Scripted Speech 23.0 (Russian) - ~20%**
* **Volume:** ~250 hours.
* **Purpose:** Including high-quality Russian speech helps the model distinguish between languages and handle loanwords or code-switching without hallucinating or degrading into gibberish.
## Usage
### Using with Hugging Face `transformers`
You can use this model directly with the Hugging Face `pipeline`.
```python
from transformers import pipeline
# Load the pipeline
pipe = pipeline("automatic-speech-recognition", model="InflexionLab/sybyrla")
# Transcribe an audio file
# The pipeline handles chunking automatically if configured (see batch inference below).
result = pipe("path/to/your/audio.mp3")
print(result["text"])