|
|
--- |
|
|
language: |
|
|
- kk |
|
|
- ru |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- automatic-speech-recognition |
|
|
- whisper |
|
|
- generated_from_trainer |
|
|
- kazakh |
|
|
- ksc2 |
|
|
- common-voice |
|
|
- gemma-27b |
|
|
datasets: |
|
|
- mozilla-foundation/common_voice_23_0 |
|
|
- InflexionLab/ISSAI-KSC2-Structured |
|
|
metrics: |
|
|
- wer |
|
|
base_model: openai/whisper-large-v3 |
|
|
model-index: |
|
|
- name: whisper-large-v3-kazakh-ksc2 |
|
|
results: |
|
|
- task: |
|
|
name: Automatic Speech Recognition |
|
|
type: automatic-speech-recognition |
|
|
dataset: |
|
|
name: Kazakh Speech Corpus 2 (KSC2) |
|
|
type: issai/ksc2 |
|
|
metrics: |
|
|
- name: Wer |
|
|
type: wer |
|
|
value: 17.7 |
|
|
--- |
|
|
|
|
|
# Whisper Large V3 Fine-tuned on KSC2 (Sybyrla) |
|
|
|
|
|
This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3). It is designed to provide robust automatic speech recognition (ASR) for the Kazakh language, achieving a Word Error Rate (WER) of approximately **17.7%**. |
|
|
|
|
|
To handle real-world acoustic environments in the region, this model was trained on a strategic mix of Kazakh and Russian data. |
|
|
|
|
|
**Developed by:** Inflexion Lab |
|
|
**License:** Apache License 2.0 |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Model type:** Transformer-based sequence-to-sequence model (Whisper Large V3) |
|
|
- **Language(s):** Kazakh (kk), Russian (ru) auxiliary |
|
|
- **Task:** Automatic Speech Recognition (ASR) |
|
|
- **Base Model:** `openai/whisper-large-v3` |
|
|
|
|
|
## Performance |
|
|
|
|
|
The model was evaluated on the held-out test split of the KSC2 dataset. |
|
|
|
|
|
| Metric | Score | |
|
|
|:---:|:---:| |
|
|
| **WER** | **~17.7%** | |
|
|
|
|
|
## Training Data & Methodology |
|
|
|
|
|
The training dataset was curated to address specific challenges in Kazakh ASR, particularly the lack of punctuation in raw datasets and the prevalence of code-switching in daily speech. |
|
|
|
|
|
### Dataset Composition (80/20 Split) |
|
|
We utilized a **80% / 20%** data mixing strategy to prevent model degradation and improve stability when encountering non-Kazakh phonemes. |
|
|
|
|
|
1. **Kazakh Speech Corpus 2 (KSC2) - ~80%** |
|
|
* **Volume:** ~1,200 hours. |
|
|
* **Processing:** The original transcripts are in plain lowercase. We utilized **Gemma 27B** to syntactically restructure the text, restoring proper capitalization and punctuation. |
|
|
* **Sources:** Parliament speeches, TV/Radio broadcasts, podcasts, and crowdsourced recordings. |
|
|
|
|
|
2. **Common Voice Scripted Speech 23.0 (Russian) - ~20%** |
|
|
* **Volume:** ~250 hours. |
|
|
* **Purpose:** Including high-quality Russian speech helps the model distinguish between languages and handle loanwords or code-switching without hallucinating or degrading into gibberish. |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Using with Hugging Face `transformers` |
|
|
|
|
|
You can use this model directly with the Hugging Face `pipeline`. |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the pipeline |
|
|
pipe = pipeline("automatic-speech-recognition", model="InflexionLab/sybyrla") |
|
|
|
|
|
# Transcribe an audio file |
|
|
# The pipeline handles chunking automatically if configured (see batch inference below). |
|
|
result = pipe("path/to/your/audio.mp3") |
|
|
|
|
|
print(result["text"]) |