File size: 2,994 Bytes

c1a8f22
ffb4b19
 
6903ee1
ffb4b19
 
 
 
 
 
 
6903ee1
 
ffb4b19
6903ee1
fc0cc57
ffb4b19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a9e617e
c1a8f22
 
6903ee1
c1a8f22
3a7ebcb
6903ee1
 
c1a8f22
ffb4b19
 
c1a8f22
ffb4b19
c1a8f22
ffb4b19
6903ee1
ffb4b19
 
c1a8f22
ffb4b19
c1a8f22
ffb4b19
c1a8f22
ffb4b19
 
a9e617e
c1a8f22
6903ee1
 
 
c1a8f22
6903ee1
 
c1a8f22
6903ee1
 
 
 
c1a8f22
6903ee1
 
 
c1a8f22
ffb4b19
c1a8f22
ffb4b19
c1a8f22
6903ee1
c1a8f22
ffb4b19
 
c1a8f22
ffb4b19
 
c1a8f22
ffb4b19
6903ee1
ffb4b19
c1a8f22
ffb4b19

---
language:
- kk
- ru
license: apache-2.0
tags:
- automatic-speech-recognition
- whisper
- generated_from_trainer
- kazakh
- ksc2
- common-voice
- gemma-27b
datasets:
- mozilla-foundation/common_voice_23_0
- InflexionLab/ISSAI-KSC2-Structured
metrics:
- wer
base_model: openai/whisper-large-v3
model-index:
- name: whisper-large-v3-kazakh-ksc2
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Kazakh Speech Corpus 2 (KSC2)
      type: issai/ksc2
    metrics:
    - name: Wer
      type: wer
      value: 17.7
---

# Whisper Large V3 Fine-tuned on KSC2 (Sybyrla)

This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3). It is designed to provide robust automatic speech recognition (ASR) for the Kazakh language, achieving a Word Error Rate (WER) of approximately **17.7%**.

To handle real-world acoustic environments in the region, this model was trained on a strategic mix of Kazakh and Russian data.

**Developed by:** Inflexion Lab  
**License:** Apache License 2.0

## Model Description

- **Model type:** Transformer-based sequence-to-sequence model (Whisper Large V3)
- **Language(s):** Kazakh (kk), Russian (ru) auxiliary
- **Task:** Automatic Speech Recognition (ASR)
- **Base Model:** `openai/whisper-large-v3`

## Performance

The model was evaluated on the held-out test split of the KSC2 dataset.

| Metric | Score |
|:---:|:---:|
| **WER** | **~17.7%** |

## Training Data & Methodology

The training dataset was curated to address specific challenges in Kazakh ASR, particularly the lack of punctuation in raw datasets and the prevalence of code-switching in daily speech.

### Dataset Composition (80/20 Split)
We utilized a **80% / 20%** data mixing strategy to prevent model degradation and improve stability when encountering non-Kazakh phonemes.

1.  **Kazakh Speech Corpus 2 (KSC2) - ~80%**
    * **Volume:** ~1,200 hours.
    * **Processing:** The original transcripts are in plain lowercase. We utilized **Gemma 27B** to syntactically restructure the text, restoring proper capitalization and punctuation.
    * **Sources:** Parliament speeches, TV/Radio broadcasts, podcasts, and crowdsourced recordings.

2.  **Common Voice Scripted Speech 23.0 (Russian) - ~20%**
    * **Volume:** ~250 hours.
    * **Purpose:** Including high-quality Russian speech helps the model distinguish between languages and handle loanwords or code-switching without hallucinating or degrading into gibberish.

## Usage

### Using with Hugging Face `transformers`

You can use this model directly with the Hugging Face `pipeline`.

```python
from transformers import pipeline

# Load the pipeline
pipe = pipeline("automatic-speech-recognition", model="InflexionLab/sybyrla")

# Transcribe an audio file
# The pipeline handles chunking automatically if configured (see batch inference below).
result = pipe("path/to/your/audio.mp3")

print(result["text"])