--- language: - kk - ru license: apache-2.0 tags: - automatic-speech-recognition - whisper - generated_from_trainer - kazakh - ksc2 - common-voice - gemma-27b datasets: - mozilla-foundation/common_voice_23_0 - InflexionLab/ISSAI-KSC2-Structured metrics: - wer base_model: openai/whisper-large-v3 model-index: - name: whisper-large-v3-kazakh-ksc2 results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Kazakh Speech Corpus 2 (KSC2) type: issai/ksc2 metrics: - name: Wer type: wer value: 17.7 --- # Whisper Large V3 Fine-tuned on KSC2 (Sybyrla) This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3). It is designed to provide robust automatic speech recognition (ASR) for the Kazakh language, achieving a Word Error Rate (WER) of approximately **17.7%**. To handle real-world acoustic environments in the region, this model was trained on a strategic mix of Kazakh and Russian data. **Developed by:** Inflexion Lab **License:** Apache License 2.0 ## Model Description - **Model type:** Transformer-based sequence-to-sequence model (Whisper Large V3) - **Language(s):** Kazakh (kk), Russian (ru) auxiliary - **Task:** Automatic Speech Recognition (ASR) - **Base Model:** `openai/whisper-large-v3` ## Performance The model was evaluated on the held-out test split of the KSC2 dataset. | Metric | Score | |:---:|:---:| | **WER** | **~17.7%** | ## Training Data & Methodology The training dataset was curated to address specific challenges in Kazakh ASR, particularly the lack of punctuation in raw datasets and the prevalence of code-switching in daily speech. ### Dataset Composition (80/20 Split) We utilized a **80% / 20%** data mixing strategy to prevent model degradation and improve stability when encountering non-Kazakh phonemes. 1. **Kazakh Speech Corpus 2 (KSC2) - ~80%** * **Volume:** ~1,200 hours. * **Processing:** The original transcripts are in plain lowercase. We utilized **Gemma 27B** to syntactically restructure the text, restoring proper capitalization and punctuation. * **Sources:** Parliament speeches, TV/Radio broadcasts, podcasts, and crowdsourced recordings. 2. **Common Voice Scripted Speech 23.0 (Russian) - ~20%** * **Volume:** ~250 hours. * **Purpose:** Including high-quality Russian speech helps the model distinguish between languages and handle loanwords or code-switching without hallucinating or degrading into gibberish. ## Usage ### Using with Hugging Face `transformers` You can use this model directly with the Hugging Face `pipeline`. ```python from transformers import pipeline # Load the pipeline pipe = pipeline("automatic-speech-recognition", model="InflexionLab/sybyrla") # Transcribe an audio file # The pipeline handles chunking automatically if configured (see batch inference below). result = pipe("path/to/your/audio.mp3") print(result["text"])