InflexionLab
/

sybyrla

@@ -1,6 +1,7 @@
 ---
 language:
 - kk
 license: apache-2.0
 tags:
 - automatic-speech-recognition
@@ -8,8 +9,11 @@ tags:
 - generated_from_trainer
 - kazakh
 - ksc2
 datasets:
 - issai/ksc2
 metrics:
 - wer
 base_model: openai/whisper-large-v3
@@ -28,9 +32,11 @@ model-index:
       value: 12.7
 ---
-# Whisper Large V3 Fine-tuned on Kazakh Speech Corpus 2
-This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) on the **Kazakh Speech Corpus 2 (KSC2)**. It is designed to provide robust automatic speech recognition (ASR) for the Kazakh language, achieving a Word Error Rate (WER) of approximately **12.7%**.
 **Developed by:** Inflexion Lab
 **License:** Apache License 2.0
@@ -38,10 +44,9 @@ This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingf
 ## Model Description
 - **Model type:** Transformer-based sequence-to-sequence model (Whisper Large V3)
-- **Language(s):** Kazakh (kk)
 - **Task:** Automatic Speech Recognition (ASR)
 - **Base Model:** `openai/whisper-large-v3`
-- **Training Dataset:** Kazakh Speech Corpus 2 (KSC2) by ISSAI
 ## Performance
@@ -51,21 +56,27 @@ The model was evaluated on the held-out test split of the KSC2 dataset.
 |:---:|:---:|
 | **WER** | **~12.7%** |
-## Training Data
-This model was trained on the **Kazakh Speech Corpus 2 (KSC2)**, an industrial-scale open-source speech corpus developed by the Institute of Smart Systems and Artificial Intelligence (ISSAI). Since transcripts are in plain lowercase, they were preprocessed with Gemma 27B to structure the sentence syntaxially.
-- **Total Duration:** ~1,200 hours
-- **Sources:** Crowdsourced recordings, TV programs, Radio broadcasts, Parliament speeches, and Podcasts.
-- **Content:** The dataset includes diverse domains and challenging acoustic environments, including code-switching (Kazakh-Russian) common in conversational Kazakh.
-*Note: The KSC2 dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).*
 ## Usage
 ### Using with Hugging Face `transformers`
-You can use this model directly with the Hugging Face `pipeline` for easy inference.
 ```python
 from transformers import pipeline
@@ -74,6 +85,7 @@ from transformers import pipeline
 pipe = pipeline("automatic-speech-recognition", model="InflexionLab/sybyrla")
 # Transcribe an audio file
 result = pipe("path/to/your/audio.mp3")
 print(result["text"])

 ---
 language:
 - kk
+- ru
 license: apache-2.0
 tags:
 - automatic-speech-recognition
 - generated_from_trainer
 - kazakh
 - ksc2
+- common-voice
+- gemma-27b
 datasets:
 - issai/ksc2
+- mozilla-foundation/common_voice_23_0
 metrics:
 - wer
 base_model: openai/whisper-large-v3
       value: 12.7
 ---
+# Whisper Large V3 Fine-tuned on KSC2 (Sybyrla)
+This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3). It is designed to provide robust automatic speech recognition (ASR) for the Kazakh language, achieving a Word Error Rate (WER) of approximately **12.7%**.
+To handle real-world acoustic environments in the region, this model was trained on a strategic mix of Kazakh and Russian data.
 **Developed by:** Inflexion Lab
 **License:** Apache License 2.0
 ## Model Description
 - **Model type:** Transformer-based sequence-to-sequence model (Whisper Large V3)
+- **Language(s):** Kazakh (kk), Russian (ru) auxiliary
 - **Task:** Automatic Speech Recognition (ASR)
 - **Base Model:** `openai/whisper-large-v3`
 ## Performance
 |:---:|:---:|
 | **WER** | **~12.7%** |
+## Training Data & Methodology
+The training dataset was curated to address specific challenges in Kazakh ASR, particularly the lack of punctuation in raw datasets and the prevalence of code-switching in daily speech.
+### Dataset Composition (80/20 Split)
+We utilized a **80% / 20%** data mixing strategy to prevent model degradation and improve stability when encountering non-Kazakh phonemes.
+1.  **Kazakh Speech Corpus 2 (KSC2) - ~80%**
+    * **Volume:** ~1,200 hours.
+    * **Processing:** The original transcripts are in plain lowercase. We utilized **Gemma 27B** to syntactically restructure the text, restoring proper capitalization and punctuation.
+    * **Sources:** Parliament speeches, TV/Radio broadcasts, podcasts, and crowdsourced recordings.
+2.  **Common Voice Scripted Speech 23.0 (Russian) - ~20%**
+    * **Volume:** ~250 hours.
+    * **Purpose:** Including high-quality Russian speech helps the model distinguish between languages and handle loanwords or code-switching without hallucinating or degrading into gibberish.
 ## Usage
 ### Using with Hugging Face `transformers`
+You can use this model directly with the Hugging Face `pipeline`.
 ```python
 from transformers import pipeline
 pipe = pipeline("automatic-speech-recognition", model="InflexionLab/sybyrla")
 # Transcribe an audio file
+# The pipeline handles chunking automatically if configured (see batch inference below).
 result = pipe("path/to/your/audio.mp3")
 print(result["text"])