diaslmb commited on
Commit
6903ee1
·
verified ·
1 Parent(s): eb1bb23

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -11
README.md CHANGED
@@ -1,6 +1,7 @@
1
  ---
2
  language:
3
  - kk
 
4
  license: apache-2.0
5
  tags:
6
  - automatic-speech-recognition
@@ -8,8 +9,11 @@ tags:
8
  - generated_from_trainer
9
  - kazakh
10
  - ksc2
 
 
11
  datasets:
12
  - issai/ksc2
 
13
  metrics:
14
  - wer
15
  base_model: openai/whisper-large-v3
@@ -28,9 +32,11 @@ model-index:
28
  value: 12.7
29
  ---
30
 
31
- # Whisper Large V3 Fine-tuned on Kazakh Speech Corpus 2
32
 
33
- This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) on the **Kazakh Speech Corpus 2 (KSC2)**. It is designed to provide robust automatic speech recognition (ASR) for the Kazakh language, achieving a Word Error Rate (WER) of approximately **12.7%**.
 
 
34
 
35
  **Developed by:** Inflexion Lab
36
  **License:** Apache License 2.0
@@ -38,10 +44,9 @@ This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingf
38
  ## Model Description
39
 
40
  - **Model type:** Transformer-based sequence-to-sequence model (Whisper Large V3)
41
- - **Language(s):** Kazakh (kk)
42
  - **Task:** Automatic Speech Recognition (ASR)
43
  - **Base Model:** `openai/whisper-large-v3`
44
- - **Training Dataset:** Kazakh Speech Corpus 2 (KSC2) by ISSAI
45
 
46
  ## Performance
47
 
@@ -51,21 +56,27 @@ The model was evaluated on the held-out test split of the KSC2 dataset.
51
  |:---:|:---:|
52
  | **WER** | **~12.7%** |
53
 
54
- ## Training Data
 
 
55
 
56
- This model was trained on the **Kazakh Speech Corpus 2 (KSC2)**, an industrial-scale open-source speech corpus developed by the Institute of Smart Systems and Artificial Intelligence (ISSAI). Since transcripts are in plain lowercase, they were preprocessed with Gemma 27B to structure the sentence syntaxially.
 
57
 
58
- - **Total Duration:** ~1,200 hours
59
- - **Sources:** Crowdsourced recordings, TV programs, Radio broadcasts, Parliament speeches, and Podcasts.
60
- - **Content:** The dataset includes diverse domains and challenging acoustic environments, including code-switching (Kazakh-Russian) common in conversational Kazakh.
 
61
 
62
- *Note: The KSC2 dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).*
 
 
63
 
64
  ## Usage
65
 
66
  ### Using with Hugging Face `transformers`
67
 
68
- You can use this model directly with the Hugging Face `pipeline` for easy inference.
69
 
70
  ```python
71
  from transformers import pipeline
@@ -74,6 +85,7 @@ from transformers import pipeline
74
  pipe = pipeline("automatic-speech-recognition", model="InflexionLab/sybyrla")
75
 
76
  # Transcribe an audio file
 
77
  result = pipe("path/to/your/audio.mp3")
78
 
79
  print(result["text"])
 
1
  ---
2
  language:
3
  - kk
4
+ - ru
5
  license: apache-2.0
6
  tags:
7
  - automatic-speech-recognition
 
9
  - generated_from_trainer
10
  - kazakh
11
  - ksc2
12
+ - common-voice
13
+ - gemma-27b
14
  datasets:
15
  - issai/ksc2
16
+ - mozilla-foundation/common_voice_23_0
17
  metrics:
18
  - wer
19
  base_model: openai/whisper-large-v3
 
32
  value: 12.7
33
  ---
34
 
35
+ # Whisper Large V3 Fine-tuned on KSC2 (Sybyrla)
36
 
37
+ This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3). It is designed to provide robust automatic speech recognition (ASR) for the Kazakh language, achieving a Word Error Rate (WER) of approximately **12.7%**.
38
+
39
+ To handle real-world acoustic environments in the region, this model was trained on a strategic mix of Kazakh and Russian data.
40
 
41
  **Developed by:** Inflexion Lab
42
  **License:** Apache License 2.0
 
44
  ## Model Description
45
 
46
  - **Model type:** Transformer-based sequence-to-sequence model (Whisper Large V3)
47
+ - **Language(s):** Kazakh (kk), Russian (ru) auxiliary
48
  - **Task:** Automatic Speech Recognition (ASR)
49
  - **Base Model:** `openai/whisper-large-v3`
 
50
 
51
  ## Performance
52
 
 
56
  |:---:|:---:|
57
  | **WER** | **~12.7%** |
58
 
59
+ ## Training Data & Methodology
60
+
61
+ The training dataset was curated to address specific challenges in Kazakh ASR, particularly the lack of punctuation in raw datasets and the prevalence of code-switching in daily speech.
62
 
63
+ ### Dataset Composition (80/20 Split)
64
+ We utilized a **80% / 20%** data mixing strategy to prevent model degradation and improve stability when encountering non-Kazakh phonemes.
65
 
66
+ 1. **Kazakh Speech Corpus 2 (KSC2) - ~80%**
67
+ * **Volume:** ~1,200 hours.
68
+ * **Processing:** The original transcripts are in plain lowercase. We utilized **Gemma 27B** to syntactically restructure the text, restoring proper capitalization and punctuation.
69
+ * **Sources:** Parliament speeches, TV/Radio broadcasts, podcasts, and crowdsourced recordings.
70
 
71
+ 2. **Common Voice Scripted Speech 23.0 (Russian) - ~20%**
72
+ * **Volume:** ~250 hours.
73
+ * **Purpose:** Including high-quality Russian speech helps the model distinguish between languages and handle loanwords or code-switching without hallucinating or degrading into gibberish.
74
 
75
  ## Usage
76
 
77
  ### Using with Hugging Face `transformers`
78
 
79
+ You can use this model directly with the Hugging Face `pipeline`.
80
 
81
  ```python
82
  from transformers import pipeline
 
85
  pipe = pipeline("automatic-speech-recognition", model="InflexionLab/sybyrla")
86
 
87
  # Transcribe an audio file
88
+ # The pipeline handles chunking automatically if configured (see batch inference below).
89
  result = pipe("path/to/your/audio.mp3")
90
 
91
  print(result["text"])