Update README.md
Browse files
README.md
CHANGED
|
@@ -1,6 +1,7 @@
|
|
| 1 |
---
|
| 2 |
language:
|
| 3 |
- kk
|
|
|
|
| 4 |
license: apache-2.0
|
| 5 |
tags:
|
| 6 |
- automatic-speech-recognition
|
|
@@ -8,8 +9,11 @@ tags:
|
|
| 8 |
- generated_from_trainer
|
| 9 |
- kazakh
|
| 10 |
- ksc2
|
|
|
|
|
|
|
| 11 |
datasets:
|
| 12 |
- issai/ksc2
|
|
|
|
| 13 |
metrics:
|
| 14 |
- wer
|
| 15 |
base_model: openai/whisper-large-v3
|
|
@@ -28,9 +32,11 @@ model-index:
|
|
| 28 |
value: 12.7
|
| 29 |
---
|
| 30 |
|
| 31 |
-
# Whisper Large V3 Fine-tuned on
|
| 32 |
|
| 33 |
-
This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)
|
|
|
|
|
|
|
| 34 |
|
| 35 |
**Developed by:** Inflexion Lab
|
| 36 |
**License:** Apache License 2.0
|
|
@@ -38,10 +44,9 @@ This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingf
|
|
| 38 |
## Model Description
|
| 39 |
|
| 40 |
- **Model type:** Transformer-based sequence-to-sequence model (Whisper Large V3)
|
| 41 |
-
- **Language(s):** Kazakh (kk)
|
| 42 |
- **Task:** Automatic Speech Recognition (ASR)
|
| 43 |
- **Base Model:** `openai/whisper-large-v3`
|
| 44 |
-
- **Training Dataset:** Kazakh Speech Corpus 2 (KSC2) by ISSAI
|
| 45 |
|
| 46 |
## Performance
|
| 47 |
|
|
@@ -51,21 +56,27 @@ The model was evaluated on the held-out test split of the KSC2 dataset.
|
|
| 51 |
|:---:|:---:|
|
| 52 |
| **WER** | **~12.7%** |
|
| 53 |
|
| 54 |
-
## Training Data
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
|
|
|
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
|
|
|
| 61 |
|
| 62 |
-
|
|
|
|
|
|
|
| 63 |
|
| 64 |
## Usage
|
| 65 |
|
| 66 |
### Using with Hugging Face `transformers`
|
| 67 |
|
| 68 |
-
You can use this model directly with the Hugging Face `pipeline
|
| 69 |
|
| 70 |
```python
|
| 71 |
from transformers import pipeline
|
|
@@ -74,6 +85,7 @@ from transformers import pipeline
|
|
| 74 |
pipe = pipeline("automatic-speech-recognition", model="InflexionLab/sybyrla")
|
| 75 |
|
| 76 |
# Transcribe an audio file
|
|
|
|
| 77 |
result = pipe("path/to/your/audio.mp3")
|
| 78 |
|
| 79 |
print(result["text"])
|
|
|
|
| 1 |
---
|
| 2 |
language:
|
| 3 |
- kk
|
| 4 |
+
- ru
|
| 5 |
license: apache-2.0
|
| 6 |
tags:
|
| 7 |
- automatic-speech-recognition
|
|
|
|
| 9 |
- generated_from_trainer
|
| 10 |
- kazakh
|
| 11 |
- ksc2
|
| 12 |
+
- common-voice
|
| 13 |
+
- gemma-27b
|
| 14 |
datasets:
|
| 15 |
- issai/ksc2
|
| 16 |
+
- mozilla-foundation/common_voice_23_0
|
| 17 |
metrics:
|
| 18 |
- wer
|
| 19 |
base_model: openai/whisper-large-v3
|
|
|
|
| 32 |
value: 12.7
|
| 33 |
---
|
| 34 |
|
| 35 |
+
# Whisper Large V3 Fine-tuned on KSC2 (Sybyrla)
|
| 36 |
|
| 37 |
+
This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3). It is designed to provide robust automatic speech recognition (ASR) for the Kazakh language, achieving a Word Error Rate (WER) of approximately **12.7%**.
|
| 38 |
+
|
| 39 |
+
To handle real-world acoustic environments in the region, this model was trained on a strategic mix of Kazakh and Russian data.
|
| 40 |
|
| 41 |
**Developed by:** Inflexion Lab
|
| 42 |
**License:** Apache License 2.0
|
|
|
|
| 44 |
## Model Description
|
| 45 |
|
| 46 |
- **Model type:** Transformer-based sequence-to-sequence model (Whisper Large V3)
|
| 47 |
+
- **Language(s):** Kazakh (kk), Russian (ru) auxiliary
|
| 48 |
- **Task:** Automatic Speech Recognition (ASR)
|
| 49 |
- **Base Model:** `openai/whisper-large-v3`
|
|
|
|
| 50 |
|
| 51 |
## Performance
|
| 52 |
|
|
|
|
| 56 |
|:---:|:---:|
|
| 57 |
| **WER** | **~12.7%** |
|
| 58 |
|
| 59 |
+
## Training Data & Methodology
|
| 60 |
+
|
| 61 |
+
The training dataset was curated to address specific challenges in Kazakh ASR, particularly the lack of punctuation in raw datasets and the prevalence of code-switching in daily speech.
|
| 62 |
|
| 63 |
+
### Dataset Composition (80/20 Split)
|
| 64 |
+
We utilized a **80% / 20%** data mixing strategy to prevent model degradation and improve stability when encountering non-Kazakh phonemes.
|
| 65 |
|
| 66 |
+
1. **Kazakh Speech Corpus 2 (KSC2) - ~80%**
|
| 67 |
+
* **Volume:** ~1,200 hours.
|
| 68 |
+
* **Processing:** The original transcripts are in plain lowercase. We utilized **Gemma 27B** to syntactically restructure the text, restoring proper capitalization and punctuation.
|
| 69 |
+
* **Sources:** Parliament speeches, TV/Radio broadcasts, podcasts, and crowdsourced recordings.
|
| 70 |
|
| 71 |
+
2. **Common Voice Scripted Speech 23.0 (Russian) - ~20%**
|
| 72 |
+
* **Volume:** ~250 hours.
|
| 73 |
+
* **Purpose:** Including high-quality Russian speech helps the model distinguish between languages and handle loanwords or code-switching without hallucinating or degrading into gibberish.
|
| 74 |
|
| 75 |
## Usage
|
| 76 |
|
| 77 |
### Using with Hugging Face `transformers`
|
| 78 |
|
| 79 |
+
You can use this model directly with the Hugging Face `pipeline`.
|
| 80 |
|
| 81 |
```python
|
| 82 |
from transformers import pipeline
|
|
|
|
| 85 |
pipe = pipeline("automatic-speech-recognition", model="InflexionLab/sybyrla")
|
| 86 |
|
| 87 |
# Transcribe an audio file
|
| 88 |
+
# The pipeline handles chunking automatically if configured (see batch inference below).
|
| 89 |
result = pipe("path/to/your/audio.mp3")
|
| 90 |
|
| 91 |
print(result["text"])
|