InflexionLab
/

sybyrla

Automatic Speech Recognition

Generated from Trainer

Eval Results (legacy)

Model card Files Files and versions

sybyrla / README.md

diaslmb's picture

Update README.md

3a7ebcb verified about 2 months ago

|

history blame contribute delete

2.99 kB

	---
	language:
	- kk
	- ru
	license: apache-2.0
	tags:
	- automatic-speech-recognition
	- whisper
	- generated_from_trainer
	- kazakh
	- ksc2
	- common-voice
	- gemma-27b
	datasets:
	- mozilla-foundation/common_voice_23_0
	- InflexionLab/ISSAI-KSC2-Structured
	metrics:
	- wer
	base_model: openai/whisper-large-v3
	model-index:
	- name: whisper-large-v3-kazakh-ksc2
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Kazakh Speech Corpus 2 (KSC2)
	type: issai/ksc2
	metrics:
	- name: Wer
	type: wer
	value: 17.7
	---

	# Whisper Large V3 Fine-tuned on KSC2 (Sybyrla)

	This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3). It is designed to provide robust automatic speech recognition (ASR) for the Kazakh language, achieving a Word Error Rate (WER) of approximately 17.7%.

	To handle real-world acoustic environments in the region, this model was trained on a strategic mix of Kazakh and Russian data.

	Developed by: Inflexion Lab
	License: Apache License 2.0

	## Model Description

	- Model type: Transformer-based sequence-to-sequence model (Whisper Large V3)
	- Language(s): Kazakh (kk), Russian (ru) auxiliary
	- Task: Automatic Speech Recognition (ASR)
	- Base Model: `openai/whisper-large-v3`

	## Performance

	The model was evaluated on the held-out test split of the KSC2 dataset.

	\| Metric \| Score \|
	\|:---:\|:---:\|
	\| WER \| ~17.7% \|

	## Training Data & Methodology

	The training dataset was curated to address specific challenges in Kazakh ASR, particularly the lack of punctuation in raw datasets and the prevalence of code-switching in daily speech.

	### Dataset Composition (80/20 Split)
	We utilized a 80% / 20% data mixing strategy to prevent model degradation and improve stability when encountering non-Kazakh phonemes.

	1. Kazakh Speech Corpus 2 (KSC2) - ~80%
	* Volume: ~1,200 hours.
	* Processing: The original transcripts are in plain lowercase. We utilized Gemma 27B to syntactically restructure the text, restoring proper capitalization and punctuation.
	* Sources: Parliament speeches, TV/Radio broadcasts, podcasts, and crowdsourced recordings.

	2. Common Voice Scripted Speech 23.0 (Russian) - ~20%
	* Volume: ~250 hours.
	* Purpose: Including high-quality Russian speech helps the model distinguish between languages and handle loanwords or code-switching without hallucinating or degrading into gibberish.

	## Usage

	### Using with Hugging Face `transformers`

	You can use this model directly with the Hugging Face `pipeline`.

	```python
	from transformers import pipeline

	# Load the pipeline
	pipe = pipeline("automatic-speech-recognition", model="InflexionLab/sybyrla")

	# Transcribe an audio file
	# The pipeline handles chunking automatically if configured (see batch inference below).
	result = pipe("path/to/your/audio.mp3")

	print(result["text"])