nielsr HF Staff commited on
Commit
afd13f4
·
verified ·
1 Parent(s): e27369e

Add comprehensive model card for Kinyarwanda Whisper ASR model

Browse files

This PR adds a comprehensive model card for the `akera/whisper-large-v3-kin-full` model, based on the paper "[How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu](https://huggingface.co/papers/2510.07221)".

The updates include:
- **Metadata**: Added `license`, `library_name`, `pipeline_tag`, `language`, `tags`, and `base_model`.
- **Content**: Added a description, the full paper abstract, link to the GitHub repository, link to the Hugging Face collection, a detailed `Usage` example with a `transformers` pipeline code snippet, the `Training Configs` table, the `Results` table, and a citation.

These additions will significantly improve the discoverability, clarity, and usability of the model for researchers and practitioners interested in ASR for African languages.

Please review and merge if everything looks good!

Files changed (1) hide show
  1. README.md +147 -0
README.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ pipeline_tag: automatic-speech-recognition
5
+ language: rw
6
+ tags:
7
+ - whisper
8
+ - kinyarwanda
9
+ - speech
10
+ - asr
11
+ base_model: openai/whisper-large-v3
12
+ ---
13
+
14
+ # Kinyarwanda Whisper Evaluation Model: whisper-large-v3-kin-full
15
+
16
+ This model is the `whisper-large-v3-kin-full` checkpoint, resulting from extensive fine-tuning and evaluation of Whisper's performance on Kinyarwanda. It was presented in the paper [How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu](https://huggingface.co/papers/2510.07221). This model has been fine-tuned on approximately 1400 hours of Kinyarwanda speech data.
17
+
18
+ The paper investigates the data requirements for ASR development in low-resource African languages, demonstrating practical ASR performance (WER < 13%) with as little as 50 hours of training data, with substantial improvements continuing through 200 hours (WER < 10%). This model represents the full fine-tuning on approximately 1400 hours of Kinyarwanda speech data, offering actionable benchmarks and deployment guidance.
19
+
20
+ ## Abstract
21
+ The abstract of the paper is the following:
22
+
23
+ The development of Automatic Speech Recognition (ASR) systems for low-resource African languages remains challenging due to limited transcribed speech data. While recent advances in large multilingual models like OpenAI's Whisper offer promising pathways for low-resource ASR development, critical questions persist regarding practical deployment requirements. This paper addresses two fundamental concerns for practitioners: determining the minimum data volumes needed for viable performance and characterizing the primary failure modes that emerge in production systems. We evaluate Whisper's performance through comprehensive experiments on two Bantu languages: systematic data scaling analysis on Kinyarwanda using training sets from 1 to 1,400 hours, and detailed error characterization on Kikuyu using 270 hours of training data. Our scaling experiments demonstrate that practical ASR performance (WER < 13\%) becomes achievable with as little as 50 hours of training data, with substantial improvements continuing through 200 hours (WER < 10\%). Complementing these volume-focused findings, our error analysis reveals that data quality issues, particularly noisy ground truth transcriptions, account for 38.6\% of high-error cases, indicating that careful data curation is as critical as data volume for robust system performance. These results provide actionable benchmarks and deployment guidance for teams developing ASR systems across similar low-resource language contexts. We release accompanying and models see this https URL
24
+
25
+ ## GitHub Repository
26
+ For more details, including installation, training scripts, and full evaluation methodology, please refer to the [official GitHub repository](https://github.com/SunbirdAI/kinyarwanda-whisper-eval).
27
+
28
+ Explore the full collection of Kinyarwanda Whisper models:
29
+ 👉 [https://huggingface.co/collections/Sunbird/kinyarwanda-hackathon-68872541c41c5d166d9bffad](https://huggingface.co/collections/Sunbird/kinyarwanda-hackathon-68872541c41c5d166d9bffad)
30
+
31
+ ## Usage
32
+ You can use this model for Automatic Speech Recognition in Kinyarwanda with the 🤗 Transformers library.
33
+
34
+ First, ensure you have the `transformers` library and audio processing dependencies installed:
35
+ ```bash
36
+ pip install transformers accelerate soundfile librosa
37
+ ```
38
+
39
+ Here's a quick example for performing inference:
40
+
41
+ ```python
42
+ import torch
43
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
44
+ import soundfile as sf
45
+ import librosa
46
+ import numpy as np
47
+
48
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
49
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
50
+
51
+ # Load the processor and model
52
+ model_id = "akera/whisper-large-v3-kin-full"
53
+ processor = AutoProcessor.from_pretrained(model_id)
54
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
55
+ model_id,
56
+ torch_dtype=torch_dtype,
57
+ low_cpu_mem_usage=True,
58
+ use_safetensors=True
59
+ )
60
+ model.to(device)
61
+
62
+ # Create an ASR pipeline
63
+ pipe = pipeline(
64
+ "automatic-speech-recognition",
65
+ model=model,
66
+ tokenizer=processor.tokenizer,
67
+ feature_extractor=processor.feature_extractor,
68
+ max_new_tokens=128,
69
+ chunk_length_s=30,
70
+ batch_size=16,
71
+ return_timestamps=True,
72
+ torch_dtype=torch_dtype,
73
+ device=device,
74
+ )
75
+
76
+ # Example: Transcribe an audio file
77
+ # Replace 'path/to/your_kinyarwanda_audio.flac' with the path to your Kinyarwanda audio file.
78
+ # Ensure your audio is 16kHz mono.
79
+ # For demonstration, let's create a simple dummy audio file (replace with actual Kinyarwanda speech)
80
+ dummy_audio_path = "dummy_kinyarwanda_audio.wav"
81
+ samplerate = 16000 # Whisper models typically use 16kHz
82
+ duration = 3 # seconds
83
+ frequency = 880 # Hz
84
+ t = np.linspace(0., duration, int(samplerate * duration), endpoint=False)
85
+ amplitude = np.iinfo(np.int16).max * 0.5
86
+ data = amplitude * np.sin(2. * np.pi * frequency * t)
87
+ sf.write(dummy_audio_path, data.astype(np.int16), samplerate)
88
+
89
+ # Load and preprocess your audio if necessary
90
+ # For example, to load and resample a generic audio file to 16kHz mono:
91
+ # audio, sr = sf.read(dummy_audio_path)
92
+ # if sr != 16000:
93
+ # audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
94
+ # if audio.ndim > 1: # Convert to mono if stereo
95
+ # audio = librosa.to_mono(audio)
96
+ #
97
+ # # Pass the numpy array to the pipeline
98
+ # result = pipe({"array": audio, "sampling_rate": 16000})
99
+
100
+ # Or directly pass the file path for simpler usage if file is already 16kHz mono
101
+ result = pipe(dummy_audio_path)
102
+ print(f"Transcription: {result['text']}")
103
+ ```
104
+
105
+ ## Training Configs
106
+ The following table from the GitHub repository provides context on the training configurations used for various models in the evaluation:
107
+
108
+ | Config | Hours | Model ID on Hugging Face |
109
+ | ------------------ | ------ | ----------------------------------- |
110
+ | `baseline.yaml` | 0 | openai/whisper-large-v3 |
111
+ | `train_1h.yaml` | 1 | akera/whisper-large-v3-kin-1h-v2 |
112
+ | `train_50h.yaml` | 50 | akera/whisper-large-v3-kin-50h-v2 |
113
+ | `train_100h.yaml` | 100 | akera/whisper-large-v3-kin-100h-v2 |
114
+ | `train_150h.yaml` | 150 | akera/whisper-large-v3-kin-150h-v2 |
115
+ | `train_200h.yaml` | 200 | akera/whisper-large-v3-kin-200h-v2 |
116
+ | `train_500h.yaml` | 500 | akera/whisper-large-v3-kin-500h-v2 |
117
+ | `train_1000h.yaml` | 1000 | akera/whisper-large-v3-kin-1000h-v2 |
118
+ | `train_full.yaml` | ~1400 | akera/whisper-large-v3-kin-full |
119
+
120
+ ## Performance
121
+ Evaluation on `dev_test[:300]` subset (as reported in the GitHub README):
122
+
123
+ | Model | Hours | WER (%) | CER (%) | Score |
124
+ | ------------------------------------- | ------ | ------- | ------- | ----- |
125
+ | `openai/whisper-large-v3` | 0 | 33.10 | 9.80 | 0.861 |
126
+ | `akera/whisper-large-v3-kin-1h-v2` | 1 | 47.63 | 16.97 | 0.754 |
127
+ | `akera/whisper-large-v3-kin-50h-v2` | 50 | 12.51 | 3.31 | 0.932 |
128
+ | `akera/whisper-large-v3-kin-100h-v2` | 100 | 10.90 | 2.84 | 0.943 |
129
+ | `akera/whisper-large-v3-kin-150h-v2` | 150 | 10.21 | 2.64 | 0.948 |
130
+ | `akera/whisper-large-v3-kin-200h-v2` | 200 | 9.82 | 2.56 | 0.951 |
131
+ | `akera/whisper-large-v3-kin-500h-v2` | 500 | 8.24 | 2.15 | 0.963 |
132
+ | `akera/whisper-large-v3-kin-1000h-v2` | 1000 | 7.65 | 1.98 | 0.967 |
133
+ | `akera/whisper-large-v3-kin-full` | ~1400 | 7.14 | 1.88 | 0.970 |
134
+
135
+ > Score = 1 - (0.6 × CER + 0.4 × WER)
136
+
137
+ ## Citation
138
+ If you find our work useful, please consider citing:
139
+ ```bibtex
140
+ @article{sunbirdai2025asr_african_languages,
141
+ title={How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu},
142
+ author={Sunbird AI},
143
+ year={2025},
144
+ journal={arXiv preprint arXiv:2510.07221},
145
+ url={https://arxiv.org/abs/2510.07221}
146
+ }
147
+ ```