Add comprehensive model card for Kinyarwanda Whisper ASR model

This PR adds a comprehensive model card for the `akera/whisper-large-v3-kin-full` model, based on the paper "[How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu](https://huggingface.co/papers/2510.07221)".

The updates include:
- **Metadata**: Added `license`, `library_name`, `pipeline_tag`, `language`, `tags`, and `base_model`.
- **Content**: Added a description, the full paper abstract, link to the GitHub repository, link to the Hugging Face collection, a detailed `Usage` example with a `transformers` pipeline code snippet, the `Training Configs` table, the `Results` table, and a citation.

These additions will significantly improve the discoverability, clarity, and usability of the model for researchers and practitioners interested in ASR for African languages.

Please review and merge if everything looks good!

Files changed (1) hide show

README.md +147 -0

README.md ADDED Viewed

	@@ -0,0 +1,147 @@

+---
+license: mit
+library_name: transformers
+pipeline_tag: automatic-speech-recognition
+language: rw
+tags:
+- whisper
+- kinyarwanda
+- speech
+- asr
+base_model: openai/whisper-large-v3
+---
+# Kinyarwanda Whisper Evaluation Model: whisper-large-v3-kin-full
+This model is the `whisper-large-v3-kin-full` checkpoint, resulting from extensive fine-tuning and evaluation of Whisper's performance on Kinyarwanda. It was presented in the paper [How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu](https://huggingface.co/papers/2510.07221). This model has been fine-tuned on approximately 1400 hours of Kinyarwanda speech data.
+The paper investigates the data requirements for ASR development in low-resource African languages, demonstrating practical ASR performance (WER < 13%) with as little as 50 hours of training data, with substantial improvements continuing through 200 hours (WER < 10%). This model represents the full fine-tuning on approximately 1400 hours of Kinyarwanda speech data, offering actionable benchmarks and deployment guidance.
+## Abstract
+The abstract of the paper is the following:
+The development of Automatic Speech Recognition (ASR) systems for low-resource African languages remains challenging due to limited transcribed speech data. While recent advances in large multilingual models like OpenAI's Whisper offer promising pathways for low-resource ASR development, critical questions persist regarding practical deployment requirements. This paper addresses two fundamental concerns for practitioners: determining the minimum data volumes needed for viable performance and characterizing the primary failure modes that emerge in production systems. We evaluate Whisper's performance through comprehensive experiments on two Bantu languages: systematic data scaling analysis on Kinyarwanda using training sets from 1 to 1,400 hours, and detailed error characterization on Kikuyu using 270 hours of training data. Our scaling experiments demonstrate that practical ASR performance (WER < 13\%) becomes achievable with as little as 50 hours of training data, with substantial improvements continuing through 200 hours (WER < 10\%). Complementing these volume-focused findings, our error analysis reveals that data quality issues, particularly noisy ground truth transcriptions, account for 38.6\% of high-error cases, indicating that careful data curation is as critical as data volume for robust system performance. These results provide actionable benchmarks and deployment guidance for teams developing ASR systems across similar low-resource language contexts. We release accompanying and models see this https URL
+## GitHub Repository
+For more details, including installation, training scripts, and full evaluation methodology, please refer to the [official GitHub repository](https://github.com/SunbirdAI/kinyarwanda-whisper-eval).
+Explore the full collection of Kinyarwanda Whisper models:
+👉 [https://huggingface.co/collections/Sunbird/kinyarwanda-hackathon-68872541c41c5d166d9bffad](https://huggingface.co/collections/Sunbird/kinyarwanda-hackathon-68872541c41c5d166d9bffad)
+## Usage
+You can use this model for Automatic Speech Recognition in Kinyarwanda with the 🤗 Transformers library.
+First, ensure you have the `transformers` library and audio processing dependencies installed:
+```bash
+pip install transformers accelerate soundfile librosa
+```
+Here's a quick example for performing inference:
+```python
+import torch
+from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
+import soundfile as sf
+import librosa
+import numpy as np
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+# Load the processor and model
+model_id = "akera/whisper-large-v3-kin-full"
+processor = AutoProcessor.from_pretrained(model_id)
+model = AutoModelForSpeechSeq2Seq.from_pretrained(
+    model_id,
+    torch_dtype=torch_dtype,
+    low_cpu_mem_usage=True,
+    use_safetensors=True
+)
+model.to(device)
+# Create an ASR pipeline
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model=model,
+    tokenizer=processor.tokenizer,
+    feature_extractor=processor.feature_extractor,
+    max_new_tokens=128,
+    chunk_length_s=30,
+    batch_size=16,
+    return_timestamps=True,
+    torch_dtype=torch_dtype,
+    device=device,
+)
+# Example: Transcribe an audio file
+# Replace 'path/to/your_kinyarwanda_audio.flac' with the path to your Kinyarwanda audio file.
+# Ensure your audio is 16kHz mono.
+# For demonstration, let's create a simple dummy audio file (replace with actual Kinyarwanda speech)
+dummy_audio_path = "dummy_kinyarwanda_audio.wav"
+samplerate = 16000 # Whisper models typically use 16kHz
+duration = 3 # seconds
+frequency = 880 # Hz
+t = np.linspace(0., duration, int(samplerate * duration), endpoint=False)
+amplitude = np.iinfo(np.int16).max * 0.5
+data = amplitude * np.sin(2. * np.pi * frequency * t)
+sf.write(dummy_audio_path, data.astype(np.int16), samplerate)
+# Load and preprocess your audio if necessary
+# For example, to load and resample a generic audio file to 16kHz mono:
+# audio, sr = sf.read(dummy_audio_path)
+# if sr != 16000:
+#    audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
+# if audio.ndim > 1: # Convert to mono if stereo
+#    audio = librosa.to_mono(audio)
+#
+# # Pass the numpy array to the pipeline
+# result = pipe({"array": audio, "sampling_rate": 16000})
+# Or directly pass the file path for simpler usage if file is already 16kHz mono
+result = pipe(dummy_audio_path)
+print(f"Transcription: {result['text']}")
+```
+## Training Configs
+The following table from the GitHub repository provides context on the training configurations used for various models in the evaluation:
+| Config             | Hours  | Model ID on Hugging Face            |
+| ------------------ | ------ | ----------------------------------- |
+| `baseline.yaml`    | 0      | openai/whisper-large-v3             |
+| `train_1h.yaml`    | 1      | akera/whisper-large-v3-kin-1h-v2    |
+| `train_50h.yaml`   | 50     | akera/whisper-large-v3-kin-50h-v2   |
+| `train_100h.yaml`  | 100    | akera/whisper-large-v3-kin-100h-v2  |
+| `train_150h.yaml`  | 150    | akera/whisper-large-v3-kin-150h-v2  |
+| `train_200h.yaml`  | 200    | akera/whisper-large-v3-kin-200h-v2  |
+| `train_500h.yaml`  | 500    | akera/whisper-large-v3-kin-500h-v2  |
+| `train_1000h.yaml` | 1000   | akera/whisper-large-v3-kin-1000h-v2 |
+| `train_full.yaml`  | ~1400  | akera/whisper-large-v3-kin-full     |
+## Performance
+Evaluation on `dev_test[:300]` subset (as reported in the GitHub README):
+| Model                                 | Hours  | WER (%) | CER (%) | Score |
+| ------------------------------------- | ------ | ------- | ------- | ----- |
+| `openai/whisper-large-v3`             | 0      | 33.10   | 9.80    | 0.861 |
+| `akera/whisper-large-v3-kin-1h-v2`    | 1      | 47.63   | 16.97   | 0.754 |
+| `akera/whisper-large-v3-kin-50h-v2`   | 50     | 12.51   | 3.31    | 0.932 |
+| `akera/whisper-large-v3-kin-100h-v2`  | 100    | 10.90   | 2.84    | 0.943 |
+| `akera/whisper-large-v3-kin-150h-v2`  | 150    | 10.21   | 2.64    | 0.948 |
+| `akera/whisper-large-v3-kin-200h-v2`  | 200    | 9.82    | 2.56    | 0.951 |
+| `akera/whisper-large-v3-kin-500h-v2`  | 500    | 8.24    | 2.15    | 0.963 |
+| `akera/whisper-large-v3-kin-1000h-v2` | 1000   | 7.65    | 1.98    | 0.967 |
+| `akera/whisper-large-v3-kin-full`     | ~1400  | 7.14    | 1.88    | 0.970 |
+> Score = 1 - (0.6 × CER + 0.4 × WER)
+## Citation
+If you find our work useful, please consider citing:
+```bibtex
+@article{sunbirdai2025asr_african_languages,
+      title={How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu},
+      author={Sunbird AI},
+      year={2025},
+      journal={arXiv preprint arXiv:2510.07221},
+      url={https://arxiv.org/abs/2510.07221}
+}
+```