| <p align="center"> | |
| <img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" height="80" /> | |
| </p> | |
| # π«’ NextInnoMind / next\_bemba\_ai\_medium | |
| **Multilingual Whisper ASR (Automatic Speech Recognition)** | |
| Fine-tuned Whisper model for Bemba and English using language tokens. | |
| Developed and maintained by **NextInnoMind**, led by **Chalwe Silas**. | |
| --- | |
| ### π§ͺ Model Type | |
| `WhisperForConditionalGeneration` β fine-tuned using [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | |
| Framework: `Transformers` | |
| Checkpoint Format: `Safetensors` | |
| Languages: `Bemba`, `English` (with `<|bem|>` language token support) | |
| --- | |
| ## π Model Description | |
| This model is a Whisper Medium variant fine-tuned for **Bemba** and **English**, enabling robust multilingual transcription. It supports the use of language tokens (e.g., `<|bem|>`) to help guide decoding, particularly for low-resource languages like Bemba. | |
| --- | |
| ## π Training Details | |
| * **Base Model**: [`openai/whisper-medium`](https://huggingface.co/openai/whisper-medium) | |
| * **Dataset**: | |
| * BembaSpeech (curated dataset of Bemba audio + transcripts) | |
| * English subset of [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) | |
| * **Training Time**: 8 epochs (\~55 hours on A100 GPU) | |
| * **Learning Rate**: 1e-5 | |
| * **Batch Size**: 16 | |
| * **Framework**: Transformers + Accelerate | |
| * **Tokenizer**: WhisperProcessor with `language="<|bem|>"` and `task="transcribe"` | |
| --- | |
| ## π Usage | |
| ```python | |
| from transformers import pipeline | |
| pipe = pipeline( | |
| "automatic-speech-recognition", | |
| model="NextInnoMind/next_bemba_ai_medium", | |
| chunk_length_s=30, | |
| return_timestamps=True | |
| ) | |
| # Example | |
| result = pipe("path_to_audio.wav") | |
| print(result["text"]) | |
| ``` | |
| > π Tip: For Bemba, use the language token `<|bem|>` to improve transcription accuracy. | |
| --- | |
| ## π Applications | |
| * **Multilingual Education**: Bemba-English subtitles and transcription | |
| * **Broadcast & Media**: Transcribe bilingual radio or TV content | |
| * **Research**: Language preservation and Bantu-English linguistic studies | |
| * **Voice Accessibility**: Multilingual ASR tools and captioning | |
| --- | |
| ## β οΈ Limitations & Biases | |
| * Slight performance drop with highly noisy or code-switched audio | |
| * Trained on formal and clean speech; informal speech may lower accuracy | |
| * `<|bem|>` is required for optimal Bemba decoding | |
| --- | |
| ## π Evaluation | |
| | Language | WER (Word Error Rate) | Dataset | | |
| | -------- | --------------------- | -------------------- | | |
| | Bemba | \~15.2% | BembaSpeech Eval Set | | |
| | English | \~10.5% | Common Voice EN | | |
| --- | |
| ## π± Environmental Impact | |
| * **Hardware**: A100 40GB x1 | |
| * **Training Time**: \~55 hours | |
| * **Carbon Emissions**: Estimated \~25.8 kg COβ | |
| *(via [ML CO2 Impact](https://mlco2.github.io/impact))* | |
| --- | |
| ## π Citation | |
| ```bibtex | |
| @misc{nextbembaai2025, | |
| title={NextInnoMind next_bemba_ai_medium: Multilingual Whisper ASR model for Bemba and English}, | |
| author={Silas Chalwe and NextInnoMind}, | |
| year={2025}, | |
| howpublished={\url{https://huggingface.co/NextInnoMind/next_bemba_ai_medium}}, | |
| } | |
| ``` | |
| --- | |
| ## π§βπ» Maintainers | |
| * **Chalwe Silas** (Lead Developer & Dataset Curator) | |
| * Team **NextInnoMind** | |
| π¬ Contact: | |
| * [silaschalwe@outlook.com](mailto:silaschalwe@outlook.com) | |
| * [mchalwesilas@gmail.com](mailto:mchalwesilas@gmail.com) | |
| π GitHub: [SilasChalwe](https://github.com/SilasChalwe) | |
| --- | |
| ## π Related Resources | |
| * [BembaSpeech Dataset](https://huggingface.co/datasets/NextInnoMind/BembaSpeech) | |
| * [NextInnoMind on GitHub](https://github.com/SilasChalwe) | |
| --- | |
| Fine tuned in Zambia. | |