# ๐Ÿซข NextInnoMind / next\_bemba\_ai\_medium **Multilingual Whisper ASR (Automatic Speech Recognition)** Fine-tuned Whisper model for Bemba and English using language tokens. Developed and maintained by **NextInnoMind**, led by **Chalwe Silas**. --- ### ๐Ÿงช Model Type `WhisperForConditionalGeneration` โ€” fine-tuned using [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) Framework: `Transformers` Checkpoint Format: `Safetensors` Languages: `Bemba`, `English` (with `<|bem|>` language token support) --- ## ๐Ÿ“œ Model Description This model is a Whisper Medium variant fine-tuned for **Bemba** and **English**, enabling robust multilingual transcription. It supports the use of language tokens (e.g., `<|bem|>`) to help guide decoding, particularly for low-resource languages like Bemba. --- ## ๐Ÿ“š Training Details * **Base Model**: [`openai/whisper-medium`](https://huggingface.co/openai/whisper-medium) * **Dataset**: * BembaSpeech (curated dataset of Bemba audio + transcripts) * English subset of [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) * **Training Time**: 8 epochs (\~55 hours on A100 GPU) * **Learning Rate**: 1e-5 * **Batch Size**: 16 * **Framework**: Transformers + Accelerate * **Tokenizer**: WhisperProcessor with `language="<|bem|>"` and `task="transcribe"` --- ## ๐Ÿš€ Usage ```python from transformers import pipeline pipe = pipeline( "automatic-speech-recognition", model="NextInnoMind/next_bemba_ai_medium", chunk_length_s=30, return_timestamps=True ) # Example result = pipe("path_to_audio.wav") print(result["text"]) ``` > ๐Ÿ“Œ Tip: For Bemba, use the language token `<|bem|>` to improve transcription accuracy. --- ## ๐Ÿ” Applications * **Multilingual Education**: Bemba-English subtitles and transcription * **Broadcast & Media**: Transcribe bilingual radio or TV content * **Research**: Language preservation and Bantu-English linguistic studies * **Voice Accessibility**: Multilingual ASR tools and captioning --- ## โš ๏ธ Limitations & Biases * Slight performance drop with highly noisy or code-switched audio * Trained on formal and clean speech; informal speech may lower accuracy * `<|bem|>` is required for optimal Bemba decoding --- ## ๐Ÿ“Š Evaluation | Language | WER (Word Error Rate) | Dataset | | -------- | --------------------- | -------------------- | | Bemba | \~15.2% | BembaSpeech Eval Set | | English | \~10.5% | Common Voice EN | --- ## ๐ŸŒฑ Environmental Impact * **Hardware**: A100 40GB x1 * **Training Time**: \~55 hours * **Carbon Emissions**: Estimated \~25.8 kg COโ‚‚ *(via [ML CO2 Impact](https://mlco2.github.io/impact))* --- ## ๐Ÿ“„ Citation ```bibtex @misc{nextbembaai2025, title={NextInnoMind next_bemba_ai_medium: Multilingual Whisper ASR model for Bemba and English}, author={Silas Chalwe and NextInnoMind}, year={2025}, howpublished={\url{https://huggingface.co/NextInnoMind/next_bemba_ai_medium}}, } ``` --- ## ๐Ÿง‘โ€๐Ÿ’ป Maintainers * **Chalwe Silas** (Lead Developer & Dataset Curator) * Team **NextInnoMind** ๐Ÿ“ฌ Contact: * [silaschalwe@outlook.com](mailto:silaschalwe@outlook.com) * [mchalwesilas@gmail.com](mailto:mchalwesilas@gmail.com) ๐Ÿ”— GitHub: [SilasChalwe](https://github.com/SilasChalwe) --- ## ๐Ÿ“Œ Related Resources * [BembaSpeech Dataset](https://huggingface.co/datasets/NextInnoMind/BembaSpeech) * [NextInnoMind on GitHub](https://github.com/SilasChalwe) --- Fine tuned in Zambia.