| --- |
| language: |
| - ki |
| tags: |
| - automatic-speech-recognition |
| - asr |
| - kikuyu |
| - wav2vec2 |
| - mms |
| - speech |
| - kenyan-languages |
| - low-resource |
| license: apache-2.0 |
| datasets: |
| - thinkKenya/kenyan_audio_datasets |
| model-index: |
| - name: MMS 1B Kikuyu ASR |
| results: |
| - task: |
| name: Automatic Speech Recognition |
| type: automatic-speech-recognition |
| dataset: |
| name: Kenyan Audio Datasets (Kikuyu) |
| type: thinkKenya/kenyan_audio_datasets |
| config: Kikuyu |
| split: test |
| args: |
| language: ki |
| metrics: |
| - name: Word Error Rate |
| type: wer |
| value: 35.74 |
| - name: Character Error Rate |
| type: cer |
| value: N/A |
| pipeline_tag: automatic-speech-recognition |
| widget: |
| - example_title: Kikuyu Speech Sample |
| src: https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets/resolve/main/data/kikuyu_sample.wav |
| --- |
| |
| # MMS 1B Kikuyu ASR Model |
|
|
| ## Model Description |
|
|
| This model is a fine-tuned version of Facebook's MMS (Massively Multilingual Speech) 1B parameter model, specifically adapted for Kikuyu (Gĩkũyũ) automatic speech recognition. The model uses language adapters to efficiently fine-tune the pre-trained MMS model for the Kikuyu language, achieving a Word Error Rate (WER) of **35.74%** on the test set. |
|
|
| ## Model Details |
|
|
| - **Model Type**: Wav2Vec2ForCTC with language adapters |
| - **Base Model**: `facebook/mms-1b-all` |
| - **Language**: Kikuyu (ISO 639-1: `ki`) |
| - **Task**: Automatic Speech Recognition (ASR) |
| - **Architecture**: Wav2Vec2 with CTC head and language-specific adapters |
| - **Parameters**: ~1B total parameters (only adapter layers fine-tuned) |
|
|
| ## Training Data |
|
|
| The model was trained on the [Kenyan Audio Datasets](https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets) Kikuyu subset: |
|
|
| - **Training samples**: 98,206 audio-text pairs |
| - **Test samples**: 32,736 audio-text pairs |
| - **Total dataset size**: 130,942 samples |
| - **Audio format**: 16kHz sampling rate |
| - **Text preprocessing**: Normalized, lowercase, special characters removed |
|
|
| ## Training Configuration |
|
|
| ### Hyperparameters |
| - **Batch size**: 64 (8 per device × 4 GPUs × 2 gradient accumulation steps) |
| - **Learning rate**: 3e-4 |
| - **Weight decay**: 0.01 |
| - **Warmup steps**: 500 |
| - **Total training steps**: 12,280 |
| - **Epochs**: 8 |
| - **Mixed precision**: fp16 |
| - **Gradient checkpointing**: Enabled |
|
|
| ### Hardware & Environment |
| - **GPUs**: 4x NVIDIA GPUs |
| - **Framework**: PyTorch with Accelerate |
| - **Optimization**: AdamW optimizer with linear warmup scheduling |
| - **Distributed training**: Multi-GPU with Accelerate |
|
|
| ## Vocabulary |
|
|
| The model uses a character-level vocabulary specifically designed for Kikuyu, containing **24 tokens**: |
|
|
| ``` |
| Characters: a, b, c, d, e, f, g, h, i, j, k, m, n, o, r, t, u, w, y, ĩ, ũ |
| Special tokens: [PAD], [UNK], | (word separator) |
| ``` |
|
|
| ## Performance |
|
|
| ### Metrics |
| - **Best WER**: 35.74% (achieved at training step 1700) |
| - **Training time**: ~3 hours on 4 GPUs |
| - **Evaluation subset**: 2,048 examples per evaluation step |
|
|
| ### Training Progress |
| The model showed consistent improvement during training: |
| - Step 100: WER 100.52% |
| - Step 500: WER 43.34% |
| - Step 800: WER 40.02% |
| - Step 1200: WER 37.48% |
| - Step 1500: WER 36.66% |
| - **Step 1700: WER 35.74% (best)** |
|
|
| ## Usage |
|
|
| ### Quick Start |
|
|
| ```python |
| from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor |
| import torch |
| import soundfile as sf |
| |
| # Load model and processor |
| model = Wav2Vec2ForCTC.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi") |
| processor = Wav2Vec2Processor.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi") |
| |
| # Load audio file (16kHz) |
| audio, sr = sf.read("kikuyu_audio.wav") |
| |
| # Process audio |
| inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True) |
| |
| # Generate transcription |
| with torch.no_grad(): |
| logits = model(inputs.input_values).logits |
| |
| # Decode prediction |
| predicted_ids = torch.argmax(logits, dim=-1) |
| transcription = processor.batch_decode(predicted_ids)[0] |
| |
| print(f"Transcription: {transcription}") |
| ``` |
|
|
| ### With Pipeline |
|
|
| ```python |
| from transformers import pipeline |
| |
| # Initialize ASR pipeline |
| asr = pipeline("automatic-speech-recognition", |
| model="nickdee96/mms-1b-kik-accelerate-2multi") |
| |
| # Transcribe audio |
| result = asr("kikuyu_audio.wav") |
| print(result["text"]) |
| ``` |
|
|
| ## Model Architecture |
|
|
| The model leverages the MMS (Massively Multilingual Speech) architecture with: |
|
|
| 1. **Wav2Vec2 Backbone**: Pre-trained on 1000+ languages |
| 2. **Language Adapters**: Lightweight adapter layers specifically trained for Kikuyu |
| 3. **CTC Head**: Connectionist Temporal Classification for sequence-to-sequence learning |
| 4. **Feature Extraction**: Convolutional layers for audio feature extraction |
|
|
| ## Limitations and Bias |
|
|
| ### Limitations |
| - **Domain specificity**: Trained primarily on read speech, may not generalize well to conversational or spontaneous speech |
| - **Audio quality**: Performance may degrade on low-quality or noisy audio |
| - **Vocabulary coverage**: Limited to characters present in training data |
| - **Code-switching**: May not handle Kikuyu-English code-switching well |
|
|
| ### Bias Considerations |
| - The model reflects the linguistic patterns and potential biases present in the training dataset |
| - Performance may vary across different Kikuyu dialects or speaker demographics |
| - The dataset composition may not represent all varieties of spoken Kikuyu |
|
|
| ## Training Infrastructure |
|
|
| ### Optimizations Applied |
| - **Vocabulary caching**: Efficient vocabulary extraction with caching |
| - **Multiprocessing**: Parallel data processing with 16 processes |
| - **Feature extraction optimization**: Batched audio processing |
| - **Memory optimization**: Gradient checkpointing and mixed precision training |
|
|
| ### Reproducibility |
| - **Seed**: 42 |
| - **Framework versions**: PyTorch 2.x, Transformers 4.x, Accelerate |
| - **Training logs**: Available in model repository |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{kikuyu-asr-2024, |
| title={MMS 1B Kikuyu ASR Model}, |
| author={Kikuyu ASR Team}, |
| year={2024}, |
| publisher={Hugging Face}, |
| journal={Hugging Face Model Hub}, |
| howpublished={\url{https://huggingface.co/nickdee96/mms-1b-kik-accelerate-2multi}} |
| } |
| ``` |
|
|
| ## Acknowledgments |
|
|
| - **Base model**: Facebook's MMS team for the pre-trained multilingual model |
| - **Dataset**: thinkKenya for the Kenyan Audio Datasets |
| - **Infrastructure**: Microsoft Azure for compute resources |
| - **Framework**: Hugging Face Transformers and Accelerate libraries |
|
|
| ## License |
|
|
| This model is released under the Apache 2.0 license, following the base MMS model licensing. |
|
|
| ## Model Card Contact |
|
|
| For questions or issues with this model, please open an issue in the repository or contact the model authors. |