--- language: - ki tags: - automatic-speech-recognition - asr - kikuyu - wav2vec2 - mms - speech - kenyan-languages - low-resource license: apache-2.0 datasets: - thinkKenya/kenyan_audio_datasets model-index: - name: MMS 1B Kikuyu ASR results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Kenyan Audio Datasets (Kikuyu) type: thinkKenya/kenyan_audio_datasets config: Kikuyu split: test args: language: ki metrics: - name: Word Error Rate type: wer value: 35.74 - name: Character Error Rate type: cer value: N/A pipeline_tag: automatic-speech-recognition widget: - example_title: Kikuyu Speech Sample src: https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets/resolve/main/data/kikuyu_sample.wav --- # MMS 1B Kikuyu ASR Model ## Model Description This model is a fine-tuned version of Facebook's MMS (Massively Multilingual Speech) 1B parameter model, specifically adapted for Kikuyu (Gĩkũyũ) automatic speech recognition. The model uses language adapters to efficiently fine-tune the pre-trained MMS model for the Kikuyu language, achieving a Word Error Rate (WER) of **35.74%** on the test set. ## Model Details - **Model Type**: Wav2Vec2ForCTC with language adapters - **Base Model**: `facebook/mms-1b-all` - **Language**: Kikuyu (ISO 639-1: `ki`) - **Task**: Automatic Speech Recognition (ASR) - **Architecture**: Wav2Vec2 with CTC head and language-specific adapters - **Parameters**: ~1B total parameters (only adapter layers fine-tuned) ## Training Data The model was trained on the [Kenyan Audio Datasets](https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets) Kikuyu subset: - **Training samples**: 98,206 audio-text pairs - **Test samples**: 32,736 audio-text pairs - **Total dataset size**: 130,942 samples - **Audio format**: 16kHz sampling rate - **Text preprocessing**: Normalized, lowercase, special characters removed ## Training Configuration ### Hyperparameters - **Batch size**: 64 (8 per device × 4 GPUs × 2 gradient accumulation steps) - **Learning rate**: 3e-4 - **Weight decay**: 0.01 - **Warmup steps**: 500 - **Total training steps**: 12,280 - **Epochs**: 8 - **Mixed precision**: fp16 - **Gradient checkpointing**: Enabled ### Hardware & Environment - **GPUs**: 4x NVIDIA GPUs - **Framework**: PyTorch with Accelerate - **Optimization**: AdamW optimizer with linear warmup scheduling - **Distributed training**: Multi-GPU with Accelerate ## Vocabulary The model uses a character-level vocabulary specifically designed for Kikuyu, containing **24 tokens**: ``` Characters: a, b, c, d, e, f, g, h, i, j, k, m, n, o, r, t, u, w, y, ĩ, ũ Special tokens: [PAD], [UNK], | (word separator) ``` ## Performance ### Metrics - **Best WER**: 35.74% (achieved at training step 1700) - **Training time**: ~3 hours on 4 GPUs - **Evaluation subset**: 2,048 examples per evaluation step ### Training Progress The model showed consistent improvement during training: - Step 100: WER 100.52% - Step 500: WER 43.34% - Step 800: WER 40.02% - Step 1200: WER 37.48% - Step 1500: WER 36.66% - **Step 1700: WER 35.74% (best)** ## Usage ### Quick Start ```python from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import torch import soundfile as sf # Load model and processor model = Wav2Vec2ForCTC.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi") processor = Wav2Vec2Processor.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi") # Load audio file (16kHz) audio, sr = sf.read("kikuyu_audio.wav") # Process audio inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True) # Generate transcription with torch.no_grad(): logits = model(inputs.input_values).logits # Decode prediction predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids)[0] print(f"Transcription: {transcription}") ``` ### With Pipeline ```python from transformers import pipeline # Initialize ASR pipeline asr = pipeline("automatic-speech-recognition", model="nickdee96/mms-1b-kik-accelerate-2multi") # Transcribe audio result = asr("kikuyu_audio.wav") print(result["text"]) ``` ## Model Architecture The model leverages the MMS (Massively Multilingual Speech) architecture with: 1. **Wav2Vec2 Backbone**: Pre-trained on 1000+ languages 2. **Language Adapters**: Lightweight adapter layers specifically trained for Kikuyu 3. **CTC Head**: Connectionist Temporal Classification for sequence-to-sequence learning 4. **Feature Extraction**: Convolutional layers for audio feature extraction ## Limitations and Bias ### Limitations - **Domain specificity**: Trained primarily on read speech, may not generalize well to conversational or spontaneous speech - **Audio quality**: Performance may degrade on low-quality or noisy audio - **Vocabulary coverage**: Limited to characters present in training data - **Code-switching**: May not handle Kikuyu-English code-switching well ### Bias Considerations - The model reflects the linguistic patterns and potential biases present in the training dataset - Performance may vary across different Kikuyu dialects or speaker demographics - The dataset composition may not represent all varieties of spoken Kikuyu ## Training Infrastructure ### Optimizations Applied - **Vocabulary caching**: Efficient vocabulary extraction with caching - **Multiprocessing**: Parallel data processing with 16 processes - **Feature extraction optimization**: Batched audio processing - **Memory optimization**: Gradient checkpointing and mixed precision training ### Reproducibility - **Seed**: 42 - **Framework versions**: PyTorch 2.x, Transformers 4.x, Accelerate - **Training logs**: Available in model repository ## Citation ```bibtex @misc{kikuyu-asr-2024, title={MMS 1B Kikuyu ASR Model}, author={Kikuyu ASR Team}, year={2024}, publisher={Hugging Face}, journal={Hugging Face Model Hub}, howpublished={\url{https://huggingface.co/nickdee96/mms-1b-kik-accelerate-2multi}} } ``` ## Acknowledgments - **Base model**: Facebook's MMS team for the pre-trained multilingual model - **Dataset**: thinkKenya for the Kenyan Audio Datasets - **Infrastructure**: Microsoft Azure for compute resources - **Framework**: Hugging Face Transformers and Accelerate libraries ## License This model is released under the Apache 2.0 license, following the base MMS model licensing. ## Model Card Contact For questions or issues with this model, please open an issue in the repository or contact the model authors.