nickdee96's picture
Upload Kikuyu ASR model - WER: 35.74%
08ac0d0 verified
---
language:
- ki
tags:
- automatic-speech-recognition
- asr
- kikuyu
- wav2vec2
- mms
- speech
- kenyan-languages
- low-resource
license: apache-2.0
datasets:
- thinkKenya/kenyan_audio_datasets
model-index:
- name: MMS 1B Kikuyu ASR
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Kenyan Audio Datasets (Kikuyu)
type: thinkKenya/kenyan_audio_datasets
config: Kikuyu
split: test
args:
language: ki
metrics:
- name: Word Error Rate
type: wer
value: 35.74
- name: Character Error Rate
type: cer
value: N/A
pipeline_tag: automatic-speech-recognition
widget:
- example_title: Kikuyu Speech Sample
src: https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets/resolve/main/data/kikuyu_sample.wav
---
# MMS 1B Kikuyu ASR Model
## Model Description
This model is a fine-tuned version of Facebook's MMS (Massively Multilingual Speech) 1B parameter model, specifically adapted for Kikuyu (Gĩkũyũ) automatic speech recognition. The model uses language adapters to efficiently fine-tune the pre-trained MMS model for the Kikuyu language, achieving a Word Error Rate (WER) of **35.74%** on the test set.
## Model Details
- **Model Type**: Wav2Vec2ForCTC with language adapters
- **Base Model**: `facebook/mms-1b-all`
- **Language**: Kikuyu (ISO 639-1: `ki`)
- **Task**: Automatic Speech Recognition (ASR)
- **Architecture**: Wav2Vec2 with CTC head and language-specific adapters
- **Parameters**: ~1B total parameters (only adapter layers fine-tuned)
## Training Data
The model was trained on the [Kenyan Audio Datasets](https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets) Kikuyu subset:
- **Training samples**: 98,206 audio-text pairs
- **Test samples**: 32,736 audio-text pairs
- **Total dataset size**: 130,942 samples
- **Audio format**: 16kHz sampling rate
- **Text preprocessing**: Normalized, lowercase, special characters removed
## Training Configuration
### Hyperparameters
- **Batch size**: 64 (8 per device × 4 GPUs × 2 gradient accumulation steps)
- **Learning rate**: 3e-4
- **Weight decay**: 0.01
- **Warmup steps**: 500
- **Total training steps**: 12,280
- **Epochs**: 8
- **Mixed precision**: fp16
- **Gradient checkpointing**: Enabled
### Hardware & Environment
- **GPUs**: 4x NVIDIA GPUs
- **Framework**: PyTorch with Accelerate
- **Optimization**: AdamW optimizer with linear warmup scheduling
- **Distributed training**: Multi-GPU with Accelerate
## Vocabulary
The model uses a character-level vocabulary specifically designed for Kikuyu, containing **24 tokens**:
```
Characters: a, b, c, d, e, f, g, h, i, j, k, m, n, o, r, t, u, w, y, ĩ, ũ
Special tokens: [PAD], [UNK], | (word separator)
```
## Performance
### Metrics
- **Best WER**: 35.74% (achieved at training step 1700)
- **Training time**: ~3 hours on 4 GPUs
- **Evaluation subset**: 2,048 examples per evaluation step
### Training Progress
The model showed consistent improvement during training:
- Step 100: WER 100.52%
- Step 500: WER 43.34%
- Step 800: WER 40.02%
- Step 1200: WER 37.48%
- Step 1500: WER 36.66%
- **Step 1700: WER 35.74% (best)**
## Usage
### Quick Start
```python
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import soundfile as sf
# Load model and processor
model = Wav2Vec2ForCTC.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi")
processor = Wav2Vec2Processor.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi")
# Load audio file (16kHz)
audio, sr = sf.read("kikuyu_audio.wav")
# Process audio
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
# Generate transcription
with torch.no_grad():
logits = model(inputs.input_values).logits
# Decode prediction
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(f"Transcription: {transcription}")
```
### With Pipeline
```python
from transformers import pipeline
# Initialize ASR pipeline
asr = pipeline("automatic-speech-recognition",
model="nickdee96/mms-1b-kik-accelerate-2multi")
# Transcribe audio
result = asr("kikuyu_audio.wav")
print(result["text"])
```
## Model Architecture
The model leverages the MMS (Massively Multilingual Speech) architecture with:
1. **Wav2Vec2 Backbone**: Pre-trained on 1000+ languages
2. **Language Adapters**: Lightweight adapter layers specifically trained for Kikuyu
3. **CTC Head**: Connectionist Temporal Classification for sequence-to-sequence learning
4. **Feature Extraction**: Convolutional layers for audio feature extraction
## Limitations and Bias
### Limitations
- **Domain specificity**: Trained primarily on read speech, may not generalize well to conversational or spontaneous speech
- **Audio quality**: Performance may degrade on low-quality or noisy audio
- **Vocabulary coverage**: Limited to characters present in training data
- **Code-switching**: May not handle Kikuyu-English code-switching well
### Bias Considerations
- The model reflects the linguistic patterns and potential biases present in the training dataset
- Performance may vary across different Kikuyu dialects or speaker demographics
- The dataset composition may not represent all varieties of spoken Kikuyu
## Training Infrastructure
### Optimizations Applied
- **Vocabulary caching**: Efficient vocabulary extraction with caching
- **Multiprocessing**: Parallel data processing with 16 processes
- **Feature extraction optimization**: Batched audio processing
- **Memory optimization**: Gradient checkpointing and mixed precision training
### Reproducibility
- **Seed**: 42
- **Framework versions**: PyTorch 2.x, Transformers 4.x, Accelerate
- **Training logs**: Available in model repository
## Citation
```bibtex
@misc{kikuyu-asr-2024,
title={MMS 1B Kikuyu ASR Model},
author={Kikuyu ASR Team},
year={2024},
publisher={Hugging Face},
journal={Hugging Face Model Hub},
howpublished={\url{https://huggingface.co/nickdee96/mms-1b-kik-accelerate-2multi}}
}
```
## Acknowledgments
- **Base model**: Facebook's MMS team for the pre-trained multilingual model
- **Dataset**: thinkKenya for the Kenyan Audio Datasets
- **Infrastructure**: Microsoft Azure for compute resources
- **Framework**: Hugging Face Transformers and Accelerate libraries
## License
This model is released under the Apache 2.0 license, following the base MMS model licensing.
## Model Card Contact
For questions or issues with this model, please open an issue in the repository or contact the model authors.