File size: 6,635 Bytes
08ac0d0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 | ---
language:
- ki
tags:
- automatic-speech-recognition
- asr
- kikuyu
- wav2vec2
- mms
- speech
- kenyan-languages
- low-resource
license: apache-2.0
datasets:
- thinkKenya/kenyan_audio_datasets
model-index:
- name: MMS 1B Kikuyu ASR
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Kenyan Audio Datasets (Kikuyu)
type: thinkKenya/kenyan_audio_datasets
config: Kikuyu
split: test
args:
language: ki
metrics:
- name: Word Error Rate
type: wer
value: 35.74
- name: Character Error Rate
type: cer
value: N/A
pipeline_tag: automatic-speech-recognition
widget:
- example_title: Kikuyu Speech Sample
src: https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets/resolve/main/data/kikuyu_sample.wav
---
# MMS 1B Kikuyu ASR Model
## Model Description
This model is a fine-tuned version of Facebook's MMS (Massively Multilingual Speech) 1B parameter model, specifically adapted for Kikuyu (G末k农y农) automatic speech recognition. The model uses language adapters to efficiently fine-tune the pre-trained MMS model for the Kikuyu language, achieving a Word Error Rate (WER) of **35.74%** on the test set.
## Model Details
- **Model Type**: Wav2Vec2ForCTC with language adapters
- **Base Model**: `facebook/mms-1b-all`
- **Language**: Kikuyu (ISO 639-1: `ki`)
- **Task**: Automatic Speech Recognition (ASR)
- **Architecture**: Wav2Vec2 with CTC head and language-specific adapters
- **Parameters**: ~1B total parameters (only adapter layers fine-tuned)
## Training Data
The model was trained on the [Kenyan Audio Datasets](https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets) Kikuyu subset:
- **Training samples**: 98,206 audio-text pairs
- **Test samples**: 32,736 audio-text pairs
- **Total dataset size**: 130,942 samples
- **Audio format**: 16kHz sampling rate
- **Text preprocessing**: Normalized, lowercase, special characters removed
## Training Configuration
### Hyperparameters
- **Batch size**: 64 (8 per device 脳 4 GPUs 脳 2 gradient accumulation steps)
- **Learning rate**: 3e-4
- **Weight decay**: 0.01
- **Warmup steps**: 500
- **Total training steps**: 12,280
- **Epochs**: 8
- **Mixed precision**: fp16
- **Gradient checkpointing**: Enabled
### Hardware & Environment
- **GPUs**: 4x NVIDIA GPUs
- **Framework**: PyTorch with Accelerate
- **Optimization**: AdamW optimizer with linear warmup scheduling
- **Distributed training**: Multi-GPU with Accelerate
## Vocabulary
The model uses a character-level vocabulary specifically designed for Kikuyu, containing **24 tokens**:
```
Characters: a, b, c, d, e, f, g, h, i, j, k, m, n, o, r, t, u, w, y, 末, 农
Special tokens: [PAD], [UNK], | (word separator)
```
## Performance
### Metrics
- **Best WER**: 35.74% (achieved at training step 1700)
- **Training time**: ~3 hours on 4 GPUs
- **Evaluation subset**: 2,048 examples per evaluation step
### Training Progress
The model showed consistent improvement during training:
- Step 100: WER 100.52%
- Step 500: WER 43.34%
- Step 800: WER 40.02%
- Step 1200: WER 37.48%
- Step 1500: WER 36.66%
- **Step 1700: WER 35.74% (best)**
## Usage
### Quick Start
```python
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import soundfile as sf
# Load model and processor
model = Wav2Vec2ForCTC.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi")
processor = Wav2Vec2Processor.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi")
# Load audio file (16kHz)
audio, sr = sf.read("kikuyu_audio.wav")
# Process audio
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
# Generate transcription
with torch.no_grad():
logits = model(inputs.input_values).logits
# Decode prediction
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(f"Transcription: {transcription}")
```
### With Pipeline
```python
from transformers import pipeline
# Initialize ASR pipeline
asr = pipeline("automatic-speech-recognition",
model="nickdee96/mms-1b-kik-accelerate-2multi")
# Transcribe audio
result = asr("kikuyu_audio.wav")
print(result["text"])
```
## Model Architecture
The model leverages the MMS (Massively Multilingual Speech) architecture with:
1. **Wav2Vec2 Backbone**: Pre-trained on 1000+ languages
2. **Language Adapters**: Lightweight adapter layers specifically trained for Kikuyu
3. **CTC Head**: Connectionist Temporal Classification for sequence-to-sequence learning
4. **Feature Extraction**: Convolutional layers for audio feature extraction
## Limitations and Bias
### Limitations
- **Domain specificity**: Trained primarily on read speech, may not generalize well to conversational or spontaneous speech
- **Audio quality**: Performance may degrade on low-quality or noisy audio
- **Vocabulary coverage**: Limited to characters present in training data
- **Code-switching**: May not handle Kikuyu-English code-switching well
### Bias Considerations
- The model reflects the linguistic patterns and potential biases present in the training dataset
- Performance may vary across different Kikuyu dialects or speaker demographics
- The dataset composition may not represent all varieties of spoken Kikuyu
## Training Infrastructure
### Optimizations Applied
- **Vocabulary caching**: Efficient vocabulary extraction with caching
- **Multiprocessing**: Parallel data processing with 16 processes
- **Feature extraction optimization**: Batched audio processing
- **Memory optimization**: Gradient checkpointing and mixed precision training
### Reproducibility
- **Seed**: 42
- **Framework versions**: PyTorch 2.x, Transformers 4.x, Accelerate
- **Training logs**: Available in model repository
## Citation
```bibtex
@misc{kikuyu-asr-2024,
title={MMS 1B Kikuyu ASR Model},
author={Kikuyu ASR Team},
year={2024},
publisher={Hugging Face},
journal={Hugging Face Model Hub},
howpublished={\url{https://huggingface.co/nickdee96/mms-1b-kik-accelerate-2multi}}
}
```
## Acknowledgments
- **Base model**: Facebook's MMS team for the pre-trained multilingual model
- **Dataset**: thinkKenya for the Kenyan Audio Datasets
- **Infrastructure**: Microsoft Azure for compute resources
- **Framework**: Hugging Face Transformers and Accelerate libraries
## License
This model is released under the Apache 2.0 license, following the base MMS model licensing.
## Model Card Contact
For questions or issues with this model, please open an issue in the repository or contact the model authors. |