File size: 6,635 Bytes
08ac0d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
language: 
- ki
tags:
- automatic-speech-recognition
- asr
- kikuyu
- wav2vec2
- mms
- speech
- kenyan-languages
- low-resource
license: apache-2.0
datasets:
- thinkKenya/kenyan_audio_datasets
model-index:
- name: MMS 1B Kikuyu ASR
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Kenyan Audio Datasets (Kikuyu)
      type: thinkKenya/kenyan_audio_datasets
      config: Kikuyu
      split: test
      args: 
        language: ki
    metrics:
    - name: Word Error Rate
      type: wer
      value: 35.74
    - name: Character Error Rate  
      type: cer
      value: N/A
pipeline_tag: automatic-speech-recognition
widget:
- example_title: Kikuyu Speech Sample
  src: https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets/resolve/main/data/kikuyu_sample.wav
---

# MMS 1B Kikuyu ASR Model

## Model Description

This model is a fine-tuned version of Facebook's MMS (Massively Multilingual Speech) 1B parameter model, specifically adapted for Kikuyu (G末k农y农) automatic speech recognition. The model uses language adapters to efficiently fine-tune the pre-trained MMS model for the Kikuyu language, achieving a Word Error Rate (WER) of **35.74%** on the test set.

## Model Details

- **Model Type**: Wav2Vec2ForCTC with language adapters
- **Base Model**: `facebook/mms-1b-all`
- **Language**: Kikuyu (ISO 639-1: `ki`)
- **Task**: Automatic Speech Recognition (ASR)
- **Architecture**: Wav2Vec2 with CTC head and language-specific adapters
- **Parameters**: ~1B total parameters (only adapter layers fine-tuned)

## Training Data

The model was trained on the [Kenyan Audio Datasets](https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets) Kikuyu subset:

- **Training samples**: 98,206 audio-text pairs
- **Test samples**: 32,736 audio-text pairs  
- **Total dataset size**: 130,942 samples
- **Audio format**: 16kHz sampling rate
- **Text preprocessing**: Normalized, lowercase, special characters removed

## Training Configuration

### Hyperparameters
- **Batch size**: 64 (8 per device 脳 4 GPUs 脳 2 gradient accumulation steps)
- **Learning rate**: 3e-4
- **Weight decay**: 0.01
- **Warmup steps**: 500
- **Total training steps**: 12,280
- **Epochs**: 8
- **Mixed precision**: fp16
- **Gradient checkpointing**: Enabled

### Hardware & Environment
- **GPUs**: 4x NVIDIA GPUs
- **Framework**: PyTorch with Accelerate
- **Optimization**: AdamW optimizer with linear warmup scheduling
- **Distributed training**: Multi-GPU with Accelerate

## Vocabulary

The model uses a character-level vocabulary specifically designed for Kikuyu, containing **24 tokens**:

```
Characters: a, b, c, d, e, f, g, h, i, j, k, m, n, o, r, t, u, w, y, 末, 农
Special tokens: [PAD], [UNK], | (word separator)
```

## Performance

### Metrics
- **Best WER**: 35.74% (achieved at training step 1700)
- **Training time**: ~3 hours on 4 GPUs
- **Evaluation subset**: 2,048 examples per evaluation step

### Training Progress
The model showed consistent improvement during training:
- Step 100: WER 100.52%
- Step 500: WER 43.34% 
- Step 800: WER 40.02%
- Step 1200: WER 37.48%
- Step 1500: WER 36.66%
- **Step 1700: WER 35.74% (best)**

## Usage

### Quick Start

```python
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import soundfile as sf

# Load model and processor
model = Wav2Vec2ForCTC.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi")
processor = Wav2Vec2Processor.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi")

# Load audio file (16kHz)
audio, sr = sf.read("kikuyu_audio.wav")

# Process audio
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

# Generate transcription
with torch.no_grad():
    logits = model(inputs.input_values).logits

# Decode prediction
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]

print(f"Transcription: {transcription}")
```

### With Pipeline

```python
from transformers import pipeline

# Initialize ASR pipeline
asr = pipeline("automatic-speech-recognition", 
               model="nickdee96/mms-1b-kik-accelerate-2multi")

# Transcribe audio
result = asr("kikuyu_audio.wav")
print(result["text"])
```

## Model Architecture

The model leverages the MMS (Massively Multilingual Speech) architecture with:

1. **Wav2Vec2 Backbone**: Pre-trained on 1000+ languages
2. **Language Adapters**: Lightweight adapter layers specifically trained for Kikuyu
3. **CTC Head**: Connectionist Temporal Classification for sequence-to-sequence learning
4. **Feature Extraction**: Convolutional layers for audio feature extraction

## Limitations and Bias

### Limitations
- **Domain specificity**: Trained primarily on read speech, may not generalize well to conversational or spontaneous speech
- **Audio quality**: Performance may degrade on low-quality or noisy audio
- **Vocabulary coverage**: Limited to characters present in training data
- **Code-switching**: May not handle Kikuyu-English code-switching well

### Bias Considerations
- The model reflects the linguistic patterns and potential biases present in the training dataset
- Performance may vary across different Kikuyu dialects or speaker demographics
- The dataset composition may not represent all varieties of spoken Kikuyu

## Training Infrastructure

### Optimizations Applied
- **Vocabulary caching**: Efficient vocabulary extraction with caching
- **Multiprocessing**: Parallel data processing with 16 processes  
- **Feature extraction optimization**: Batched audio processing
- **Memory optimization**: Gradient checkpointing and mixed precision training

### Reproducibility
- **Seed**: 42
- **Framework versions**: PyTorch 2.x, Transformers 4.x, Accelerate
- **Training logs**: Available in model repository

## Citation

```bibtex
@misc{kikuyu-asr-2024,
  title={MMS 1B Kikuyu ASR Model},
  author={Kikuyu ASR Team},
  year={2024},
  publisher={Hugging Face},
  journal={Hugging Face Model Hub},
  howpublished={\url{https://huggingface.co/nickdee96/mms-1b-kik-accelerate-2multi}}
}
```

## Acknowledgments

- **Base model**: Facebook's MMS team for the pre-trained multilingual model
- **Dataset**: thinkKenya for the Kenyan Audio Datasets
- **Infrastructure**: Microsoft Azure for compute resources
- **Framework**: Hugging Face Transformers and Accelerate libraries

## License

This model is released under the Apache 2.0 license, following the base MMS model licensing.

## Model Card Contact

For questions or issues with this model, please open an issue in the repository or contact the model authors.