Upload Kikuyu ASR model - WER: 35.74%

08ac0d0 verified 7 months ago

6.64 kB

	---
	language:
	- ki
	tags:
	- automatic-speech-recognition
	- asr
	- kikuyu
	- wav2vec2
	- mms
	- speech
	- kenyan-languages
	- low-resource
	license: apache-2.0
	datasets:
	- thinkKenya/kenyan_audio_datasets
	model-index:
	- name: MMS 1B Kikuyu ASR
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Kenyan Audio Datasets (Kikuyu)
	type: thinkKenya/kenyan_audio_datasets
	config: Kikuyu
	split: test
	args:
	language: ki
	metrics:
	- name: Word Error Rate
	type: wer
	value: 35.74
	- name: Character Error Rate
	type: cer
	value: N/A
	pipeline_tag: automatic-speech-recognition
	widget:
	- example_title: Kikuyu Speech Sample
	src: https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets/resolve/main/data/kikuyu_sample.wav
	---

	# MMS 1B Kikuyu ASR Model

	## Model Description

	This model is a fine-tuned version of Facebook's MMS (Massively Multilingual Speech) 1B parameter model, specifically adapted for Kikuyu (Gĩkũyũ) automatic speech recognition. The model uses language adapters to efficiently fine-tune the pre-trained MMS model for the Kikuyu language, achieving a Word Error Rate (WER) of 35.74% on the test set.

	## Model Details

	- Model Type: Wav2Vec2ForCTC with language adapters
	- Base Model: `facebook/mms-1b-all`
	- Language: Kikuyu (ISO 639-1: `ki`)
	- Task: Automatic Speech Recognition (ASR)
	- Architecture: Wav2Vec2 with CTC head and language-specific adapters
	- Parameters: ~1B total parameters (only adapter layers fine-tuned)

	## Training Data

	The model was trained on the [Kenyan Audio Datasets](https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets) Kikuyu subset:

	- Training samples: 98,206 audio-text pairs
	- Test samples: 32,736 audio-text pairs
	- Total dataset size: 130,942 samples
	- Audio format: 16kHz sampling rate
	- Text preprocessing: Normalized, lowercase, special characters removed

	## Training Configuration

	### Hyperparameters
	- Batch size: 64 (8 per device × 4 GPUs × 2 gradient accumulation steps)
	- Learning rate: 3e-4
	- Weight decay: 0.01
	- Warmup steps: 500
	- Total training steps: 12,280
	- Epochs: 8
	- Mixed precision: fp16
	- Gradient checkpointing: Enabled

	### Hardware & Environment
	- GPUs: 4x NVIDIA GPUs
	- Framework: PyTorch with Accelerate
	- Optimization: AdamW optimizer with linear warmup scheduling
	- Distributed training: Multi-GPU with Accelerate

	## Vocabulary

	The model uses a character-level vocabulary specifically designed for Kikuyu, containing 24 tokens:

	```
	Characters: a, b, c, d, e, f, g, h, i, j, k, m, n, o, r, t, u, w, y, ĩ, ũ
	Special tokens: [PAD], [UNK], \| (word separator)
	```

	## Performance

	### Metrics
	- Best WER: 35.74% (achieved at training step 1700)
	- Training time: ~3 hours on 4 GPUs
	- Evaluation subset: 2,048 examples per evaluation step

	### Training Progress
	The model showed consistent improvement during training:
	- Step 100: WER 100.52%
	- Step 500: WER 43.34%
	- Step 800: WER 40.02%
	- Step 1200: WER 37.48%
	- Step 1500: WER 36.66%
	- Step 1700: WER 35.74% (best)

	## Usage

	### Quick Start

	```python
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
	import torch
	import soundfile as sf

	# Load model and processor
	model = Wav2Vec2ForCTC.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi")
	processor = Wav2Vec2Processor.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi")

	# Load audio file (16kHz)
	audio, sr = sf.read("kikuyu_audio.wav")

	# Process audio
	inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

	# Generate transcription
	with torch.no_grad():
	logits = model(inputs.input_values).logits

	# Decode prediction
	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = processor.batch_decode(predicted_ids)[0]

	print(f"Transcription: {transcription}")
	```

	### With Pipeline

	```python
	from transformers import pipeline

	# Initialize ASR pipeline
	asr = pipeline("automatic-speech-recognition",
	model="nickdee96/mms-1b-kik-accelerate-2multi")

	# Transcribe audio
	result = asr("kikuyu_audio.wav")
	print(result["text"])
	```

	## Model Architecture

	The model leverages the MMS (Massively Multilingual Speech) architecture with:

	1. Wav2Vec2 Backbone: Pre-trained on 1000+ languages
	2. Language Adapters: Lightweight adapter layers specifically trained for Kikuyu
	3. CTC Head: Connectionist Temporal Classification for sequence-to-sequence learning
	4. Feature Extraction: Convolutional layers for audio feature extraction

	## Limitations and Bias

	### Limitations
	- Domain specificity: Trained primarily on read speech, may not generalize well to conversational or spontaneous speech
	- Audio quality: Performance may degrade on low-quality or noisy audio
	- Vocabulary coverage: Limited to characters present in training data
	- Code-switching: May not handle Kikuyu-English code-switching well

	### Bias Considerations
	- The model reflects the linguistic patterns and potential biases present in the training dataset
	- Performance may vary across different Kikuyu dialects or speaker demographics
	- The dataset composition may not represent all varieties of spoken Kikuyu

	## Training Infrastructure

	### Optimizations Applied
	- Vocabulary caching: Efficient vocabulary extraction with caching
	- Multiprocessing: Parallel data processing with 16 processes
	- Feature extraction optimization: Batched audio processing
	- Memory optimization: Gradient checkpointing and mixed precision training

	### Reproducibility
	- Seed: 42
	- Framework versions: PyTorch 2.x, Transformers 4.x, Accelerate
	- Training logs: Available in model repository

	## Citation

	```bibtex
	@misc{kikuyu-asr-2024,
	title={MMS 1B Kikuyu ASR Model},
	author={Kikuyu ASR Team},
	year={2024},
	publisher={Hugging Face},
	journal={Hugging Face Model Hub},
	howpublished={\url{https://huggingface.co/nickdee96/mms-1b-kik-accelerate-2multi}}
	}
	```

	## Acknowledgments

	- Base model: Facebook's MMS team for the pre-trained multilingual model
	- Dataset: thinkKenya for the Kenyan Audio Datasets
	- Infrastructure: Microsoft Azure for compute resources
	- Framework: Hugging Face Transformers and Accelerate libraries

	## License

	This model is released under the Apache 2.0 license, following the base MMS model licensing.

	## Model Card Contact

	For questions or issues with this model, please open an issue in the repository or contact the model authors.