You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

asr-whisper-largev2-v5

This model is a domain-adapted version of openchs/asr-whisper-helpline-sw-v1 fine-tuned on real phone call recordings from the Tanzania Child Helpline system, powered by OpenCHS.

Model Description

This ASR model represents domain-specific fine-tuning to bridge the gap between clean, read speech (Common Voice) and real-world telephony audio. While the base model achieved 23.56% WER on Common Voice validation data, this model is optimized for actual call center environments with authentic phone call audio quality challenges.

Key Characteristics:

Domain: Child helpline phone call transcription (Swahili)
Best Checkpoint: Step 4,500
Validation WER: 45.74% on real phone call audio
Validation Loss: 1.133
Training Dataset: Custom Swahili ASR v6 (~46.5 hours of augmented telephony speech)

Performance Context: The higher WER compared to the base model (45.74% vs 23.56%) reflects the significant domain shift from clean Common Voice recordings to real telephony audio. This is expected and represents realistic performance on production audio with:

Telephone bandwidth limitations (8kHz → upsampled)
Background noise and cross-talk
Natural conversational speech (vs. read speech)
Authentic Tanzanian Swahili dialects and speaking styles

Training Strategy

Three-Stage Training Pipeline:

Stage 1 - Common Voice 17.0: Initial fine-tuning from Whisper Large v2 (10,000 steps)
Stage 2 - Common Voice 23.0: Continued training on updated Common Voice data (7,500 steps → 23.56% WER)
Stage 3 (This Model) - Real Phone Calls: Domain adaptation on actual helpline recordings (4,500 steps → 45.74% WER on telephony)

This model represents Stage 3 with domain-specific optimization for production deployment.

Intended Uses & Limitations

Intended Uses

Primary:

Transcribing Swahili speech in Tanzania Child Helpline call center environments
Real-time or batch processing of telephony audio (8kHz phone quality)
Production ASR system for helpline service documentation and analytics

Secondary:

General Swahili ASR for telephony/call center applications
Research baseline for domain adaptation studies (clean speech → telephony)
Transfer learning base for similar low-resource telephony ASR tasks

Key Improvements Over Base Model

✅ Domain Adaptation: Fine-tuned on ~46.5 hours of augmented real phone calls ✅ Telephony Robustness: Optimized for phone bandwidth (8kHz) and call quality variations ✅ Dialect Coverage: Trained on authentic Tanzanian Swahili dialects from real conversations ✅ Production Ready: Validated on actual helpline audio (not just clean datasets)

Limitations

⚠️ Domain-Specific Vocabulary:

Optimized for child helpline and healthcare-related conversations
May underperform on technical, legal, or specialized domains outside training data scope

⚠️ Dialect Specificity:

Best performance on Tanzanian Swahili dialects represented in training data
May have reduced accuracy on coastal, northern, or other regional variants not well-represented

⚠️ Audio Quality Requirements:

Designed for telephony (8kHz-16kHz), may need retuning for high-fidelity audio
Performance degrades with severe background noise or very poor connections (though trained on augmented noisy data)

⚠️ Code-Switching:

Limited handling of Swahili-English code-switching common in urban Tanzania
May struggle with mixed-language utterances

⚠️ Model Size:

Large model (Whisper Large v2 architecture) requires GPU for real-time transcription
Consider quantization or distillation for edge deployment

Training and Evaluation Data

The model was trained on the Swahili ASR Dataset v6, a private dataset curated specifically for this task.

Data Privacy & Access

Status: 🔒 Private / Internal Use Only The dataset is not publicly available due to strict privacy and Personally Identifiable Information (PII) concerns. The source audio consists of real calls to the Tanzania Child Helpline. While the model weights are shared, the training data remains confidential to protect the identities of callers, many of whom are minors.

Dataset Volume (Hours & Samples)

The dataset utilizes a 5x augmentation strategy to maximize the utility of the available domain-specific audio.

Split	Unique Samples	Original Duration	Augmented Duration	Notes
Training	31,720	~9.3 hours	~46.5 hours	1 original + 4 augmented versions per file
Validation	1,813	~2.7 hours	~2.7 hours	Original audio only (No augmentation)
Test	907	~1.3 hours	~1.3 hours	Original audio only (No augmentation)
TOTAL	34,440	~13.3 hours	~50.5 hours

Data Characteristics

Source: Real-world phone call audio (not studio recordings)
Language: Tanzanian Swahili with natural conversational characteristics
Format: Telephony quality (primarily 8kHz, upsampled to 16kHz for Whisper)
Content: Domain-relevant vocabulary (child welfare, healthcare, family support)

Audio Augmentation Strategy

To make the model robust against the noisy environment of a call center, the training set (~9.3 hours) was expanded to ~46.5 hours using a one-technique-per-augmentation strategy.

Every original training sample was augmented with exactly one of the following techniques (weighted probability):

Volume Variation (±6 dB): Simulating distant or loud speakers (22.2%)
VTLP (Vocal Tract Length Perturbation): Simulating different speaker characteristics (15.6%)
Colored Noise: Simulating background static/environment (White/Pink/Brown noise at 30% coverage) (15.5%)
Time Stretch: Variation in speaking speed (0.9x - 1.1x) (12.6%)
Pitch Shift: Variation in tone (±2 semitones) (12.4%)
Packet Loss: Simulating VoIP connection drops (15% crop) (12.3%)
Codec Masking: Simulating compression artifacts (9.4%)

Note: Validation and Test splits contain only original audio to ensure unbiased evaluation metrics.

Training Procedure

Training Hyperparameters

Optimization:

learning_rate: 1e-05
lr_scheduler_type: cosine_with_restarts
lr_scheduler_warmup_steps: 500
optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
max_training_steps: 12,000 (stopped at 6,000, best at 4,500)
seed: 42

Batch Configuration:

per_device_train_batch_size: 16
per_device_eval_batch_size: 16
gradient_accumulation_steps: 1
Effective batch size: 16

Memory Optimization:

gradient_checkpointing: true (with use_reentrant=False)
mixed_precision_training: Native AMP (FP16)
dataloader_num_workers: 2

Evaluation & Checkpointing:

evaluation_strategy: steps
eval_steps: 500
save_steps: 500
logging_steps: 50
save_total_limit: 3

Best Model Selection:

load_best_model_at_end: true
metric_for_best_model: "wer"
greater_is_better: false
early_stopping_patience: 3 evaluations (1,500 steps)

Infrastructure:

GPU: RunPod A40 (40GB VRAM)
Training time: ~6.5 hours for 6,000 steps
Checkpoint size: ~3GB per checkpoint

Training Results

Training Loss	Epoch	Step	Validation Loss	WER	Notes
0.9509	0.0417	500	0.8714	49.7126	Initial adaptation
0.6505	0.0833	1000	0.8277	52.6501
0.4923	0.125	1500	0.8766	50.7503
0.3597	1.0014	2000	0.9145	48.4994
0.2188	1.0431	2500	0.9662	48.4036
0.1351	1.0848	3000	1.0237	46.5358
0.1057	1.1264	3500	1.0614	47.3819
0.0839	2.0028	4000	1.1110	46.6156
0.0541	2.0445	4500	1.1333	45.7375	✅ Best checkpoint
0.0411	2.0862	5000	1.1670	47.1264	Performance degradation
0.0321	2.1278	5500	1.1806	46.5358
0.0243	3.0042	6000	1.2159	46.8870	Overfitting signs

Training Observations:

Convergence: Best WER achieved at step 4,500 (45.74%)
Early signs of overfitting: Validation loss increased after step 4,500 while training loss continued decreasing
Model selection: Weights restored to step 4,500 checkpoint for optimal generalization
Training curve: Steady improvement from 49.71% → 45.74% WER over first 4,500 steps

Final Metrics (Step 4,500):

Training loss: 0.0541
Validation loss: 1.1333
Validation WER: 45.74%
Total training time: ~4.7 hours
Total training samples processed: ~192,000 (31,720 samples × ~6 epochs)

Domain Adaptation Summary

Stage	Dataset	WER	Domain Gap
Stage 1 (Base)	Common Voice 17.0	23.62%	Clean read speech
Stage 2 (Base)	Common Voice 23.0	23.56%	Clean read speech
Stage 3 (This Model)	Real Phone Calls v6	45.74%	Telephony, conversational

Domain Gap Analysis: The ~22 percentage point WER increase from Common Voice (23.56%) to real phone calls (45.74%) quantifies the domain adaptation challenge:

📞 Telephony bandwidth vs. full-bandwidth audio
🎤 Conversational vs. read speech
🔊 Real noise conditions vs. clean recordings
🗣️ Natural disfluencies vs. prepared text

This gap is expected and normal for production ASR systems deployed on telephony audio.

Performance Comparison

Model	Test Domain	WER	Notes
Whisper Large v2 (zero-shot)	Common Voice 17.0	89.05%	Baseline
Base model (v1) - Stage 1	Common Voice 17.0	23.62%	Clean speech tuning
Base model (v1) - Stage 2	Common Voice 23.0	23.56%	Clean speech tuning
This model (v5)	Real phone calls	45.74%	Telephony adaptation

Key Insight: While WER appears higher, this model is optimized for the actual production domain (telephony) where the base model would perform significantly worse despite its lower WER on clean data.

Usage

Quick Start

from transformers import pipeline

# Load the model
pipe = pipeline("automatic-speech-recognition",
                model="openchs/asr-whisper-largev2-v5")

# Transcribe phone call audio
result = pipe("path/to/phone_call.wav")
print(result["text"])

Advanced Usage with Audio Preprocessing

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa

# Load model and processor
processor = WhisperProcessor.from_pretrained("openchs/asr-whisper-largev2-v5")
model = WhisperForConditionalGeneration.from_pretrained("openchs/asr-whisper-largev2-v5")

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load and preprocess audio (handles telephony audio)
audio, sr = librosa.load("path/to/phone_call.wav", sr=16000, mono=True)

# Process audio
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(device)

# Generate transcription with language hint
forced_decoder_ids = processor.get_decoder_prompt_ids(language="sw", task="transcribe")
predicted_ids = model.generate(
    input_features,
    forced_decoder_ids=forced_decoder_ids,
    max_length=448
)

# Decode transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Production Deployment Recommendations

Audio Requirements:

Sample rate: 16kHz (model will work with 8kHz telephony audio upsampled to 16kHz)
Format: Mono (single channel)
Duration: Optimal <30 seconds per segment for memory efficiency

Inference Optimization:

# Use half-precision for faster inference
model = model.half()  # FP16

# Enable batch processing for multiple files
batch_size = 8
results = pipe(audio_files, batch_size=batch_size)

Real-time Considerations:

GPU required for real-time transcription (RTF < 1.0)
CPU inference possible but slower (RTF ~3-5x on modern CPUs)
Consider model quantization for edge deployment

Evaluation Methodology

Validation Set:

500 samples randomly selected from 1,813-sample validation split
Evaluated every 500 training steps
Represents diverse call scenarios and speakers

WER Calculation:

Standard Word Error Rate: (Substitutions + Deletions + Insertions) / Total Words
Normalized text (lowercase, punctuation handling)
Swahili-specific text normalization applied

Best Model Selection:

Automatic selection based on lowest validation WER
Early stopping after 3 evaluations without improvement
Final model: Step 4,500 checkpoint

Future Work

Test set evaluation: Comprehensive evaluation on held-out 907-sample test set
Code-switching support: Improve Swahili-English mixed utterance handling
Model compression: Quantization and distillation for faster inference
Streaming ASR: Adapt for real-time streaming transcription
Dialect expansion: Include more regional Swahili variants
Noise robustness: Further augmentation with extreme noise conditions
Benchmark comparison: Evaluate against other Swahili ASR systems

Citation

If you use this model in your research or production systems, please cite:

@misc{openchs-swahili-asr-v5,
  title={Domain-Adapted Swahili ASR for Tanzania Child Helpline Telephony},
  author={OpenCHS Team},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/openchs/asr-whisper-largev2-v5}},
  note={Fine-tuned from openchs/asr-whisper-helpline-sw-v1 on real phone call data}
}

Framework Versions

Transformers: 4.56.2
PyTorch: 2.8.0+cu128
Datasets: 2.21.0
Tokenizers: 0.22.1

License

Apache 2.0

Acknowledgments

Base model: openchs/asr-whisper-helpline-sw-v1
Foundation model: OpenAI Whisper Large v2
Training infrastructure: RunPod (A40 GPU)
Project: OpenCHS - Open Source Child Helpline System
Data collection: Tanzania Child Helpline operations team

Model Status: ✅ Production Ready - Optimized for Tanzania Child Helpline telephony transcription

Last Updated: 2025-11-17 (Checkpoint 4,500 restored as best performing model)

Downloads last month: -

Safetensors

Model size

2B params

Tensor type

F32

Model tree for openchs/asr-whisper-largev2-v5

Base model

openai/whisper-large-v2

Finetuned

openchs/asr-whisper-helpline-sw-v1

Finetuned

(3)

this model

Evaluation results

WER on Custom Swahili ASR v6 (Phone Calls)
validation set self-reported

45.737