Swahili-English Translation Model (General Domain Expansion v2)

This model is a fine-tuned version of openchs/sw-en-opus-mt-mul-en-v1 designed to excel at both general Swahili-English translation and specialized helpline/crisis support conversations. It uses a domain-aware training approach with explicit domain tags to maintain high performance across different contexts.

Model Details

Basic Information

  • Model Type: MarianMT Neural Machine Translation
  • Base Model: openchs/sw-en-opus-mt-mul-en-v1 (Helsinki-NLP/opus-mt architecture)
  • Language Pair: Swahili (sw) β†’ English (en)
  • Version: 2.0 (General Domain Expansion)
  • Training Approach: Domain-aware fine-tuning with knowledge distillation

Key Features

  • Domain-Aware Architecture: Uses <HELPLINE> and <GENERAL> tags for context-specific translation
  • Dual-Domain Optimization: Maintains specialized helpline performance while expanding general capabilities
  • Knowledge Distillation: Learned from a teacher model specialized in helpline translations
  • Production-Ready: Meets greater than 96% helpline retention and greater than 120% general improvement thresholds

Training Data Composition

Dataset Samples Weight Purpose
CCAligned General Corpus ~200k+ 1.0x General translation capability
Helpline Conversations ~40k 5.0x Crisis support and child protection
Total Training Samples ~240k - After filtering and oversampling

Data Sources:

Data Processing:

  • Token-based filtering (3-512 tokens, maximum 3.5:1 length ratio)
  • Deduplication applied
  • Train/Validation split: 98%/2%

Training Procedure

Training Architecture

Base Configuration:

Base Model: openchs/sw-en-opus-mt-mul-en-v1
Teacher Model: openchs/sw-en-opus-mt-mul-en-v1 (frozen, CPU-offloaded)
Training Method: Supervised fine-tuning with knowledge distillation
Optimization: AdamW with cosine learning rate schedule

Hyperparameters

# Optimization
Learning Rate: 1.5e-5
Warmup Steps: 1000
LR Scheduler: Cosine with warmup
Weight Decay: 0.01
Max Gradient Norm: 1.0

# Batch Configuration
Per-Device Batch Size: 8
Gradient Accumulation Steps: 16
Effective Batch Size: 128
Number of Epochs: 6

# Memory Optimization
Mixed Precision: BF16
Gradient Checkpointing: Enabled
Teacher Model Location: CPU (offloaded)

# Generation Settings
Max Length: 512 tokens
Beam Search: 4 beams

Knowledge Distillation Strategy

The model uses CPU-offloaded knowledge distillation to learn from a specialized helpline model:

Total Loss = (1 - Ξ±) Γ— Standard Loss + Ξ± Γ— Distillation Loss

Parameters:

  • Distillation Alpha (Ξ±): 0.3-0.5
  • Temperature (T): 2.0
  • Method: KL divergence with soft targets
  • Teacher Location: CPU (moved to GPU only during forward pass)

Memory Savings:

  • Approximately 3.5GB GPU memory saved through CPU offloading
  • 30-40% memory reduction with gradient checkpointing

Domain-Aware Training

Each training sample is tagged with its domain:

# Helpline domain
Input:  "<HELPLINE> Ninahitaji msaada wa haraka"
Output: "I need urgent help"

# General domain
Input:  "<GENERAL> Habari za asubuhi"
Output: "Good morning"

Domain Tag Benefits:

  • Explicit context signaling
  • Prevents catastrophic forgetting
  • Enables domain-specific optimization

Evaluation Strategy

Dual-Domain Evaluation (every 2000 steps):

Test Set Samples Metrics
Helpline Domain 500 BLEU, chrF, Keyword Preservation
General Domain 2000 BLEU, chrF

Evaluation Metrics:

  • BLEU Score: Primary translation quality metric
  • chrF Score: Character-level evaluation
  • Keyword Preservation: Critical term accuracy (helpline only)
  • Domain Retention Rate: Helpline performance vs. baseline
  • Domain Improvement Rate: General performance vs. baseline

Performance

Baseline vs. Final Results

Domain Baseline BLEU Final BLEU Change
Helpline X.XXXX X.XXXX +X.X% (XX.X% retention)
General X.XXXX X.XXXX +XX.X% (XXX.X% improvement)

Replace with actual metrics from your training run

Production Readiness Criteria

Production Status: READY

  • Helpline Retention: Greater than or equal to 96% of baseline
  • General Improvement: Greater than or equal to 120% of baseline

Sample Translations

General Domain:

SW: Habari za asubuhi, ninatumaini uko vizuri
EN: Good morning, I hope you are well

SW: Nina furaha kukuona tena
EN: I'm happy to see you again

Helpline Domain:

SW: Ninahitaji msaada wa haraka
EN: I need urgent help

SW: Mtoto wangu yupo hatarini
EN: My child is in danger

Usage

Basic Translation

from transformers import MarianMTModel, MarianTokenizer

# Load model and tokenizer
model_name = "brendaogutu/sw-en-opus-mt-general-expanded"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# For general translations
text = "<GENERAL> Habari za asubuhi"
inputs = tokenizer(text, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)  # "Good morning"

# For helpline/crisis translations
text = "<HELPLINE> Ninahitaji msaada wa haraka"
inputs = tokenizer(text, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)  # "I need urgent help"

Batch Translation

# Translate multiple sentences
texts = [
    "<GENERAL> Asante sana kwa msaada",
    "<HELPLINE> Mtoto anaumia",
    "<GENERAL> Tutaonana kesho"
]

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)

for src, tgt in zip(texts, translations):
    print(f"{src} β†’ {tgt}")

Without Domain Tags

# The model will default to GENERAL behavior if no tag is provided
text = "Habari za asubuhi"
inputs = tokenizer(text, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)

Training Infrastructure

Compute Requirements

  • Hardware Used: Single NVIDIA A100 40GB / V100 32GB GPU with CPU support
  • Training Time: Approximately 22 hours (6 epochs on ~240k samples)
  • Peak Memory Usage: ~35GB GPU + 16GB CPU (with optimizations)
  • Storage Required: ~50GB (datasets and checkpoints)

Memory Optimization Techniques

  1. Gradient Checkpointing: Enabled (30-40% memory reduction)
  2. CPU Teacher Offloading: Teacher model on CPU during distillation
  3. Mixed Precision Training: BF16 format
  4. Efficient Data Loading: 8 workers with memory pinning
  5. Reduced Batch Size: 8 per device with 16 gradient accumulation steps

Checkpoint Strategy

  • Save Frequency: Every 2000 steps
  • Evaluation Frequency: Every 2000 steps
  • Best Model Selection: Based on validation BLEU score
  • Checkpoints Kept: Best 3 models
  • Early Stopping: Patience of 10 evaluations, threshold 0.0001

Training Callbacks

  • Early Stopping: Prevents overfitting
  • Domain-Aware Evaluation: Monitors both domains during training
  • MLflow Tracking: Experiment tracking and model versioning

Limitations and Considerations

Known Limitations

  • Unidirectional: Optimized for Swahili β†’ English only (not bidirectional)
  • Domain Tags Required: Best performance when using appropriate <HELPLINE> or <GENERAL> tags
  • Specialized Vocabulary: May struggle with highly technical terms outside training domains
  • Context Length: Maximum 512 tokens (typical for MarianMT)
  • Informal Language: Performance may vary on heavy slang or very informal text

Recommended Use Cases

  • General Swahili-English translation
  • Crisis hotline and helpline support
  • Child protection conversations
  • Educational content
  • News and media translation

Not Recommended For

  • English β†’ Swahili translation (use dedicated model)
  • Medical/legal documents requiring 100% accuracy
  • Real-time interpretation without human oversight
  • Highly technical scientific papers
  • Documents exceeding 512 tokens without chunking

Ethical Considerations

Intended Use

This model is designed to support:

  • Helpline operators translating crisis communications
  • Child protection services handling multilingual cases
  • General translation needs in Swahili-speaking regions

Potential Risks

  • Translation Errors: May produce incorrect translations; human review recommended for critical applications
  • Bias: May reflect biases present in training data
  • Crisis Situations: Should not replace trained human operators in life-threatening emergencies
  • Privacy: Ensure compliance with data protection regulations when processing sensitive content

Responsible Use Guidelines

  1. Always have human oversight for crisis/emergency translations
  2. Do not rely solely on automated translation for legal or medical decisions
  3. Be aware of cultural context that may not be captured in direct translation
  4. Regularly evaluate performance on your specific use case
  5. Implement appropriate safeguards for sensitive content

Training Pipeline Details

Dataset Preparation Flow

Raw Data β†’ Token Filtering β†’ Deduplication β†’ Domain Tagging β†’ 
Tokenization β†’ Train/Val Split β†’ Training

Training Flow

Load Base Model β†’ Add Domain Tags β†’ Load Datasets β†’ 
Apply Filtering β†’ Baseline Evaluation β†’ Training Loop β†’ 
Domain Evaluation (every 2000 steps) β†’ Final Evaluation β†’ 
Save and Register Model

Quality Filters Applied

  • Minimum length: 3 tokens
  • Maximum length: 512 tokens
  • Maximum length ratio: 3.5:1
  • Duplicate removal
  • Encoding validation

Reproducibility

Experiment Tracking

All training runs tracked with:

  • MLflow experiment tracking
  • Versioned configuration files
  • Dataset composition statistics
  • Training metrics logging
  • Model checkpoints and metadata

Random Seeds

  • Data shuffling seed: 42
  • Train/test split seed: 42
  • Deterministic training where possible

Configuration

Complete training configuration available in repository:

  • configs/swahili_v1.json: Full hyperparameters
  • Training scripts with all optimization flags
  • Dataset preparation pipeline

Citation

If you use this model in your research or applications, please cite:

@misc{ogutu2025swahili-en-general-expanded,
  author = {Ogutu, Brenda},
  title = {Swahili-English General Domain Translation Model with Helpline Specialization},
  year = {2025},
  publisher = {HuggingFace},
  journal = {HuggingFace Model Hub},
  howpublished = {\url{https://huggingface.co/brendaogutu/sw-en-opus-mt-general-expanded}},
  note = {Fine-tuned with domain-aware training and knowledge distillation}
}

License

This model inherits the Apache 2.0 license from Helsinki-NLP/opus-mt-mul-en.

Acknowledgments

  • Base Model: Helsinki-NLP for the opus-mt architecture
  • Training Data: CCAligned corpus for general translations
  • Helpline Data: OpenCHs helpline conversation dataset
  • Framework: Hugging Face Transformers, PyTorch
  • Experiment Tracking: MLflow

Contact and Support

  • Issues: Open an issue on the model repository
  • Questions: Contact via Hugging Face discussions
  • Updates: Follow the model page for new versions

Version History

  • v2.0 (Current): General domain expansion with knowledge distillation
  • v1.0: Initial helpline-specialized model (openchs/sw-en-opus-mt-mul-en-v1)

Last Updated: December 2024

Model Card Authors: Brenda Ogutu (OpenCHs)

Downloads last month
63
Safetensors
Model size
77.1M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for openchs/sw-en-opus-mult-en-ccalligned

Finetuned
(1)
this model

Dataset used to train openchs/sw-en-opus-mult-en-ccalligned