metadata
language:
- en
- nso
tags:
- translation
- african-languages
- scientific-translation
- afriscience-mt
- m2m100
license: apache-2.0
base_model: facebook/m2m100_1.2B
datasets:
- afriscience-mt
pipeline_tag: translation
model-index:
- name: m2m100_1.2b-eng-nso
results:
- task:
type: translation
metrics:
- name: BLEU (test)
type: bleu
value: 40.36
- name: chrF (test)
type: chrf
value: 62.37
- name: SSA-COMET (test)
type: comet
value: 67.12
m2m100_1.2b-eng-nso
This model is part of the AfriScience-MT project, focused on machine translation of scientific texts for African languages.
Model Description
| Property | Value |
|---|---|
| Model Type | Seq2Seq Translation |
| Translation Direction | English → Northern Sotho |
| Base Model | facebook/m2m100_1.2B |
| Domain | Scientific/Academic texts |
| Training | Full fine-tuning on AfriScience-MT dataset |
Evaluation Results
Performance on the AfriScience-MT test set:
| Split | BLEU | chrF | SSA-COMET |
|---|---|---|---|
| Validation | 43.69 | 64.81 | 68.35 |
| Test | 40.36 | 62.37 | 67.12 |
Metrics explanation:
- BLEU: Measures n-gram overlap with reference translations (0-100, higher is better)
- chrF: Character-level F-score, robust for morphologically rich languages (0-100, higher is better)
- SSA-COMET: Neural metric trained for Sub-Saharan African languages, shown as percentage (0-100, higher is better) (McGill-NLP/ssa-comet-stl)
Usage
Quick Start
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_id = "AfriScience-MT/m2m100_1.2b-eng-nso"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Set source language
tokenizer.src_lang = "en"
# Translate
text = "The mitochondria is the powerhouse of the cell."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)
# Generate with target language
forced_bos_token_id = tokenizer.get_lang_id("ns")
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5)
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(translation)
Batch Translation
texts = [
"Climate change affects agricultural productivity.",
"The study analyzed genetic markers in the population.",
"Renewable energy sources are essential for sustainable development."
]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256)
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5)
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for src, tgt in zip(texts, translations):
print(f"{src}\n→ {tgt}\n")
Training Details
Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 10 |
| Batch Size | 2 |
| Learning Rate | 2e-05 |
Training Data
- Dataset: AfriScience-MT
- Domain: Scientific abstracts and papers
- Languages: English and 6 African languages (Amharic, Hausa, Luganda, Northern Sotho, Yoruba, isiZulu)
Reproducibility
To reproduce this model:
# Clone the AfriScience-MT repository
git clone https://github.com/afriscience-mt/afriscience-mt.git
cd afriscience-mt
# Install dependencies
pip install -r requirements.txt
# Run training
python -m afriscience_mt.scripts.run_seq2seq_training \
--data_dir ./data \
--source_lang eng \
--target_lang nso \
--model_name facebook/m2m100_1.2B \
--model_type m2m100 \
--output_dir ./output \
--num_epochs 10 \
--batch_size 16 \
--learning_rate 2e-5
Limitations
- Domain Specificity: This model is optimized for scientific/academic texts and may perform poorly on colloquial or informal text.
- Language Coverage: Only supports the specific language pair indicated.
- Input Length: Maximum input length is 256 tokens; longer texts should be split into segments.
Citation
If you use this model, please cite the AfriScience-MT project:
@inproceedings{afriscience-mt-2025,
title={AfriScience-MT: Machine Translation for African Scientific Literature},
author={AfriScience-MT Team},
year={2025},
url={https://github.com/afriscience-mt/afriscience-mt}
}
License
This model is released under the Apache 2.0 License.
Acknowledgments
- Built on top of {base_model}
- Evaluation using SSA-COMET for African language assessment