Upload m2m100_1.2b-eng-nso model

754bb19 verified 3 months ago

5.03 kB

language:
  - en
  - nso
tags:
  - translation
  - african-languages
  - scientific-translation
  - afriscience-mt
  - m2m100
license: apache-2.0
base_model: facebook/m2m100_1.2B
datasets:
  - afriscience-mt
pipeline_tag: translation
model-index:
  - name: m2m100_1.2b-eng-nso
    results:
      - task:
          type: translation
        metrics:
          - name: BLEU (test)
            type: bleu
            value: 40.36
          - name: chrF (test)
            type: chrf
            value: 62.37
          - name: SSA-COMET (test)
            type: comet
            value: 67.12

m2m100_1.2b-eng-nso

This model is part of the AfriScience-MT project, focused on machine translation of scientific texts for African languages.

Model Description

Property	Value
Model Type	Seq2Seq Translation
Translation Direction	English → Northern Sotho
Base Model	facebook/m2m100_1.2B
Domain	Scientific/Academic texts
Training	Full fine-tuning on AfriScience-MT dataset

Evaluation Results

Performance on the AfriScience-MT test set:

Split	BLEU	chrF	SSA-COMET
Validation	43.69	64.81	68.35
Test	40.36	62.37	67.12

Metrics explanation:

BLEU: Measures n-gram overlap with reference translations (0-100, higher is better)
chrF: Character-level F-score, robust for morphologically rich languages (0-100, higher is better)
SSA-COMET: Neural metric trained for Sub-Saharan African languages, shown as percentage (0-100, higher is better) (McGill-NLP/ssa-comet-stl)

Usage

Quick Start

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "AfriScience-MT/m2m100_1.2b-eng-nso"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Set source language
tokenizer.src_lang = "en"

# Translate
text = "The mitochondria is the powerhouse of the cell."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)

# Generate with target language
forced_bos_token_id = tokenizer.get_lang_id("ns")
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5)
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(translation)

Batch Translation

texts = [
    "Climate change affects agricultural productivity.",
    "The study analyzed genetic markers in the population.",
    "Renewable energy sources are essential for sustainable development."
]

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256)
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5)
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for src, tgt in zip(texts, translations):
    print(f"{src}\n→ {tgt}\n")

Training Details

Hyperparameters

Parameter	Value
Epochs	10
Batch Size	2
Learning Rate	2e-05

Training Data

Dataset: AfriScience-MT
Domain: Scientific abstracts and papers
Languages: English and 6 African languages (Amharic, Hausa, Luganda, Northern Sotho, Yoruba, isiZulu)

Reproducibility

To reproduce this model:

# Clone the AfriScience-MT repository
git clone https://github.com/afriscience-mt/afriscience-mt.git
cd afriscience-mt

# Install dependencies
pip install -r requirements.txt

# Run training
python -m afriscience_mt.scripts.run_seq2seq_training \
    --data_dir ./data \
    --source_lang eng \
    --target_lang nso \
    --model_name facebook/m2m100_1.2B \
    --model_type m2m100 \
    --output_dir ./output \
    --num_epochs 10 \
    --batch_size 16 \
    --learning_rate 2e-5

Limitations

Domain Specificity: This model is optimized for scientific/academic texts and may perform poorly on colloquial or informal text.
Language Coverage: Only supports the specific language pair indicated.
Input Length: Maximum input length is 256 tokens; longer texts should be split into segments.

Citation

If you use this model, please cite the AfriScience-MT project:

@inproceedings{afriscience-mt-2025,
  title={AfriScience-MT: Machine Translation for African Scientific Literature},
  author={AfriScience-MT Team},
  year={2025},
  url={https://github.com/afriscience-mt/afriscience-mt}
}

License

This model is released under the Apache 2.0 License.

Acknowledgments

Built on top of {base_model}
Evaluation using SSA-COMET for African language assessment