AfriE5-Large-instruct

AfriE5-Large-instruct is a text embedding model adapted from multilingual-e5-large-instruct to better support African languages. It was developed by leveraging cross-lingual contrastive learning with knowledge distillation, specifically targeting 9 African languages while generalizing well to 59 languages covered in the AfriMTEB benchmark.

Model Details

  • Model Name: AfriE5-Large-instruct
  • Base Model: intfloat/multilingual-e5-large-instruct
  • Architecture: XLM-RoBERTa-large based (24 layers, 1024 hidden size)
  • Training Method: Cross-lingual contrastive learning + Knowledge Distillation (Teacher: BGE Reranker v2 m3)
  • Training Data: NLI datasets (MNLI, SNLI) translated into 9 African languages using NLLB-200-3.3B, filtered by SSA-COMET.
  • Supported Languages:
    • Targeted (Training): Amharic, Oromo, Hausa, Igbo, Kinyarwanda, Swahili, Twi, Xhosa, Yoruba, Zulu.
    • Evaluated (AfriMTEB): Covers 59 languages including the targeted ones and others like Afrikaans, Somali, Twi, etc.

Usage

Using Sentence Transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('McGill-NLP/AfriE5-Large-instruct')

# Define queries and documents
# IMPORTANT: Queries require a specific instruction prefix. 
# Documents do not strictly need a prefix, but usage should mirror mE5 conventions.
query_instruction = "Instruct: Retrieve sentences that are semantically consistent with the input.\nQuery: "

queries = [
    "What are the key features of AfriMTEB?",
    "Hali ya hewa ikoje leo?" # Swahili: How is the weather today?
]

documents = [
    "AfriMTEB is a benchmark for evaluating text embeddings in African languages.",
    "Leo kuna jua kali sana." # Swahili: Today it is very sunny.
]

# Add prefix to queries
formatted_queries = [query_instruction + q for q in queries]

# Encode
query_embeddings = model.encode(formatted_queries, normalize_embeddings=True)
doc_embeddings = model.encode(documents, normalize_embeddings=True)

# Compute similarity
scores = (query_embeddings @ doc_embeddings.T) * 100
print(scores)

Using Hugging Face Transformers

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states, attention_mask):
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('McGill-NLP/AfriE5-Large-instruct')
model = AutoModel.from_pretrained('McGill-NLP/AfriE5-Large-instruct')

# Define input texts
query_instruction = "Instruct: Retrieve sentences that are semantically consistent with the input.\nQuery: "
input_texts = [
    query_instruction + "What is the capital of Nigeria?",
    "Abuja is the capital city of Nigeria.",
    "Lagos is the largest city in Nigeria."
]

# Tokenize
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

# Get embeddings
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# Normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)

# Compute cosine similarity
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores)

Benchmark Results

AfriE5-Large-instruct was evaluated on AfriMTEB, a comprehensive benchmark for African languages.

AfriMTEB-Lite (9 Languages)

Average performance across 12 tasks on 9 target African languages.

Model Average Score
AfriE5-Large-instruct 63.7
Gemini Embedding-001 63.1
mE5-Large-instruct 62.0
BGE-M3 55.0

AfriMTEB-Full (59 Languages)

Macro-average across 38 datasets and 59 languages.

Model Average Score
AfriE5-Large-instruct 62.4
mE5-Large-instruct 61.3
Gemini Embedding-001 60.6
BGE-M3 55.8

Note: AfriE5 outperforms strong baselines despite being trained on only 9 languages, demonstrating effective cross-lingual generalization.

Training Details

  • Source Data: MNLI and SNLI (English).
  • Translation: Translated into 9 African languages (Amharic, Oromo, Hausa, Igbo, Kinyarwanda, Swahili, Xhosa, Yoruba, Zulu) using facebook/nllb-200-3.3B.
  • Quality Control: Filtered using SSA-COMET (threshold 0.75) to ensure high-quality training pairs.
  • Data Augmentation: Expanded with cross-lingual pairs (e.g., Target Premise - Source Hypothesis) and hard negatives mined using mE5.
  • Objective: Contrastive loss + KL-divergence distillation from BAAI/bge-reranker-v2-m3.

Citation

If you use this model or the AfriMTEB benchmark, please cite:

@article{uemura2025afrimteb,
  title={AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages},
  author={Uemura, Kosei and Zhang, Miaoran and Adelani, David Ifeoluwa},
  journal={arXiv preprint},
  year={2025}
}

Acknowledgments

This work adapts the FlagEmbedding library. We thank the BAAI team for their open-source contributions.

Downloads last month
9
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for McGill-NLP/AfriE5-Large-instruct

Finetuned
(162)
this model
Quantizations
1 model

Dataset used to train McGill-NLP/AfriE5-Large-instruct

Paper for McGill-NLP/AfriE5-Large-instruct