AfriE5-Large-instruct
AfriE5-Large-instruct is a text embedding model adapted from multilingual-e5-large-instruct to better support African languages. It was developed by leveraging cross-lingual contrastive learning with knowledge distillation, specifically targeting 9 African languages while generalizing well to 59 languages covered in the AfriMTEB benchmark.
Model Details
- Model Name: AfriE5-Large-instruct
- Base Model: intfloat/multilingual-e5-large-instruct
- Architecture: XLM-RoBERTa-large based (24 layers, 1024 hidden size)
- Training Method: Cross-lingual contrastive learning + Knowledge Distillation (Teacher: BGE Reranker v2 m3)
- Training Data: NLI datasets (MNLI, SNLI) translated into 9 African languages using NLLB-200-3.3B, filtered by SSA-COMET.
- Supported Languages:
- Targeted (Training): Amharic, Oromo, Hausa, Igbo, Kinyarwanda, Swahili, Twi, Xhosa, Yoruba, Zulu.
- Evaluated (AfriMTEB): Covers 59 languages including the targeted ones and others like Afrikaans, Somali, Twi, etc.
Usage
Using Sentence Transformers
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('McGill-NLP/AfriE5-Large-instruct')
# Define queries and documents
# IMPORTANT: Queries require a specific instruction prefix.
# Documents do not strictly need a prefix, but usage should mirror mE5 conventions.
query_instruction = "Instruct: Retrieve sentences that are semantically consistent with the input.\nQuery: "
queries = [
"What are the key features of AfriMTEB?",
"Hali ya hewa ikoje leo?" # Swahili: How is the weather today?
]
documents = [
"AfriMTEB is a benchmark for evaluating text embeddings in African languages.",
"Leo kuna jua kali sana." # Swahili: Today it is very sunny.
]
# Add prefix to queries
formatted_queries = [query_instruction + q for q in queries]
# Encode
query_embeddings = model.encode(formatted_queries, normalize_embeddings=True)
doc_embeddings = model.encode(documents, normalize_embeddings=True)
# Compute similarity
scores = (query_embeddings @ doc_embeddings.T) * 100
print(scores)
Using Hugging Face Transformers
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states, attention_mask):
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('McGill-NLP/AfriE5-Large-instruct')
model = AutoModel.from_pretrained('McGill-NLP/AfriE5-Large-instruct')
# Define input texts
query_instruction = "Instruct: Retrieve sentences that are semantically consistent with the input.\nQuery: "
input_texts = [
query_instruction + "What is the capital of Nigeria?",
"Abuja is the capital city of Nigeria.",
"Lagos is the largest city in Nigeria."
]
# Tokenize
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
# Get embeddings
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# Normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
# Compute cosine similarity
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores)
Benchmark Results
AfriE5-Large-instruct was evaluated on AfriMTEB, a comprehensive benchmark for African languages.
AfriMTEB-Lite (9 Languages)
Average performance across 12 tasks on 9 target African languages.
| Model | Average Score |
|---|---|
| AfriE5-Large-instruct | 63.7 |
| Gemini Embedding-001 | 63.1 |
| mE5-Large-instruct | 62.0 |
| BGE-M3 | 55.0 |
AfriMTEB-Full (59 Languages)
Macro-average across 38 datasets and 59 languages.
| Model | Average Score |
|---|---|
| AfriE5-Large-instruct | 62.4 |
| mE5-Large-instruct | 61.3 |
| Gemini Embedding-001 | 60.6 |
| BGE-M3 | 55.8 |
Note: AfriE5 outperforms strong baselines despite being trained on only 9 languages, demonstrating effective cross-lingual generalization.
Training Details
- Source Data: MNLI and SNLI (English).
- Translation: Translated into 9 African languages (Amharic, Oromo, Hausa, Igbo, Kinyarwanda, Swahili, Xhosa, Yoruba, Zulu) using
facebook/nllb-200-3.3B. - Quality Control: Filtered using SSA-COMET (threshold 0.75) to ensure high-quality training pairs.
- Data Augmentation: Expanded with cross-lingual pairs (e.g., Target Premise - Source Hypothesis) and hard negatives mined using mE5.
- Objective: Contrastive loss + KL-divergence distillation from
BAAI/bge-reranker-v2-m3.
Citation
If you use this model or the AfriMTEB benchmark, please cite:
@article{uemura2025afrimteb,
title={AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages},
author={Uemura, Kosei and Zhang, Miaoran and Adelani, David Ifeoluwa},
journal={arXiv preprint},
year={2025}
}
Acknowledgments
This work adapts the FlagEmbedding library. We thank the BAAI team for their open-source contributions.
- Downloads last month
- 9