Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
- am
|
| 5 |
+
- om
|
| 6 |
+
- ig
|
| 7 |
+
- yo
|
| 8 |
+
- ha
|
| 9 |
+
- sw
|
| 10 |
+
- rw
|
| 11 |
+
- xh
|
| 12 |
+
- zu
|
| 13 |
+
tags:
|
| 14 |
+
- sentence-transformers
|
| 15 |
+
- feature-extraction
|
| 16 |
+
- sentence-similarity
|
| 17 |
+
- mteb
|
| 18 |
+
- transformers
|
| 19 |
+
license: mit
|
| 20 |
+
base_model: intfloat/multilingual-e5-large-instruct
|
| 21 |
+
datasets:
|
| 22 |
+
- mnli
|
| 23 |
+
- snli
|
| 24 |
+
metrics:
|
| 25 |
+
- spearmanr
|
| 26 |
+
- ndcg_at_10
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
# AfriE5-Large-instruct
|
| 30 |
+
|
| 31 |
+
**AfriE5-Large-instruct** is a text embedding model adapted from [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) to better support African languages. It was developed by leveraging cross-lingual contrastive learning with knowledge distillation, specifically targeting 9 African languages while generalizing well to 59 languages covered in the [AfriMTEB benchmark](https://arxiv.org/abs/2510.23896).
|
| 32 |
+
|
| 33 |
+
## Model Details
|
| 34 |
+
|
| 35 |
+
- **Model Name:** AfriE5-Large-instruct
|
| 36 |
+
- **Base Model:** [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)
|
| 37 |
+
- **Architecture:** XLM-RoBERTa-large based (24 layers, 1024 hidden size)
|
| 38 |
+
- **Training Method:** Cross-lingual contrastive learning + Knowledge Distillation (Teacher: [BGE Reranker v2 m3](https://huggingface.co/BAAI/bge-reranker-v2-m3))
|
| 39 |
+
- **Training Data:** NLI datasets (MNLI, SNLI) translated into 9 African languages using NLLB-200-3.3B, filtered by SSA-COMET.
|
| 40 |
+
- **Supported Languages:**
|
| 41 |
+
- **Targeted (Training):** Amharic, Oromo, Hausa, Igbo, Kinyarwanda, Swahili, Twi, Xhosa, Yoruba, Zulu.
|
| 42 |
+
- **Evaluated (AfriMTEB):** Covers 59 languages including the targeted ones and others like Afrikaans, Somali, Twi, etc.
|
| 43 |
+
|
| 44 |
+
## Usage
|
| 45 |
+
|
| 46 |
+
### Using Sentence Transformers
|
| 47 |
+
|
| 48 |
+
```python
|
| 49 |
+
from sentence_transformers import SentenceTransformer
|
| 50 |
+
|
| 51 |
+
# Load the model
|
| 52 |
+
model = SentenceTransformer('McGill-NLP/AfriE5-Large-instruct')
|
| 53 |
+
|
| 54 |
+
# Define queries and documents
|
| 55 |
+
# IMPORTANT: Queries require a specific instruction prefix.
|
| 56 |
+
# Documents do not strictly need a prefix, but usage should mirror mE5 conventions.
|
| 57 |
+
query_instruction = "Instruct: Retrieve sentences that are semantically consistent with the input.\nQuery: "
|
| 58 |
+
|
| 59 |
+
queries = [
|
| 60 |
+
"What are the key features of AfriMTEB?",
|
| 61 |
+
"Hali ya hewa ikoje leo?" # Swahili: How is the weather today?
|
| 62 |
+
]
|
| 63 |
+
|
| 64 |
+
documents = [
|
| 65 |
+
"AfriMTEB is a benchmark for evaluating text embeddings in African languages.",
|
| 66 |
+
"Leo kuna jua kali sana." # Swahili: Today it is very sunny.
|
| 67 |
+
]
|
| 68 |
+
|
| 69 |
+
# Add prefix to queries
|
| 70 |
+
formatted_queries = [query_instruction + q for q in queries]
|
| 71 |
+
|
| 72 |
+
# Encode
|
| 73 |
+
query_embeddings = model.encode(formatted_queries, normalize_embeddings=True)
|
| 74 |
+
doc_embeddings = model.encode(documents, normalize_embeddings=True)
|
| 75 |
+
|
| 76 |
+
# Compute similarity
|
| 77 |
+
scores = (query_embeddings @ doc_embeddings.T) * 100
|
| 78 |
+
print(scores)
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
### Using Hugging Face Transformers
|
| 82 |
+
|
| 83 |
+
```python
|
| 84 |
+
import torch
|
| 85 |
+
import torch.nn.functional as F
|
| 86 |
+
from transformers import AutoTokenizer, AutoModel
|
| 87 |
+
|
| 88 |
+
def average_pool(last_hidden_states, attention_mask):
|
| 89 |
+
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
|
| 90 |
+
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
|
| 91 |
+
|
| 92 |
+
# Load model and tokenizer
|
| 93 |
+
tokenizer = AutoTokenizer.from_pretrained('McGill-NLP/AfriE5-Large-instruct')
|
| 94 |
+
model = AutoModel.from_pretrained('McGill-NLP/AfriE5-Large-instruct')
|
| 95 |
+
|
| 96 |
+
# Define input texts
|
| 97 |
+
query_instruction = "Instruct: Retrieve sentences that are semantically consistent with the input.\nQuery: "
|
| 98 |
+
input_texts = [
|
| 99 |
+
query_instruction + "What is the capital of Nigeria?",
|
| 100 |
+
"Abuja is the capital city of Nigeria.",
|
| 101 |
+
"Lagos is the largest city in Nigeria."
|
| 102 |
+
]
|
| 103 |
+
|
| 104 |
+
# Tokenize
|
| 105 |
+
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
|
| 106 |
+
|
| 107 |
+
# Get embeddings
|
| 108 |
+
outputs = model(**batch_dict)
|
| 109 |
+
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
|
| 110 |
+
|
| 111 |
+
# Normalize embeddings
|
| 112 |
+
embeddings = F.normalize(embeddings, p=2, dim=1)
|
| 113 |
+
|
| 114 |
+
# Compute cosine similarity
|
| 115 |
+
scores = (embeddings[:1] @ embeddings[1:].T) * 100
|
| 116 |
+
print(scores)
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
## Benchmark Results
|
| 120 |
+
|
| 121 |
+
AfriE5-Large-instruct was evaluated on **AfriMTEB**, a comprehensive benchmark for African languages.
|
| 122 |
+
|
| 123 |
+
### AfriMTEB-Lite (9 Languages)
|
| 124 |
+
*Average performance across 12 tasks on 9 target African languages.*
|
| 125 |
+
|
| 126 |
+
| Model | Average Score |
|
| 127 |
+
| :--- | :---: |
|
| 128 |
+
| **AfriE5-Large-instruct** | **63.7** |
|
| 129 |
+
| Gemini Embedding-001 | 63.1 |
|
| 130 |
+
| mE5-Large-instruct | 62.0 |
|
| 131 |
+
| BGE-M3 | 55.0 |
|
| 132 |
+
|
| 133 |
+
### AfriMTEB-Full (59 Languages)
|
| 134 |
+
*Macro-average across 38 datasets and 59 languages.*
|
| 135 |
+
|
| 136 |
+
| Model | Average Score |
|
| 137 |
+
| :--- | :---: |
|
| 138 |
+
| **AfriE5-Large-instruct** | **62.4** |
|
| 139 |
+
| mE5-Large-instruct | 61.3 |
|
| 140 |
+
| Gemini Embedding-001 | 60.6 |
|
| 141 |
+
| BGE-M3 | 55.8 |
|
| 142 |
+
|
| 143 |
+
*Note: AfriE5 outperforms strong baselines despite being trained on only 9 languages, demonstrating effective cross-lingual generalization.*
|
| 144 |
+
|
| 145 |
+
## Training Details
|
| 146 |
+
|
| 147 |
+
- **Source Data:** MNLI and SNLI (English).
|
| 148 |
+
- **Translation:** Translated into 9 African languages (Amharic, Oromo, Hausa, Igbo, Kinyarwanda, Swahili, Xhosa, Yoruba, Zulu) using `facebook/nllb-200-3.3B`.
|
| 149 |
+
- **Quality Control:** Filtered using **SSA-COMET** (threshold 0.75) to ensure high-quality training pairs.
|
| 150 |
+
- **Data Augmentation:** Expanded with cross-lingual pairs (e.g., Target Premise - Source Hypothesis) and hard negatives mined using mE5.
|
| 151 |
+
- **Objective:** Contrastive loss + KL-divergence distillation from `BAAI/bge-reranker-v2-m3`.
|
| 152 |
+
|
| 153 |
+
## Citation
|
| 154 |
+
|
| 155 |
+
If you use this model or the AfriMTEB benchmark, please cite:
|
| 156 |
+
|
| 157 |
+
```bibtex
|
| 158 |
+
@article{uemura2025afrimteb,
|
| 159 |
+
title={AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages},
|
| 160 |
+
author={Uemura, Kosei and Zhang, Miaoran and Adelani, David Ifeoluwa},
|
| 161 |
+
journal={arXiv preprint},
|
| 162 |
+
year={2025}
|
| 163 |
+
}
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
## Acknowledgments
|
| 167 |
+
|
| 168 |
+
This work adapts the [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding) library. We thank the BAAI team for their open-source contributions.
|
| 169 |
+
|