KoseiUemura commited on
Commit
2bbf55d
·
verified ·
1 Parent(s): 03904d0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +169 -0
README.md ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - am
5
+ - om
6
+ - ig
7
+ - yo
8
+ - ha
9
+ - sw
10
+ - rw
11
+ - xh
12
+ - zu
13
+ tags:
14
+ - sentence-transformers
15
+ - feature-extraction
16
+ - sentence-similarity
17
+ - mteb
18
+ - transformers
19
+ license: mit
20
+ base_model: intfloat/multilingual-e5-large-instruct
21
+ datasets:
22
+ - mnli
23
+ - snli
24
+ metrics:
25
+ - spearmanr
26
+ - ndcg_at_10
27
+ ---
28
+
29
+ # AfriE5-Large-instruct
30
+
31
+ **AfriE5-Large-instruct** is a text embedding model adapted from [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) to better support African languages. It was developed by leveraging cross-lingual contrastive learning with knowledge distillation, specifically targeting 9 African languages while generalizing well to 59 languages covered in the [AfriMTEB benchmark](https://arxiv.org/abs/2510.23896).
32
+
33
+ ## Model Details
34
+
35
+ - **Model Name:** AfriE5-Large-instruct
36
+ - **Base Model:** [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)
37
+ - **Architecture:** XLM-RoBERTa-large based (24 layers, 1024 hidden size)
38
+ - **Training Method:** Cross-lingual contrastive learning + Knowledge Distillation (Teacher: [BGE Reranker v2 m3](https://huggingface.co/BAAI/bge-reranker-v2-m3))
39
+ - **Training Data:** NLI datasets (MNLI, SNLI) translated into 9 African languages using NLLB-200-3.3B, filtered by SSA-COMET.
40
+ - **Supported Languages:**
41
+ - **Targeted (Training):** Amharic, Oromo, Hausa, Igbo, Kinyarwanda, Swahili, Twi, Xhosa, Yoruba, Zulu.
42
+ - **Evaluated (AfriMTEB):** Covers 59 languages including the targeted ones and others like Afrikaans, Somali, Twi, etc.
43
+
44
+ ## Usage
45
+
46
+ ### Using Sentence Transformers
47
+
48
+ ```python
49
+ from sentence_transformers import SentenceTransformer
50
+
51
+ # Load the model
52
+ model = SentenceTransformer('McGill-NLP/AfriE5-Large-instruct')
53
+
54
+ # Define queries and documents
55
+ # IMPORTANT: Queries require a specific instruction prefix.
56
+ # Documents do not strictly need a prefix, but usage should mirror mE5 conventions.
57
+ query_instruction = "Instruct: Retrieve sentences that are semantically consistent with the input.\nQuery: "
58
+
59
+ queries = [
60
+ "What are the key features of AfriMTEB?",
61
+ "Hali ya hewa ikoje leo?" # Swahili: How is the weather today?
62
+ ]
63
+
64
+ documents = [
65
+ "AfriMTEB is a benchmark for evaluating text embeddings in African languages.",
66
+ "Leo kuna jua kali sana." # Swahili: Today it is very sunny.
67
+ ]
68
+
69
+ # Add prefix to queries
70
+ formatted_queries = [query_instruction + q for q in queries]
71
+
72
+ # Encode
73
+ query_embeddings = model.encode(formatted_queries, normalize_embeddings=True)
74
+ doc_embeddings = model.encode(documents, normalize_embeddings=True)
75
+
76
+ # Compute similarity
77
+ scores = (query_embeddings @ doc_embeddings.T) * 100
78
+ print(scores)
79
+ ```
80
+
81
+ ### Using Hugging Face Transformers
82
+
83
+ ```python
84
+ import torch
85
+ import torch.nn.functional as F
86
+ from transformers import AutoTokenizer, AutoModel
87
+
88
+ def average_pool(last_hidden_states, attention_mask):
89
+ last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
90
+ return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
91
+
92
+ # Load model and tokenizer
93
+ tokenizer = AutoTokenizer.from_pretrained('McGill-NLP/AfriE5-Large-instruct')
94
+ model = AutoModel.from_pretrained('McGill-NLP/AfriE5-Large-instruct')
95
+
96
+ # Define input texts
97
+ query_instruction = "Instruct: Retrieve sentences that are semantically consistent with the input.\nQuery: "
98
+ input_texts = [
99
+ query_instruction + "What is the capital of Nigeria?",
100
+ "Abuja is the capital city of Nigeria.",
101
+ "Lagos is the largest city in Nigeria."
102
+ ]
103
+
104
+ # Tokenize
105
+ batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
106
+
107
+ # Get embeddings
108
+ outputs = model(**batch_dict)
109
+ embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
110
+
111
+ # Normalize embeddings
112
+ embeddings = F.normalize(embeddings, p=2, dim=1)
113
+
114
+ # Compute cosine similarity
115
+ scores = (embeddings[:1] @ embeddings[1:].T) * 100
116
+ print(scores)
117
+ ```
118
+
119
+ ## Benchmark Results
120
+
121
+ AfriE5-Large-instruct was evaluated on **AfriMTEB**, a comprehensive benchmark for African languages.
122
+
123
+ ### AfriMTEB-Lite (9 Languages)
124
+ *Average performance across 12 tasks on 9 target African languages.*
125
+
126
+ | Model | Average Score |
127
+ | :--- | :---: |
128
+ | **AfriE5-Large-instruct** | **63.7** |
129
+ | Gemini Embedding-001 | 63.1 |
130
+ | mE5-Large-instruct | 62.0 |
131
+ | BGE-M3 | 55.0 |
132
+
133
+ ### AfriMTEB-Full (59 Languages)
134
+ *Macro-average across 38 datasets and 59 languages.*
135
+
136
+ | Model | Average Score |
137
+ | :--- | :---: |
138
+ | **AfriE5-Large-instruct** | **62.4** |
139
+ | mE5-Large-instruct | 61.3 |
140
+ | Gemini Embedding-001 | 60.6 |
141
+ | BGE-M3 | 55.8 |
142
+
143
+ *Note: AfriE5 outperforms strong baselines despite being trained on only 9 languages, demonstrating effective cross-lingual generalization.*
144
+
145
+ ## Training Details
146
+
147
+ - **Source Data:** MNLI and SNLI (English).
148
+ - **Translation:** Translated into 9 African languages (Amharic, Oromo, Hausa, Igbo, Kinyarwanda, Swahili, Xhosa, Yoruba, Zulu) using `facebook/nllb-200-3.3B`.
149
+ - **Quality Control:** Filtered using **SSA-COMET** (threshold 0.75) to ensure high-quality training pairs.
150
+ - **Data Augmentation:** Expanded with cross-lingual pairs (e.g., Target Premise - Source Hypothesis) and hard negatives mined using mE5.
151
+ - **Objective:** Contrastive loss + KL-divergence distillation from `BAAI/bge-reranker-v2-m3`.
152
+
153
+ ## Citation
154
+
155
+ If you use this model or the AfriMTEB benchmark, please cite:
156
+
157
+ ```bibtex
158
+ @article{uemura2025afrimteb,
159
+ title={AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages},
160
+ author={Uemura, Kosei and Zhang, Miaoran and Adelani, David Ifeoluwa},
161
+ journal={arXiv preprint},
162
+ year={2025}
163
+ }
164
+ ```
165
+
166
+ ## Acknowledgments
167
+
168
+ This work adapts the [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding) library. We thank the BAAI team for their open-source contributions.
169
+