Update README.md
Browse files
README.md
CHANGED
|
@@ -177,7 +177,7 @@ The below numbers are with mDPR model, but miniDense_arabic_v1 should give a eve
|
|
| 177 |
|
| 178 |
*Note: MIRACL paper shows a different (higher) value for BM25 Arabic, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
|
| 179 |
|
| 180 |
-
# MTEB numbers:
|
| 181 |
MTEB is a general purpose embedding evaluation benchmark covering wide range of tasks, but miniDense models (like BGE-M3) are predominantly tuned for retireval tasks aimed at search & IR based usecases.
|
| 182 |
So it makes sense to evaluate our models in retrieval slice of the MTEB benchmark.
|
| 183 |
|
|
@@ -185,13 +185,22 @@ So it makes sense to evaluate our models in retrieval slice of the MTEB benchmar
|
|
| 185 |
|
| 186 |
Refer tables above
|
| 187 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
#### Long Document Retrieval
|
| 189 |
|
| 190 |
This is very ambitious eval because we have not trained for long context, the max_len was 512 for all the models below except BGE-M3 which had 8192 context and finetuned for long doc.
|
| 191 |
|
| 192 |
<center>
|
| 193 |
<img src="./ar_metrics_4.png" width=150%/>
|
| 194 |
-
<b><p>Table
|
| 195 |
</center>
|
| 196 |
|
| 197 |
|
|
@@ -202,7 +211,7 @@ This explains it's overall competitive performance when compared to models that
|
|
| 202 |
|
| 203 |
<center>
|
| 204 |
<img src="./ar_metrics_5.png" width=120%/>
|
| 205 |
-
<b><p>Table
|
| 206 |
</center>
|
| 207 |
|
| 208 |
<br/>
|
|
|
|
| 177 |
|
| 178 |
*Note: MIRACL paper shows a different (higher) value for BM25 Arabic, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
|
| 179 |
|
| 180 |
+
# MTEB Retrieval numbers:
|
| 181 |
MTEB is a general purpose embedding evaluation benchmark covering wide range of tasks, but miniDense models (like BGE-M3) are predominantly tuned for retireval tasks aimed at search & IR based usecases.
|
| 182 |
So it makes sense to evaluate our models in retrieval slice of the MTEB benchmark.
|
| 183 |
|
|
|
|
| 185 |
|
| 186 |
Refer tables above
|
| 187 |
|
| 188 |
+
#### Sadeem Question Retrieval
|
| 189 |
+
|
| 190 |
+
<center>
|
| 191 |
+
<img src="./ar_metrics_6.png" width=150%/>
|
| 192 |
+
<b><p>Table 3: Detailed Arabic retrieval performance on the SadeemQA eval set (measured by nDCG@10)</p></b>
|
| 193 |
+
</center>
|
| 194 |
+
|
| 195 |
+
|
| 196 |
+
|
| 197 |
#### Long Document Retrieval
|
| 198 |
|
| 199 |
This is very ambitious eval because we have not trained for long context, the max_len was 512 for all the models below except BGE-M3 which had 8192 context and finetuned for long doc.
|
| 200 |
|
| 201 |
<center>
|
| 202 |
<img src="./ar_metrics_4.png" width=150%/>
|
| 203 |
+
<b><p>Table 4: Detailed Arabic retrieval performance on the MultiLongDoc dev set (measured by nDCG@10)</p></b>
|
| 204 |
</center>
|
| 205 |
|
| 206 |
|
|
|
|
| 211 |
|
| 212 |
<center>
|
| 213 |
<img src="./ar_metrics_5.png" width=120%/>
|
| 214 |
+
<b><p>Table 5: Detailed Arabic retrieval performance on the 3 X-lingual test set (measured by nDCG@10)</p></b>
|
| 215 |
</center>
|
| 216 |
|
| 217 |
<br/>
|