Update README.md
Browse files
README.md
CHANGED
|
@@ -215,11 +215,6 @@ print(model.compute_score(sentence_pairs,
|
|
| 215 |
|
| 216 |
|
| 217 |
We compare BGE-M3 with some popular methods, including BM25, openAI embedding, etc.
|
| 218 |
-
We utilized Pyserini to implement BM25, and the test results can be reproduced by this [script](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#bm25-baseline).
|
| 219 |
-
To make the BM25 and BGE-M3 more comparable, in the experiment,
|
| 220 |
-
BM25 used the same tokenizer as BGE-M3 (i.e., the tokenizer of XLM-Roberta).
|
| 221 |
-
Using the same vocabulary can also ensure that both approaches have the same retrieval latency.
|
| 222 |
-
|
| 223 |
|
| 224 |
- Multilingual (Miracl dataset)
|
| 225 |
|
|
@@ -242,6 +237,12 @@ Using the same vocabulary can also ensure that both approaches have the same ret
|
|
| 242 |
- NarritiveQA:
|
| 243 |

|
| 244 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 245 |
|
| 246 |
## Training
|
| 247 |
- Self-knowledge Distillation: combining multiple outputs from different
|
|
@@ -259,7 +260,7 @@ Refer to our [report](https://arxiv.org/pdf/2402.03216.pdf) for more details.
|
|
| 259 |
## Acknowledgement
|
| 260 |
|
| 261 |
Thanks the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc.
|
| 262 |
-
Thanks the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [
|
| 263 |
|
| 264 |
|
| 265 |
|
|
|
|
| 215 |
|
| 216 |
|
| 217 |
We compare BGE-M3 with some popular methods, including BM25, openAI embedding, etc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 218 |
|
| 219 |
- Multilingual (Miracl dataset)
|
| 220 |
|
|
|
|
| 237 |
- NarritiveQA:
|
| 238 |

|
| 239 |
|
| 240 |
+
- BM25
|
| 241 |
+
|
| 242 |
+
We utilized Pyserini to implement BM25, and the test results can be reproduced by this [script](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#bm25-baseline).
|
| 243 |
+
|
| 244 |
+

|
| 245 |
+
|
| 246 |
|
| 247 |
## Training
|
| 248 |
- Self-knowledge Distillation: combining multiple outputs from different
|
|
|
|
| 260 |
## Acknowledgement
|
| 261 |
|
| 262 |
Thanks the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc.
|
| 263 |
+
Thanks the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [Pyserini](https://github.com/castorini/pyserini).
|
| 264 |
|
| 265 |
|
| 266 |
|