Update README.md
Browse files
README.md
CHANGED
|
@@ -88,13 +88,42 @@ print(similarities.shape)
|
|
| 88 |
# [3, 3]
|
| 89 |
```
|
| 90 |
|
| 91 |
-
<!--
|
| 92 |
### Direct Usage (Transformers)
|
| 93 |
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
-
</details>
|
| 97 |
-
-->
|
| 98 |
|
| 99 |
<!--
|
| 100 |
### Downstream Usage (Sentence Transformers)
|
|
@@ -114,15 +143,26 @@ You can finetune this model on your own dataset.
|
|
| 114 |
|
| 115 |
## Evaluation
|
| 116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
### Metrics
|
| 118 |
|
|
|
|
|
|
|
| 119 |
#### Information Retrieval
|
| 120 |
|
| 121 |
-
<!--
|
| 122 |
-
## Bias, Risks and Limitations
|
| 123 |
|
| 124 |
-
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
|
| 125 |
-
-->
|
| 126 |
|
| 127 |
<!--
|
| 128 |
### Recommendations
|
|
@@ -133,7 +173,8 @@ You can finetune this model on your own dataset.
|
|
| 133 |
## Training Details
|
| 134 |
|
| 135 |
### Training Datasets
|
| 136 |
-
|
|
|
|
| 137 |
|
| 138 |
### Training Hyperparameters
|
| 139 |
#### Non-Default Hyperparameters
|
|
@@ -278,6 +319,25 @@ You can finetune this model on your own dataset.
|
|
| 278 |
- Datasets: 3.5.1
|
| 279 |
- Tokenizers: 0.21.1
|
| 280 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 281 |
## Citation
|
| 282 |
|
| 283 |
### BibTeX
|
|
@@ -295,6 +355,20 @@ You can finetune this model on your own dataset.
|
|
| 295 |
}
|
| 296 |
```
|
| 297 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 298 |
<!--
|
| 299 |
## Glossary
|
| 300 |
|
|
|
|
| 88 |
# [3, 3]
|
| 89 |
```
|
| 90 |
|
|
|
|
| 91 |
### Direct Usage (Transformers)
|
| 92 |
|
| 93 |
+
```python
|
| 94 |
+
import torch.nn.functional as F
|
| 95 |
+
|
| 96 |
+
from torch import Tensor
|
| 97 |
+
from transformers import AutoTokenizer, AutoModel
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
def average_pool(last_hidden_states: Tensor,
|
| 101 |
+
attention_mask: Tensor) -> Tensor:
|
| 102 |
+
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
|
| 103 |
+
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
# Each input text should start with "query: " or "passage: ", even for non-English texts.
|
| 107 |
+
# For tasks other than retrieval, you can simply use the "query: " prefix.
|
| 108 |
+
input_texts = ["query: λΆνκ°μ‘±λ² λͺ μ°¨ κ°μ μμ μ΄νΌνκ²° νμ ν 3κ°μ λ΄μ λ±λ‘μμλ§ μ ν¨νλ€λ μ‘°νμ νμ€ν νμκΉ?",
|
| 109 |
+
"passage: 1990λ
μ μ μ λ λΆν κ°μ‘±λ²μ μ§κΈκΉμ§ 4μ°¨λ‘ κ°μ λμ΄ νμ¬μ μ΄λ₯΄κ³ μλ€. 1993λ
μ μ΄λ£¨μ΄μ§ μ 1μ°¨ κ°μ μ μ£Όλ‘ κ·μ μ μ νμ±μ κΈ°νκΈ° μνμ¬ λͺλͺ μ‘°λ¬Έμ μμ ν κ²μ΄λ©°, μ€μ²΄μ μΈ λ΄μ©μ 보μν κ²μ μμμ μΉμΈκ³Ό ν¬κΈ°κΈ°κ°μ μ€μ ν μ 52μ‘° μ λλΌκ³ ν μ μλ€. 2004λ
μ μ΄λ£¨μ΄μ§ μ 2μ°¨μ κ°μ μμλ μ 20μ‘°μ 3νμ μ μ€νμ¬ μ¬νμ νμ λ μ΄νΌνκ²°μ 3κ°μ λ΄μ λ±λ‘ν΄μΌ μ΄νΌμ ν¨λ ₯μ΄ λ°μνλ€λ κ²μ λͺ
ννκ² νμλ€. 2007λ
μ μ΄λ£¨μ΄μ§ μ 3μ°¨ κ°μ μμλ λΆλͺ¨μ μλ
κ΄κ³ λν μ λΆλ±λ‘κΈ°κ΄μ λ±λ‘ν λλΆν° λ²μ ν¨λ ₯μ΄ λ°μνλ€λ κ²μ μ μ€(μ 25μ‘°μ 2ν)νμλ€. λν λ―Έμ±λ
μ, λ
Έλλ₯λ ₯ μλ μμ λΆμκ³Ό κ΄λ ¨(μ 37μ‘°μ 2ν)νμ¬ κΈ°μ‘΄μλ βλΆμλ₯λ ₯μ΄ μλ κ°μ μ±μμ΄ μμ κ²½μ°μλ λ°λ‘ μ¬λ λΆλͺ¨λ μλ
, μ‘°λΆλͺ¨λ μμλ
, νμ μλ§€κ° λΆμνλ€βκ³ κ·μ νκ³ μμλ κ²μ βλΆμλ₯λ ₯μ΄ μλ κ°μ μ±μμ΄ μμ κ²½μ°μλ λ°λ‘ μ¬λ λΆλͺ¨λ μλ
κ° λΆμνλ©° κ·Έλ€μ΄ μμ κ²½μ°μλ μ‘°λΆλͺ¨λ μμλ
, νμ μλ§€κ° λΆμνλ€βλ‘ κ°μ νμλ€.",
|
| 110 |
+
"passage: νκ²½λ§ν¬ μ λ, μΈμ¦κΈ°μ€ λ³κ²½μΌλ‘ κΈ°μ
λΆλ΄ μ€μΈλ€\nνκ²½λ§ν¬ μ λ μκ°\nβ‘ κ°μ\nβ λμΌ μ©λμ λ€λ₯Έ μ νμ λΉν΄ βμ νμ νκ²½μ±*βμ κ°μ ν μ νμ λ‘κ³ μ μ€λͺ
μ νμν μ μλλ‘νλ μΈμ¦ μ λ\nβ» μ νμ νκ²½μ± : μ¬λ£μ μ νμ μ μ‘°β€μλΉ νκΈ°νλ μ κ³Όμ μμ μ€μΌλ¬Όμ§μ΄λ μ¨μ€κ°μ€ λ±μ λ°°μΆνλ μ λ λ° μμκ³Ό μλμ§λ₯Ό μλΉνλ μ λ λ± νκ²½μ λ―ΈμΉλ μν₯λ ₯μ μ λ(γνκ²½κΈ°μ λ° νκ²½μ°μ
μ§μλ²γμ 2μ‘°μ 5νΈ)\nβ‘ λ²μ κ·Όκ±°\nβ γνκ²½κΈ°μ λ° νκ²½μ°μ
μ§μλ²γμ 17μ‘°(νκ²½νμ§μ μΈμ¦)\nβ‘ κ΄λ ¨ κ΅μ νμ€\nβ ISO 14024(μ 1μ ν νκ²½λΌλ²¨λ§)\nβ‘ μ μ©λμ\nβ μ¬λ¬΄κΈ°κΈ°, κ°μ μ ν, μνμ©ν, 건μΆμμ¬ λ± 156κ° λμμ νκ΅°\nβ‘ μΈμ¦νν©\nβ 2,737κ° κΈ°μ
μ 16,647κ° μ ν(2015.12μλ§ κΈ°μ€)"]
|
| 111 |
+
|
| 112 |
+
tokenizer = AutoTokenizer.from_pretrained('dragonkue/multilingual-e5-small-ko')
|
| 113 |
+
model = AutoModel.from_pretrained('dragonkue/multilingual-e5-small-ko')
|
| 114 |
+
|
| 115 |
+
# Tokenize the input texts
|
| 116 |
+
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
|
| 117 |
+
|
| 118 |
+
outputs = model(**batch_dict)
|
| 119 |
+
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
|
| 120 |
+
|
| 121 |
+
# normalize embeddings
|
| 122 |
+
embeddings = F.normalize(embeddings, p=2, dim=1)
|
| 123 |
+
scores = (embeddings[:1] @ embeddings[1:].T)
|
| 124 |
+
print(scores.tolist())
|
| 125 |
+
```
|
| 126 |
|
|
|
|
|
|
|
| 127 |
|
| 128 |
<!--
|
| 129 |
### Downstream Usage (Sentence Transformers)
|
|
|
|
| 143 |
|
| 144 |
## Evaluation
|
| 145 |
|
| 146 |
+
- This evaluation references the KURE GitHub repository. (https://github.com/nlpai-lab/KURE)
|
| 147 |
+
- We conducted an evaluation on all **Korean Retrieval Benchmarks** registered in [MTEB](https://github.com/embeddings-benchmark/mteb).
|
| 148 |
+
|
| 149 |
+
### Korean Retrieval Benchmark
|
| 150 |
+
- [Ko-StrategyQA](https://huggingface.co/datasets/taeminlee/Ko-StrategyQA): A Korean **ODQA multi-hop retrieval dataset**, translated from StrategyQA.
|
| 151 |
+
- [AutoRAGRetrieval](https://huggingface.co/datasets/yjoonjang/markers_bm): A **Korean document retrieval dataset** constructed by parsing PDFs from five domains: **finance, public, medical, legal, and commerce**.
|
| 152 |
+
- [MIRACLRetrieval](https://huggingface.co/datasets/miracl/miracl): A **Korean document retrieval dataset** based on Wikipedia.
|
| 153 |
+
- [PublicHealthQA](https://huggingface.co/datasets/xhluca/publichealth-qa): A **retrieval dataset** focused on **medical and public health domains** in Korean.
|
| 154 |
+
- [BelebeleRetrieval](https://huggingface.co/datasets/facebook/belebele): A **Korean document retrieval dataset** based on FLORES-200.
|
| 155 |
+
- [MrTidyRetrieval](https://huggingface.co/datasets/mteb/mrtidy): A **Wikipedia-based Korean document retrieval dataset**.
|
| 156 |
+
- [MultiLongDocRetrieval](https://huggingface.co/datasets/Shitao/MLDR): A **long-document retrieval dataset** covering various domains in Korean.
|
| 157 |
+
- [XPQARetrieval](https://huggingface.co/datasets/jinaai/xpqa): A **cross-domain Korean document retrieval dataset**.
|
| 158 |
+
|
| 159 |
### Metrics
|
| 160 |
|
| 161 |
+
* Standard metric : NDCG@10
|
| 162 |
+
|
| 163 |
#### Information Retrieval
|
| 164 |
|
|
|
|
|
|
|
| 165 |
|
|
|
|
|
|
|
| 166 |
|
| 167 |
<!--
|
| 168 |
### Recommendations
|
|
|
|
| 173 |
## Training Details
|
| 174 |
|
| 175 |
### Training Datasets
|
| 176 |
+
This model was fine-tuned on the same dataset used in dragonkue/snowflake-arctic-embed-l-v2.0-ko, which consists of Korean query-passage pairs.
|
| 177 |
+
The training objective was to improve retrieval performance specifically for Korean-language tasks.
|
| 178 |
|
| 179 |
### Training Hyperparameters
|
| 180 |
#### Non-Default Hyperparameters
|
|
|
|
| 319 |
- Datasets: 3.5.1
|
| 320 |
- Tokenizers: 0.21.1
|
| 321 |
|
| 322 |
+
## FAQ
|
| 323 |
+
1. Do I need to add the prefix "query: " and "passage: " to input texts?
|
| 324 |
+
|
| 325 |
+
Yes, this is how the model is trained, otherwise you will see a performance degradation.
|
| 326 |
+
|
| 327 |
+
Here are some rules of thumb:
|
| 328 |
+
|
| 329 |
+
Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
|
| 330 |
+
|
| 331 |
+
Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
|
| 332 |
+
|
| 333 |
+
Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
|
| 334 |
+
|
| 335 |
+
2. Why does the cosine similarity scores distribute around 0.7 to 1.0?
|
| 336 |
+
|
| 337 |
+
This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss.
|
| 338 |
+
|
| 339 |
+
For text embedding tasks like text retrieval or semantic similarity, what matters is the relative order of the scores instead of the absolute values, so this should not be an issue.
|
| 340 |
+
|
| 341 |
## Citation
|
| 342 |
|
| 343 |
### BibTeX
|
|
|
|
| 355 |
}
|
| 356 |
```
|
| 357 |
|
| 358 |
+
#### Base Model
|
| 359 |
+
```bibtex
|
| 360 |
+
@article{wang2024multilingual,
|
| 361 |
+
title={Multilingual E5 Text Embeddings: A Technical Report},
|
| 362 |
+
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
|
| 363 |
+
journal={arXiv preprint arXiv:2402.05672},
|
| 364 |
+
year={2024}
|
| 365 |
+
}
|
| 366 |
+
```
|
| 367 |
+
|
| 368 |
+
## Limitations
|
| 369 |
+
|
| 370 |
+
Long texts will be truncated to at most 512 tokens.
|
| 371 |
+
|
| 372 |
<!--
|
| 373 |
## Glossary
|
| 374 |
|