KURE-v1 / README.md
Bingsu's picture
Update README.md
cecb821 verified
---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:1879136
- loss:CachedGISTEmbedLoss
license: mit
metrics:
- recall
- precision
- f1
base_model:
- nlpai-lab/KURE-v1
library_name: sentence-transformers
---
# ๐Ÿ”Ž KURE-v1
Introducing Korea University Retrieval Embedding model, KURE-v1
It has shown remarkable performance in Korean text retrieval, speficially overwhelming most multilingual embedding models.
To our knowledge, It is one of the best publicly opened Korean retrieval models.
For details, visit the [KURE repository](https://github.com/nlpai-lab/KURE)
---
## Model Versions
| Model Name | Dimension | Sequence Length | Introduction |
|:----:|:---:|:---:|:---:|
| [KURE-v1](https://huggingface.co/nlpai-lab/KURE-v1) | 1024 | 8192 | Fine-tuned [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) with Korean data via [CachedGISTEmbedLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedgistembedloss)
| [KoE5](https://huggingface.co/nlpai-lab/KoE5) | 1024 | 512 | Fine-tuned [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) with [ko-triplet-v1.0](https://huggingface.co/datasets/nlpai-lab/ko-triplet-v1.0) via [CachedMultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss) |
## Model Description
This is the model card of a ๐Ÿค— transformers model that has been pushed on the Hub.
- **Developed by:** [NLP&AI Lab](http://nlp.korea.ac.kr/)
- **Language(s) (NLP):** Korean, English
- **License:** MIT
- **Finetuned from model:** [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)
## Example code
### Install Dependencies
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
### Python code
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the ๐Ÿค— Hub
model = SentenceTransformer("nlpai-lab/KURE-v1")
# Run inference
sentences = [
'ํ—Œ๋ฒ•๊ณผ ๋ฒ•์›์กฐ์ง๋ฒ•์€ ์–ด๋–ค ๋ฐฉ์‹์„ ํ†ตํ•ด ๊ธฐ๋ณธ๊ถŒ ๋ณด์žฅ ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๋ฒ•์  ๋ชจ์ƒ‰์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ์–ด',
'4. ์‹œ์‚ฌ์ ๊ณผ ๊ฐœ์„ ๋ฐฉํ–ฅ ์•ž์„œ ์‚ดํŽด๋ณธ ๋ฐ”์™€ ๊ฐ™์ด ์šฐ๋ฆฌ ํ—Œ๋ฒ•๊ณผ ๏ฝข๋ฒ•์›์กฐ์ง ๋ฒ•๏ฝฃ์€ ๋Œ€๋ฒ•์› ๊ตฌ์„ฑ์„ ๋‹ค์–‘ํ™”ํ•˜์—ฌ ๊ธฐ๋ณธ๊ถŒ ๋ณด์žฅ๊ณผ ๋ฏผ์ฃผ์ฃผ์˜ ํ™•๋ฆฝ์— ์žˆ์–ด ๋‹ค๊ฐ์ ์ธ ๋ฒ•์  ๋ชจ์ƒ‰์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ์„ ๊ทผ๋ณธ ๊ทœ๋ฒ”์œผ๋กœ ํ•˜๊ณ  ์žˆ๋‹ค. ๋”์šฑ์ด ํ•ฉ์˜์ฒด๋กœ์„œ์˜ ๋Œ€๋ฒ•์› ์›๋ฆฌ๋ฅผ ์ฑ„ํƒํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ ์—ญ์‹œ ๊ทธ ๊ตฌ์„ฑ์˜ ๋‹ค์–‘์„ฑ์„ ์š”์ฒญํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ•ด์„๋œ๋‹ค. ์ด์™€ ๊ฐ™์€ ๊ด€์ ์—์„œ ๋ณผ ๋•Œ ํ˜„์ง ๋ฒ•์›์žฅ๊ธ‰ ๊ณ ์œ„๋ฒ•๊ด€์„ ์ค‘์‹ฌ์œผ๋กœ ๋Œ€๋ฒ•์›์„ ๊ตฌ์„ฑํ•˜๋Š” ๊ด€ํ–‰์€ ๊ฐœ์„ ํ•  ํ•„์š”๊ฐ€ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.',
'์—ฐ๋ฐฉํ—Œ๋ฒ•์žฌํŒ์†Œ๋Š” 2001๋…„ 1์›” 24์ผ 5:3์˜ ๋‹ค์ˆ˜๊ฒฌํ•ด๋กœ ใ€Œ๋ฒ•์›์กฐ์ง๋ฒ•ใ€ ์ œ169์กฐ ์ œ2๋ฌธ์ด ํ—Œ๋ฒ•์— ํ•ฉ์น˜๋œ๋‹ค๋Š” ํŒ๊ฒฐ์„ ๋‚ด๋ ธ์Œ โ—‹ 5์ธ์˜ ๋‹ค์ˆ˜ ์žฌํŒ๊ด€์€ ์†Œ์†ก๊ด€๊ณ„์ธ์˜ ์ธ๊ฒฉ๊ถŒ ๋ณดํ˜ธ, ๊ณต์ •ํ•œ ์ ˆ์ฐจ์˜ ๋ณด์žฅ๊ณผ ๋ฐฉํ•ด๋ฐ›์ง€ ์•Š๋Š” ๋ฒ•๊ณผ ์ง„์‹ค ๋ฐœ๊ฒฌ ๋“ฑ์„ ๊ทผ๊ฑฐ๋กœ ํ•˜์—ฌ ํ…”๋ ˆ๋น„์ „ ์ดฌ์˜์— ๋Œ€ํ•œ ์ ˆ๋Œ€์ ์ธ ๊ธˆ์ง€๋ฅผ ํ—Œ๋ฒ•์— ํ•ฉ์น˜ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์•˜์Œ โ—‹ ๊ทธ๋Ÿฌ๋‚˜ ๋‚˜๋จธ์ง€ 3์ธ์˜ ์žฌํŒ๊ด€์€ ํ–‰์ •๋ฒ•์›์˜ ์†Œ์†ก์ ˆ์ฐจ๋Š” ํŠน๋ณ„ํ•œ ์ธ๊ฒฉ๊ถŒ ๋ณดํ˜ธ์˜ ์ด์ต๋„ ์—†์œผ๋ฉฐ, ํ…”๋ ˆ๋น„์ „ ๊ณต๊ฐœ์ฃผ์˜๋กœ ์ธํ•ด ๋ฒ•๊ณผ ์ง„์‹ค ๋ฐœ๊ฒฌ์˜ ๊ณผ์ •์ด ์–ธ์ œ๋‚˜ ์œ„ํƒœ๋กญ๊ฒŒ ๋˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋ผ๋ฉด์„œ ๋ฐ˜๋Œ€์˜๊ฒฌ์„ ์ œ์‹œํ•จ โ—‹ ์™œ๋ƒํ•˜๋ฉด ํ–‰์ •๋ฒ•์›์˜ ์†Œ์†ก์ ˆ์ฐจ์—์„œ๋Š” ์†Œ์†ก๋‹น์‚ฌ์ž๊ฐ€ ๊ฐœ์ธ์ ์œผ๋กœ ์ง์ ‘ ์‹ฌ๋ฆฌ์— ์ฐธ์„ํ•˜๊ธฐ๋ณด๋‹ค๋Š” ๋ณ€ํ˜ธ์‚ฌ๊ฐ€ ์ฐธ์„ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์œผ๋ฉฐ, ์‹ฌ๋ฆฌ๋Œ€์ƒ๋„ ์‚ฌ์‹ค๋ฌธ์ œ๊ฐ€ ์•„๋‹Œ ๋ฒ•๋ฅ ๋ฌธ์ œ๊ฐ€ ๋Œ€๋ถ€๋ถ„์ด๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๋Š” ๊ฒƒ์ž„ โ–ก ํ•œํŽธ, ์—ฐ๋ฐฉํ—Œ๋ฒ•์žฌํŒ์†Œ๋Š” ใ€Œ์—ฐ๋ฐฉํ—Œ๋ฒ•์žฌํŒ์†Œ๋ฒ•ใ€(Bundesverfassungsgerichtsgesetz: BVerfGG) ์ œ17a์กฐ์— ๋”ฐ๋ผ ์ œํ•œ์ ์ด๋‚˜๋งˆ ์žฌํŒ์— ๋Œ€ํ•œ ๋ฐฉ์†ก์„ ํ—ˆ์šฉํ•˜๊ณ  ์žˆ์Œ โ—‹ ใ€Œ์—ฐ๋ฐฉํ—Œ๋ฒ•์žฌํŒ์†Œ๋ฒ•ใ€ ์ œ17์กฐ์—์„œ ใ€Œ๋ฒ•์›์กฐ์ง๋ฒ•ใ€ ์ œ14์ ˆ ๋‚ด์ง€ ์ œ16์ ˆ์˜ ๊ทœ์ •์„ ์ค€์šฉํ•˜๋„๋ก ํ•˜๊ณ  ์žˆ์ง€๋งŒ, ๋…น์Œ์ด๋‚˜ ์ดฌ์˜์„ ํ†ตํ•œ ์žฌํŒ๊ณต๊ฐœ์™€ ๊ด€๋ จํ•˜์—ฌ์„œ๋Š” ใ€Œ๋ฒ•์›์กฐ์ง๋ฒ•ใ€๊ณผ ๋‹ค๋ฅธ ๋‚ด์šฉ์„ ๊ทœ์ •ํ•˜๊ณ  ์žˆ์Œ',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# Results for KURE-v1
# tensor([[1.0000, 0.6967, 0.5306],
# [0.6967, 1.0000, 0.4427],
# [0.5306, 0.4427, 1.0000]])
```
## Training Details
### Training Data
#### KURE-v1
- Korean query-document-hard_negative(5) data
- 2,000,000 examples
### Training Procedure
- **loss:** Used **[CachedGISTEmbedLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedgistembedloss)** by sentence-transformers
- **batch size:** 4096
- **learning rate:** 2e-05
- **epochs:** 1
## Evaluation
### Metrics
- Recall, Precision, NDCG, F1
### Benchmark Datasets
- [Ko-StrategyQA](https://huggingface.co/datasets/taeminlee/Ko-StrategyQA): ํ•œ๊ตญ์–ด ODQA multi-hop ๊ฒ€์ƒ‰ ๋ฐ์ดํ„ฐ์…‹ (StrategyQA ๋ฒˆ์—ญ)
- [AutoRAGRetrieval](https://huggingface.co/datasets/yjoonjang/markers_bm): ๊ธˆ์œต, ๊ณต๊ณต, ์˜๋ฃŒ, ๋ฒ•๋ฅ , ์ปค๋จธ์Šค 5๊ฐœ ๋ถ„์•ผ์— ๋Œ€ํ•ด, pdf๋ฅผ ํŒŒ์‹ฑํ•˜์—ฌ ๊ตฌ์„ฑํ•œ ํ•œ๊ตญ์–ด ๋ฌธ์„œ ๊ฒ€์ƒ‰ ๋ฐ์ดํ„ฐ์…‹
- [MIRACLRetrieval]([url](https://huggingface.co/datasets/miracl/miracl)): Wikipedia ๊ธฐ๋ฐ˜์˜ ํ•œ๊ตญ์–ด ๋ฌธ์„œ ๊ฒ€์ƒ‰ ๋ฐ์ดํ„ฐ์…‹
- [PublicHealthQA]([url](https://huggingface.co/datasets/xhluca/publichealth-qa)): ์˜๋ฃŒ ๋ฐ ๊ณต์ค‘๋ณด๊ฑด ๋„๋ฉ”์ธ์— ๋Œ€ํ•œ ํ•œ๊ตญ์–ด ๋ฌธ์„œ ๊ฒ€์ƒ‰ ๋ฐ์ดํ„ฐ์…‹
- [BelebeleRetrieval]([url](https://huggingface.co/datasets/facebook/belebele)): FLORES-200 ๊ธฐ๋ฐ˜์˜ ํ•œ๊ตญ์–ด ๋ฌธ์„œ ๊ฒ€์ƒ‰ ๋ฐ์ดํ„ฐ์…‹
- [MrTidyRetrieval](https://huggingface.co/datasets/mteb/mrtidy): Wikipedia ๊ธฐ๋ฐ˜์˜ ํ•œ๊ตญ์–ด ๋ฌธ์„œ ๊ฒ€์ƒ‰ ๋ฐ์ดํ„ฐ์…‹
- [MultiLongDocRetrieval](https://huggingface.co/datasets/Shitao/MLDR): ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์˜ ํ•œ๊ตญ์–ด ์žฅ๋ฌธ ๊ฒ€์ƒ‰ ๋ฐ์ดํ„ฐ์…‹
- [XPQARetrieval](https://huggingface.co/datasets/jinaai/xpqa): ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์˜ ํ•œ๊ตญ์–ด ๋ฌธ์„œ ๊ฒ€์ƒ‰ ๋ฐ์ดํ„ฐ์…‹
## Results
์•„๋ž˜๋Š” ๋ชจ๋“  ๋ชจ๋ธ์˜, ๋ชจ๋“  ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ํ‰๊ท  ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.
์ž์„ธํ•œ ๊ฒฐ๊ณผ๋Š” [KURE Github](https://github.com/nlpai-lab/KURE/tree/main/eval/results)์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
### Top-k 1
| Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
|-----------------------------------------|----------------------|------------------------|-------------------|-----------------|
| **nlpai-lab/KURE-v1** | **0.52640** | **0.60551** | **0.60551** | **0.55784** |
| dragonkue/BGE-m3-ko | 0.52361 | 0.60394 | 0.60394 | 0.55535 |
| BAAI/bge-m3 | 0.51778 | 0.59846 | 0.59846 | 0.54998 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.51246 | 0.59384 | 0.59384 | 0.54489 |
| nlpai-lab/KoE5 | 0.50157 | 0.57790 | 0.57790 | 0.53178 |
| intfloat/multilingual-e5-large | 0.50052 | 0.57727 | 0.57727 | 0.53122 |
| jinaai/jina-embeddings-v3 | 0.48287 | 0.56068 | 0.56068 | 0.51361 |
| BAAI/bge-multilingual-gemma2 | 0.47904 | 0.55472 | 0.55472 | 0.50916 |
| intfloat/multilingual-e5-large-instruct | 0.47842 | 0.55435 | 0.55435 | 0.50826 |
| intfloat/multilingual-e5-base | 0.46950 | 0.54490 | 0.54490 | 0.49947 |
| intfloat/e5-mistral-7b-instruct | 0.46772 | 0.54394 | 0.54394 | 0.49781 |
| Alibaba-NLP/gte-multilingual-base | 0.46469 | 0.53744 | 0.53744 | 0.49353 |
| Alibaba-NLP/gte-Qwen2-7B-instruct | 0.46633 | 0.53625 | 0.53625 | 0.49429 |
| openai/text-embedding-3-large | 0.44884 | 0.51688 | 0.51688 | 0.47572 |
| Salesforce/SFR-Embedding-2_R | 0.43748 | 0.50815 | 0.50815 | 0.46504 |
| upskyy/bge-m3-korean | 0.43125 | 0.50245 | 0.50245 | 0.45945 |
| jhgan/ko-sroberta-multitask | 0.33788 | 0.38497 | 0.38497 | 0.35678 |
### Top-k 3
| Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
|-----------------------------------------|----------------------|------------------------|-------------------|-----------------|
| **nlpai-lab/KURE-v1** | **0.68678** | **0.28711** | **0.65538** | **0.39835** |
| dragonkue/BGE-m3-ko | 0.67834 | 0.28385 | 0.64950 | 0.39378 |
| BAAI/bge-m3 | 0.67526 | 0.28374 | 0.64556 | 0.39291 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.67128 | 0.28193 | 0.64042 | 0.39072 |
| intfloat/multilingual-e5-large | 0.65807 | 0.27777 | 0.62822 | 0.38423 |
| nlpai-lab/KoE5 | 0.65174 | 0.27329 | 0.62369 | 0.37882 |
| BAAI/bge-multilingual-gemma2 | 0.64415 | 0.27416 | 0.61105 | 0.37782 |
| jinaai/jina-embeddings-v3 | 0.64116 | 0.27165 | 0.60954 | 0.37511 |
| intfloat/multilingual-e5-large-instruct | 0.64353 | 0.27040 | 0.60790 | 0.37453 |
| Alibaba-NLP/gte-multilingual-base | 0.63744 | 0.26404 | 0.59695 | 0.36764 |
| Alibaba-NLP/gte-Qwen2-7B-instruct | 0.63163 | 0.25937 | 0.59237 | 0.36263 |
| intfloat/multilingual-e5-base | 0.62099 | 0.26144 | 0.59179 | 0.36203 |
| intfloat/e5-mistral-7b-instruct | 0.62087 | 0.26144 | 0.58917 | 0.36188 |
| openai/text-embedding-3-large | 0.61035 | 0.25356 | 0.57329 | 0.35270 |
| Salesforce/SFR-Embedding-2_R | 0.60001 | 0.25253 | 0.56346 | 0.34952 |
| upskyy/bge-m3-korean | 0.59215 | 0.25076 | 0.55722 | 0.34623 |
| jhgan/ko-sroberta-multitask | 0.46930 | 0.18994 | 0.43293 | 0.26696 |
### Top-k 5
| Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
|-----------------------------------------|----------------------|------------------------|-------------------|-----------------|
| **nlpai-lab/KURE-v1** | **0.73851** | **0.19130** | **0.67479** | **0.29903** |
| dragonkue/BGE-m3-ko | 0.72517 | 0.18799 | 0.66692 | 0.29401 |
| BAAI/bge-m3 | 0.72954 | 0.18975 | 0.66615 | 0.29632 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.72962 | 0.18875 | 0.66236 | 0.29542 |
| nlpai-lab/KoE5 | 0.70820 | 0.18287 | 0.64499 | 0.28628 |
| intfloat/multilingual-e5-large | 0.70124 | 0.18316 | 0.64402 | 0.28588 |
| BAAI/bge-multilingual-gemma2 | 0.70258 | 0.18556 | 0.63338 | 0.28851 |
| jinaai/jina-embeddings-v3 | 0.69933 | 0.18256 | 0.63133 | 0.28505 |
| intfloat/multilingual-e5-large-instruct | 0.69018 | 0.17838 | 0.62486 | 0.27933 |
| Alibaba-NLP/gte-multilingual-base | 0.69365 | 0.17789 | 0.61896 | 0.27879 |
| intfloat/multilingual-e5-base | 0.67250 | 0.17406 | 0.61119 | 0.27247 |
| Alibaba-NLP/gte-Qwen2-7B-instruct | 0.67447 | 0.17114 | 0.60952 | 0.26943 |
| intfloat/e5-mistral-7b-instruct | 0.67449 | 0.17484 | 0.60935 | 0.27349 |
| openai/text-embedding-3-large | 0.66365 | 0.17004 | 0.59389 | 0.26677 |
| Salesforce/SFR-Embedding-2_R | 0.65622 | 0.17018 | 0.58494 | 0.26612 |
| upskyy/bge-m3-korean | 0.65477 | 0.17015 | 0.58073 | 0.26589 |
| jhgan/ko-sroberta-multitask | 0.53136 | 0.13264 | 0.45879 | 0.20976 |
### Top-k 10
| Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
|-----------------------------------------|----------------------|------------------------|-------------------|-----------------|
| **nlpai-lab/KURE-v1** | **0.79682** | **0.10624** | **0.69473** | **0.18524** |
| dragonkue/BGE-m3-ko | 0.78450 | 0.10492 | 0.68748 | 0.18288 |
| BAAI/bge-m3 | 0.79195 | 0.10592 | 0.68723 | 0.18456 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.78669 | 0.10462 | 0.68189 | 0.18260 |
| intfloat/multilingual-e5-large | 0.75902 | 0.10147 | 0.66370 | 0.17693 |
| nlpai-lab/KoE5 | 0.75296 | 0.09937 | 0.66012 | 0.17369 |
| BAAI/bge-multilingual-gemma2 | 0.76153 | 0.10364 | 0.65330 | 0.18003 |
| jinaai/jina-embeddings-v3 | 0.76277 | 0.10240 | 0.65290 | 0.17843 |
| intfloat/multilingual-e5-large-instruct | 0.74851 | 0.09888 | 0.64451 | 0.17283 |
| Alibaba-NLP/gte-multilingual-base | 0.75631 | 0.09938 | 0.64025 | 0.17363 |
| Alibaba-NLP/gte-Qwen2-7B-instruct | 0.74092 | 0.09607 | 0.63258 | 0.16847 |
| intfloat/multilingual-e5-base | 0.73512 | 0.09717 | 0.63216 | 0.16977 |
| intfloat/e5-mistral-7b-instruct | 0.73795 | 0.09777 | 0.63076 | 0.17078 |
| openai/text-embedding-3-large | 0.72946 | 0.09571 | 0.61670 | 0.16739 |
| Salesforce/SFR-Embedding-2_R | 0.71662 | 0.09546 | 0.60589 | 0.16651 |
| upskyy/bge-m3-korean | 0.71895 | 0.09583 | 0.60258 | 0.16712 |
| jhgan/ko-sroberta-multitask | 0.61225 | 0.07826 | 0.48687 | 0.13757 |
<br/>
## Citation
If you find our paper or models helpful, please consider cite as follows:
```text
@misc{KURE,
publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee},
year = {2024},
url = {https://github.com/nlpai-lab/KURE}
},
@misc{KoE5,
author = {NLP & AI Lab and Human-Inspired AI research},
title = {KoE5: A New Dataset and Model for Improving Korean Embedding Performance},
year = {2024},
publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee},
journal = {GitHub repository},
howpublished = {\url{https://github.com/nlpai-lab/KoE5}},
}
```