| | --- |
| | tags: |
| | - sentence-transformers |
| | - sentence-similarity |
| | - feature-extraction |
| | - generated_from_trainer |
| | - dataset_size:1879136 |
| | - loss:CachedGISTEmbedLoss |
| | license: mit |
| | metrics: |
| | - recall |
| | - precision |
| | - f1 |
| | base_model: |
| | - nlpai-lab/KURE-v1 |
| | library_name: sentence-transformers |
| | --- |
| | |
| | # ๐ KURE-v1 |
| |
|
| | Introducing Korea University Retrieval Embedding model, KURE-v1 |
| | It has shown remarkable performance in Korean text retrieval, speficially overwhelming most multilingual embedding models. |
| | To our knowledge, It is one of the best publicly opened Korean retrieval models. |
| |
|
| | For details, visit the [KURE repository](https://github.com/nlpai-lab/KURE) |
| |
|
| | --- |
| |
|
| | ## Model Versions |
| | | Model Name | Dimension | Sequence Length | Introduction | |
| | |:----:|:---:|:---:|:---:| |
| | | [KURE-v1](https://huggingface.co/nlpai-lab/KURE-v1) | 1024 | 8192 | Fine-tuned [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) with Korean data via [CachedGISTEmbedLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedgistembedloss) |
| | | [KoE5](https://huggingface.co/nlpai-lab/KoE5) | 1024 | 512 | Fine-tuned [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) with [ko-triplet-v1.0](https://huggingface.co/datasets/nlpai-lab/ko-triplet-v1.0) via [CachedMultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss) | |
| |
|
| | ## Model Description |
| |
|
| | This is the model card of a ๐ค transformers model that has been pushed on the Hub. |
| |
|
| | - **Developed by:** [NLP&AI Lab](http://nlp.korea.ac.kr/) |
| | - **Language(s) (NLP):** Korean, English |
| | - **License:** MIT |
| | - **Finetuned from model:** [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) |
| |
|
| | ## Example code |
| | ### Install Dependencies |
| | First install the Sentence Transformers library: |
| |
|
| | ```bash |
| | pip install -U sentence-transformers |
| | ``` |
| | ### Python code |
| | Then you can load this model and run inference. |
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | # Download from the ๐ค Hub |
| | model = SentenceTransformer("nlpai-lab/KURE-v1") |
| | |
| | # Run inference |
| | sentences = [ |
| | 'ํ๋ฒ๊ณผ ๋ฒ์์กฐ์ง๋ฒ์ ์ด๋ค ๋ฐฉ์์ ํตํด ๊ธฐ๋ณธ๊ถ ๋ณด์ฅ ๋ฑ์ ๋ค์ํ ๋ฒ์ ๋ชจ์์ ๊ฐ๋ฅํ๊ฒ ํ์ด', |
| | '4. ์์ฌ์ ๊ณผ ๊ฐ์ ๋ฐฉํฅ ์์ ์ดํด๋ณธ ๋ฐ์ ๊ฐ์ด ์ฐ๋ฆฌ ํ๋ฒ๊ณผ ๏ฝข๋ฒ์์กฐ์ง ๋ฒ๏ฝฃ์ ๋๋ฒ์ ๊ตฌ์ฑ์ ๋ค์ํํ์ฌ ๊ธฐ๋ณธ๊ถ ๋ณด์ฅ๊ณผ ๋ฏผ์ฃผ์ฃผ์ ํ๋ฆฝ์ ์์ด ๋ค๊ฐ์ ์ธ ๋ฒ์ ๋ชจ์์ ๊ฐ๋ฅํ๊ฒ ํ๋ ๊ฒ์ ๊ทผ๋ณธ ๊ท๋ฒ์ผ๋ก ํ๊ณ ์๋ค. ๋์ฑ์ด ํฉ์์ฒด๋ก์์ ๋๋ฒ์ ์๋ฆฌ๋ฅผ ์ฑํํ๊ณ ์๋ ๊ฒ ์ญ์ ๊ทธ ๊ตฌ์ฑ์ ๋ค์์ฑ์ ์์ฒญํ๋ ๊ฒ์ผ๋ก ํด์๋๋ค. ์ด์ ๊ฐ์ ๊ด์ ์์ ๋ณผ ๋ ํ์ง ๋ฒ์์ฅ๊ธ ๊ณ ์๋ฒ๊ด์ ์ค์ฌ์ผ๋ก ๋๋ฒ์์ ๊ตฌ์ฑํ๋ ๊ดํ์ ๊ฐ์ ํ ํ์๊ฐ ์๋ ๊ฒ์ผ๋ก ๋ณด์ธ๋ค.', |
| | '์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ 2001๋
1์ 24์ผ 5:3์ ๋ค์๊ฒฌํด๋ก ใ๋ฒ์์กฐ์ง๋ฒใ ์ 169์กฐ ์ 2๋ฌธ์ด ํ๋ฒ์ ํฉ์น๋๋ค๋ ํ๊ฒฐ์ ๋ด๋ ธ์ โ 5์ธ์ ๋ค์ ์ฌํ๊ด์ ์์ก๊ด๊ณ์ธ์ ์ธ๊ฒฉ๊ถ ๋ณดํธ, ๊ณต์ ํ ์ ์ฐจ์ ๋ณด์ฅ๊ณผ ๋ฐฉํด๋ฐ์ง ์๋ ๋ฒ๊ณผ ์ง์ค ๋ฐ๊ฒฌ ๋ฑ์ ๊ทผ๊ฑฐ๋ก ํ์ฌ ํ
๋ ๋น์ ์ดฌ์์ ๋ํ ์ ๋์ ์ธ ๊ธ์ง๋ฅผ ํ๋ฒ์ ํฉ์นํ๋ ๊ฒ์ผ๋ก ๋ณด์์ โ ๊ทธ๋ฌ๋ ๋๋จธ์ง 3์ธ์ ์ฌํ๊ด์ ํ์ ๋ฒ์์ ์์ก์ ์ฐจ๋ ํน๋ณํ ์ธ๊ฒฉ๊ถ ๋ณดํธ์ ์ด์ต๋ ์์ผ๋ฉฐ, ํ
๋ ๋น์ ๊ณต๊ฐ์ฃผ์๋ก ์ธํด ๋ฒ๊ณผ ์ง์ค ๋ฐ๊ฒฌ์ ๊ณผ์ ์ด ์ธ์ ๋ ์ํ๋กญ๊ฒ ๋๋ ๊ฒ์ ์๋๋ผ๋ฉด์ ๋ฐ๋์๊ฒฌ์ ์ ์ํจ โ ์๋ํ๋ฉด ํ์ ๋ฒ์์ ์์ก์ ์ฐจ์์๋ ์์ก๋น์ฌ์๊ฐ ๊ฐ์ธ์ ์ผ๋ก ์ง์ ์ฌ๋ฆฌ์ ์ฐธ์ํ๊ธฐ๋ณด๋ค๋ ๋ณํธ์ฌ๊ฐ ์ฐธ์ํ๋ ๊ฒฝ์ฐ๊ฐ ๋ง์ผ๋ฉฐ, ์ฌ๋ฆฌ๋์๋ ์ฌ์ค๋ฌธ์ ๊ฐ ์๋ ๋ฒ๋ฅ ๋ฌธ์ ๊ฐ ๋๋ถ๋ถ์ด๊ธฐ ๋๋ฌธ์ด๋ผ๋ ๊ฒ์ โก ํํธ, ์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ ใ์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ฒใ(Bundesverfassungsgerichtsgesetz: BVerfGG) ์ 17a์กฐ์ ๋ฐ๋ผ ์ ํ์ ์ด๋๋ง ์ฌํ์ ๋ํ ๋ฐฉ์ก์ ํ์ฉํ๊ณ ์์ โ ใ์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ฒใ ์ 17์กฐ์์ ใ๋ฒ์์กฐ์ง๋ฒใ ์ 14์ ๋ด์ง ์ 16์ ์ ๊ท์ ์ ์ค์ฉํ๋๋ก ํ๊ณ ์์ง๋ง, ๋
น์์ด๋ ์ดฌ์์ ํตํ ์ฌํ๊ณต๊ฐ์ ๊ด๋ จํ์ฌ์๋ ใ๋ฒ์์กฐ์ง๋ฒใ๊ณผ ๋ค๋ฅธ ๋ด์ฉ์ ๊ท์ ํ๊ณ ์์', |
| | ] |
| | embeddings = model.encode(sentences) |
| | print(embeddings.shape) |
| | # [3, 1024] |
| | |
| | # Get the similarity scores for the embeddings |
| | similarities = model.similarity(embeddings, embeddings) |
| | print(similarities) |
| | # Results for KURE-v1 |
| | # tensor([[1.0000, 0.6967, 0.5306], |
| | # [0.6967, 1.0000, 0.4427], |
| | # [0.5306, 0.4427, 1.0000]]) |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | #### KURE-v1 |
| | - Korean query-document-hard_negative(5) data |
| | - 2,000,000 examples |
| | |
| | ### Training Procedure |
| | - **loss:** Used **[CachedGISTEmbedLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedgistembedloss)** by sentence-transformers |
| | - **batch size:** 4096 |
| | - **learning rate:** 2e-05 |
| | - **epochs:** 1 |
| | |
| | ## Evaluation |
| | ### Metrics |
| | - Recall, Precision, NDCG, F1 |
| | ### Benchmark Datasets |
| | - [Ko-StrategyQA](https://huggingface.co/datasets/taeminlee/Ko-StrategyQA): ํ๊ตญ์ด ODQA multi-hop ๊ฒ์ ๋ฐ์ดํฐ์
(StrategyQA ๋ฒ์ญ) |
| | - [AutoRAGRetrieval](https://huggingface.co/datasets/yjoonjang/markers_bm): ๊ธ์ต, ๊ณต๊ณต, ์๋ฃ, ๋ฒ๋ฅ , ์ปค๋จธ์ค 5๊ฐ ๋ถ์ผ์ ๋ํด, pdf๋ฅผ ํ์ฑํ์ฌ ๊ตฌ์ฑํ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
|
| | - [MIRACLRetrieval]([url](https://huggingface.co/datasets/miracl/miracl)): Wikipedia ๊ธฐ๋ฐ์ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
|
| | - [PublicHealthQA]([url](https://huggingface.co/datasets/xhluca/publichealth-qa)): ์๋ฃ ๋ฐ ๊ณต์ค๋ณด๊ฑด ๋๋ฉ์ธ์ ๋ํ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
|
| | - [BelebeleRetrieval]([url](https://huggingface.co/datasets/facebook/belebele)): FLORES-200 ๊ธฐ๋ฐ์ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
|
| | - [MrTidyRetrieval](https://huggingface.co/datasets/mteb/mrtidy): Wikipedia ๊ธฐ๋ฐ์ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
|
| | - [MultiLongDocRetrieval](https://huggingface.co/datasets/Shitao/MLDR): ๋ค์ํ ๋๋ฉ์ธ์ ํ๊ตญ์ด ์ฅ๋ฌธ ๊ฒ์ ๋ฐ์ดํฐ์
|
| | - [XPQARetrieval](https://huggingface.co/datasets/jinaai/xpqa): ๋ค์ํ ๋๋ฉ์ธ์ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
|
| | |
| | ## Results |
| | |
| | ์๋๋ ๋ชจ๋ ๋ชจ๋ธ์, ๋ชจ๋ ๋ฒค์น๋งํฌ ๋ฐ์ดํฐ์
์ ๋ํ ํ๊ท ๊ฒฐ๊ณผ์
๋๋ค. |
| | ์์ธํ ๊ฒฐ๊ณผ๋ [KURE Github](https://github.com/nlpai-lab/KURE/tree/main/eval/results)์์ ํ์ธํ์ค ์ ์์ต๋๋ค. |
| | ### Top-k 1 |
| | | Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 | |
| | |-----------------------------------------|----------------------|------------------------|-------------------|-----------------| |
| | | **nlpai-lab/KURE-v1** | **0.52640** | **0.60551** | **0.60551** | **0.55784** | |
| | | dragonkue/BGE-m3-ko | 0.52361 | 0.60394 | 0.60394 | 0.55535 | |
| | | BAAI/bge-m3 | 0.51778 | 0.59846 | 0.59846 | 0.54998 | |
| | | Snowflake/snowflake-arctic-embed-l-v2.0 | 0.51246 | 0.59384 | 0.59384 | 0.54489 | |
| | | nlpai-lab/KoE5 | 0.50157 | 0.57790 | 0.57790 | 0.53178 | |
| | | intfloat/multilingual-e5-large | 0.50052 | 0.57727 | 0.57727 | 0.53122 | |
| | | jinaai/jina-embeddings-v3 | 0.48287 | 0.56068 | 0.56068 | 0.51361 | |
| | | BAAI/bge-multilingual-gemma2 | 0.47904 | 0.55472 | 0.55472 | 0.50916 | |
| | | intfloat/multilingual-e5-large-instruct | 0.47842 | 0.55435 | 0.55435 | 0.50826 | |
| | | intfloat/multilingual-e5-base | 0.46950 | 0.54490 | 0.54490 | 0.49947 | |
| | | intfloat/e5-mistral-7b-instruct | 0.46772 | 0.54394 | 0.54394 | 0.49781 | |
| | | Alibaba-NLP/gte-multilingual-base | 0.46469 | 0.53744 | 0.53744 | 0.49353 | |
| | | Alibaba-NLP/gte-Qwen2-7B-instruct | 0.46633 | 0.53625 | 0.53625 | 0.49429 | |
| | | openai/text-embedding-3-large | 0.44884 | 0.51688 | 0.51688 | 0.47572 | |
| | | Salesforce/SFR-Embedding-2_R | 0.43748 | 0.50815 | 0.50815 | 0.46504 | |
| | | upskyy/bge-m3-korean | 0.43125 | 0.50245 | 0.50245 | 0.45945 | |
| | | jhgan/ko-sroberta-multitask | 0.33788 | 0.38497 | 0.38497 | 0.35678 | |
| |
|
| | ### Top-k 3 |
| | | Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 | |
| | |-----------------------------------------|----------------------|------------------------|-------------------|-----------------| |
| | | **nlpai-lab/KURE-v1** | **0.68678** | **0.28711** | **0.65538** | **0.39835** | |
| | | dragonkue/BGE-m3-ko | 0.67834 | 0.28385 | 0.64950 | 0.39378 | |
| | | BAAI/bge-m3 | 0.67526 | 0.28374 | 0.64556 | 0.39291 | |
| | | Snowflake/snowflake-arctic-embed-l-v2.0 | 0.67128 | 0.28193 | 0.64042 | 0.39072 | |
| | | intfloat/multilingual-e5-large | 0.65807 | 0.27777 | 0.62822 | 0.38423 | |
| | | nlpai-lab/KoE5 | 0.65174 | 0.27329 | 0.62369 | 0.37882 | |
| | | BAAI/bge-multilingual-gemma2 | 0.64415 | 0.27416 | 0.61105 | 0.37782 | |
| | | jinaai/jina-embeddings-v3 | 0.64116 | 0.27165 | 0.60954 | 0.37511 | |
| | | intfloat/multilingual-e5-large-instruct | 0.64353 | 0.27040 | 0.60790 | 0.37453 | |
| | | Alibaba-NLP/gte-multilingual-base | 0.63744 | 0.26404 | 0.59695 | 0.36764 | |
| | | Alibaba-NLP/gte-Qwen2-7B-instruct | 0.63163 | 0.25937 | 0.59237 | 0.36263 | |
| | | intfloat/multilingual-e5-base | 0.62099 | 0.26144 | 0.59179 | 0.36203 | |
| | | intfloat/e5-mistral-7b-instruct | 0.62087 | 0.26144 | 0.58917 | 0.36188 | |
| | | openai/text-embedding-3-large | 0.61035 | 0.25356 | 0.57329 | 0.35270 | |
| | | Salesforce/SFR-Embedding-2_R | 0.60001 | 0.25253 | 0.56346 | 0.34952 | |
| | | upskyy/bge-m3-korean | 0.59215 | 0.25076 | 0.55722 | 0.34623 | |
| | | jhgan/ko-sroberta-multitask | 0.46930 | 0.18994 | 0.43293 | 0.26696 | |
| | |
| | ### Top-k 5 |
| | | Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 | |
| | |-----------------------------------------|----------------------|------------------------|-------------------|-----------------| |
| | | **nlpai-lab/KURE-v1** | **0.73851** | **0.19130** | **0.67479** | **0.29903** | |
| | | dragonkue/BGE-m3-ko | 0.72517 | 0.18799 | 0.66692 | 0.29401 | |
| | | BAAI/bge-m3 | 0.72954 | 0.18975 | 0.66615 | 0.29632 | |
| | | Snowflake/snowflake-arctic-embed-l-v2.0 | 0.72962 | 0.18875 | 0.66236 | 0.29542 | |
| | | nlpai-lab/KoE5 | 0.70820 | 0.18287 | 0.64499 | 0.28628 | |
| | | intfloat/multilingual-e5-large | 0.70124 | 0.18316 | 0.64402 | 0.28588 | |
| | | BAAI/bge-multilingual-gemma2 | 0.70258 | 0.18556 | 0.63338 | 0.28851 | |
| | | jinaai/jina-embeddings-v3 | 0.69933 | 0.18256 | 0.63133 | 0.28505 | |
| | | intfloat/multilingual-e5-large-instruct | 0.69018 | 0.17838 | 0.62486 | 0.27933 | |
| | | Alibaba-NLP/gte-multilingual-base | 0.69365 | 0.17789 | 0.61896 | 0.27879 | |
| | | intfloat/multilingual-e5-base | 0.67250 | 0.17406 | 0.61119 | 0.27247 | |
| | | Alibaba-NLP/gte-Qwen2-7B-instruct | 0.67447 | 0.17114 | 0.60952 | 0.26943 | |
| | | intfloat/e5-mistral-7b-instruct | 0.67449 | 0.17484 | 0.60935 | 0.27349 | |
| | | openai/text-embedding-3-large | 0.66365 | 0.17004 | 0.59389 | 0.26677 | |
| | | Salesforce/SFR-Embedding-2_R | 0.65622 | 0.17018 | 0.58494 | 0.26612 | |
| | | upskyy/bge-m3-korean | 0.65477 | 0.17015 | 0.58073 | 0.26589 | |
| | | jhgan/ko-sroberta-multitask | 0.53136 | 0.13264 | 0.45879 | 0.20976 | |
| |
|
| | ### Top-k 10 |
| | | Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 | |
| | |-----------------------------------------|----------------------|------------------------|-------------------|-----------------| |
| | | **nlpai-lab/KURE-v1** | **0.79682** | **0.10624** | **0.69473** | **0.18524** | |
| | | dragonkue/BGE-m3-ko | 0.78450 | 0.10492 | 0.68748 | 0.18288 | |
| | | BAAI/bge-m3 | 0.79195 | 0.10592 | 0.68723 | 0.18456 | |
| | | Snowflake/snowflake-arctic-embed-l-v2.0 | 0.78669 | 0.10462 | 0.68189 | 0.18260 | |
| | | intfloat/multilingual-e5-large | 0.75902 | 0.10147 | 0.66370 | 0.17693 | |
| | | nlpai-lab/KoE5 | 0.75296 | 0.09937 | 0.66012 | 0.17369 | |
| | | BAAI/bge-multilingual-gemma2 | 0.76153 | 0.10364 | 0.65330 | 0.18003 | |
| | | jinaai/jina-embeddings-v3 | 0.76277 | 0.10240 | 0.65290 | 0.17843 | |
| | | intfloat/multilingual-e5-large-instruct | 0.74851 | 0.09888 | 0.64451 | 0.17283 | |
| | | Alibaba-NLP/gte-multilingual-base | 0.75631 | 0.09938 | 0.64025 | 0.17363 | |
| | | Alibaba-NLP/gte-Qwen2-7B-instruct | 0.74092 | 0.09607 | 0.63258 | 0.16847 | |
| | | intfloat/multilingual-e5-base | 0.73512 | 0.09717 | 0.63216 | 0.16977 | |
| | | intfloat/e5-mistral-7b-instruct | 0.73795 | 0.09777 | 0.63076 | 0.17078 | |
| | | openai/text-embedding-3-large | 0.72946 | 0.09571 | 0.61670 | 0.16739 | |
| | | Salesforce/SFR-Embedding-2_R | 0.71662 | 0.09546 | 0.60589 | 0.16651 | |
| | | upskyy/bge-m3-korean | 0.71895 | 0.09583 | 0.60258 | 0.16712 | |
| | | jhgan/ko-sroberta-multitask | 0.61225 | 0.07826 | 0.48687 | 0.13757 | |
| | <br/> |
| | |
| | ## Citation |
| | |
| | If you find our paper or models helpful, please consider cite as follows: |
| | ```text |
| | @misc{KURE, |
| | publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee}, |
| | year = {2024}, |
| | url = {https://github.com/nlpai-lab/KURE} |
| | }, |
| | |
| | @misc{KoE5, |
| | author = {NLP & AI Lab and Human-Inspired AI research}, |
| | title = {KoE5: A New Dataset and Model for Improving Korean Embedding Performance}, |
| | year = {2024}, |
| | publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee}, |
| | journal = {GitHub repository}, |
| | howpublished = {\url{https://github.com/nlpai-lab/KoE5}}, |
| | } |
| | ``` |