File size: 23,306 Bytes
130a057
 
 
 
 
 
 
 
 
49021c9
 
 
 
130a057
 
bba4cb8
289f11f
f376cf2
 
 
 
 
 
130a057
 
49021c9
 
 
 
 
 
 
 
 
 
 
130a057
 
 
 
 
 
 
 
 
 
49021c9
130a057
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d21e609
 
 
 
 
 
 
 
 
 
 
130a057
 
 
 
 
 
 
 
 
 
 
 
 
49021c9
130a057
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a06ee89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130a057
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a06ee89
 
 
 
 
 
 
 
 
 
 
 
130a057
 
a06ee89
 
130a057
596652b
 
 
 
 
 
 
d21e609
596652b
 
 
 
289f11f
 
d21e609
130a057
 
 
 
 
 
 
 
 
 
a06ee89
 
130a057
5267538
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d8634f9
5267538
 
 
 
 
 
 
 
 
 
130a057
 
 
 
 
 
 
49021c9
130a057
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f376cf2
130a057
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a06ee89
78fa19f
a06ee89
 
 
 
 
 
 
 
 
 
 
78fa19f
a06ee89
 
 
 
 
130a057
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a06ee89
 
 
 
 
 
 
 
 
5267538
 
 
 
 
 
 
 
 
 
 
a06ee89
d21e609
 
 
 
 
 
 
 
 
a06ee89
 
 
 
d21e609
 
 
130a057
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
base_model: intfloat/multilingual-e5-small
pipeline_tag: sentence-similarity
library_name: sentence-transformers
license: apache-2.0
language:
- ko
- en
---

<img src="https://cdn-uploads.huggingface.co/production/uploads/642b0c2fecec03b4464a1d9b/9uN5ypGY-GRGgakLs_s1o.png" width="600">

# ✨ New Version Available

We've released a new and improved version of this model!

[dragonkue/multilingual-e5-small-ko-v2](https://huggingface.co/dragonkue/multilingual-e5-small-ko-v2)

# SentenceTransformer based on intfloat/multilingual-e5-small

This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) on datasets that include Korean query-passage pairs for improved performance on Korean retrieval tasks. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

This model is a lightweight Korean retriever, designed for ease of use and strong performance in practical retrieval tasks.
It is ideal for running demos or lightweight applications, offering a good balance between speed and accuracy.

For even higher retrieval performance, we recommend combining it with a reranker.
Suggested reranker models:

- dragonkue/bge-reranker-v2-m3-ko

- BAAI/bge-reranker-v2-m3

## Model Details

### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) <!-- at revision c007d7ef6fd86656326059b28395a7a03a7c5846 -->
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 384 dimensions
- **Similarity Function:** Cosine Similarity
- **Training Datasets:**

<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->

### Model Sources

- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

### Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
```

## Usage

**πŸͺΆ Lightweight Version Available**

We also introduce a lightweight variant of this model:
[`exp-models/dragonkue-KoEn-E5-Tiny`](https://huggingface.co/exp-models/dragonkue-KoEn-E5-Tiny),
which removes all tokens **except Korean and English** to reduce model size while maintaining performance.

The repository also includes a **GGUF-quantized version**, making it suitable for efficient local or on-device embedding model serving.

> πŸ”§ For practical deployment, we highly recommend using this **lightweight retriever** in combination with a **reranker** model β€” it forms a powerful and resource-efficient retrieval setup.


### Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer

# Download from the πŸ€— Hub
model = SentenceTransformer("dragonkue/multilingual-e5-small-ko")
# Run inference
sentences = [
    'query: λΆν•œκ°€μ‘±λ²• λͺ‡ μ°¨ κ°œμ •μ—μ„œ 이혼판결 ν™•μ • ν›„ 3κ°œμ›” 내에 λ“±λ‘μ‹œμ—λ§Œ μœ νš¨ν•˜λ‹€λŠ” 쑰항을 ν™•μ‹€νžˆ ν–ˆμ„κΉŒ?',
    'passage: 1990년에 μ œμ •λœ λΆν•œ 가쑱법은 μ§€κΈˆκΉŒμ§€ 4μ°¨λ‘€ κ°œμ •λ˜μ–΄ ν˜„μž¬μ— 이λ₯΄κ³  μžˆλ‹€. 1993년에 이루어진 제1μ°¨ κ°œμ •μ€ 주둜 κ·œμ •μ˜ 정확성을 κΈ°ν•˜κΈ° μœ„ν•˜μ—¬ λͺ‡λͺ‡ 쑰문을 μˆ˜μ •ν•œ 것이며, 싀체적인 λ‚΄μš©μ„ λ³΄μ™„ν•œ 것은 μƒμ†μ˜ 승인과 포기기간을 μ„€μ •ν•œ 제52μ‘° 정도라고 ν•  수 μžˆλ‹€. 2004년에 이루어진 제2차에 κ°œμ •μ—μ„œλŠ” 제20쑰제3항을 μ‹ μ„€ν•˜μ—¬ μž¬νŒμƒ ν™•μ •λœ μ΄ν˜ΌνŒκ²°μ„ 3κ°œμ›” 내에 등둝해야 이혼의 효λ ₯이 λ°œμƒν•œλ‹€λŠ” 것을 λͺ…ν™•ν•˜κ²Œ ν•˜μ˜€λ‹€. 2007년에 이루어진 제3μ°¨ κ°œμ •μ—μ„œλŠ” λΆ€λͺ¨μ™€ μžλ…€ 관계 λ˜ν•œ 신뢄등둝기관에 λ“±λ‘ν•œ λ•ŒλΆ€ν„° 법적 효λ ₯이 λ°œμƒν•œλ‹€λŠ” 것을 μ‹ μ„€(제25쑰제2ν•­)ν•˜μ˜€λ‹€. λ˜ν•œ λ―Έμ„±λ…„μž, 노동λŠ₯λ ₯ μ—†λŠ” 자의 λΆ€μ–‘κ³Ό κ΄€λ ¨(제37쑰제2ν•­)ν•˜μ—¬ κΈ°μ‘΄μ—λŠ” β€œλΆ€μ–‘λŠ₯λ ₯이 μžˆλŠ” 가정성원이 없을 κ²½μš°μ—λŠ” λ”°λ‘œ μ‚¬λŠ” λΆ€λͺ¨λ‚˜ μžλ…€, μ‘°λΆ€λͺ¨λ‚˜ μ†μžλ…€, ν˜•μ œμžλ§€κ°€ λΆ€μ–‘ν•œλ‹€β€κ³  κ·œμ •ν•˜κ³  μžˆμ—ˆλ˜ 것을 β€œλΆ€μ–‘λŠ₯λ ₯이 μžˆλŠ” 가정성원이 없을 κ²½μš°μ—λŠ” λ”°λ‘œ μ‚¬λŠ” λΆ€λͺ¨λ‚˜ μžλ…€κ°€ λΆ€μ–‘ν•˜λ©° 그듀이 없을 κ²½μš°μ—λŠ” μ‘°λΆ€λͺ¨λ‚˜ μ†μžλ…€, ν˜•μ œμžλ§€κ°€ λΆ€μ–‘ν•œλ‹€β€λ‘œ κ°œμ •ν•˜μ˜€λ‹€.',
    'passage: ν™˜κ²½λ§ˆν¬ μ œλ„, 인증기쀀 λ³€κ²½μœΌλ‘œ κΈ°μ—…λΆ€λ‹΄ 쀄인닀\nν™˜κ²½λ§ˆν¬ μ œλ„ μ†Œκ°œ\nβ–‘ κ°œμš”\nβ—‹ 동일 μš©λ„μ˜ λ‹€λ₯Έ μ œν’ˆμ— λΉ„ν•΄ β€˜μ œν’ˆμ˜ ν™˜κ²½μ„±*’을 κ°œμ„ ν•œ μ œν’ˆμ— λ‘œκ³ μ™€ μ„€λͺ…을 ν‘œμ‹œν•  수 μžˆλ„λ‘ν•˜λŠ” 인증 μ œλ„\nβ€» μ œν’ˆμ˜ ν™˜κ²½μ„± : μž¬λ£Œμ™€ μ œν’ˆμ„ μ œμ‘°β€€μ†ŒλΉ„ νκΈ°ν•˜λŠ” μ „κ³Όμ •μ—μ„œ μ˜€μ—Όλ¬Όμ§ˆμ΄λ‚˜ μ˜¨μ‹€κ°€μŠ€ 등을 λ°°μΆœν•˜λŠ” 정도 및 μžμ›κ³Ό μ—λ„ˆμ§€λ₯Ό μ†ŒλΉ„ν•˜λŠ” 정도 λ“± ν™˜κ²½μ— λ―ΈμΉ˜λŠ” 영ν–₯λ ₯의 정도(γ€Œν™˜κ²½κΈ°μˆ  및 ν™˜κ²½μ‚°μ—… μ§€μ›λ²•γ€μ œ2쑰제5호)\nβ–‘ 법적근거\nβ—‹ γ€Œν™˜κ²½κΈ°μˆ  및 ν™˜κ²½μ‚°μ—… μ§€μ›λ²•γ€μ œ17μ‘°(ν™˜κ²½ν‘œμ§€μ˜ 인증)\nβ–‘ κ΄€λ ¨ κ΅­μ œν‘œμ€€\nβ—‹ ISO 14024(제1μœ ν˜• ν™˜κ²½λΌλ²¨λ§)\nβ–‘ μ μš©λŒ€μƒ\nβ—‹ 사무기기, κ°€μ „μ œν’ˆ, μƒν™œμš©ν’ˆ, κ±΄μΆ•μžμž¬ λ“± 156개 λŒ€μƒμ œν’ˆκ΅°\nβ–‘ μΈμ¦ν˜„ν™©\nβ—‹ 2,737개 κΈ°μ—…μ˜ 16,647개 μ œν’ˆ(2015.12월말 κΈ°μ€€)',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```

### Direct Usage (Transformers)

```python
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


# Each input text should start with "query: " or "passage: ", even for non-English texts.
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = ["query: λΆν•œκ°€μ‘±λ²• λͺ‡ μ°¨ κ°œμ •μ—μ„œ 이혼판결 ν™•μ • ν›„ 3κ°œμ›” 내에 λ“±λ‘μ‹œμ—λ§Œ μœ νš¨ν•˜λ‹€λŠ” 쑰항을 ν™•μ‹€νžˆ ν–ˆμ„κΉŒ?",
               "passage: 1990년에 μ œμ •λœ λΆν•œ 가쑱법은 μ§€κΈˆκΉŒμ§€ 4μ°¨λ‘€ κ°œμ •λ˜μ–΄ ν˜„μž¬μ— 이λ₯΄κ³  μžˆλ‹€. 1993년에 이루어진 제1μ°¨ κ°œμ •μ€ 주둜 κ·œμ •μ˜ 정확성을 κΈ°ν•˜κΈ° μœ„ν•˜μ—¬ λͺ‡λͺ‡ 쑰문을 μˆ˜μ •ν•œ 것이며, 싀체적인 λ‚΄μš©μ„ λ³΄μ™„ν•œ 것은 μƒμ†μ˜ 승인과 포기기간을 μ„€μ •ν•œ 제52μ‘° 정도라고 ν•  수 μžˆλ‹€. 2004년에 이루어진 제2차에 κ°œμ •μ—μ„œλŠ” 제20쑰제3항을 μ‹ μ„€ν•˜μ—¬ μž¬νŒμƒ ν™•μ •λœ μ΄ν˜ΌνŒκ²°μ„ 3κ°œμ›” 내에 등둝해야 이혼의 효λ ₯이 λ°œμƒν•œλ‹€λŠ” 것을 λͺ…ν™•ν•˜κ²Œ ν•˜μ˜€λ‹€. 2007년에 이루어진 제3μ°¨ κ°œμ •μ—μ„œλŠ” λΆ€λͺ¨μ™€ μžλ…€ 관계 λ˜ν•œ 신뢄등둝기관에 λ“±λ‘ν•œ λ•ŒλΆ€ν„° 법적 효λ ₯이 λ°œμƒν•œλ‹€λŠ” 것을 μ‹ μ„€(제25쑰제2ν•­)ν•˜μ˜€λ‹€. λ˜ν•œ λ―Έμ„±λ…„μž, 노동λŠ₯λ ₯ μ—†λŠ” 자의 λΆ€μ–‘κ³Ό κ΄€λ ¨(제37쑰제2ν•­)ν•˜μ—¬ κΈ°μ‘΄μ—λŠ” β€œλΆ€μ–‘λŠ₯λ ₯이 μžˆλŠ” 가정성원이 없을 κ²½μš°μ—λŠ” λ”°λ‘œ μ‚¬λŠ” λΆ€λͺ¨λ‚˜ μžλ…€, μ‘°λΆ€λͺ¨λ‚˜ μ†μžλ…€, ν˜•μ œμžλ§€κ°€ λΆ€μ–‘ν•œλ‹€β€κ³  κ·œμ •ν•˜κ³  μžˆμ—ˆλ˜ 것을 β€œλΆ€μ–‘λŠ₯λ ₯이 μžˆλŠ” 가정성원이 없을 κ²½μš°μ—λŠ” λ”°λ‘œ μ‚¬λŠ” λΆ€λͺ¨λ‚˜ μžλ…€κ°€ λΆ€μ–‘ν•˜λ©° 그듀이 없을 κ²½μš°μ—λŠ” μ‘°λΆ€λͺ¨λ‚˜ μ†μžλ…€, ν˜•μ œμžλ§€κ°€ λΆ€μ–‘ν•œλ‹€β€λ‘œ κ°œμ •ν•˜μ˜€λ‹€.",
               "passage: ν™˜κ²½λ§ˆν¬ μ œλ„, 인증기쀀 λ³€κ²½μœΌλ‘œ κΈ°μ—…λΆ€λ‹΄ 쀄인닀\nν™˜κ²½λ§ˆν¬ μ œλ„ μ†Œκ°œ\nβ–‘ κ°œμš”\nβ—‹ 동일 μš©λ„μ˜ λ‹€λ₯Έ μ œν’ˆμ— λΉ„ν•΄ β€˜μ œν’ˆμ˜ ν™˜κ²½μ„±*’을 κ°œμ„ ν•œ μ œν’ˆμ— λ‘œκ³ μ™€ μ„€λͺ…을 ν‘œμ‹œν•  수 μžˆλ„λ‘ν•˜λŠ” 인증 μ œλ„\nβ€» μ œν’ˆμ˜ ν™˜κ²½μ„± : μž¬λ£Œμ™€ μ œν’ˆμ„ μ œμ‘°β€€μ†ŒλΉ„ νκΈ°ν•˜λŠ” μ „κ³Όμ •μ—μ„œ μ˜€μ—Όλ¬Όμ§ˆμ΄λ‚˜ μ˜¨μ‹€κ°€μŠ€ 등을 λ°°μΆœν•˜λŠ” 정도 및 μžμ›κ³Ό μ—λ„ˆμ§€λ₯Ό μ†ŒλΉ„ν•˜λŠ” 정도 λ“± ν™˜κ²½μ— λ―ΈμΉ˜λŠ” 영ν–₯λ ₯의 정도(γ€Œν™˜κ²½κΈ°μˆ  및 ν™˜κ²½μ‚°μ—… μ§€μ›λ²•γ€μ œ2쑰제5호)\nβ–‘ 법적근거\nβ—‹ γ€Œν™˜κ²½κΈ°μˆ  및 ν™˜κ²½μ‚°μ—… μ§€μ›λ²•γ€μ œ17μ‘°(ν™˜κ²½ν‘œμ§€μ˜ 인증)\nβ–‘ κ΄€λ ¨ κ΅­μ œν‘œμ€€\nβ—‹ ISO 14024(제1μœ ν˜• ν™˜κ²½λΌλ²¨λ§)\nβ–‘ μ μš©λŒ€μƒ\nβ—‹ 사무기기, κ°€μ „μ œν’ˆ, μƒν™œμš©ν’ˆ, κ±΄μΆ•μžμž¬ λ“± 156개 λŒ€μƒμ œν’ˆκ΅°\nβ–‘ μΈμ¦ν˜„ν™©\nβ—‹ 2,737개 κΈ°μ—…μ˜ 16,647개 μ œν’ˆ(2015.12월말 κΈ°μ€€)"]

tokenizer = AutoTokenizer.from_pretrained('dragonkue/multilingual-e5-small-ko')
model = AutoModel.from_pretrained('dragonkue/multilingual-e5-small-ko')

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T)
print(scores.tolist())
```


<!--
### Downstream Usage (Sentence Transformers)

You can finetune this model on your own dataset.

<details><summary>Click to expand</summary>

</details>
-->

<!--
### Out-of-Scope Use

*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->

## Evaluation

- This evaluation references the KURE GitHub repository. (https://github.com/nlpai-lab/KURE)
- We conducted an evaluation on all **Korean Retrieval Benchmarks** registered in [MTEB](https://github.com/embeddings-benchmark/mteb).

### Korean Retrieval Benchmark
- [Ko-StrategyQA](https://huggingface.co/datasets/taeminlee/Ko-StrategyQA): A Korean **ODQA multi-hop retrieval dataset**, translated from StrategyQA.
- [AutoRAGRetrieval](https://huggingface.co/datasets/yjoonjang/markers_bm): A **Korean document retrieval dataset** constructed by parsing PDFs from five domains: **finance, public, medical, legal, and commerce**.
- [MIRACLRetrieval](https://huggingface.co/datasets/miracl/miracl): A **Korean document retrieval dataset** based on Wikipedia.
- [PublicHealthQA](https://huggingface.co/datasets/xhluca/publichealth-qa): A **retrieval dataset** focused on **medical and public health domains** in Korean.
- [BelebeleRetrieval](https://huggingface.co/datasets/facebook/belebele): A **Korean document retrieval dataset** based on FLORES-200.
- [MrTidyRetrieval](https://huggingface.co/datasets/mteb/mrtidy): A **Wikipedia-based Korean document retrieval dataset**.
- [XPQARetrieval](https://huggingface.co/datasets/jinaai/xpqa): A **cross-domain Korean document retrieval dataset**.

### Metrics

* Standard metric : NDCG@10

#### Information Retrieval
| Model                                                       |   Size(M) |   Average |   XPQARetrieval |   PublicHealthQA |   MIRACLRetrieval |   Ko-StrategyQA |   BelebeleRetrieval |   AutoRAGRetrieval |   MrTidyRetrieval |
|:------------------------------------------------------------|----------:|----------:|----------------:|-----------------:|------------------:|----------------:|--------------------:|-------------------:|------------------:|
| BAAI/bge-m3                                                 |       560 |  0.724169 |         0.36075 |          0.80412 |           0.70146 |         0.79405 |             0.93164 |            0.83008 |           0.64708 |
| Snowflake/snowflake-arctic-embed-l-v2.0                     |       560 |  0.724104 |         0.43018 |          0.81679 |           0.66077 |         0.80455 |             0.9271  |            0.83863 |           0.59071 |
| intfloat/multilingual-e5-large                              |       560 |  0.721607 |         0.3571  |          0.82534 |           0.66486 |         0.80348 |             0.94499 |            0.81337 |           0.64211 |
| intfloat/multilingual-e5-base                               |       278 |  0.689429 |         0.3607  |          0.77203 |           0.6227  |         0.76355 |             0.92868 |            0.79752 |           0.58082 |
| **dragonkue/multilingual-e5-small-ko**                      |       118 |  0.688819 |         0.34871 |          0.79729 |           0.61113 |         0.76173 |             0.9297  |            0.86184 |           0.51133 |
| **exp-models/dragonkue-KoEn-E5-Tiny**                       |        37 |  0.687496 |         0.34735 |          0.7925  |           0.6143  |         0.75978 |             0.93018 |            0.86503 |           0.50333 |
| intfloat/multilingual-e5-small                              |       118 |  0.670906 |         0.33003 |          0.73668 |           0.61238 |         0.75157 |             0.90531 |            0.80068 |           0.55969 |
| ibm-granite/granite-embedding-278m-multilingual             |       278 |  0.616466 |         0.23058 |          0.77668 |           0.59216 |         0.71762 |             0.83231 |            0.70226 |           0.46365 |
| ibm-granite/granite-embedding-107m-multilingual             |       107 |  0.599759 |         0.23058 |          0.73209 |           0.58413 |         0.70531 |             0.82063 |            0.68243 |           0.44314 |
| sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 |       118 |  0.409766 |         0.21345 |          0.67409 |           0.25676 |         0.45903 |             0.71491 |            0.42296 |           0.12716 |

#### Performance Comparison by Model Size (Based on Average NDCG@10)
<img src="https://cdn-uploads.huggingface.co/production/uploads/642b0c2fecec03b4464a1d9b/Utunk7FbZsTDEVsOVUms1.png" width="1000"/>

<!--
### Recommendations

*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->

## Training Details

### Training Datasets
This model was fine-tuned on the same dataset used in dragonkue/snowflake-arctic-embed-l-v2.0-ko, which consists of Korean query-passage pairs.
The training objective was to improve retrieval performance specifically for Korean-language tasks.

### Training Methods

Following the training approach used in dragonkue/snowflake-arctic-embed-l-v2.0-ko, this model constructs in-batch negatives based on clustered passages. In addition, we introduce GISTEmbedLoss with a configurable margin.

**πŸ“ˆ Margin-based Training Results**
- Using the standard MNR (Multiple Negatives Ranking) loss alone resulted in decreased performance.

- The original GISTEmbedLoss (without margin) yielded modest improvements of around +0.8 NDCG@10.

- Applying a margin led to performance gains of up to +1.5 NDCG@10.

- This indicates that simply tuning the margin value can lead to up to 2x improvement, showing strong sensitivity and effectiveness of margin scaling.

This margin-based approach extends the idea proposed in the NV-Retriever paper, which originally filtered false negatives during hard negative sampling.
We adapt this to in-batch negatives, treating false negatives as dynamic samples guided by margin-based filtering.

<img src="https://cdn-uploads.huggingface.co/production/uploads/642b0c2fecec03b4464a1d9b/IpDDTshuZ5noxPOdm6gVk.png" width="800"/>

The sentence-transformers library now supports GISTEmbedLoss with margin configuration, making it easy to integrate into any training pipeline.

You can install the latest version with:

```bash
pip install -U sentence-transformers
```


### Training Hyperparameters
#### Non-Default Hyperparameters

- `eval_strategy`: steps
- `per_device_train_batch_size`: 20000
- `per_device_eval_batch_size`: 4096
- `learning_rate`: 0.00025
- `num_train_epochs`: 3
- `warmup_ratio`: 0.05
- `fp16`: True
- `dataloader_drop_last`: True
- `batch_sampler`: no_duplicates

#### All Hyperparameters
<details><summary>Click to expand</summary>

- `overwrite_output_dir`: False
- `do_predict`: False
- `eval_strategy`: steps
- `prediction_loss_only`: True
- `per_device_train_batch_size`: 20000
- `per_device_eval_batch_size`: 4096
- `per_gpu_train_batch_size`: None
- `per_gpu_eval_batch_size`: None
- `gradient_accumulation_steps`: 1
- `eval_accumulation_steps`: None
- `torch_empty_cache_steps`: None
- `learning_rate`: 0.00025
- `weight_decay`: 0.0
- `adam_beta1`: 0.9
- `adam_beta2`: 0.999
- `adam_epsilon`: 1e-08
- `max_grad_norm`: 1.0
- `num_train_epochs`: 3
- `max_steps`: -1
- `lr_scheduler_type`: linear
- `lr_scheduler_kwargs`: {}
- `warmup_ratio`: 0.05
- `warmup_steps`: 0
- `log_level`: passive
- `log_level_replica`: warning
- `log_on_each_node`: True
- `logging_nan_inf_filter`: True
- `save_safetensors`: True
- `save_on_each_node`: False
- `save_only_model`: False
- `restore_callback_states_from_checkpoint`: False
- `no_cuda`: False
- `use_cpu`: False
- `use_mps_device`: False
- `seed`: 42
- `data_seed`: None
- `jit_mode_eval`: False
- `use_ipex`: False
- `bf16`: False
- `fp16`: True
- `fp16_opt_level`: O1
- `half_precision_backend`: auto
- `bf16_full_eval`: False
- `fp16_full_eval`: False
- `tf32`: None
- `local_rank`: 0
- `ddp_backend`: None
- `tpu_num_cores`: None
- `tpu_metrics_debug`: False
- `debug`: []
- `dataloader_drop_last`: True
- `dataloader_num_workers`: 0
- `dataloader_prefetch_factor`: None
- `past_index`: -1
- `disable_tqdm`: False
- `remove_unused_columns`: True
- `label_names`: None
- `load_best_model_at_end`: False
- `ignore_data_skip`: False
- `fsdp`: []
- `fsdp_min_num_params`: 0
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
- `tp_size`: 0
- `fsdp_transformer_layer_cls_to_wrap`: None
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
- `deepspeed`: None
- `label_smoothing_factor`: 0.0
- `optim`: adamw_torch
- `optim_args`: None
- `adafactor`: False
- `group_by_length`: False
- `length_column_name`: length
- `ddp_find_unused_parameters`: None
- `ddp_bucket_cap_mb`: None
- `ddp_broadcast_buffers`: False
- `dataloader_pin_memory`: True
- `dataloader_persistent_workers`: False
- `skip_memory_metrics`: True
- `use_legacy_prediction_loop`: False
- `push_to_hub`: False
- `resume_from_checkpoint`: None
- `hub_model_id`: None
- `hub_strategy`: every_save
- `hub_private_repo`: None
- `hub_always_push`: False
- `gradient_checkpointing`: False
- `gradient_checkpointing_kwargs`: None
- `include_inputs_for_metrics`: False
- `include_for_metrics`: []
- `eval_do_concat_batches`: True
- `fp16_backend`: auto
- `push_to_hub_model_id`: None
- `push_to_hub_organization`: None
- `mp_parameters`: 
- `auto_find_batch_size`: False
- `full_determinism`: False
- `torchdynamo`: None
- `ray_scope`: last
- `ddp_timeout`: 1800
- `torch_compile`: False
- `torch_compile_backend`: None
- `torch_compile_mode`: None
- `include_tokens_per_second`: False
- `include_num_input_tokens_seen`: False
- `neftune_noise_alpha`: None
- `optim_target_modules`: None
- `batch_eval_metrics`: False
- `eval_on_start`: False
- `use_liger_kernel`: False
- `eval_use_gather_object`: False
- `average_tokens_across_devices`: False
- `prompts`: None
- `batch_sampler`: no_duplicates
- `multi_dataset_batch_sampler`: proportional

</details>


### Framework Versions
- Python: 3.11.10
- Sentence Transformers: 4.1.0
- Transformers: 4.51.3
- PyTorch: 2.7.0+cu126
- Accelerate: 1.6.0
- Datasets: 3.5.1
- Tokenizers: 0.21.1

## FAQ
**1. Do I need to add the prefix "query: " and "passage: " to input texts?**

Yes, this is how the model is trained, otherwise you will see a performance degradation.

Here are some rules of thumb:

Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.

Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.

Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.

**2. Why does the cosine similarity scores distribute around 0.7 to 1.0?**

This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss.

For text embedding tasks like text retrieval or semantic similarity, what matters is the relative order of the scores instead of the absolute values, so this should not be an issue.

## Citation

### BibTeX

#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
```

#### Base Model
```bibtex
@article{wang2024multilingual,
  title={Multilingual E5 Text Embeddings: A Technical Report},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2402.05672},
  year={2024}
}
```
#### NV-Retriever: Improving text embedding models with effective hard-negative mining
```bibtex
@article{moreira2024nvretriever,
  title     = {NV-Retriever: Improving text embedding models with effective hard-negative mining},
  author    = {Moreira, Gabriel de Souza P. and Osmulski, Radek and Xu, Mengyao and Ak, Ronay and Schifferer, Benedikt and Oldridge, Even},
  journal   = {arXiv preprint arXiv:2407.15831},
  year      = {2024},
  url       = {https://arxiv.org/abs/2407.15831},
  doi       = {10.48550/arXiv.2407.15831}
}
```

#### KURE
```bibtex
@misc{KURE,
  publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee},
  year = {2024},
  url = {https://github.com/nlpai-lab/KURE}
}
```

## Limitations

Long texts will be truncated to at most 512 tokens.

## Acknowledgements
Special thanks to lemon-mint for their valuable contribution in optimizing and compressing this model.

<!--
## Glossary

*Clearly define terms in order to be accessible across audiences.*
-->

<!--
## Model Card Authors

*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->

<!--
## Model Card Contact

*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->