File size: 6,562 Bytes
d69b57c
73dce49
 
 
d69b57c
 
 
 
 
 
 
73dce49
 
d69b57c
 
73dce49
 
 
 
d69b57c
73dce49
 
 
 
 
 
d69b57c
 
 
 
 
e08b38c
 
73dce49
d69b57c
73dce49
d69b57c
 
73dce49
 
 
 
 
 
a443cea
73dce49
 
0821278
73dce49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2128ea9
 
 
e08b38c
1c67e5b
 
 
 
2128ea9
e08b38c
 
2128ea9
 
 
 
326e8ac
d69b57c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae20198
d69b57c
ae20198
73dce49
 
 
 
 
d69b57c
73dce49
 
 
 
 
ae20198
 
 
 
 
 
 
 
73dce49
d69b57c
73dce49
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
language:
- ko
- en
license: apache-2.0
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
pipeline_tag: sentence-similarity
library_name: sentence-transformers
base_model:
- klue/roberta-large
---

# Frony Embed V2 (medium)
This is an efficient embedding model designed specifically for the Korean language.
It has been trained on a diverse set of data sources, including AI 허브, to ensure robust performance in a wide range of retrieval tasks.
The model demonstrates strong retrieval capabilities across:<br>

* Korean–Korean
* Korean–English
* English–Korean

To support resource-constrained environments, the model also provides compatibility with Matryoshka Embeddings, enabling retrieval even at reduced dimensions **(e.g., half of the original size)** without significant performance loss.
All training and data preprocessing were performed on **a single GPU (46VRAM)**, showcasing not only the model’s effectiveness but also its efficiency.

## Model Details

### Model Description
- **Model Type:** Sentence Transformer
- **Base Model:** klue/roberta-large
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 1024 / 512 dimensions
- **Similarity Function:** Cosine Similarity
- **Languages:** ko, en
- **License:** apache-2.0

### Datasets
This model is trained from many sources data including **AI 허브**.<br>
Total trained query and document pair is 500,000.<br>

### Training Details
The overall training process was conducted with reference to snowflake-arctic-2.0.<br>
**In V2, a three-stage training process was introduced as a key component of the overall learning strategy.**<br>
The training process consisted of three stages: Adaptation-training, Pre-training, and Post-training.

* In the adaptation-training stage, we observed through preliminary experiments that multi-vector retrieval consistently outperformed standard dense retrieval. To reflect this, we first trained the model using a multi-vector retrieval objective.
* In the pre-training stage, we introduced knowledge distillation, **where the multi-vector retrieval loss was distilled into the dense retrieval loss**. This allowed the model to capture fine-grained token-level similarity signals while being trained with in-batch negatives.
* In the post-training stage, we utilized the multilingual-e5-large model to mine hard negatives—specifically, the top 4 samples with a similarity score below a 99% threshold—and fine-tuned the model further using these examples.

Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br>
The types of data augmentation applied are as follows:

| Augmentation* | Description |
-----------|-----------|
| Pair concatenation | Multi-query & Multi-passage |
| Language transfer | Korean to English on query & passage |
| Style transfer | Plain sentences to Markdown description |
**Augmentation was carried out using the Gemma-3-12B*

### Evaluation
The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups.
Three groups are subsets extracted from AI 허브 datasets.
One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini.
The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.<br>
The following table presents the average retrieval performance across five dataset groups.

| Architecture                                           | Open/Closed | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 |
|--------------------------------------------------------|-----------|-----------|-----------|-----------|------------|
| upstage-large                                          | Closed | 0.6323    | 0.8522    | 0.9068    | 0.9459     |
| dragonkue/snowflake-arctic-embed-l-v2.0-ko             | Open   | 0.6612    | 0.8396    | 0.8931    | 0.9390     |
| **FronyAI/frony-embed-medium-ko-v2**                       | Open   | **0.6805**    | **0.8375**    | 0.8819    | 0.9206     |
| FronyAI/frony-embed-medium-arctic-ko-v2.5              | Open   | 0.6942    | 0.8361    | 0.8807    | 0.9197     |
| FronyAI/frony-embed-medium-arctic-ko-v2.5 (half dim)   | Open   | 0.6778    | 0.8277    | 0.8726    | 0.9129     |
| **FronyAI/frony-embed-medium-ko-v2 (half dim)**            | Open   | **0.6722**    | **0.8274**    | 0.8712    | 0.9157     |
| nlpai-lab/KURE-v1                                      | Open   | 0.6434    | 0.8240    | 0.8788    | 0.9285     |
| FronyAI/frony-embed-medium-ko-v1                       | Open   | 0.6649    | 0.8040    | 0.8458    | 0.8876     |
| FronyAI/frony-embed-medium-ko-v1 (half dim)            | Open   | 0.6520    | 0.7923    | 0.8361    | 0.8796     |
| BAAI/bge-m3                                            | Open   | 0.5849    | 0.7763    | 0.8420    | 0.8985     |
| intfloat/multilingual-e5-large                         | Open   | 0.5764    | 0.7630    | 0.8267    | 0.8891     |
| Snowflake/snowflake-arctic-embed-l-v2.0                | Open   | 0.5726    | 0.7591    | 0.8232    | 0.8917     |
| jinaai/jina-embeddings-v3                              | Open   | 0.5270    | 0.7242    | 0.7953    | 0.8644     |
| openai-text-embedding-3-large                          | Closed | 0.4903    | 0.6621    | 0.7316    | 0.8149     |

## Usage

### Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("FronyAI/frony-embed-medium-ko-v2")

# Run inference
# '<Q>' is special token for query.
queries = [
    '<Q>안녕하세요',
]
embeddings = model.encode(queries)

# '<P>' is special token for passage.
passages = [
    '<P>반갑습니다',
]
embeddings = model.encode(passages)

# Matryoshka Embeddings (half of the original dimension)
# '<Q>' is special token for query.
queries = [
    '<Q>안녕하세요',
]
embeddings = model.encode(queries, normalize_embeddings=False, convert_to_tensor=True)[:, :512]
embeddings = F.normalize(embeddings, p=2, dim=-1)
```

## Contact
Feel free to open an issue or pull request if you have any questions or suggestions about this project.
You also can email (flash659@gmail.com).