|
|
--- |
|
|
language: |
|
|
- ko |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- sentence-similarity |
|
|
- feature-extraction |
|
|
pipeline_tag: sentence-similarity |
|
|
library_name: sentence-transformers |
|
|
base_model: |
|
|
- klue/roberta-large |
|
|
--- |
|
|
|
|
|
# Frony Embed V2 (medium) |
|
|
This is an efficient embedding model designed specifically for the Korean language. |
|
|
It has been trained on a diverse set of data sources, including AI 허브, to ensure robust performance in a wide range of retrieval tasks. |
|
|
The model demonstrates strong retrieval capabilities across:<br> |
|
|
|
|
|
* Korean–Korean |
|
|
* Korean–English |
|
|
* English–Korean |
|
|
|
|
|
To support resource-constrained environments, the model also provides compatibility with Matryoshka Embeddings, enabling retrieval even at reduced dimensions **(e.g., half of the original size)** without significant performance loss. |
|
|
All training and data preprocessing were performed on **a single GPU (46VRAM)**, showcasing not only the model’s effectiveness but also its efficiency. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
- **Model Type:** Sentence Transformer |
|
|
- **Base Model:** klue/roberta-large |
|
|
- **Maximum Sequence Length:** 512 tokens |
|
|
- **Output Dimensionality:** 1024 / 512 dimensions |
|
|
- **Similarity Function:** Cosine Similarity |
|
|
- **Languages:** ko, en |
|
|
- **License:** apache-2.0 |
|
|
|
|
|
### Datasets |
|
|
This model is trained from many sources data including **AI 허브**.<br> |
|
|
Total trained query and document pair is 500,000.<br> |
|
|
|
|
|
### Training Details |
|
|
The overall training process was conducted with reference to snowflake-arctic-2.0.<br> |
|
|
**In V2, a three-stage training process was introduced as a key component of the overall learning strategy.**<br> |
|
|
The training process consisted of three stages: Adaptation-training, Pre-training, and Post-training. |
|
|
|
|
|
* In the adaptation-training stage, we observed through preliminary experiments that multi-vector retrieval consistently outperformed standard dense retrieval. To reflect this, we first trained the model using a multi-vector retrieval objective. |
|
|
* In the pre-training stage, we introduced knowledge distillation, **where the multi-vector retrieval loss was distilled into the dense retrieval loss**. This allowed the model to capture fine-grained token-level similarity signals while being trained with in-batch negatives. |
|
|
* In the post-training stage, we utilized the multilingual-e5-large model to mine hard negatives—specifically, the top 4 samples with a similarity score below a 99% threshold—and fine-tuned the model further using these examples. |
|
|
|
|
|
Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br> |
|
|
The types of data augmentation applied are as follows: |
|
|
|
|
|
| Augmentation* | Description | |
|
|
-----------|-----------| |
|
|
| Pair concatenation | Multi-query & Multi-passage | |
|
|
| Language transfer | Korean to English on query & passage | |
|
|
| Style transfer | Plain sentences to Markdown description | |
|
|
**Augmentation was carried out using the Gemma-3-12B* |
|
|
|
|
|
### Evaluation |
|
|
The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups. |
|
|
Three groups are subsets extracted from AI 허브 datasets. |
|
|
One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini. |
|
|
The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.<br> |
|
|
The following table presents the average retrieval performance across five dataset groups. |
|
|
|
|
|
| Architecture | Open/Closed | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 | |
|
|
|--------------------------------------------------------|-----------|-----------|-----------|-----------|------------| |
|
|
| upstage-large | Closed | 0.6323 | 0.8522 | 0.9068 | 0.9459 | |
|
|
| dragonkue/snowflake-arctic-embed-l-v2.0-ko | Open | 0.6612 | 0.8396 | 0.8931 | 0.9390 | |
|
|
| **FronyAI/frony-embed-medium-ko-v2** | Open | **0.6805** | **0.8375** | 0.8819 | 0.9206 | |
|
|
| FronyAI/frony-embed-medium-arctic-ko-v2.5 | Open | 0.6942 | 0.8361 | 0.8807 | 0.9197 | |
|
|
| FronyAI/frony-embed-medium-arctic-ko-v2.5 (half dim) | Open | 0.6778 | 0.8277 | 0.8726 | 0.9129 | |
|
|
| **FronyAI/frony-embed-medium-ko-v2 (half dim)** | Open | **0.6722** | **0.8274** | 0.8712 | 0.9157 | |
|
|
| nlpai-lab/KURE-v1 | Open | 0.6434 | 0.8240 | 0.8788 | 0.9285 | |
|
|
| FronyAI/frony-embed-medium-ko-v1 | Open | 0.6649 | 0.8040 | 0.8458 | 0.8876 | |
|
|
| FronyAI/frony-embed-medium-ko-v1 (half dim) | Open | 0.6520 | 0.7923 | 0.8361 | 0.8796 | |
|
|
| BAAI/bge-m3 | Open | 0.5849 | 0.7763 | 0.8420 | 0.8985 | |
|
|
| intfloat/multilingual-e5-large | Open | 0.5764 | 0.7630 | 0.8267 | 0.8891 | |
|
|
| Snowflake/snowflake-arctic-embed-l-v2.0 | Open | 0.5726 | 0.7591 | 0.8232 | 0.8917 | |
|
|
| jinaai/jina-embeddings-v3 | Open | 0.5270 | 0.7242 | 0.7953 | 0.8644 | |
|
|
| openai-text-embedding-3-large | Closed | 0.4903 | 0.6621 | 0.7316 | 0.8149 | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
|
|
First install the Sentence Transformers library: |
|
|
|
|
|
```bash |
|
|
pip install -U sentence-transformers |
|
|
``` |
|
|
|
|
|
Then you can load this model and run inference. |
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
# Download from the 🤗 Hub |
|
|
model = SentenceTransformer("FronyAI/frony-embed-medium-ko-v2") |
|
|
|
|
|
# Run inference |
|
|
# '<Q>' is special token for query. |
|
|
queries = [ |
|
|
'<Q>안녕하세요', |
|
|
] |
|
|
embeddings = model.encode(queries) |
|
|
|
|
|
# '<P>' is special token for passage. |
|
|
passages = [ |
|
|
'<P>반갑습니다', |
|
|
] |
|
|
embeddings = model.encode(passages) |
|
|
|
|
|
# Matryoshka Embeddings (half of the original dimension) |
|
|
# '<Q>' is special token for query. |
|
|
queries = [ |
|
|
'<Q>안녕하세요', |
|
|
] |
|
|
embeddings = model.encode(queries, normalize_embeddings=False, convert_to_tensor=True)[:, :512] |
|
|
embeddings = F.normalize(embeddings, p=2, dim=-1) |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
Feel free to open an issue or pull request if you have any questions or suggestions about this project. |
|
|
You also can email (flash659@gmail.com). |