Update README.md
Browse files
README.md
CHANGED
|
@@ -14,16 +14,16 @@ base_model:
|
|
| 14 |
---
|
| 15 |
|
| 16 |
# FronyAI Embedding (tiny)
|
| 17 |
-
This is a lightweight and efficient embedding model designed specifically for the Korean language
|
| 18 |
-
It has been trained on a diverse set of data sources, including AI 허브, to ensure robust performance in a wide range of retrieval tasks
|
| 19 |
The model demonstrates strong retrieval capabilities across:<br>
|
| 20 |
|
| 21 |
* Korean–Korean
|
| 22 |
* Korean–English
|
| 23 |
* English–Korean
|
| 24 |
|
| 25 |
-
To support resource-constrained environments, the model also provides compatibility with Matryoshka Embeddings, enabling retrieval even at reduced dimensions **(e.g., half of the original size)** without significant performance loss
|
| 26 |
-
All training and data preprocessing were performed on **a single GPU (46VRAM)**, showcasing not only the model’s effectiveness but also its efficiency
|
| 27 |
|
| 28 |
## Model Details
|
| 29 |
|
|
@@ -42,13 +42,13 @@ Total trained query and document pair is 100,000.<br>
|
|
| 42 |
|
| 43 |
### Training Details
|
| 44 |
The overall training process was conducted with reference to **snowflake-arctic-2.0**.<br>
|
| 45 |
-
Training was divided into two stages: Pre-training and Post-training
|
| 46 |
|
| 47 |
* In the pre-training stage, the model was trained using in-batch negatives.
|
| 48 |
* In the post-training stage, we utilized the multilingual-e5-large model to identify hard negatives—specifically, the top 4 samples with a similarity score below a **99% threshold**.
|
| 49 |
|
| 50 |
Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br>
|
| 51 |
-
The types of data augmentation applied are as follows
|
| 52 |
| Augmentation* | Description |
|
| 53 |
-----------|-----------|
|
| 54 |
| Pair concatenation | Multi-query & Multi-passage |
|
|
@@ -57,11 +57,11 @@ The types of data augmentation applied are as follows:<br>
|
|
| 57 |
**Augmentation was carried out using the Gemma-3-12B*
|
| 58 |
|
| 59 |
### Evaluation
|
| 60 |
-
The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups
|
| 61 |
-
Three groups are subsets extracted from AI 허브 datasets
|
| 62 |
-
One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini
|
| 63 |
The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.<br>
|
| 64 |
-
The following table presents the average retrieval performance across five dataset groups
|
| 65 |
|
| 66 |
| Models | Open/Closed | Size | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 |
|
| 67 |
|--------------|-----------|-----------|-----------|------------|------------|-------------|
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
# FronyAI Embedding (tiny)
|
| 17 |
+
This is a lightweight and efficient embedding model designed specifically for the Korean language.
|
| 18 |
+
It has been trained on a diverse set of data sources, including AI 허브, to ensure robust performance in a wide range of retrieval tasks.
|
| 19 |
The model demonstrates strong retrieval capabilities across:<br>
|
| 20 |
|
| 21 |
* Korean–Korean
|
| 22 |
* Korean–English
|
| 23 |
* English–Korean
|
| 24 |
|
| 25 |
+
To support resource-constrained environments, the model also provides compatibility with Matryoshka Embeddings, enabling retrieval even at reduced dimensions **(e.g., half of the original size)** without significant performance loss.
|
| 26 |
+
All training and data preprocessing were performed on **a single GPU (46VRAM)**, showcasing not only the model’s effectiveness but also its efficiency.
|
| 27 |
|
| 28 |
## Model Details
|
| 29 |
|
|
|
|
| 42 |
|
| 43 |
### Training Details
|
| 44 |
The overall training process was conducted with reference to **snowflake-arctic-2.0**.<br>
|
| 45 |
+
Training was divided into two stages: Pre-training and Post-training.
|
| 46 |
|
| 47 |
* In the pre-training stage, the model was trained using in-batch negatives.
|
| 48 |
* In the post-training stage, we utilized the multilingual-e5-large model to identify hard negatives—specifically, the top 4 samples with a similarity score below a **99% threshold**.
|
| 49 |
|
| 50 |
Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br>
|
| 51 |
+
The types of data augmentation applied are as follows:
|
| 52 |
| Augmentation* | Description |
|
| 53 |
-----------|-----------|
|
| 54 |
| Pair concatenation | Multi-query & Multi-passage |
|
|
|
|
| 57 |
**Augmentation was carried out using the Gemma-3-12B*
|
| 58 |
|
| 59 |
### Evaluation
|
| 60 |
+
The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups.
|
| 61 |
+
Three groups are subsets extracted from AI 허브 datasets.
|
| 62 |
+
One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini.
|
| 63 |
The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.<br>
|
| 64 |
+
The following table presents the average retrieval performance across five dataset groups.
|
| 65 |
|
| 66 |
| Models | Open/Closed | Size | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 |
|
| 67 |
|--------------|-----------|-----------|-----------|------------|------------|-------------|
|