FronyAI commited on
Commit
c95a4db
·
verified ·
1 Parent(s): 4ed1d72

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -10
README.md CHANGED
@@ -14,16 +14,16 @@ base_model:
14
  ---
15
 
16
  # FronyAI Embedding (tiny)
17
- This is a lightweight and efficient embedding model designed specifically for the Korean language.<br>
18
- It has been trained on a diverse set of data sources, including AI 허브, to ensure robust performance in a wide range of retrieval tasks.<br>
19
  The model demonstrates strong retrieval capabilities across:<br>
20
 
21
  * Korean–Korean
22
  * Korean–English
23
  * English–Korean
24
 
25
- To support resource-constrained environments, the model also provides compatibility with Matryoshka Embeddings, enabling retrieval even at reduced dimensions **(e.g., half of the original size)** without significant performance loss.<br>
26
- All training and data preprocessing were performed on **a single GPU (46VRAM)**, showcasing not only the model’s effectiveness but also its efficiency.<br>
27
 
28
  ## Model Details
29
 
@@ -42,13 +42,13 @@ Total trained query and document pair is 100,000.<br>
42
 
43
  ### Training Details
44
  The overall training process was conducted with reference to **snowflake-arctic-2.0**.<br>
45
- Training was divided into two stages: Pre-training and Post-training.<br>
46
 
47
  * In the pre-training stage, the model was trained using in-batch negatives.
48
  * In the post-training stage, we utilized the multilingual-e5-large model to identify hard negatives—specifically, the top 4 samples with a similarity score below a **99% threshold**.
49
 
50
  Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br>
51
- The types of data augmentation applied are as follows:<br>
52
  | Augmentation* | Description |
53
  -----------|-----------|
54
  | Pair concatenation | Multi-query & Multi-passage |
@@ -57,11 +57,11 @@ The types of data augmentation applied are as follows:<br>
57
  **Augmentation was carried out using the Gemma-3-12B*
58
 
59
  ### Evaluation
60
- The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups.<br>
61
- Three groups are subsets extracted from AI 허브 datasets.<br>
62
- One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini.<br>
63
  The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.<br>
64
- The following table presents the average retrieval performance across five dataset groups.<br>
65
 
66
  | Models | Open/Closed | Size | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 |
67
  |--------------|-----------|-----------|-----------|------------|------------|-------------|
 
14
  ---
15
 
16
  # FronyAI Embedding (tiny)
17
+ This is a lightweight and efficient embedding model designed specifically for the Korean language.
18
+ It has been trained on a diverse set of data sources, including AI 허브, to ensure robust performance in a wide range of retrieval tasks.
19
  The model demonstrates strong retrieval capabilities across:<br>
20
 
21
  * Korean–Korean
22
  * Korean–English
23
  * English–Korean
24
 
25
+ To support resource-constrained environments, the model also provides compatibility with Matryoshka Embeddings, enabling retrieval even at reduced dimensions **(e.g., half of the original size)** without significant performance loss.
26
+ All training and data preprocessing were performed on **a single GPU (46VRAM)**, showcasing not only the model’s effectiveness but also its efficiency.
27
 
28
  ## Model Details
29
 
 
42
 
43
  ### Training Details
44
  The overall training process was conducted with reference to **snowflake-arctic-2.0**.<br>
45
+ Training was divided into two stages: Pre-training and Post-training.
46
 
47
  * In the pre-training stage, the model was trained using in-batch negatives.
48
  * In the post-training stage, we utilized the multilingual-e5-large model to identify hard negatives—specifically, the top 4 samples with a similarity score below a **99% threshold**.
49
 
50
  Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br>
51
+ The types of data augmentation applied are as follows:
52
  | Augmentation* | Description |
53
  -----------|-----------|
54
  | Pair concatenation | Multi-query & Multi-passage |
 
57
  **Augmentation was carried out using the Gemma-3-12B*
58
 
59
  ### Evaluation
60
+ The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups.
61
+ Three groups are subsets extracted from AI 허브 datasets.
62
+ One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini.
63
  The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.<br>
64
+ The following table presents the average retrieval performance across five dataset groups.
65
 
66
  | Models | Open/Closed | Size | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 |
67
  |--------------|-----------|-----------|-----------|------------|------------|-------------|