FronyAI commited on
Commit
fc4c4c6
·
verified ·
1 Parent(s): 10ef023

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -112
README.md CHANGED
@@ -1,41 +1,65 @@
1
- ---
2
- language:
3
- - ko
4
- - en
5
- license: apache-2.0
6
- tags:
7
- - sentence-transformers
8
- - sentence-similarity
9
- - feature-extraction
10
- pipeline_tag: sentence-similarity
11
- library_name: sentence-transformers
12
- ---
 
 
13
 
14
  # FronyAI Embedding (tiny)
 
 
 
 
 
 
 
 
 
 
15
 
16
  ## Model Details
17
 
18
  ### Model Description
19
  - **Model Type:** Sentence Transformer
20
  - **Base Model:** microsoft/Multilingual-MiniLM-L12-H384
21
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
22
  - **Maximum Sequence Length:** 512 tokens
23
  - **Output Dimensionality:** 384 / 192 dimensions
24
  - **Similarity Function:** Cosine Similarity
25
- <!-- - **Training Dataset:** Unknown -->
26
  - **Languages:** ko, en
27
  - **License:** apache-2.0
28
 
29
  ### Datasets
30
- This model is trained from many sources data including **AI 허브**.
31
- Total trained query and document pair is 100,000.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ### Evaluation
34
- The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups.
35
- Three groups are subsets extracted from **AI 허브** datasets.
36
- One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini.
37
- The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.
38
- The following table presents the average retrieval performance across five dataset groups.
39
 
40
  | Models | Open/Closed | Size | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 |
41
  |--------------|-----------|-----------|-----------|------------|------------|-------------|
@@ -43,16 +67,15 @@ The following table presents the average retrieval performance across five datas
43
  | frony-embed-medium (half dim) | Open | 337M | 0.6520 | 0.7923 | 0.8361 | 0.8796 |
44
  | frony-embed-small | Open | 111M | 0.6152 | 0.7616 | 0.8056 | 0.8559 |
45
  | frony-embed-small (half dim) | Open | 111M | 0.5988 | 0.7478 | 0.7984 | 0.8461 |
46
- | frony-embed-tiny | **Open** | 0.5084 | **0.6757** | 0.7278 | 0.7845 |
47
- | frony-embed-tiny (half dim) | Open | 0.4710 | 0.6390 | 0.6933 | 0.7596 |
48
- | bge-m3 | **Open** | 0.5852 | **0.7763** | 0.8418 | 0.8987 |
49
- | multilingual-e5-large | Open | 0.5764 | 0.7630 | 0.8267 | 0.8891 |
50
- | snowflake-arctic-embed-l-v2.0 | Open | 0.5726 | 0.7591 | 0.8232 | 0.8917 |
51
- | jina-embeddings-v3 | Open | 0.5270 | 0.7246 | 0.7953 | 0.8649 |
52
- | upstage-large | **Closed** | 0.6334 | **0.8527** | 0.9065 | 0.9478 |
53
- | openai-text-embedding-3-large | Closed | 0.4907 | 0.6617 | 0.7311 | 0.8148 |
54
-
55
- ## Training
56
 
57
  ## Usage
58
 
@@ -71,86 +94,16 @@ from sentence_transformers import SentenceTransformer
71
  # Download from the 🤗 Hub
72
  model = SentenceTransformer("FronyAI/frony-embed-tiny-ko-v1")
73
  # Run inference
74
- sentences = [
75
- 'The weather is lovely today.',
76
- "It's so sunny outside!",
77
- 'He drove to the stadium.',
78
- ]
79
- embeddings = model.encode(sentences)
80
- print(embeddings.shape)
81
- # [3, 384]
82
-
83
- # Get the similarity scores for the embeddings
84
- similarities = model.similarity(embeddings, embeddings)
85
- print(similarities.shape)
86
- # [3, 3]
87
- ```
88
-
89
- <!--
90
- ### Direct Usage (Transformers)
91
-
92
- <details><summary>Click to see the direct usage in Transformers</summary>
93
-
94
- </details>
95
- -->
96
-
97
- <!--
98
- ### Downstream Usage (Sentence Transformers)
99
-
100
- You can finetune this model on your own dataset.
101
-
102
- <details><summary>Click to expand</summary>
103
-
104
- </details>
105
- -->
106
-
107
- <!--
108
- ### Out-of-Scope Use
109
-
110
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
111
- -->
112
 
113
- <!--
114
- ## Bias, Risks and Limitations
115
-
116
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
117
- -->
118
-
119
- <!--
120
- ### Recommendations
121
-
122
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
123
- -->
124
-
125
- ## Training Details
126
-
127
- ### Framework Versions
128
- - Python: 3.10.16
129
- - Sentence Transformers: 4.0.2
130
- - Transformers: 4.47.1
131
- - PyTorch: 2.5.1+cu121
132
- - Accelerate: 1.2.1
133
- - Datasets: 2.21.0
134
- - Tokenizers: 0.21.0
135
-
136
- ## Citation
137
-
138
- ### BibTeX
139
-
140
- <!--
141
- ## Glossary
142
-
143
- *Clearly define terms in order to be accessible across audiences.*
144
- -->
145
-
146
- <!--
147
- ## Model Card Authors
148
-
149
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
150
- -->
151
-
152
- <!--
153
- ## Model Card Contact
154
 
155
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
156
- -->
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ - en
5
+ license: apache-2.0
6
+ tags:
7
+ - sentence-transformers
8
+ - sentence-similarity
9
+ - feature-extraction
10
+ pipeline_tag: sentence-similarity
11
+ library_name: sentence-transformers
12
+ base_model:
13
+ - microsoft/Multilingual-MiniLM-L12-H384
14
+ ---
15
 
16
  # FronyAI Embedding (tiny)
17
+ This is a lightweight and efficient embedding model designed specifically for the Korean language.<br>
18
+ It has been trained on a diverse set of data sources, including **AI 허브**, to ensure robust performance in a wide range of retrieval tasks.<br>
19
+ The model demonstrates strong retrieval capabilities across:<br>
20
+
21
+ * Korean–Korean
22
+ * Korean–English
23
+ * English–Korean
24
+
25
+ To support resource-constrained environments, the model also provides compatibility with Matryoshka Embeddings, enabling retrieval even at reduced dimensions **(e.g., half of the original size)** without significant performance loss.<br>
26
+ All training and data preprocessing were performed on **a single GPU (46VRAM)**, showcasing not only the model’s effectiveness but also its efficiency.<br>
27
 
28
  ## Model Details
29
 
30
  ### Model Description
31
  - **Model Type:** Sentence Transformer
32
  - **Base Model:** microsoft/Multilingual-MiniLM-L12-H384
 
33
  - **Maximum Sequence Length:** 512 tokens
34
  - **Output Dimensionality:** 384 / 192 dimensions
35
  - **Similarity Function:** Cosine Similarity
 
36
  - **Languages:** ko, en
37
  - **License:** apache-2.0
38
 
39
  ### Datasets
40
+ This model is trained from many sources data including **AI 허브**.<br>
41
+ Total trained query and document pair is 100,000.<br>
42
+
43
+ ### Training Details
44
+ The overall training process was conducted with reference to **snowflake-arctic-2.0**.<br>
45
+ Training was divided into two stages: Pre-training and Post-training.<br>
46
+ In the pre-training stage, the model was trained using in-batch negatives.<br>
47
+ In the post-training stage, we utilized the multilingual-e5-large model to identify hard negatives—specifically, the top 4 samples with a similarity score below a **99% threshold**.<br>
48
+ Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br>
49
+ The types of data augmentation applied are as follows:<br>
50
+ | Augmentation* | Description |
51
+ -----------|-----------|
52
+ | Pair concatenation | Multi-query & Multi-passage |
53
+ | Language transfer | Korean to English on query & passage |
54
+ | Style transfer | Plain sentences to Markdown description |
55
+ **Augmentation was carried out using the Gemma-3-12B*
56
 
57
  ### Evaluation
58
+ The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups.<br>
59
+ Three groups are subsets extracted from **AI 허브** datasets.<br>
60
+ One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini.<br>
61
+ The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.<br>
62
+ The following table presents the average retrieval performance across five dataset groups.<br>
63
 
64
  | Models | Open/Closed | Size | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 |
65
  |--------------|-----------|-----------|-----------|------------|------------|-------------|
 
67
  | frony-embed-medium (half dim) | Open | 337M | 0.6520 | 0.7923 | 0.8361 | 0.8796 |
68
  | frony-embed-small | Open | 111M | 0.6152 | 0.7616 | 0.8056 | 0.8559 |
69
  | frony-embed-small (half dim) | Open | 111M | 0.5988 | 0.7478 | 0.7984 | 0.8461 |
70
+ | frony-embed-tiny | **Open** | 21M* | 0.5084 | **0.6757** | 0.7278 | 0.7845 |
71
+ | frony-embed-tiny (half dim) | Open | 21M* | 0.4710 | 0.6390 | 0.6933 | 0.7596 |
72
+ | bge-m3 | **Open** | 560M | 0.5852 | **0.7763** | 0.8418 | 0.8987 |
73
+ | multilingual-e5-large | Open | 560M | 0.5764 | 0.7630 | 0.8267 | 0.8891 |
74
+ | snowflake-arctic-embed-l-v2.0 | Open | 568M | 0.5726 | 0.7591 | 0.8232 | 0.8917 |
75
+ | jina-embeddings-v3 | Open | 572M | 0.5270 | 0.7246 | 0.7953 | 0.8649 |
76
+ | upstage-large | **Closed** | - | 0.6334 | **0.8527** | 0.9065 | 0.9478 |
77
+ | openai-text-embedding-3-large | Closed | - | 0.4907 | 0.6617 | 0.7311 | 0.8148 |
78
+ **Transformer blocks only*
 
79
 
80
  ## Usage
81
 
 
94
  # Download from the 🤗 Hub
95
  model = SentenceTransformer("FronyAI/frony-embed-tiny-ko-v1")
96
  # Run inference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
+ # '<Q>' is special token for query.
99
+ queries = [
100
+ '<Q>안녕하세요',
101
+ ]
102
+ embeddings = model.encode(queries)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
+ # '<P>' is special token for passage.
105
+ passages = [
106
+ '<P>반갑습니다',
107
+ ]
108
+ embeddings = model.encode(passages)
109
+ ```