FronyAI commited on
Commit
73dce49
·
verified ·
1 Parent(s): d649a20

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -102
README.md CHANGED
@@ -1,4 +1,7 @@
1
  ---
 
 
 
2
  license: apache-2.0
3
  tags:
4
  - sentence-transformers
@@ -6,39 +9,78 @@ tags:
6
  - feature-extraction
7
  pipeline_tag: sentence-similarity
8
  library_name: sentence-transformers
 
 
9
  ---
10
 
11
- # FronyAI/frony-embed-medium-ko-v2
 
 
 
12
 
13
- This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 
 
 
 
 
14
 
15
  ## Model Details
16
 
17
  ### Model Description
18
  - **Model Type:** Sentence Transformer
19
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
20
  - **Maximum Sequence Length:** 512 tokens
21
- - **Output Dimensionality:** 1024 dimensions
22
  - **Similarity Function:** Cosine Similarity
23
- <!-- - **Training Dataset:** Unknown -->
24
- <!-- - **Language:** Unknown -->
25
  - **License:** apache-2.0
26
 
27
- ### Model Sources
28
-
29
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
30
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
31
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
32
-
33
- ### Full Model Architecture
34
-
35
- ```
36
- SentenceTransformer(
37
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel
38
- (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
39
- (2): Normalize()
40
- )
41
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  ## Usage
44
 
@@ -55,88 +97,22 @@ Then you can load this model and run inference.
55
  from sentence_transformers import SentenceTransformer
56
 
57
  # Download from the 🤗 Hub
58
- model = SentenceTransformer("FronyAI/frony-embed-medium-ko-v2")
59
  # Run inference
60
- sentences = [
61
- 'The weather is lovely today.',
62
- "It's so sunny outside!",
63
- 'He drove to the stadium.',
64
- ]
65
- embeddings = model.encode(sentences)
66
- print(embeddings.shape)
67
- # [3, 1024]
68
-
69
- # Get the similarity scores for the embeddings
70
- similarities = model.similarity(embeddings, embeddings)
71
- print(similarities.shape)
72
- # [3, 3]
73
- ```
74
-
75
- <!--
76
- ### Direct Usage (Transformers)
77
-
78
- <details><summary>Click to see the direct usage in Transformers</summary>
79
-
80
- </details>
81
- -->
82
-
83
- <!--
84
- ### Downstream Usage (Sentence Transformers)
85
-
86
- You can finetune this model on your own dataset.
87
-
88
- <details><summary>Click to expand</summary>
89
-
90
- </details>
91
- -->
92
-
93
- <!--
94
- ### Out-of-Scope Use
95
 
96
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
97
- -->
98
-
99
- <!--
100
- ## Bias, Risks and Limitations
101
-
102
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
103
- -->
104
-
105
- <!--
106
- ### Recommendations
107
-
108
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
109
- -->
110
-
111
- ## Training Details
112
-
113
- ### Framework Versions
114
- - Python: 3.10.16
115
- - Sentence Transformers: 4.0.2
116
- - Transformers: 4.47.1
117
- - PyTorch: 2.5.1+cu121
118
- - Accelerate: 1.2.1
119
- - Datasets: 2.21.0
120
- - Tokenizers: 0.21.0
121
-
122
- ## Citation
123
-
124
- ### BibTeX
125
-
126
- <!--
127
- ## Glossary
128
-
129
- *Clearly define terms in order to be accessible across audiences.*
130
- -->
131
-
132
- <!--
133
- ## Model Card Authors
134
-
135
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
136
- -->
137
 
138
- <!--
139
- ## Model Card Contact
 
 
 
 
140
 
141
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
142
- -->
 
 
1
  ---
2
+ language:
3
+ - ko
4
+ - en
5
  license: apache-2.0
6
  tags:
7
  - sentence-transformers
 
9
  - feature-extraction
10
  pipeline_tag: sentence-similarity
11
  library_name: sentence-transformers
12
+ base_model:
13
+ - klue/roberta-large
14
  ---
15
 
16
+ # Frony Embed V2 (medium)
17
+ This is an efficient embedding model designed specifically for the Korean language.
18
+ It has been trained on a diverse set of data sources, including AI 허브, to ensure robust performance in a wide range of retrieval tasks.
19
+ The model demonstrates strong retrieval capabilities across:<br>
20
 
21
+ * Korean–Korean
22
+ * Korean–English
23
+ * English–Korean
24
+
25
+ To support resource-constrained environments, the model also provides compatibility with Matryoshka Embeddings, enabling retrieval even at reduced dimensions **(e.g., half of the original size)** without significant performance loss.
26
+ All training and data preprocessing were performed on **a single GPU (46VRAM)**, showcasing not only the model’s effectiveness but also its efficiency.
27
 
28
  ## Model Details
29
 
30
  ### Model Description
31
  - **Model Type:** Sentence Transformer
32
+ - **Base Model:** klue/roberta-large
33
  - **Maximum Sequence Length:** 512 tokens
34
+ - **Output Dimensionality:** 1024 / 512 dimensions
35
  - **Similarity Function:** Cosine Similarity
36
+ - **Languages:** ko, en
 
37
  - **License:** apache-2.0
38
 
39
+ ### Datasets
40
+ This model is trained from many sources data including **AI 허브**.<br>
41
+ Total trained query and document pair is 500,000.<br>
42
+
43
+ ### Training Details
44
+ The overall training process was conducted with reference to snowflake-arctic-2.0.<br>
45
+ In V2, a three-stage training process was introduced as a key component of the overall learning strategy.
46
+ The training process consisted of three stages: Adaptation-training, Pre-training, and Post-training.
47
+
48
+ * In the adaptation-training stage, we observed through preliminary experiments that multi-vector Retrieval consistently outperformed standard dense retrieval. To reflect this, we first trained the model using a multi-vector Retrieval objective.
49
+ * In the pre-training stage, we introduced knowledge distillation, **where the multi-vector retrieval loss was distilled into the dense retrieval loss**. This allowed the model to capture fine-grained token-level similarity signals while being trained with in-batch negatives.
50
+ * In the post-training stage, we utilized the multilingual-e5-large model to mine hard negatives—specifically, the top 4 samples with a similarity score below a 99% threshold—and fine-tuned the model further using these examples.
51
+
52
+ Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br>
53
+ The types of data augmentation applied are as follows:
54
+
55
+ | Augmentation* | Description |
56
+ -----------|-----------|
57
+ | Pair concatenation | Multi-query & Multi-passage |
58
+ | Language transfer | Korean to English on query & passage |
59
+ | Style transfer | Plain sentences to Markdown description |
60
+ **Augmentation was carried out using the Gemma-3-12B*
61
+
62
+ ### Evaluation
63
+ The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups.
64
+ Three groups are subsets extracted from AI 허브 datasets.
65
+ One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini.
66
+ The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.<br>
67
+ The following table presents the average retrieval performance across five dataset groups.
68
+
69
+ | Models | Open/Closed | Size | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 |
70
+ |--------------|-----------|-----------|-----------|------------|------------|-------------|
71
+ | frony-embed-medium | **Open** | 337M | 0.6649 | **0.8040** | 0.8458 | 0.8876 |
72
+ | frony-embed-medium (half dim) | Open | 337M | 0.6520 | 0.7923 | 0.8361 | 0.8796 |
73
+ | frony-embed-small | Open | 111M | 0.6152 | 0.7616 | 0.8056 | 0.8559 |
74
+ | frony-embed-small (half dim) | Open | 111M | 0.5988 | 0.7478 | 0.7984 | 0.8461 |
75
+ | frony-embed-tiny | Open | 21M* | 0.5084 | 0.6757 | 0.7278 | 0.7845 |
76
+ | frony-embed-tiny (half dim) | Open | 21M* | 0.4710 | 0.6390 | 0.6933 | 0.7596 |
77
+ | bge-m3 | **Open** | 560M | 0.5852 | **0.7763** | 0.8418 | 0.8987 |
78
+ | multilingual-e5-large | Open | 560M | 0.5764 | 0.7630 | 0.8267 | 0.8891 |
79
+ | snowflake-arctic-embed-l-v2.0 | Open | 568M | 0.5726 | 0.7591 | 0.8232 | 0.8917 |
80
+ | jina-embeddings-v3 | Open | 572M | 0.5270 | 0.7246 | 0.7953 | 0.8649 |
81
+ | upstage-large | **Closed** | - | 0.6334 | **0.8527** | 0.9065 | 0.9478 |
82
+ | openai-text-embedding-3-large | Closed | - | 0.4907 | 0.6617 | 0.7311 | 0.8148 |
83
+ **Transformer blocks only*
84
 
85
  ## Usage
86
 
 
97
  from sentence_transformers import SentenceTransformer
98
 
99
  # Download from the 🤗 Hub
100
+ model = SentenceTransformer("FronyAI/frony-embed-medium-ko-v1")
101
  # Run inference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
+ # '<Q>' is special token for query.
104
+ queries = [
105
+ '<Q>안녕하세요',
106
+ ]
107
+ embeddings = model.encode(queries)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
 
109
+ # '<P>' is special token for passage.
110
+ passages = [
111
+ '<P>반갑습니다',
112
+ ]
113
+ embeddings = model.encode(passages)
114
+ ```
115
 
116
+ ## Contact
117
+ Feel free to open an issue or pull request if you have any questions or suggestions about this project.
118
+ You also can email (flash659@gmail.com).