FronyAI commited on
Commit
ee4738e
·
verified ·
1 Parent(s): 92f8f4f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -102
README.md CHANGED
@@ -9,39 +9,73 @@ tags:
9
  - feature-extraction
10
  pipeline_tag: sentence-similarity
11
  library_name: sentence-transformers
 
 
12
  ---
13
 
14
- # FronyAI/frony-embed-medium-ko-v1
 
 
 
15
 
16
- This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for Retrieval.
 
 
 
 
 
17
 
18
  ## Model Details
19
 
20
  ### Model Description
21
  - **Model Type:** Sentence Transformer
22
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
23
  - **Maximum Sequence Length:** 512 tokens
24
- - **Output Dimensionality:** 1024 dimensions
25
  - **Similarity Function:** Cosine Similarity
26
- <!-- - **Training Dataset:** Unknown -->
27
  - **Languages:** ko, en
28
  - **License:** apache-2.0
29
 
30
- ### Model Sources
31
-
32
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
33
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
34
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
35
-
36
- ### Full Model Architecture
37
-
38
- ```
39
- SentenceTransformer(
40
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel
41
- (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
42
- (2): Normalize()
43
- )
44
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  ## Usage
47
 
@@ -58,88 +92,18 @@ Then you can load this model and run inference.
58
  from sentence_transformers import SentenceTransformer
59
 
60
  # Download from the 🤗 Hub
61
- model = SentenceTransformer("FronyAI/frony-embed-medium-ko-v1")
62
  # Run inference
63
- sentences = [
64
- 'The weather is lovely today.',
65
- "It's so sunny outside!",
66
- 'He drove to the stadium.',
67
- ]
68
- embeddings = model.encode(sentences)
69
- print(embeddings.shape)
70
- # [3, 1024]
71
-
72
- # Get the similarity scores for the embeddings
73
- similarities = model.similarity(embeddings, embeddings)
74
- print(similarities.shape)
75
- # [3, 3]
76
- ```
77
-
78
- <!--
79
- ### Direct Usage (Transformers)
80
-
81
- <details><summary>Click to see the direct usage in Transformers</summary>
82
-
83
- </details>
84
- -->
85
-
86
- <!--
87
- ### Downstream Usage (Sentence Transformers)
88
-
89
- You can finetune this model on your own dataset.
90
-
91
- <details><summary>Click to expand</summary>
92
-
93
- </details>
94
- -->
95
-
96
- <!--
97
- ### Out-of-Scope Use
98
 
99
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
100
- -->
101
-
102
- <!--
103
- ## Bias, Risks and Limitations
104
-
105
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
106
- -->
107
-
108
- <!--
109
- ### Recommendations
110
-
111
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
112
- -->
113
-
114
- ## Training Details
115
-
116
- ### Framework Versions
117
- - Python: 3.10.16
118
- - Sentence Transformers: 4.0.2
119
- - Transformers: 4.47.1
120
- - PyTorch: 2.5.1+cu121
121
- - Accelerate: 1.2.1
122
- - Datasets: 2.21.0
123
- - Tokenizers: 0.21.0
124
-
125
- ## Citation
126
-
127
- ### BibTeX
128
-
129
- <!--
130
- ## Glossary
131
-
132
- *Clearly define terms in order to be accessible across audiences.*
133
- -->
134
-
135
- <!--
136
- ## Model Card Authors
137
-
138
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
139
- -->
140
-
141
- <!--
142
- ## Model Card Contact
143
 
144
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
145
- -->
 
 
 
 
 
9
  - feature-extraction
10
  pipeline_tag: sentence-similarity
11
  library_name: sentence-transformers
12
+ base_model:
13
+ - klue/roberta-large
14
  ---
15
 
16
+ # FronyAI Embedding (medium)
17
+ This is a lightweight and efficient embedding model designed specifically for the Korean language.<br>
18
+ It has been trained on a diverse set of data sources, including **AI 허브**, to ensure robust performance in a wide range of retrieval tasks.<br>
19
+ The model demonstrates strong retrieval capabilities across:<br>
20
 
21
+ * Korean–Korean
22
+ * Korean–English
23
+ * English–Korean
24
+
25
+ To support resource-constrained environments, the model also provides compatibility with Matryoshka Embeddings, enabling retrieval even at reduced dimensions **(e.g., half of the original size)** without significant performance loss.<br>
26
+ All training and data preprocessing were performed on **a single GPU (46VRAM)**, showcasing not only the model’s effectiveness but also its efficiency.<br>
27
 
28
  ## Model Details
29
 
30
  ### Model Description
31
  - **Model Type:** Sentence Transformer
32
+ - **Base Model:** klue/roberta-large
33
  - **Maximum Sequence Length:** 512 tokens
34
+ - **Output Dimensionality:** 1024 / 512 dimensions
35
  - **Similarity Function:** Cosine Similarity
 
36
  - **Languages:** ko, en
37
  - **License:** apache-2.0
38
 
39
+ ### Datasets
40
+ This model is trained from many sources data including **AI 허브**.<br>
41
+ Total trained query and document pair is 100,000.<br>
42
+
43
+ ### Training Details
44
+ The overall training process was conducted with reference to **snowflake-arctic-2.0**.<br>
45
+ Training was divided into two stages: Pre-training and Post-training.<br>
46
+ In the pre-training stage, the model was trained using in-batch negatives.<br>
47
+ In the post-training stage, we utilized the multilingual-e5-large model to identify hard negatives—specifically, the top 4 samples with a similarity score below a **99% threshold**.<br>
48
+ Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br>
49
+ The types of data augmentation applied are as follows:<br>
50
+ | Augmentation* | Description |
51
+ -----------|-----------|
52
+ | Pair concatenation | Multi-query & Multi-passage |
53
+ | Language transfer | Korean to English on query & passage |
54
+ | Style transfer | Plain sentences to Markdown description |
55
+ **Augmentation was carried out using the Gemma-3-12B*
56
+
57
+ ### Evaluation
58
+ The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups.<br>
59
+ Three groups are subsets extracted from **AI 허브** datasets.<br>
60
+ One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini.<br>
61
+ The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.<br>
62
+ The following table presents the average retrieval performance across five dataset groups.<br>
63
+
64
+ | Models | Open/Closed | Size | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 |
65
+ |--------------|-----------|-----------|-----------|------------|------------|-------------|
66
+ | frony-embed-medium | **Open** | 337M | 0.6649 | **0.8040** | 0.8458 | 0.8876 |
67
+ | frony-embed-medium (half dim) | Open | 337M | 0.6520 | 0.7923 | 0.8361 | 0.8796 |
68
+ | frony-embed-small | Open | 111M | 0.6152 | 0.7616 | 0.8056 | 0.8559 |
69
+ | frony-embed-small (half dim) | Open | 111M | 0.5988 | 0.7478 | 0.7984 | 0.8461 |
70
+ | frony-embed-tiny | Open | 21M* | 0.5084 | 0.6757 | 0.7278 | 0.7845 |
71
+ | frony-embed-tiny (half dim) | Open | 21M* | 0.4710 | 0.6390 | 0.6933 | 0.7596 |
72
+ | bge-m3 | **Open** | 560M | 0.5852 | **0.7763** | 0.8418 | 0.8987 |
73
+ | multilingual-e5-large | Open | 560M | 0.5764 | 0.7630 | 0.8267 | 0.8891 |
74
+ | snowflake-arctic-embed-l-v2.0 | Open | 568M | 0.5726 | 0.7591 | 0.8232 | 0.8917 |
75
+ | jina-embeddings-v3 | Open | 572M | 0.5270 | 0.7246 | 0.7953 | 0.8649 |
76
+ | upstage-large | **Closed** | - | 0.6334 | **0.8527** | 0.9065 | 0.9478 |
77
+ | openai-text-embedding-3-large | Closed | - | 0.4907 | 0.6617 | 0.7311 | 0.8148 |
78
+ **Transformer blocks only*
79
 
80
  ## Usage
81
 
 
92
  from sentence_transformers import SentenceTransformer
93
 
94
  # Download from the 🤗 Hub
95
+ model = SentenceTransformer("FronyAI/frony-embed-tiny-ko-v1")
96
  # Run inference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
+ # '<Q>' is special token for query.
99
+ queries = [
100
+ '<Q>안녕하세요',
101
+ ]
102
+ embeddings = model.encode(queries)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
+ # '<P>' is special token for passage.
105
+ passages = [
106
+ '<P>반갑습니다',
107
+ ]
108
+ embeddings = model.encode(passages)
109
+ ```