kekeappa commited on
Commit
4f4a4d4
ยท
verified ยท
1 Parent(s): 1a3d541

Add Matryoshka family table (64/128/256/512)

Browse files
Files changed (1) hide show
  1. README.md +50 -430
README.md CHANGED
@@ -1,455 +1,75 @@
1
  ---
 
 
 
 
 
2
  tags:
3
  - sentence-transformers
4
  - sentence-similarity
5
  - feature-extraction
6
- - generated_from_trainer
7
- - dataset_size:277826
8
- - loss:MultipleNegativesRankingLoss
9
- - loss:MatryoshkaLoss
10
- - loss:CosineSimilarityLoss
11
- widget:
12
- - source_sentence: ์•„๋ฌด๋„ ์ผ์–ด๋‚˜ ์•‰์„ ํ•„์š”๊ฐ€ ์—†๋‹ค.
13
- sentences:
14
- - ์•‰์œผ์„ธ์š”.
15
- - ํ‘ธ์•„๋กœ๋Š” ์‹ ์‹œ์•„๊ฐ€ ์ผํ•˜๋Š” ๋ณ‘์›์— ์žˆ๋Š” ์•ฝ๊ตญ์„ ๋ฐฉ๋ฌธํ•˜๊ณ  ์‹ถ์—ˆ๋‹ค.
16
- - ์ผ์–ด๋‚˜ ์•‰์ง€ ๋งˆ์„ธ์š”.
17
- - source_sentence: ๋‚ด ๋ง์€, ๊ทธ๊ฒŒ ๋„ค๊ฐ€ ์•„๋Š” ์‚ฌ๋žŒ๋“ค์ด ๊ทธ๋…€๊ฐ€ ๊ฐ„์งํ•˜๊ณ  ์žˆ๋Š” ๊ทธ๋Ÿฐ ์ข…๋ฅ˜์˜ ๊ฒƒ๋“ค์ด์•ผ. ๊ทธ๋ฆฌ๊ณ  ์ด๊ฒƒ์ด ์‹œ๊ฐ„๋‚ญ๋น„๋ผ๋Š” ๊ฒƒ์„
18
- ๋ณด์—ฌ์ฃผ๋Š” ๊ฑฐ์•ผ. ๊ทธ๊ฑด ์ข…์ด ๋‚ญ๋น„์˜€์–ด. ๋„ค๊ฐ€ ์•„๋Š” ๊ทธ๋Ÿฐ ๊ฒƒ์„ ๋ฐ”๊พธ๋Š” ๊ฑด ๋‚ญ๋น„์˜€์–ด.
19
- sentences:
20
- - ๊ทธ๋…€๋Š” ์‹œ๊ฐ„๊ณผ ์ข…์ด๋ฅผ ๋‚ญ๋น„ํ–ˆ๋‹ค.
21
- - ๋ˆ์€ ๊ณ„์‚ฐ์„ ํ†ตํ•ด ์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ ์„ฑ์žฅํ•  ์ˆ˜ ์žˆ๋‹ค.
22
- - ๊ทธ๋…€๊ฐ€ ํ•œ ๋ชจ๋“  ์ผ์€ ์ƒ์‚ฐ์ ์ด๊ณ  ๊ฐ€์น˜ ์žˆ๋Š” ์ผ์ด์—ˆ๋‹ค.
23
- - source_sentence: ๊ทธ๋ฆฌ๊ณ , ์•„๋งˆ ์•„์‹œ๊ฒ ์ง€๋งŒ, ์šฐ์ฒด๊ตญ์€ ๋ฏธ๋ž˜์˜ ํ•„์š”์— ๋Œ€ํ•œ ์ถ”์ •์น˜๋ฅผ ์ œ์‹œํ•  ๋•Œ 1998 ํšŒ๊ณ„์—ฐ๋„์˜ ๊ธˆ์œต ๋ฐ ์šด์˜ ๋ฐ์ดํ„ฐ๋ฅผ
24
- ๋ฒค์น˜๋งˆํฌ๋กœ ์‚ฌ์šฉํ–ˆ๋‹ค.
25
- sentences:
26
- - ์šฐ์ฒด๊ตญ์€ ์žฌ์ • ๋ฐ ์šด์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์—ˆ๋‹ค.
27
- - ๋‚˜๋Š” ์ฐจ๊ฐ€ ๊ธฐ๋‹ค๋ฆฌ๊ณ  ์žˆ์„ ๋•Œ ์ฐจ๋ฅผ ๋ชฐ๊ณ  ๋Œ์•„์™”๋‹ค.
28
- - 1998 ํšŒ๊ณ„์—ฐ๋„๋Š” ์šฐ์ฒด๊ตญ์˜ ๋ฏธ๋ž˜ ์š”๊ตฌ๋ฅผ ์ œ์‹œํ•˜๊ธฐ ์œ„ํ•œ ๋ฒค์น˜๋งˆํฌ๋กœ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.
29
- - source_sentence: ์ด ํšŒ์‚ฌ์— ๋”ฐ๋ฅด๋ฉด ์œ ๊ถŒ์ž๋“ค์€ ์ƒ๋…„์›”์ผ๊ณผ ๋””์ง€ํ„ธ ์„œ๋ช…์„ ํ†ตํ•ด ์ž์‹ ์˜ ์‹ ์›์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋ผ๊ณ  ํ•œ๋‹ค.
30
- sentences:
31
- - ๋‚˜๋Š” ๋ถ€๋Ÿฝ๋‹ค.
32
- - ์œ ๊ถŒ์ž๋“ค์€ ๊ทธ๋“ค์˜ ์ •์ฒด์„ฑ์„ ํ™•์ธํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค.
33
- - ์œ ๊ถŒ์ž๋“ค์€ ์ƒ์ผ๊ณผ ๋””์ง€ํ„ธ ์„œ๋ช…์œผ๋กœ ์ž์‹ ์˜ ์‹ ์›์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.
34
- - source_sentence: ๋‚ด ์›Œ์ปค๊ฐ€ ๊ณ ์žฅ๋‚˜์„œ ์ง€๊ธˆ ํ™”๊ฐ€ ๋‚ฌ์–ด. ์Šคํ…Œ๋ ˆ์˜ค๋ฅผ ์ •๋ง ํฌ๊ฒŒ ํ‹€์–ด์•ผ ํ•ด.
35
- sentences:
36
- - ๋‚ด ์›Œํฌ๋งจ์€ ์—ฌ์ „ํžˆ ํ•ญ์ƒ ๊ทธ๋žฌ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ ์ž˜ ์ž‘๋™ํ•œ๋‹ค.
37
- - ๋‚˜๋Š” ๋‚ด ์›Œํฌ๋งจ์ด ๊ณ ์žฅ๋‚˜์„œ ํ™”๊ฐ€ ๋‚˜์„œ ์ด์ œ ์Šคํ…Œ๋ ˆ์˜ค๋ฅผ ์ •๋ง ํฌ๊ฒŒ ํ‹€์–ด์•ผ ํ•œ๋‹ค.
38
- - ๋†์—…์— ๋ฏธ์น˜๋Š” ๋‘ ๊ฐ€์ง€ ์˜ํ–ฅ์€ ์šฐ๋ฆฌ์˜ ๋ถ„์„์—์„œ ์ •๋Ÿ‰์ ์œผ๋กœ ์ถ”์ •๋œ๋‹ค.
39
- pipeline_tag: sentence-similarity
40
- library_name: sentence-transformers
41
- metrics:
42
- - pearson_cosine
43
- - spearman_cosine
44
- model-index:
45
- - name: SentenceTransformer
46
- results:
47
- - task:
48
- type: semantic-similarity
49
- name: Semantic Similarity
50
- dataset:
51
- name: korsts valid
52
- type: korsts-valid
53
- metrics:
54
- - type: pearson_cosine
55
- value: 0.8325143434234821
56
- name: Pearson Cosine
57
- - type: spearman_cosine
58
- value: 0.833013169247792
59
- name: Spearman Cosine
60
  ---
61
 
62
- # SentenceTransformer
63
-
64
- This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 512-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
65
-
66
- ## Model Details
67
 
68
- ### Model Description
69
- - **Model Type:** Sentence Transformer
70
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
71
- - **Maximum Sequence Length:** inf tokens
72
- - **Output Dimensionality:** 512 dimensions
73
- - **Similarity Function:** Cosine Similarity
74
- <!-- - **Training Dataset:** Unknown -->
75
- <!-- - **Language:** Unknown -->
76
- <!-- - **License:** Unknown -->
77
 
78
- ### Model Sources
79
 
80
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
81
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
82
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
 
 
 
83
 
84
- ### Full Model Architecture
85
 
86
- ```
87
- SentenceTransformer(
88
- (0): StaticEmbedding(
89
- (embedding): EmbeddingBag(32000, 512, mode='mean')
90
- )
91
- )
92
- ```
93
-
94
- ## Usage
95
 
96
- ### Direct Usage (Sentence Transformers)
97
-
98
- First install the Sentence Transformers library:
99
-
100
- ```bash
101
- pip install -U sentence-transformers
102
- ```
103
 
104
- Then you can load this model and run inference.
105
  ```python
106
  from sentence_transformers import SentenceTransformer
107
 
108
- # Download from the ๐Ÿค— Hub
109
- model = SentenceTransformer("sentence_transformers_model_id")
110
- # Run inference
111
- sentences = [
112
- '๋‚ด ์›Œ์ปค๊ฐ€ ๊ณ ์žฅ๋‚˜์„œ ์ง€๊ธˆ ํ™”๊ฐ€ ๋‚ฌ์–ด. ์Šคํ…Œ๋ ˆ์˜ค๋ฅผ ์ •๋ง ํฌ๊ฒŒ ํ‹€์–ด์•ผ ํ•ด.',
113
- '๋‚˜๋Š” ๋‚ด ์›Œํฌ๋งจ์ด ๊ณ ์žฅ๋‚˜์„œ ํ™”๊ฐ€ ๋‚˜์„œ ์ด์ œ ์Šคํ…Œ๋ ˆ์˜ค๋ฅผ ์ •๋ง ํฌ๊ฒŒ ํ‹€์–ด์•ผ ํ•œ๋‹ค.',
114
- '๋‚ด ์›Œํฌ๋งจ์€ ์—ฌ์ „ํžˆ ํ•ญ์ƒ ๊ทธ๋žฌ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ ์ž˜ ์ž‘๋™ํ•œ๋‹ค.',
115
- ]
116
- embeddings = model.encode(sentences)
117
- print(embeddings.shape)
118
- # [3, 512]
119
-
120
- # Get the similarity scores for the embeddings
121
- similarities = model.similarity(embeddings, embeddings)
122
- print(similarities.shape)
123
- # [3, 3]
124
- ```
125
-
126
- <!--
127
- ### Direct Usage (Transformers)
128
-
129
- <details><summary>Click to see the direct usage in Transformers</summary>
130
-
131
- </details>
132
- -->
133
-
134
- <!--
135
- ### Downstream Usage (Sentence Transformers)
136
-
137
- You can finetune this model on your own dataset.
138
-
139
- <details><summary>Click to expand</summary>
140
-
141
- </details>
142
- -->
143
-
144
- <!--
145
- ### Out-of-Scope Use
146
-
147
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
148
- -->
149
-
150
- ## Evaluation
151
-
152
- ### Metrics
153
-
154
- #### Semantic Similarity
155
-
156
- * Dataset: `korsts-valid`
157
- * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
158
-
159
- | Metric | Value |
160
- |:--------------------|:----------|
161
- | pearson_cosine | 0.8325 |
162
- | **spearman_cosine** | **0.833** |
163
-
164
- <!--
165
- ## Bias, Risks and Limitations
166
-
167
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
168
- -->
169
-
170
- <!--
171
- ### Recommendations
172
-
173
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
174
- -->
175
-
176
- ## Training Details
177
-
178
- ### Training Dataset
179
-
180
- #### Unnamed Dataset
181
-
182
- * Size: 277,826 training samples
183
- * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
184
- * Approximate statistics based on the first 1000 samples:
185
- | | sentence1 | sentence2 | score |
186
- |:--------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:---------------------------------------------------------------|
187
- | type | string | string | float |
188
- | details | <ul><li>min: 9 characters</li><li>mean: 18.19 characters</li><li>max: 53 characters</li></ul> | <ul><li>min: 9 characters</li><li>mean: 17.97 characters</li><li>max: 44 characters</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.45</li><li>max: 1.0</li></ul> |
189
- * Samples:
190
- | sentence1 | sentence2 | score |
191
- |:------------------------------------|:------------------------------------------|:------------------|
192
- | <code>๋น„ํ–‰๊ธฐ๊ฐ€ ์ด๋ฅ™ํ•˜๊ณ  ์žˆ๋‹ค.</code> | <code>๋น„ํ–‰๊ธฐ๊ฐ€ ์ด๋ฅ™ํ•˜๊ณ  ์žˆ๋‹ค.</code> | <code>1.0</code> |
193
- | <code>ํ•œ ๋‚จ์ž๊ฐ€ ํฐ ํ”Œ๋ฃจํŠธ๋ฅผ ์—ฐ์ฃผํ•˜๊ณ  ์žˆ๋‹ค.</code> | <code>๋‚จ์ž๊ฐ€ ํ”Œ๋ฃจํŠธ๋ฅผ ์—ฐ์ฃผํ•˜๊ณ  ์žˆ๋‹ค.</code> | <code>0.76</code> |
194
- | <code>ํ•œ ๋‚จ์ž๊ฐ€ ํ”ผ์ž์— ์น˜์ฆˆ๋ฅผ ๋ฟŒ๋ ค๋†“๊ณ  ์žˆ๋‹ค.</code> | <code>ํ•œ ๋‚จ์ž๊ฐ€ ๊ตฌ์šด ํ”ผ์ž์— ์น˜์ฆˆ ์กฐ๊ฐ์„ ๋ฟŒ๋ ค๋†“๊ณ  ์žˆ๋‹ค.</code> | <code>0.76</code> |
195
- * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
196
- ```json
197
- {
198
- "loss": "CosineSimilarityLoss",
199
- "matryoshka_dims": [
200
- 512,
201
- 256,
202
- 128,
203
- 64
204
- ],
205
- "matryoshka_weights": [
206
- 1,
207
- 1,
208
- 1,
209
- 1
210
- ],
211
- "n_dims_per_step": -1
212
- }
213
- ```
214
-
215
- ### Training Hyperparameters
216
- #### Non-Default Hyperparameters
217
-
218
- - `eval_strategy`: epoch
219
- - `per_device_train_batch_size`: 64
220
- - `learning_rate`: 0.02
221
- - `num_train_epochs`: 5
222
- - `warmup_ratio`: 0.1
223
- - `bf16`: True
224
- - `load_best_model_at_end`: True
225
-
226
- #### All Hyperparameters
227
- <details><summary>Click to expand</summary>
228
-
229
- - `overwrite_output_dir`: False
230
- - `do_predict`: False
231
- - `eval_strategy`: epoch
232
- - `prediction_loss_only`: True
233
- - `per_device_train_batch_size`: 64
234
- - `per_device_eval_batch_size`: 8
235
- - `per_gpu_train_batch_size`: None
236
- - `per_gpu_eval_batch_size`: None
237
- - `gradient_accumulation_steps`: 1
238
- - `eval_accumulation_steps`: None
239
- - `torch_empty_cache_steps`: None
240
- - `learning_rate`: 0.02
241
- - `weight_decay`: 0.0
242
- - `adam_beta1`: 0.9
243
- - `adam_beta2`: 0.999
244
- - `adam_epsilon`: 1e-08
245
- - `max_grad_norm`: 1.0
246
- - `num_train_epochs`: 5
247
- - `max_steps`: -1
248
- - `lr_scheduler_type`: linear
249
- - `lr_scheduler_kwargs`: {}
250
- - `warmup_ratio`: 0.1
251
- - `warmup_steps`: 0
252
- - `log_level`: passive
253
- - `log_level_replica`: warning
254
- - `log_on_each_node`: True
255
- - `logging_nan_inf_filter`: True
256
- - `save_safetensors`: True
257
- - `save_on_each_node`: False
258
- - `save_only_model`: False
259
- - `restore_callback_states_from_checkpoint`: False
260
- - `no_cuda`: False
261
- - `use_cpu`: False
262
- - `use_mps_device`: False
263
- - `seed`: 42
264
- - `data_seed`: None
265
- - `jit_mode_eval`: False
266
- - `use_ipex`: False
267
- - `bf16`: True
268
- - `fp16`: False
269
- - `fp16_opt_level`: O1
270
- - `half_precision_backend`: auto
271
- - `bf16_full_eval`: False
272
- - `fp16_full_eval`: False
273
- - `tf32`: None
274
- - `local_rank`: 0
275
- - `ddp_backend`: None
276
- - `tpu_num_cores`: None
277
- - `tpu_metrics_debug`: False
278
- - `debug`: []
279
- - `dataloader_drop_last`: False
280
- - `dataloader_num_workers`: 0
281
- - `dataloader_prefetch_factor`: None
282
- - `past_index`: -1
283
- - `disable_tqdm`: False
284
- - `remove_unused_columns`: True
285
- - `label_names`: None
286
- - `load_best_model_at_end`: True
287
- - `ignore_data_skip`: False
288
- - `fsdp`: []
289
- - `fsdp_min_num_params`: 0
290
- - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
291
- - `fsdp_transformer_layer_cls_to_wrap`: None
292
- - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
293
- - `deepspeed`: None
294
- - `label_smoothing_factor`: 0.0
295
- - `optim`: adamw_torch
296
- - `optim_args`: None
297
- - `adafactor`: False
298
- - `group_by_length`: False
299
- - `length_column_name`: length
300
- - `ddp_find_unused_parameters`: None
301
- - `ddp_bucket_cap_mb`: None
302
- - `ddp_broadcast_buffers`: False
303
- - `dataloader_pin_memory`: True
304
- - `dataloader_persistent_workers`: False
305
- - `skip_memory_metrics`: True
306
- - `use_legacy_prediction_loop`: False
307
- - `push_to_hub`: False
308
- - `resume_from_checkpoint`: None
309
- - `hub_model_id`: None
310
- - `hub_strategy`: every_save
311
- - `hub_private_repo`: False
312
- - `hub_always_push`: False
313
- - `gradient_checkpointing`: False
314
- - `gradient_checkpointing_kwargs`: None
315
- - `include_inputs_for_metrics`: False
316
- - `include_for_metrics`: []
317
- - `eval_do_concat_batches`: True
318
- - `fp16_backend`: auto
319
- - `push_to_hub_model_id`: None
320
- - `push_to_hub_organization`: None
321
- - `mp_parameters`:
322
- - `auto_find_batch_size`: False
323
- - `full_determinism`: False
324
- - `torchdynamo`: None
325
- - `ray_scope`: last
326
- - `ddp_timeout`: 1800
327
- - `torch_compile`: False
328
- - `torch_compile_backend`: None
329
- - `torch_compile_mode`: None
330
- - `dispatch_batches`: None
331
- - `split_batches`: None
332
- - `include_tokens_per_second`: False
333
- - `include_num_input_tokens_seen`: False
334
- - `neftune_noise_alpha`: None
335
- - `optim_target_modules`: None
336
- - `batch_eval_metrics`: False
337
- - `eval_on_start`: False
338
- - `use_liger_kernel`: False
339
- - `eval_use_gather_object`: False
340
- - `average_tokens_across_devices`: False
341
- - `prompts`: None
342
- - `batch_sampler`: batch_sampler
343
- - `multi_dataset_batch_sampler`: proportional
344
-
345
- </details>
346
-
347
- ### Training Logs
348
- | Epoch | Step | Training Loss | korsts-valid_spearman_cosine |
349
- |:------:|:----:|:-------------:|:----------------------------:|
350
- | -1 | -1 | - | 0.5714 |
351
- | 0.3676 | 50 | 2.6082 | - |
352
- | 0.7353 | 100 | 1.9692 | - |
353
- | -1 | -1 | - | 0.8077 |
354
- | 0.2604 | 50 | 5.0909 | - |
355
- | 0.5208 | 100 | 3.4769 | - |
356
- | 0.7812 | 150 | 3.0821 | - |
357
- | -1 | -1 | - | 0.7796 |
358
- | 0.1381 | 50 | 0.1676 | - |
359
- | 0.2762 | 100 | 0.1483 | - |
360
- | 0.4144 | 150 | 0.1283 | - |
361
- | 0.5525 | 200 | 0.1186 | - |
362
- | 0.6906 | 250 | 0.1183 | - |
363
- | 0.8287 | 300 | 0.1019 | - |
364
- | 0.9669 | 350 | 0.0938 | - |
365
- | 1.0 | 362 | - | 0.8262 |
366
- | 1.1050 | 400 | 0.0593 | - |
367
- | 1.2431 | 450 | 0.0463 | - |
368
- | 1.3812 | 500 | 0.0443 | - |
369
- | 1.5193 | 550 | 0.0419 | - |
370
- | 1.6575 | 600 | 0.0419 | - |
371
- | 1.7956 | 650 | 0.0436 | - |
372
- | 1.9337 | 700 | 0.0406 | - |
373
- | 2.0 | 724 | - | 0.8307 |
374
- | 2.0718 | 750 | 0.0331 | - |
375
- | 2.2099 | 800 | 0.0229 | - |
376
- | 2.3481 | 850 | 0.0249 | - |
377
- | 2.4862 | 900 | 0.0231 | - |
378
- | 2.6243 | 950 | 0.0225 | - |
379
- | 2.7624 | 1000 | 0.023 | - |
380
- | 2.9006 | 1050 | 0.0241 | - |
381
- | 3.0 | 1086 | - | 0.8325 |
382
- | 3.0387 | 1100 | 0.0197 | - |
383
- | 3.1768 | 1150 | 0.012 | - |
384
- | 3.3149 | 1200 | 0.0115 | - |
385
- | 3.4530 | 1250 | 0.0117 | - |
386
- | 3.5912 | 1300 | 0.012 | - |
387
- | 3.7293 | 1350 | 0.0107 | - |
388
- | 3.8674 | 1400 | 0.011 | - |
389
- | 4.0 | 1448 | - | 0.8335 |
390
- | 4.0055 | 1450 | 0.0118 | - |
391
- | 4.1436 | 1500 | 0.0056 | - |
392
- | 4.2818 | 1550 | 0.0069 | - |
393
- | 4.4199 | 1600 | 0.006 | - |
394
- | 4.5580 | 1650 | 0.0057 | - |
395
- | 4.6961 | 1700 | 0.0055 | - |
396
- | 4.8343 | 1750 | 0.0078 | - |
397
- | 4.9724 | 1800 | 0.0072 | - |
398
- | 5.0 | 1810 | - | 0.8330 |
399
-
400
-
401
- ### Framework Versions
402
- - Python: 3.11.10
403
- - Sentence Transformers: 3.4.1
404
- - Transformers: 4.46.3
405
- - PyTorch: 2.4.1+cu124
406
- - Accelerate: 1.13.0
407
- - Datasets: 4.8.5
408
- - Tokenizers: 0.20.3
409
-
410
- ## Citation
411
-
412
- ### BibTeX
413
-
414
- #### Sentence Transformers
415
- ```bibtex
416
- @inproceedings{reimers-2019-sentence-bert,
417
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
418
- author = "Reimers, Nils and Gurevych, Iryna",
419
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
420
- month = "11",
421
- year = "2019",
422
- publisher = "Association for Computational Linguistics",
423
- url = "https://arxiv.org/abs/1908.10084",
424
- }
425
  ```
426
 
427
- #### MatryoshkaLoss
428
- ```bibtex
429
- @misc{kusupati2024matryoshka,
430
- title={Matryoshka Representation Learning},
431
- author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
432
- year={2024},
433
- eprint={2205.13147},
434
- archivePrefix={arXiv},
435
- primaryClass={cs.LG}
436
- }
437
- ```
438
 
439
- <!--
440
- ## Glossary
 
 
441
 
442
- *Clearly define terms in order to be accessible across audiences.*
443
- -->
444
 
445
- <!--
446
- ## Model Card Authors
 
 
 
 
447
 
448
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
449
- -->
450
 
451
- <!--
452
- ## Model Card Contact
453
 
454
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
455
- -->
 
1
  ---
2
+ language:
3
+ - ko
4
+ license: apache-2.0
5
+ library_name: sentence-transformers
6
+ pipeline_tag: sentence-similarity
7
  tags:
8
  - sentence-transformers
9
  - sentence-similarity
10
  - feature-extraction
11
+ - static-embedding
12
+ - model2vec
13
+ - korean
14
+ - ko
15
+ - matryoshka
16
+ datasets:
17
+ - kakaobrain/kor_nli
18
+ - mteb/KorSTS
19
+ - klue/klue
20
+ - Helsinki-NLP/opus-100
21
+ base_model: klue/roberta-base
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ---
23
 
24
+ # kor-static-embedding-512
 
 
 
 
25
 
26
+ ํ•œ๊ตญ์–ด ํŠนํ™” **์ดˆ๊ฒฝ๋Ÿ‰ Static Embedding** ๋ชจ๋ธ โ€” **68MB**, **512์ฐจ์›**.
 
 
 
 
 
 
 
 
27
 
28
+ [kekeappa/kor-static-embedding-512](https://huggingface.co/kekeappa/kor-static-embedding-512)๋ฅผ Matryoshka ํ•™์Šต์œผ๋กœ ๋งŒ๋“ค๊ณ  **512์ฐจ์›์œผ๋กœ ์ž˜๋ผ๋‚ธ ๋ณ€์ข…**์ž…๋‹ˆ๋‹ค. ๊ฐ™์€ ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ์— 4๊ฐœ ์ฐจ์› ์กด์žฌ โ€” ์šฉ๋„์— ๋งž๊ฒŒ ์„ ํƒ:
29
 
30
+ | ์ฐจ์› | ํฌ๊ธฐ | ์šฉ๋„ |
31
+ |---:|---:|---|
32
+ | **[64](https://huggingface.co/kekeappa/kor-static-embedding-64)** | 9MB | ๐ŸŒ ๋ธŒ๋ผ์šฐ์ € ยท ๋ชจ๋ฐ”์ผ ยท ์—ฃ์ง€ |
33
+ | **[128](https://huggingface.co/kekeappa/kor-static-embedding-128)** | 17MB | โšก ๊ฐ€๋ฒผ์šด ๊ฒ€์ƒ‰ยท๋ถ„๋ฅ˜ |
34
+ | **[256](https://huggingface.co/kekeappa/kor-static-embedding-256)** | 34MB | โš–๏ธ ๊ฐ€์„ฑ๋น„ |
35
+ | **[512](https://huggingface.co/kekeappa/kor-static-embedding-512)** | 68MB | ๐ŸŽฏ ์ตœ๊ณ  ์ •ํ™•๋„ |
36
 
37
+ ## ์„ฑ๋Šฅ (KorSTS / KLUE-STS)
38
 
39
+ | ๋ฒค์น˜๋งˆํฌ | Pearson | **Spearman** |
40
+ |---|---:|---:|
41
+ | KorSTS-test | 0.7760 | **0.7718** |
42
+ | KorSTS-valid | โ€” | **0.8330** |
43
+ | KLUE-STS-val | โ€” | **0.7033** |
 
 
 
 
44
 
45
+ ## ์‚ฌ์šฉ
 
 
 
 
 
 
46
 
 
47
  ```python
48
  from sentence_transformers import SentenceTransformer
49
 
50
+ model = SentenceTransformer("kekeappa/kor-static-embedding-512")
51
+ emb = model.encode(["ํ•œ๊ตญ์–ด ๋ฌธ์žฅ", "์ž„๋ฒ ๋”ฉ ํ…Œ์ŠคํŠธ"], normalize_embeddings=True)
52
+ print(emb.shape) # (2, 512)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ```
54
 
55
+ ## ํŠน์ง•
 
 
 
 
 
 
 
 
 
 
56
 
57
+ - **์•„ํ‚คํ…์ฒ˜**: StaticEmbedding (model2vec ๊ณ„์—ด) โ€” ํŠธ๋žœ์Šคํฌ๋จธ attention ์—†์Œ
58
+ - **์ถ”๋ก **: CPU ์ตœ์ , GPU ๋ถˆํ•„์š”
59
+ - **์†๋„**: ๋‹จ์ผ ์ฟผ๋ฆฌ < 1ms (๋ธŒ๋ผ์šฐ์ €์—์„œ๋„ ๋น ๋ฆ„)
60
+ - **ํ•œ์˜ ํ˜ธํ™˜**: cross-lingual ํ•™์Šต๋จ โ€” ํ•œ๊ตญ์–ด ์ฟผ๋ฆฌ๋กœ ์˜์–ด ๋ฌธ์„œ ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅ
61
 
62
+ ## ํ•™์Šต ๋ฐฉ๋ฒ•
 
63
 
64
+ 4-stage ํ•™์Šต:
65
+ 1. **Distillation ์ดˆ๊ธฐํ™”**: `BM-K/KoSimCSE-roberta-multitask` teacher์˜ vocab ์ž„๋ฒ ๋”ฉ โ†’ PCA + Zipf weighting
66
+ 2. **KorNLI MNRL**: `kakaobrain/kor_nli` (multi_nli + snli) 277K triplet
67
+ 3. **Cross-lingual MNRL**: OPUS-100 ko-en parallel 200K pair
68
+ 4. **Matryoshka regression**: KorSTS + KLUE-STS + NLLB๋กœ ๋ฒˆ์—ญํ•œ ์˜์–ด STS-B
69
+ - 64/128/256/512 ์ฐจ์› ๋™์‹œ ์ตœ์ ํ™” (`MatryoshkaLoss`)
70
 
71
+ ํ•™์Šต ์ฝ”๋“œ: https://github.com/johunsang/kor-static-embedding-512
 
72
 
73
+ ## ๋ผ์ด์„ ์Šค
 
74
 
75
+ Apache 2.0