BM-K commited on
Commit
8f0abf3
ยท
verified ยท
1 Parent(s): 6c5516e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +235 -362
README.md CHANGED
@@ -4,59 +4,85 @@ tags:
4
  - sparse-encoder
5
  - sparse
6
  - splade
7
- - generated_from_trainer
8
- - dataset_size:1112040
9
- - loss:SpladeLoss
10
- - loss:SparseMultipleNegativesRankingLoss
11
- - loss:FlopsLoss
12
- widget:
13
- - text: ๋งคํฌ๋กœ (๋ช…์‚ฌ). ๋ณต์žกํ•œ ์ž…๋ ฅ์„ ์ปดํ“จํ„ฐ ํ”„๋กœ๊ทธ๋žจ์— ๋Œ€ํ•ด ๋น„๊ต์  ์ธ๊ฐ„ ์นœํ™”์ ์œผ๋กœ ์ค„์ธ ํ‘œํ˜„. ์ „์ฒ˜๋ฆฌ๊ธฐ๋Š” ์ปดํŒŒ์ผ๋˜๊ธฐ ์ „์— ๋ชจ๋“  ๋‚ด์žฅ๋œ ๋งคํฌ๋กœ๋ฅผ
14
- ์†Œ์Šค ์ฝ”๋“œ๋กœ ํ™•์žฅํ•œ๋‹ค.
15
- - text: "๋ธŒ๋ ˆ๋„ค ํ˜ธ์ˆ˜ \n๋ธŒ๋ ˆ๋„ค ํ˜ธ์ˆ˜๋Š” ์Šค์œ„์Šค ๋ณด์ฃผ์ฃผ ์กฐ ๊ณ„๊ณก์— ์œ„์น˜ํ•œ ํ˜ธ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด ํ˜ธ์ˆ˜๋Š” ์กฐ ํ˜ธ์ˆ˜์˜ ๋ถ์ชฝ์— ์žˆ์œผ๋ฉฐ, ๋‹จ 200๋ฏธํ„ฐ ๋–จ์–ด์ ธ\
16
- \ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•ด๋ฐœ 1002๋ฏธํ„ฐ๋กœ ์กฐ ํ˜ธ์ˆ˜๋ณด๋‹ค 2๋ฏธํ„ฐ ๋‚ฎ์Šต๋‹ˆ๋‹ค."
17
- - text: ๊ทธ ์•จ๋ฒ” "Making Lite of Myself"๋ฅผ ๋งŒ๋“  ์ฝ”๋ฏธ๋””์–ธ์˜ ๊ตญ์ ์€ ๋ฌด์—‡์ธ๊ฐ€์š”?
18
- - text: ๋น„์–ด ์žˆ์Œ์˜ ์˜๋ฏธ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?
19
- - text: 'ํŒŒํŠธ๋ผ๋ฐ๋น„(์ฝ˜์นด๋‹ˆ์–ด: ํฌํŠธ๋ผ๋ฐ์˜ค)๋Š” ๊ณ ์•„์˜ ํŽ˜๋ฅด๋„ด ํƒˆ๋ฃจํฌ์— ์œ„์น˜ํ•œ ๋งˆ์„๋กœ, ๊ณ ์•„์™€ ๋งˆํ•˜๋ผ์ŠˆํŠธ๋ผ ๊ฒฝ๊ณ„์— ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋งˆ์„์—๋Š” ํŒŒํŠธ๋ผ๋ฐ๋น„
20
- ๊ฒ€๋ฌธ์†Œ๊ฐ€ ์œ„์น˜ํ•ด ์žˆ์Šต๋‹ˆ๋‹ค.'
21
  pipeline_tag: feature-extraction
22
  library_name: sentence-transformers
 
23
  ---
24
-
25
- # SPLADE Sparse Encoder
26
-
27
- This is a [SPLADE Sparse Encoder](https://www.sbert.net/docs/sparse_encoder/usage/usage.html) model trained on the json dataset using the [sentence-transformers](https://www.SBERT.net) library. It maps sentences & paragraphs to a 50000-dimensional sparse vector space and can be used for semantic search and sparse retrieval.
28
- ## Model Details
29
-
30
- ### Model Description
 
 
 
 
 
 
 
 
 
 
 
 
31
  - **Model Type:** SPLADE Sparse Encoder
32
  <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
33
- - **Maximum Sequence Length:** 3072 tokens
34
  - **Output Dimensionality:** 50000 dimensions
35
  - **Similarity Function:** Dot Product
36
- - **Training Dataset:**
37
- - json
38
- <!-- - **Language:** Unknown -->
39
- <!-- - **License:** Unknown -->
40
-
41
- ### Model Sources
42
-
43
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
44
- - **Documentation:** [Sparse Encoder Documentation](https://www.sbert.net/docs/sparse_encoder/usage/usage.html)
45
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
46
- - **Hugging Face:** [Sparse Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=sparse-encoder)
47
 
48
  ### Full Model Architecture
49
 
50
  ```
51
  SparseEncoder(
52
- (0): MLMTransformer({'max_seq_length': 3072, 'do_lower_case': False, 'architecture': 'ModernBertForMaskedLM'})
53
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 50000})
54
  )
55
  ```
56
 
57
- ## Usage
58
-
59
- ### Direct Usage (Sentence Transformers)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
  First install the Sentence Transformers library:
62
 
@@ -66,341 +92,188 @@ pip install -U sentence-transformers
66
 
67
  Then you can load this model and run inference.
68
  ```python
 
 
 
 
 
69
  from sentence_transformers import SparseEncoder
70
 
71
- # Download from the ๐Ÿค— Hub
72
- model = SparseEncoder("sparse_encoder_model_id")
73
- # Run inference
74
- sentences = [
75
- 'ํŒŒํŠธ๋ผ๋ฐ๋น„๋Š” ๊ณ ์•„์˜ ํŽ˜๋ฅด๋„ด ํƒ€๋ฃฉ์— ์œ„์น˜ํ•œ ๋งˆ์„๋กœ, ๊ณ ์•„๋Š” ์–ด๋А ๋‚˜๋ผ์— ์žˆ๋Š” ์ฃผ์ธ๊ฐ€์š”?',
76
- 'ํŒŒํŠธ๋ผ๋ฐ๋น„(์ฝ˜์นด๋‹ˆ์–ด: ํฌํŠธ๋ผ๋ฐ์˜ค)๋Š” ๊ณ ์•„์˜ ํŽ˜๋ฅด๋„ด ํƒˆ๋ฃจํฌ์— ์œ„์น˜ํ•œ ๋งˆ์„๋กœ, ๊ณ ์•„์™€ ๋งˆํ•˜๋ผ์ŠˆํŠธ๋ผ ๊ฒฝ๊ณ„์— ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋งˆ์„์—๋Š” ํŒŒํŠธ๋ผ๋ฐ๋น„ ๊ฒ€๋ฌธ์†Œ๊ฐ€ ์œ„์น˜ํ•ด ์žˆ์Šต๋‹ˆ๋‹ค.',
77
- '์ฝ˜๋””๋ฐ”๋ฐ A.m ์ฝ˜๋””๋ฐ”๋ฐ A.m์€ ์ธ๋„์˜ ํ•œ ๋งˆ์„์ž…๋‹ˆ๋‹ค. ์ด ๋งˆ์„์€ ๋งˆํ•˜๋ผ์ŠˆํŠธ๋ผ ์ฃผ์˜ ํ‘ธ๋„ค ์ง€๊ตฌ ๋งˆ์™ˆ ํƒˆ๋ฃจ์นด์— ์œ„์น˜ํ•ด ์žˆ์Šต๋‹ˆ๋‹ค.',
78
- ]
79
- embeddings = model.encode(sentences)
80
- print(embeddings.shape)
81
- # [3, 50000]
82
-
83
- # Get the similarity scores for the embeddings
84
- similarities = model.similarity(embeddings, embeddings)
85
- print(similarities)
86
- # tensor([[25.1626, 27.0573, 7.1256],
87
- # [27.0573, 84.2966, 31.7376],
88
- # [ 7.1256, 31.7376, 74.3025]])
89
- ```
90
 
91
- <!--
92
- ### Direct Usage (Transformers)
93
-
94
- <details><summary>Click to see the direct usage in Transformers</summary>
95
-
96
- </details>
97
- -->
98
-
99
- <!--
100
- ### Downstream Usage (Sentence Transformers)
101
-
102
- You can finetune this model on your own dataset.
103
-
104
- <details><summary>Click to expand</summary>
105
-
106
- </details>
107
- -->
108
-
109
- <!--
110
- ### Out-of-Scope Use
111
-
112
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
113
- -->
114
-
115
- <!--
116
- ## Bias, Risks and Limitations
117
-
118
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
119
- -->
120
-
121
- <!--
122
- ### Recommendations
123
-
124
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
125
- -->
126
-
127
- ## Training Details
128
-
129
- ### Training Dataset
130
-
131
- #### json
132
-
133
- * Dataset: json
134
- * Size: 1,112,040 training samples
135
- * Columns: <code>anchor</code>, <code>positive</code>, <code>negative_1</code>, <code>negative_2</code>, and <code>negative_3</code>
136
- * Approximate statistics based on the first 1000 samples:
137
- | | anchor | positive | negative_1 | negative_2 | negative_3 |
138
- |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
139
- | type | string | string | string | string | string |
140
- | details | <ul><li>min: 3 tokens</li><li>mean: 18.8 tokens</li><li>max: 126 tokens</li></ul> | <ul><li>min: 15 tokens</li><li>mean: 50.36 tokens</li><li>max: 77 tokens</li></ul> | <ul><li>min: 16 tokens</li><li>mean: 47.98 tokens</li><li>max: 73 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 47.69 tokens</li><li>max: 79 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 47.96 tokens</li><li>max: 78 tokens</li></ul> |
141
- * Samples:
142
- | anchor | positive | negative_1 | negative_2 | negative_3 |
143
- |:-------------------------------------------------------------------|:-------------------------------------------------------|:--------------------------------------------------------------|:---------------------------------------------------------------|:----------------------------------------------------------------|
144
- | <code>๋‚œ์ดจ๊ตฌ์™€ ๋‘ฅ์ดจ๊ตฌ๋Š” ์–ด๋А ๋‚˜๋ผ์— ์œ„์น˜ํ•ด ์žˆ์Šต๋‹ˆ๊นŒ?</code> | <code>๋‚œ์ดจ๊ตฌ(ๅ—ๅทๅŒบ)๋Š” ์ค‘๊ตญ ์ถฉ์นญ์˜ ๊ตฌ์ด์ž ์ด์ „์˜ ํ˜„์ด๋‹ค.</code> | <code>๋‚จํ’ํ˜„(ๅ—ไธฐๅŽฟ)์€ ์ค‘๊ตญ ์žฅ์‹œ์„ฑ(ๆฑŸ่ฅฟ็œ) ํ‘ธ์ €์šฐ(็ฆๅทž)์— ์œ„์น˜ํ•œ ๊ตฐ์ด๋‹ค.</code> | <code>๋„๊ต, ๊ด‘๋‘ฅ ๋„๊ต(้“ๆป˜)๋Š” ์ค‘๊ตญ ๋‚จ๋ถ€ ๊ด‘๋‘ฅ์„ฑ ๋™๊ด€ ์‹œ์˜ ๊ด€ํ•  ํ•˜์— ์žˆ๋Š” ๋„์‹œ์ž…๋‹ˆ๋‹ค.</code> | <code>๋™ํฌ๊ตฌ ๋™ํฌ๊ตฌ๋Š” ์ค‘๊ตญ ์“ฐ์ดจ์„ฑ์˜ ๊ตฌ์—ญ์ž…๋‹ˆ๋‹ค. ์ด๊ณณ์€ ๋ฉ”์ด์‚ฐ์‹œ์˜ ๊ด€ํ•  ํ•˜์— ์žˆ์Šต๋‹ˆ๋‹ค.</code> |
145
- | <code>๊ฐ€์งœ๋Œ€๋‚˜๋ฌด(Pseudosasa)์™€ ๋ณ„๊ฝƒ(Cerastium)์€ ๋ชจ๋‘ ์ž์ƒ ์‹๋ฌผ๊ณผ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๊นŒ?</code> | <code>๊ฐ€์งœ์‚ฌ์‚ฌ(Pseudosasa)๋Š” ํ’€๊ณผ์— ์†ํ•˜๋Š” ๋™์•„์‹œ์•„ ๋Œ€๋‚˜๋ฌด์˜ ์†์ž…๋‹ˆ๋‹ค.</code> | <code>์„ธํŒ”๋กœ์†Œ๋ฃจ์Šค(Cephalosorus)๋Š” ๋ฐ์ด์ง€ ๊ณผ์— ์†ํ•˜๋Š” ๊ฝƒ์ด ํ”ผ๋Š” ์‹๋ฌผ์˜ ์†์ž…๋‹ˆ๋‹ค.</code> | <code>๊ฐ€์งœ๊ธฐ์ƒ์ถฉ์†(Pseudoparasitus)์€ ๋ผ์—˜๋ผํ”ผ๋‹ค์— ์†ํ•˜๋Š” ์ง„๋“œ๊ธฐ์˜ ์†์ž…๋‹ˆ๋‹ค.</code> | <code>ํŽ˜๋ฆฌํƒ€์‚ฌ(Peritassa)๋Š” ์๊ธฐํ’€๊ณผ(Celastraceae) ์‹๋ฌผ์˜ ์†์ž…๋‹ˆ๋‹ค.</code> |
146
- | <code>๊ทธ์ €์šฐ์™€ ํ—ค์ด๋ฃฝ์žฅ์„ฑ ๋™๋‹์€ ์–ด๋–ค ๋‚˜๋ผ์™€ ์ ‘๊ฒฝํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ?</code> | <code>ํ—ˆ์ฃผ(่ดบๅทž)๋Š” ์ค‘ํ™”์ธ๋ฏผ๊ณตํ™”๊ตญ ๊ด‘์‹œ ์ขก์กฑ ์ž์น˜๊ตฌ ๋ถ๋™๋ถ€์— ์œ„์น˜ํ•œ ์ง€๊ธ‰์‹œ์ด๋‹ค.</code> | <code>์ง€๊ด€๊ตฌ(์ง€๊ด€๊ตฌ)๋Š” ์ค‘๊ตญ ์ธ๋ฏผ๊ณตํ™”๊ตญ ํ—ค์ด๋ฃฝ์žฅ์„ฑ ์ง€์‹œ์‹œ์˜ ๊ตฌ์ด์ž ์‹œ์ฒญ ์†Œ์žฌ์ง€์ž…๋‹ˆ๋‹ค.</code> | <code>ํ—ค๋™ ๊ฐ€๋„(ๆฒณไธœ่ก—้“)๋Š” ์ค‘๊ตญ ๊ด‘์‹œ(ๅนฟ่ฅฟ) ๋ฆฌ์šฐ์ €์šฐ(ๆŸณๅทž) ์ฒญ์ค‘ ๊ตฌ(ๅŸŽไธญๅŒบ)์˜ ๊ฐ€๋„์ž…๋‹ˆ๋‹ค.</code> | <code>ํ™”๋‹ํ˜„ (ๅŽๅฎๅŽฟ; ๋ณ‘์Œ: Huรกnรญng Xiร n)์€ ์ค‘๊ตญ ์œˆ๋‚œ์„ฑ ์œ ์‹œ์‹œ์— ์œ„์น˜ํ•ด ์žˆ์Šต๋‹ˆ๋‹ค.</code> |
147
- * Loss: [<code>SpladeLoss</code>](https://sbert.net/docs/package_reference/sparse_encoder/losses.html#spladeloss) with these parameters:
148
- ```json
149
- {
150
- "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score')",
151
- "document_regularizer_weight": 3e-05,
152
- "query_regularizer_weight": 5e-05
153
- }
154
- ```
155
-
156
- ### Training Hyperparameters
157
- #### Non-Default Hyperparameters
158
-
159
- - `per_device_train_batch_size`: 6
160
- - `gradient_accumulation_steps`: 4
161
- - `learning_rate`: 2e-06
162
- - `warmup_ratio`: 0.1
163
- - `bf16`: True
164
- - `ddp_find_unused_parameters`: True
165
- - `ddp_timeout`: 7200
166
- - `batch_sampler`: no_duplicates
167
-
168
- #### All Hyperparameters
169
- <details><summary>Click to expand</summary>
170
-
171
- - `overwrite_output_dir`: False
172
- - `do_predict`: False
173
- - `eval_strategy`: no
174
- - `prediction_loss_only`: True
175
- - `per_device_train_batch_size`: 6
176
- - `per_device_eval_batch_size`: 8
177
- - `per_gpu_train_batch_size`: None
178
- - `per_gpu_eval_batch_size`: None
179
- - `gradient_accumulation_steps`: 4
180
- - `eval_accumulation_steps`: None
181
- - `torch_empty_cache_steps`: None
182
- - `learning_rate`: 2e-06
183
- - `weight_decay`: 0.0
184
- - `adam_beta1`: 0.9
185
- - `adam_beta2`: 0.999
186
- - `adam_epsilon`: 1e-08
187
- - `max_grad_norm`: 1.0
188
- - `num_train_epochs`: 3
189
- - `max_steps`: -1
190
- - `lr_scheduler_type`: linear
191
- - `lr_scheduler_kwargs`: {}
192
- - `warmup_ratio`: 0.1
193
- - `warmup_steps`: 0
194
- - `log_level`: passive
195
- - `log_level_replica`: warning
196
- - `log_on_each_node`: True
197
- - `logging_nan_inf_filter`: True
198
- - `save_safetensors`: True
199
- - `save_on_each_node`: False
200
- - `save_only_model`: False
201
- - `restore_callback_states_from_checkpoint`: False
202
- - `no_cuda`: False
203
- - `use_cpu`: False
204
- - `use_mps_device`: False
205
- - `seed`: 42
206
- - `data_seed`: None
207
- - `jit_mode_eval`: False
208
- - `use_ipex`: False
209
- - `bf16`: True
210
- - `fp16`: False
211
- - `fp16_opt_level`: O1
212
- - `half_precision_backend`: auto
213
- - `bf16_full_eval`: False
214
- - `fp16_full_eval`: False
215
- - `tf32`: None
216
- - `local_rank`: 2
217
- - `ddp_backend`: None
218
- - `tpu_num_cores`: None
219
- - `tpu_metrics_debug`: False
220
- - `debug`: []
221
- - `dataloader_drop_last`: True
222
- - `dataloader_num_workers`: 0
223
- - `dataloader_prefetch_factor`: None
224
- - `past_index`: -1
225
- - `disable_tqdm`: False
226
- - `remove_unused_columns`: True
227
- - `label_names`: None
228
- - `load_best_model_at_end`: False
229
- - `ignore_data_skip`: False
230
- - `fsdp`: []
231
- - `fsdp_min_num_params`: 0
232
- - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
233
- - `tp_size`: 0
234
- - `fsdp_transformer_layer_cls_to_wrap`: None
235
- - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
236
- - `deepspeed`: None
237
- - `label_smoothing_factor`: 0.0
238
- - `optim`: adamw_torch
239
- - `optim_args`: None
240
- - `adafactor`: False
241
- - `group_by_length`: False
242
- - `length_column_name`: length
243
- - `ddp_find_unused_parameters`: True
244
- - `ddp_bucket_cap_mb`: None
245
- - `ddp_broadcast_buffers`: False
246
- - `dataloader_pin_memory`: True
247
- - `dataloader_persistent_workers`: False
248
- - `skip_memory_metrics`: True
249
- - `use_legacy_prediction_loop`: False
250
- - `push_to_hub`: False
251
- - `resume_from_checkpoint`: None
252
- - `hub_model_id`: None
253
- - `hub_strategy`: every_save
254
- - `hub_private_repo`: None
255
- - `hub_always_push`: False
256
- - `gradient_checkpointing`: False
257
- - `gradient_checkpointing_kwargs`: None
258
- - `include_inputs_for_metrics`: False
259
- - `include_for_metrics`: []
260
- - `eval_do_concat_batches`: True
261
- - `fp16_backend`: auto
262
- - `push_to_hub_model_id`: None
263
- - `push_to_hub_organization`: None
264
- - `mp_parameters`:
265
- - `auto_find_batch_size`: False
266
- - `full_determinism`: False
267
- - `torchdynamo`: None
268
- - `ray_scope`: last
269
- - `ddp_timeout`: 7200
270
- - `torch_compile`: False
271
- - `torch_compile_backend`: None
272
- - `torch_compile_mode`: None
273
- - `include_tokens_per_second`: False
274
- - `include_num_input_tokens_seen`: False
275
- - `neftune_noise_alpha`: None
276
- - `optim_target_modules`: None
277
- - `batch_eval_metrics`: False
278
- - `eval_on_start`: False
279
- - `use_liger_kernel`: False
280
- - `eval_use_gather_object`: False
281
- - `average_tokens_across_devices`: False
282
- - `prompts`: None
283
- - `batch_sampler`: no_duplicates
284
- - `multi_dataset_batch_sampler`: proportional
285
- - `router_mapping`: {}
286
- - `learning_rate_mapping`: {}
287
-
288
- </details>
289
-
290
- ### Training Logs
291
- | Epoch | Step | Training Loss |
292
- |:------:|:-----:|:-------------:|
293
- | 0.0863 | 1000 | 4.8919 |
294
- | 0.1727 | 2000 | 3.4433 |
295
- | 0.2590 | 3000 | 3.1294 |
296
- | 0.3453 | 4000 | 2.9256 |
297
- | 0.4316 | 5000 | 2.8705 |
298
- | 0.5180 | 6000 | 2.2949 |
299
- | 0.6043 | 7000 | 1.451 |
300
- | 0.6906 | 8000 | 1.1573 |
301
- | 0.7770 | 9000 | 1.0298 |
302
- | 0.8633 | 10000 | 1.1008 |
303
- | 0.9496 | 11000 | 1.3943 |
304
- | 1.0360 | 12000 | 2.1922 |
305
- | 1.1223 | 13000 | 2.6991 |
306
- | 1.2087 | 14000 | 2.4977 |
307
- | 1.2950 | 15000 | 2.448 |
308
- | 1.3813 | 16000 | 2.4044 |
309
- | 1.4676 | 17000 | 2.3224 |
310
- | 1.5540 | 18000 | 1.4636 |
311
- | 1.6403 | 19000 | 1.0056 |
312
- | 1.7266 | 20000 | 0.8397 |
313
- | 1.8129 | 21000 | 0.8211 |
314
- | 1.8993 | 22000 | 0.9905 |
315
- | 1.9856 | 23000 | 1.3015 |
316
- | 2.0720 | 24000 | 2.3987 |
317
- | 2.1583 | 25000 | 2.3067 |
318
- | 2.2447 | 26000 | 2.2579 |
319
- | 2.3310 | 27000 | 2.2134 |
320
- | 2.4173 | 28000 | 2.2357 |
321
- | 2.5036 | 29000 | 1.867 |
322
- | 2.5900 | 30000 | 1.0632 |
323
- | 2.6763 | 31000 | 0.8168 |
324
- | 2.7626 | 32000 | 0.7357 |
325
- | 2.8489 | 33000 | 0.7851 |
326
- | 2.9353 | 34000 | 1.0681 |
327
-
328
-
329
- ### Framework Versions
330
- - Python: 3.11.12
331
- - Sentence Transformers: 5.0.0
332
- - Transformers: 4.51.3
333
- - PyTorch: 2.7.0+cu128
334
- - Accelerate: 1.5.2
335
- - Datasets: 2.21.0
336
- - Tokenizers: 0.21.1
337
 
338
- ## Citation
339
 
340
- ### BibTeX
341
-
342
- #### Sentence Transformers
343
- ```bibtex
344
- @inproceedings{reimers-2019-sentence-bert,
345
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
346
- author = "Reimers, Nils and Gurevych, Iryna",
347
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
348
- month = "11",
349
- year = "2019",
350
- publisher = "Association for Computational Linguistics",
351
- url = "https://arxiv.org/abs/1908.10084",
352
- }
353
  ```
354
 
355
- #### SpladeLoss
356
- ```bibtex
357
- @misc{formal2022distillationhardnegativesampling,
358
- title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
359
- author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stรฉphane Clinchant},
360
- year={2022},
361
- eprint={2205.04733},
362
- archivePrefix={arXiv},
363
- primaryClass={cs.IR},
364
- url={https://arxiv.org/abs/2205.04733},
365
- }
366
- ```
367
 
368
- #### SparseMultipleNegativesRankingLoss
369
- ```bibtex
370
- @misc{henderson2017efficient,
371
- title={Efficient Natural Language Response Suggestion for Smart Reply},
372
- author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
373
- year={2017},
374
- eprint={1705.00652},
375
- archivePrefix={arXiv},
376
- primaryClass={cs.CL}
377
- }
378
  ```
379
-
380
- #### FlopsLoss
381
- ```bibtex
382
- @article{paria2020minimizing,
383
- title={Minimizing flops to learn efficient sparse representations},
384
- author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
385
- journal={arXiv preprint arXiv:2004.05665},
386
- year={2020}
387
  }
388
  ```
389
 
390
- <!--
391
- ## Glossary
392
-
393
- *Clearly define terms in order to be accessible across audiences.*
394
- -->
395
-
396
- <!--
397
- ## Model Card Authors
398
-
399
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
400
- -->
401
-
402
- <!--
403
- ## Model Card Contact
404
 
405
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
406
- -->
 
4
  - sparse-encoder
5
  - sparse
6
  - splade
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pipeline_tag: feature-extraction
8
  library_name: sentence-transformers
9
+ license: apache-2.0
10
  ---
11
+ <p align="center">
12
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/61d6f4a4d49065ee28a1ee7e/V8n2En7BlMNHoi1YXVv8Q.png" width="400"/>
13
+ <p>
14
+
15
+ # PIXIE-Splade-Preview
16
+ **PIXIE-Splade-Preview** is a Korean-only [SPLADE](https://arxiv.org/abs/2403.06789) (Sparse Lexical and Expansion) retriever, developed by [TelePIX Co., Ltd](https://telepix.net/).
17
+ **PIXIE** stands for Tele**PIX** **I**ntelligent **E**mbedding, representing TelePIXโ€™s high-performance embedding technology.
18
+ This model is trained exclusively on Korean data and outputs sparse lexical vectors that are directly
19
+ compatible with inverted indexing (e.g., Lucene/Elasticsearch).
20
+ Because each non-zero weight corresponds to a Korean subword/token,
21
+ interpretability is built-in: you can inspect which tokens drive retrieval.
22
+
23
+ ## Why SPLADE for Korean Search?
24
+ - **Inverted Index Ready**: Directly index weighted tokens in standard IR stacks (Lucene/Elasticsearch).
25
+ - **Interpretable by Design**: Top-k contributing tokens per query/document explain *why* a hit matched.
26
+ - **Production-Friendly**: Fast candidate generation at web scale; memory/latency tunable via sparsity thresholds.
27
+ - **Hybrid-Retrieval Friendly**: Combine with dense retrievers via score fusion.
28
+
29
+ ## Model Description
30
  - **Model Type:** SPLADE Sparse Encoder
31
  <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
32
+ - **Maximum Sequence Length:** 8192 tokens
33
  - **Output Dimensionality:** 50000 dimensions
34
  - **Similarity Function:** Dot Product
35
+ - **Language:** Korean
36
+ - **License:** apache-2.0
 
 
 
 
 
 
 
 
 
37
 
38
  ### Full Model Architecture
39
 
40
  ```
41
  SparseEncoder(
42
+ (0): MLMTransformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertForMaskedLM'})
43
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 50000})
44
  )
45
  ```
46
 
47
+ ## Quality Benchmarks
48
+ **PIXIE-Splade-Preview** delivers consistently strong performance across a diverse set of domain-specific and open-domain benchmarks in Korean, demonstrating its effectiveness in real-world search applications.
49
+ The table below presents the retrieval performance of several embedding models evaluated on a variety of Korean MTEM benchmarks.
50
+ We report Normalized Discounted Cumulative Gain (NDCG) scores, which measure how well a ranked list of documents aligns with ground truth relevance. Higher values indicate better retrieval quality.
51
+
52
+ ### 7 Datasets of MTEB (Korean)
53
+ Our model, **telepix/PIXIE-Splade-Preview**, achieves strong performance across most metrics and benchmarks,
54
+ demonstrating strong generalization across domains such as multi-hop QA, long-document retrieval, public health, and e-commerce.
55
+
56
+ | Model Name | # params | Avg. NDCG | NDCG@1 | NDCG@3 | NDCG@5 | NDCG@10 |
57
+ |------|:---:|:---:|:---:|:---:|:---:|:---:|
58
+ | telepix/PIXIE-Rune-Preview | 0.5B | 0.6905 | 0.6461 | 0.6859 | 0.7063 | 0.7238 |
59
+ | telepix/PIXIE-Splade-Preview | 0.1B | **0.6677** | **0.6238** | **0.6628** | **0.6831** | **0.7009** |
60
+ | | | | | | | |
61
+ | nlpai-lab/KURE-v1 | 0.5B | 0.6751 | 0.6277 | 0.6725 | 0.6907 | 0.7095 |
62
+ | Snowflake/snowflake-arctic-embed-l-v2.0 | 0.5B | 0.6592 | 0.6118 | 0.6542 | 0.6759 | 0.6949 |
63
+ | BAAI/bge-m3 | 0.5B | 0.6573 | 0.6099 | 0.6533 | 0.6732 | 0.6930 |
64
+ | Qwen/Qwen3-Embedding-0.6B | 0.6B | 0.6321 | 0.5894 | 0.6274 | 0.6455 | 0.6662 |
65
+ | jinaai/jina-embeddings-v3 | 0.6B | 0.6293 | 0.5800 | 0.6254 | 0.6456 | 0.6665 |
66
+ | Alibaba-NLP/gte-multilingual-base | 0.3B | 0.6111 | 0.5542 | 0.6089 | 0.6302 | 0.6511 |
67
+ | openai/text-embedding-3-large | N/A | 0.6015 | 0.5466 | 0.5999 | 0.6187 | 0.6409 |
68
+
69
+ Descriptions of the benchmark datasets used for evaluation are as follows:
70
+ - **Ko-StrategyQA**
71
+ A Korean multi-hop open-domain question answering dataset designed for complex reasoning over multiple documents.
72
+ - **AutoRAGRetrieval**
73
+ A domain-diverse retrieval dataset covering finance, government, healthcare, legal, and e-commerce sectors.
74
+ - **MIRACLRetrieval**
75
+ A document retrieval benchmark built on Korean Wikipedia articles.
76
+ - **PublicHealthQA**
77
+ A retrieval dataset focused on medical and public health topics.
78
+ - **BelebeleRetrieval**
79
+ A dataset for retrieving relevant content from web and news articles in Korean.
80
+ - **MultiLongDocRetrieval**
81
+ A long-document retrieval benchmark based on Korean Wikipedia and mC4 corpus.
82
+ - **XPQARetrieval**
83
+ A real-world dataset constructed from user queries and relevant product documents in a Korean e-commerce platform.
84
+
85
+ ## Direct Usage (Inverted index retrieval)
86
 
87
  First install the Sentence Transformers library:
88
 
 
92
 
93
  Then you can load this model and run inference.
94
  ```python
95
+ import torch
96
+ import numpy as np
97
+ from collections import defaultdict
98
+ from typing import Dict, List, Tuple
99
+ from transformers import AutoTokenizer
100
  from sentence_transformers import SparseEncoder
101
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
+ MODEL_NAME = "telepix/PIXIE-Splade-Preview"
104
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
105
+
106
+ def _to_dense_numpy(x) -> np.ndarray:
107
+ """
108
+ Safely converts a tensor returned by SparseEncoder to a dense numpy array.
109
+ """
110
+ if hasattr(x, "to_dense"):
111
+ return x.to_dense().float().cpu().numpy()
112
+ # If it's already a numpy array or a dense tensor
113
+ if isinstance(x, torch.Tensor):
114
+ return x.float().cpu().numpy()
115
+ return np.asarray(x)
116
+
117
+ def _filter_special_ids(ids: List[int], tokenizer) -> List[int]:
118
+ """
119
+ Filters out special token IDs from a list of token IDs.
120
+ """
121
+ special = set(getattr(tokenizer, "all_special_ids", []) or [])
122
+ return [i for i in ids if i not in special]
123
+
124
+ def build_inverted_index(
125
+ model: SparseEncoder,
126
+ tokenizer,
127
+ documents: List[str],
128
+ batch_size: int = 8,
129
+ min_weight: float = 0.0,
130
+ ) -> Tuple[Dict[int, List[Tuple[int, float]]], List[str]]:
131
+ """
132
+ Generates document embeddings and constructs an inverted index.
133
+ The index maps token_id to a list of (doc_idx, weight) tuples.
134
+ index[token_id] = [(doc_idx, weight), ...]
135
+ """
136
+ with torch.no_grad():
137
+ doc_emb = model.encode_document(documents, batch_size=batch_size)
138
+ doc_dense = _to_dense_numpy(doc_emb)
139
+
140
+ index: Dict[int, List[Tuple[int, float]]] = defaultdict(list)
141
+
142
+ for doc_idx, vec in enumerate(doc_dense):
143
+ # Extract only active tokens (those with weight above the threshold)
144
+ nz = np.flatnonzero(vec > min_weight)
145
+ # Optionally, remove special tokens
146
+ nz = _filter_special_ids(nz.tolist(), tokenizer)
147
+
148
+ for token_id in nz:
149
+ index[token_id].append((doc_idx, float(vec[token_id])))
150
+
151
+ return index
152
+
153
+ # -------------------------
154
+ # Search + Token Overlap Explanation
155
+ # -------------------------
156
+ def splade_token_overlap_inverted(
157
+ model: SparseEncoder,
158
+ tokenizer,
159
+ inverted_index: Dict[int, List[Tuple[int, float]]],
160
+ documents: List[str],
161
+ queries: List[str],
162
+ top_k_docs: int = 3,
163
+ top_k_tokens: int = 10,
164
+ min_weight: float = 0.0,
165
+ ):
166
+ """
167
+ Calculates SPLADE similarity using an inverted index and shows the
168
+ contribution (qw*dw) of the top_k_tokens 'overlapping tokens' for each top-ranked document.
169
+ """
170
+ for qi, qtext in enumerate(queries):
171
+ with torch.no_grad():
172
+ q_vec = model.encode_query(qtext)
173
+ q_vec = _to_dense_numpy(q_vec).ravel()
174
+
175
+ # Active query tokens
176
+ q_nz = np.flatnonzero(q_vec > min_weight).tolist()
177
+ q_nz = _filter_special_ids(q_nz, tokenizer)
178
+
179
+ scores: Dict[int, float] = defaultdict(float)
180
+ # Token contribution per document: token_id -> (qw, dw, qw*dw)
181
+ per_doc_contrib: Dict[int, Dict[int, Tuple[float, float, float]]] = defaultdict(dict)
182
+
183
+ for tid in q_nz:
184
+ qw = float(q_vec[tid])
185
+ postings = inverted_index.get(tid, [])
186
+ for doc_idx, dw in postings:
187
+ prod = qw * dw
188
+ scores[doc_idx] += prod
189
+ # Store per-token contribution (can be summed if needed)
190
+ per_doc_contrib[doc_idx][tid] = (qw, dw, prod)
191
+
192
+ ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k_docs]
193
+
194
+ print("\n============================")
195
+ print(f"[Query {qi}] {qtext}")
196
+ print("============================")
197
+
198
+ if not ranked:
199
+ print("โ†’ ์ผ์น˜ ํ† ํฐ์ด ์—†์–ด ๋ฌธ์„œ ์Šค์ฝ”์–ด๊ฐ€ ์ƒ์„ฑ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.")
200
+ continue
201
+
202
+ for rank, (doc_idx, score) in enumerate(ranked, start=1):
203
+ doc = documents[doc_idx]
204
+ print(f"\nโ†’ Rank {rank} | Document {doc_idx}: {doc}")
205
+ print(f" [Similarity Score ({score:.6f})]")
206
+
207
+ contrib = per_doc_contrib[doc_idx]
208
+ if not contrib:
209
+ print("(๊ฒน์น˜๋Š” ํ† ํฐ์ด ์—†์Šต๋‹ˆ๋‹ค.)")
210
+ continue
211
+
212
+ # Extract top K contributing tokens
213
+ top = sorted(contrib.items(), key=lambda kv: kv[1][2], reverse=True)[:top_k_tokens]
214
+ token_ids = [tid for tid, _ in top]
215
+ tokens = tokenizer.convert_ids_to_tokens(token_ids)
216
+
217
+ print(" [Top Contributing Tokens]")
218
+ for (tid, (qw, dw, prod)), tok in zip(top, tokens):
219
+ print(f" {tok:20} {prod:.6f}")
220
+
221
+ if __name__ == "__main__":
222
+ # 1) Load model and tokenizer
223
+ model = SparseEncoder(MODEL_NAME).to(DEVICE)
224
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
225
+
226
+ # 2) Example data
227
+ queries = [
228
+ "ํ…”๋ ˆํ”ฝ์Šค๋Š” ์–ด๋–ค ์‚ฐ์—… ๋ถ„์•ผ์—์„œ ์œ„์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋‚˜์š”?",
229
+ "๊ตญ๋ฐฉ ๋ถ„์•ผ์— ์–ด๋–ค ์œ„์„ฑ ์„œ๋น„์Šค๊ฐ€ ์ œ๊ณต๋˜๋‚˜์š”?",
230
+ "ํ…”๋ ˆํ”ฝ์Šค์˜ ๊ธฐ์ˆ  ์ˆ˜์ค€์€ ์–ด๋А ์ •๋„์ธ๊ฐ€์š”?",
231
+ ]
232
+ documents = [
233
+ "ํ…”๋ ˆํ”ฝ์Šค๋Š” ํ•ด์–‘, ์ž์›, ๋†์—… ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ์œ„์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์„œ๋น„์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.",
234
+ "์ •์ฐฐ ๋ฐ ๊ฐ์‹œ ๋ชฉ์ ์˜ ์œ„์„ฑ ์˜์ƒ์„ ํ†ตํ•ด ๊ตญ๋ฐฉ ๊ด€๋ จ ์ •๋ฐ€ ๋ถ„์„ ์„œ๋น„์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.",
235
+ "TelePIX์˜ ๊ด‘ํ•™ ํƒ‘์žฌ์ฒด ๋ฐ AI ๋ถ„์„ ๊ธฐ์ˆ ์€ Global standard๋ฅผ ์ƒํšŒํ•˜๋Š” ์ˆ˜์ค€์œผ๋กœ ํ‰๊ฐ€๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.",
236
+ "ํ…”๋ ˆํ”ฝ์Šค๋Š” ์šฐ์ฃผ์—์„œ ์ˆ˜์ง‘ํ•œ ์ •๋ณด๋ฅผ ๋ถ„์„ํ•˜์—ฌ '์šฐ์ฃผ ๊ฒฝ์ œ(Space Economy)'๋ผ๋Š” ์ƒˆ๋กœ์šด ๊ฐ€์น˜๋ฅผ ์ฐฝ์ถœํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.",
237
+ "ํ…”๋ ˆํ”ฝ์Šค๋Š” ์œ„์„ฑ ์˜์ƒ ํš๋“๋ถ€ํ„ฐ ๋ถ„์„, ์„œ๋น„์Šค ์ œ๊ณต๊นŒ์ง€ ์ „ ์ฃผ๊ธฐ๋ฅผ ์•„์šฐ๋ฅด๋Š” ์†”๋ฃจ์…˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.",
238
+ ]
239
+
240
+ # 3) Build document index (inverted index)
241
+ inverted_index = build_inverted_index(
242
+ model=model,
243
+ tokenizer=tokenizer,
244
+ documents=documents,
245
+ batch_size=8,
246
+ min_weight=0.0, # Adjust to 1e-6 ~ 1e-4 to filter out very small noise
247
+ )
248
+
249
+ # 4) Search and explain token overlap
250
+ splade_token_overlap_inverted(
251
+ model=model,
252
+ tokenizer=tokenizer,
253
+ inverted_index=inverted_index,
254
+ documents=documents,
255
+ queries=queries,
256
+ top_k_docs=2, # Print only the top 3 documents
257
+ top_k_tokens=5, # Top 10 contributing tokens for each document
258
+ min_weight=0.0,
259
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
260
 
 
261
 
 
 
 
 
 
 
 
 
 
 
 
 
 
262
  ```
263
 
264
+ ## License
265
+ The PIXIE-Splade-Preview model is licensed under Apache License 2.0.
 
 
 
 
 
 
 
 
 
 
266
 
267
+ ## Citation
 
 
 
 
 
 
 
 
 
268
  ```
269
+ @software{TelePIX-PIXIE-Splade-Preview,
270
+ title={PIXIE-Splade-Preview},
271
+ author={TelePIX AI Research Team},
272
+ year={2025},
273
+ url={https://huggingface.co/telepix/PIXIE-Splade-Preview}
 
 
 
274
  }
275
  ```
276
 
277
+ ## Contact
 
 
 
 
 
 
 
 
 
 
 
 
 
278
 
279
+ If you have any suggestions or questions about the PIXIE, please reach out to the authors at bmkim@telepix.net.