Update README.md
Browse files
README.md
CHANGED
|
@@ -4,59 +4,85 @@ tags:
|
|
| 4 |
- sparse-encoder
|
| 5 |
- sparse
|
| 6 |
- splade
|
| 7 |
-
- generated_from_trainer
|
| 8 |
-
- dataset_size:1112040
|
| 9 |
-
- loss:SpladeLoss
|
| 10 |
-
- loss:SparseMultipleNegativesRankingLoss
|
| 11 |
-
- loss:FlopsLoss
|
| 12 |
-
widget:
|
| 13 |
-
- text: ๋งคํฌ๋ก (๋ช
์ฌ). ๋ณต์กํ ์
๋ ฅ์ ์ปดํจํฐ ํ๋ก๊ทธ๋จ์ ๋ํด ๋น๊ต์ ์ธ๊ฐ ์นํ์ ์ผ๋ก ์ค์ธ ํํ. ์ ์ฒ๋ฆฌ๊ธฐ๋ ์ปดํ์ผ๋๊ธฐ ์ ์ ๋ชจ๋ ๋ด์ฅ๋ ๋งคํฌ๋ก๋ฅผ
|
| 14 |
-
์์ค ์ฝ๋๋ก ํ์ฅํ๋ค.
|
| 15 |
-
- text: "๋ธ๋ ๋ค ํธ์ \n๋ธ๋ ๋ค ํธ์๋ ์ค์์ค ๋ณด์ฃผ์ฃผ ์กฐ ๊ณ๊ณก์ ์์นํ ํธ์์
๋๋ค. ์ด ํธ์๋ ์กฐ ํธ์์ ๋ถ์ชฝ์ ์์ผ๋ฉฐ, ๋จ 200๋ฏธํฐ ๋จ์ด์ ธ\
|
| 16 |
-
\ ์์ต๋๋ค. ํด๋ฐ 1002๋ฏธํฐ๋ก ์กฐ ํธ์๋ณด๋ค 2๋ฏธํฐ ๋ฎ์ต๋๋ค."
|
| 17 |
-
- text: ๊ทธ ์จ๋ฒ "Making Lite of Myself"๋ฅผ ๋ง๋ ์ฝ๋ฏธ๋์ธ์ ๊ตญ์ ์ ๋ฌด์์ธ๊ฐ์?
|
| 18 |
-
- text: ๋น์ด ์์์ ์๋ฏธ๋ ๋ฌด์์ธ๊ฐ์?
|
| 19 |
-
- text: 'ํํธ๋ผ๋ฐ๋น(์ฝ์นด๋์ด: ํฌํธ๋ผ๋ฐ์ค)๋ ๊ณ ์์ ํ๋ฅด๋ด ํ๋ฃจํฌ์ ์์นํ ๋ง์๋ก, ๊ณ ์์ ๋งํ๋ผ์ํธ๋ผ ๊ฒฝ๊ณ์ ์์ต๋๋ค. ์ด ๋ง์์๋ ํํธ๋ผ๋ฐ๋น
|
| 20 |
-
๊ฒ๋ฌธ์๊ฐ ์์นํด ์์ต๋๋ค.'
|
| 21 |
pipeline_tag: feature-extraction
|
| 22 |
library_name: sentence-transformers
|
|
|
|
| 23 |
---
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
- **Model Type:** SPLADE Sparse Encoder
|
| 32 |
<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
|
| 33 |
-
- **Maximum Sequence Length:**
|
| 34 |
- **Output Dimensionality:** 50000 dimensions
|
| 35 |
- **Similarity Function:** Dot Product
|
| 36 |
-
- **
|
| 37 |
-
|
| 38 |
-
<!-- - **Language:** Unknown -->
|
| 39 |
-
<!-- - **License:** Unknown -->
|
| 40 |
-
|
| 41 |
-
### Model Sources
|
| 42 |
-
|
| 43 |
-
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
|
| 44 |
-
- **Documentation:** [Sparse Encoder Documentation](https://www.sbert.net/docs/sparse_encoder/usage/usage.html)
|
| 45 |
-
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
|
| 46 |
-
- **Hugging Face:** [Sparse Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=sparse-encoder)
|
| 47 |
|
| 48 |
### Full Model Architecture
|
| 49 |
|
| 50 |
```
|
| 51 |
SparseEncoder(
|
| 52 |
-
(0): MLMTransformer({'max_seq_length':
|
| 53 |
(1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 50000})
|
| 54 |
)
|
| 55 |
```
|
| 56 |
|
| 57 |
-
##
|
| 58 |
-
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
First install the Sentence Transformers library:
|
| 62 |
|
|
@@ -66,341 +92,188 @@ pip install -U sentence-transformers
|
|
| 66 |
|
| 67 |
Then you can load this model and run inference.
|
| 68 |
```python
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
from sentence_transformers import SparseEncoder
|
| 70 |
|
| 71 |
-
# Download from the ๐ค Hub
|
| 72 |
-
model = SparseEncoder("sparse_encoder_model_id")
|
| 73 |
-
# Run inference
|
| 74 |
-
sentences = [
|
| 75 |
-
'ํํธ๋ผ๋ฐ๋น๋ ๊ณ ์์ ํ๋ฅด๋ด ํ๋ฃฉ์ ์์นํ ๋ง์๋ก, ๊ณ ์๋ ์ด๋ ๋๋ผ์ ์๋ ์ฃผ์ธ๊ฐ์?',
|
| 76 |
-
'ํํธ๋ผ๋ฐ๋น(์ฝ์นด๋์ด: ํฌํธ๋ผ๋ฐ์ค)๋ ๊ณ ์์ ํ๋ฅด๋ด ํ๋ฃจํฌ์ ์์นํ ๋ง์๋ก, ๊ณ ์์ ๋งํ๋ผ์ํธ๋ผ ๊ฒฝ๊ณ์ ์์ต๋๋ค. ์ด ๋ง์์๋ ํํธ๋ผ๋ฐ๋น ๊ฒ๋ฌธ์๊ฐ ์์นํด ์์ต๋๋ค.',
|
| 77 |
-
'์ฝ๋๋ฐ๋ฐ A.m ์ฝ๋๋ฐ๋ฐ A.m์ ์ธ๋์ ํ ๋ง์์
๋๋ค. ์ด ๋ง์์ ๋งํ๋ผ์ํธ๋ผ ์ฃผ์ ํธ๋ค ์ง๊ตฌ ๋ง์ ํ๋ฃจ์นด์ ์์นํด ์์ต๋๋ค.',
|
| 78 |
-
]
|
| 79 |
-
embeddings = model.encode(sentences)
|
| 80 |
-
print(embeddings.shape)
|
| 81 |
-
# [3, 50000]
|
| 82 |
-
|
| 83 |
-
# Get the similarity scores for the embeddings
|
| 84 |
-
similarities = model.similarity(embeddings, embeddings)
|
| 85 |
-
print(similarities)
|
| 86 |
-
# tensor([[25.1626, 27.0573, 7.1256],
|
| 87 |
-
# [27.0573, 84.2966, 31.7376],
|
| 88 |
-
# [ 7.1256, 31.7376, 74.3025]])
|
| 89 |
-
```
|
| 90 |
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
- `skip_memory_metrics`: True
|
| 249 |
-
- `use_legacy_prediction_loop`: False
|
| 250 |
-
- `push_to_hub`: False
|
| 251 |
-
- `resume_from_checkpoint`: None
|
| 252 |
-
- `hub_model_id`: None
|
| 253 |
-
- `hub_strategy`: every_save
|
| 254 |
-
- `hub_private_repo`: None
|
| 255 |
-
- `hub_always_push`: False
|
| 256 |
-
- `gradient_checkpointing`: False
|
| 257 |
-
- `gradient_checkpointing_kwargs`: None
|
| 258 |
-
- `include_inputs_for_metrics`: False
|
| 259 |
-
- `include_for_metrics`: []
|
| 260 |
-
- `eval_do_concat_batches`: True
|
| 261 |
-
- `fp16_backend`: auto
|
| 262 |
-
- `push_to_hub_model_id`: None
|
| 263 |
-
- `push_to_hub_organization`: None
|
| 264 |
-
- `mp_parameters`:
|
| 265 |
-
- `auto_find_batch_size`: False
|
| 266 |
-
- `full_determinism`: False
|
| 267 |
-
- `torchdynamo`: None
|
| 268 |
-
- `ray_scope`: last
|
| 269 |
-
- `ddp_timeout`: 7200
|
| 270 |
-
- `torch_compile`: False
|
| 271 |
-
- `torch_compile_backend`: None
|
| 272 |
-
- `torch_compile_mode`: None
|
| 273 |
-
- `include_tokens_per_second`: False
|
| 274 |
-
- `include_num_input_tokens_seen`: False
|
| 275 |
-
- `neftune_noise_alpha`: None
|
| 276 |
-
- `optim_target_modules`: None
|
| 277 |
-
- `batch_eval_metrics`: False
|
| 278 |
-
- `eval_on_start`: False
|
| 279 |
-
- `use_liger_kernel`: False
|
| 280 |
-
- `eval_use_gather_object`: False
|
| 281 |
-
- `average_tokens_across_devices`: False
|
| 282 |
-
- `prompts`: None
|
| 283 |
-
- `batch_sampler`: no_duplicates
|
| 284 |
-
- `multi_dataset_batch_sampler`: proportional
|
| 285 |
-
- `router_mapping`: {}
|
| 286 |
-
- `learning_rate_mapping`: {}
|
| 287 |
-
|
| 288 |
-
</details>
|
| 289 |
-
|
| 290 |
-
### Training Logs
|
| 291 |
-
| Epoch | Step | Training Loss |
|
| 292 |
-
|:------:|:-----:|:-------------:|
|
| 293 |
-
| 0.0863 | 1000 | 4.8919 |
|
| 294 |
-
| 0.1727 | 2000 | 3.4433 |
|
| 295 |
-
| 0.2590 | 3000 | 3.1294 |
|
| 296 |
-
| 0.3453 | 4000 | 2.9256 |
|
| 297 |
-
| 0.4316 | 5000 | 2.8705 |
|
| 298 |
-
| 0.5180 | 6000 | 2.2949 |
|
| 299 |
-
| 0.6043 | 7000 | 1.451 |
|
| 300 |
-
| 0.6906 | 8000 | 1.1573 |
|
| 301 |
-
| 0.7770 | 9000 | 1.0298 |
|
| 302 |
-
| 0.8633 | 10000 | 1.1008 |
|
| 303 |
-
| 0.9496 | 11000 | 1.3943 |
|
| 304 |
-
| 1.0360 | 12000 | 2.1922 |
|
| 305 |
-
| 1.1223 | 13000 | 2.6991 |
|
| 306 |
-
| 1.2087 | 14000 | 2.4977 |
|
| 307 |
-
| 1.2950 | 15000 | 2.448 |
|
| 308 |
-
| 1.3813 | 16000 | 2.4044 |
|
| 309 |
-
| 1.4676 | 17000 | 2.3224 |
|
| 310 |
-
| 1.5540 | 18000 | 1.4636 |
|
| 311 |
-
| 1.6403 | 19000 | 1.0056 |
|
| 312 |
-
| 1.7266 | 20000 | 0.8397 |
|
| 313 |
-
| 1.8129 | 21000 | 0.8211 |
|
| 314 |
-
| 1.8993 | 22000 | 0.9905 |
|
| 315 |
-
| 1.9856 | 23000 | 1.3015 |
|
| 316 |
-
| 2.0720 | 24000 | 2.3987 |
|
| 317 |
-
| 2.1583 | 25000 | 2.3067 |
|
| 318 |
-
| 2.2447 | 26000 | 2.2579 |
|
| 319 |
-
| 2.3310 | 27000 | 2.2134 |
|
| 320 |
-
| 2.4173 | 28000 | 2.2357 |
|
| 321 |
-
| 2.5036 | 29000 | 1.867 |
|
| 322 |
-
| 2.5900 | 30000 | 1.0632 |
|
| 323 |
-
| 2.6763 | 31000 | 0.8168 |
|
| 324 |
-
| 2.7626 | 32000 | 0.7357 |
|
| 325 |
-
| 2.8489 | 33000 | 0.7851 |
|
| 326 |
-
| 2.9353 | 34000 | 1.0681 |
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
### Framework Versions
|
| 330 |
-
- Python: 3.11.12
|
| 331 |
-
- Sentence Transformers: 5.0.0
|
| 332 |
-
- Transformers: 4.51.3
|
| 333 |
-
- PyTorch: 2.7.0+cu128
|
| 334 |
-
- Accelerate: 1.5.2
|
| 335 |
-
- Datasets: 2.21.0
|
| 336 |
-
- Tokenizers: 0.21.1
|
| 337 |
|
| 338 |
-
## Citation
|
| 339 |
|
| 340 |
-
### BibTeX
|
| 341 |
-
|
| 342 |
-
#### Sentence Transformers
|
| 343 |
-
```bibtex
|
| 344 |
-
@inproceedings{reimers-2019-sentence-bert,
|
| 345 |
-
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
|
| 346 |
-
author = "Reimers, Nils and Gurevych, Iryna",
|
| 347 |
-
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
|
| 348 |
-
month = "11",
|
| 349 |
-
year = "2019",
|
| 350 |
-
publisher = "Association for Computational Linguistics",
|
| 351 |
-
url = "https://arxiv.org/abs/1908.10084",
|
| 352 |
-
}
|
| 353 |
```
|
| 354 |
|
| 355 |
-
|
| 356 |
-
|
| 357 |
-
@misc{formal2022distillationhardnegativesampling,
|
| 358 |
-
title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
|
| 359 |
-
author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stรฉphane Clinchant},
|
| 360 |
-
year={2022},
|
| 361 |
-
eprint={2205.04733},
|
| 362 |
-
archivePrefix={arXiv},
|
| 363 |
-
primaryClass={cs.IR},
|
| 364 |
-
url={https://arxiv.org/abs/2205.04733},
|
| 365 |
-
}
|
| 366 |
-
```
|
| 367 |
|
| 368 |
-
|
| 369 |
-
```bibtex
|
| 370 |
-
@misc{henderson2017efficient,
|
| 371 |
-
title={Efficient Natural Language Response Suggestion for Smart Reply},
|
| 372 |
-
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
|
| 373 |
-
year={2017},
|
| 374 |
-
eprint={1705.00652},
|
| 375 |
-
archivePrefix={arXiv},
|
| 376 |
-
primaryClass={cs.CL}
|
| 377 |
-
}
|
| 378 |
```
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
|
| 382 |
-
|
| 383 |
-
|
| 384 |
-
author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
|
| 385 |
-
journal={arXiv preprint arXiv:2004.05665},
|
| 386 |
-
year={2020}
|
| 387 |
}
|
| 388 |
```
|
| 389 |
|
| 390 |
-
|
| 391 |
-
## Glossary
|
| 392 |
-
|
| 393 |
-
*Clearly define terms in order to be accessible across audiences.*
|
| 394 |
-
-->
|
| 395 |
-
|
| 396 |
-
<!--
|
| 397 |
-
## Model Card Authors
|
| 398 |
-
|
| 399 |
-
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
|
| 400 |
-
-->
|
| 401 |
-
|
| 402 |
-
<!--
|
| 403 |
-
## Model Card Contact
|
| 404 |
|
| 405 |
-
|
| 406 |
-
-->
|
|
|
|
| 4 |
- sparse-encoder
|
| 5 |
- sparse
|
| 6 |
- splade
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
pipeline_tag: feature-extraction
|
| 8 |
library_name: sentence-transformers
|
| 9 |
+
license: apache-2.0
|
| 10 |
---
|
| 11 |
+
<p align="center">
|
| 12 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/61d6f4a4d49065ee28a1ee7e/V8n2En7BlMNHoi1YXVv8Q.png" width="400"/>
|
| 13 |
+
<p>
|
| 14 |
+
|
| 15 |
+
# PIXIE-Splade-Preview
|
| 16 |
+
**PIXIE-Splade-Preview** is a Korean-only [SPLADE](https://arxiv.org/abs/2403.06789) (Sparse Lexical and Expansion) retriever, developed by [TelePIX Co., Ltd](https://telepix.net/).
|
| 17 |
+
**PIXIE** stands for Tele**PIX** **I**ntelligent **E**mbedding, representing TelePIXโs high-performance embedding technology.
|
| 18 |
+
This model is trained exclusively on Korean data and outputs sparse lexical vectors that are directly
|
| 19 |
+
compatible with inverted indexing (e.g., Lucene/Elasticsearch).
|
| 20 |
+
Because each non-zero weight corresponds to a Korean subword/token,
|
| 21 |
+
interpretability is built-in: you can inspect which tokens drive retrieval.
|
| 22 |
+
|
| 23 |
+
## Why SPLADE for Korean Search?
|
| 24 |
+
- **Inverted Index Ready**: Directly index weighted tokens in standard IR stacks (Lucene/Elasticsearch).
|
| 25 |
+
- **Interpretable by Design**: Top-k contributing tokens per query/document explain *why* a hit matched.
|
| 26 |
+
- **Production-Friendly**: Fast candidate generation at web scale; memory/latency tunable via sparsity thresholds.
|
| 27 |
+
- **Hybrid-Retrieval Friendly**: Combine with dense retrievers via score fusion.
|
| 28 |
+
|
| 29 |
+
## Model Description
|
| 30 |
- **Model Type:** SPLADE Sparse Encoder
|
| 31 |
<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
|
| 32 |
+
- **Maximum Sequence Length:** 8192 tokens
|
| 33 |
- **Output Dimensionality:** 50000 dimensions
|
| 34 |
- **Similarity Function:** Dot Product
|
| 35 |
+
- **Language:** Korean
|
| 36 |
+
- **License:** apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
### Full Model Architecture
|
| 39 |
|
| 40 |
```
|
| 41 |
SparseEncoder(
|
| 42 |
+
(0): MLMTransformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertForMaskedLM'})
|
| 43 |
(1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 50000})
|
| 44 |
)
|
| 45 |
```
|
| 46 |
|
| 47 |
+
## Quality Benchmarks
|
| 48 |
+
**PIXIE-Splade-Preview** delivers consistently strong performance across a diverse set of domain-specific and open-domain benchmarks in Korean, demonstrating its effectiveness in real-world search applications.
|
| 49 |
+
The table below presents the retrieval performance of several embedding models evaluated on a variety of Korean MTEM benchmarks.
|
| 50 |
+
We report Normalized Discounted Cumulative Gain (NDCG) scores, which measure how well a ranked list of documents aligns with ground truth relevance. Higher values indicate better retrieval quality.
|
| 51 |
+
|
| 52 |
+
### 7 Datasets of MTEB (Korean)
|
| 53 |
+
Our model, **telepix/PIXIE-Splade-Preview**, achieves strong performance across most metrics and benchmarks,
|
| 54 |
+
demonstrating strong generalization across domains such as multi-hop QA, long-document retrieval, public health, and e-commerce.
|
| 55 |
+
|
| 56 |
+
| Model Name | # params | Avg. NDCG | NDCG@1 | NDCG@3 | NDCG@5 | NDCG@10 |
|
| 57 |
+
|------|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 58 |
+
| telepix/PIXIE-Rune-Preview | 0.5B | 0.6905 | 0.6461 | 0.6859 | 0.7063 | 0.7238 |
|
| 59 |
+
| telepix/PIXIE-Splade-Preview | 0.1B | **0.6677** | **0.6238** | **0.6628** | **0.6831** | **0.7009** |
|
| 60 |
+
| | | | | | | |
|
| 61 |
+
| nlpai-lab/KURE-v1 | 0.5B | 0.6751 | 0.6277 | 0.6725 | 0.6907 | 0.7095 |
|
| 62 |
+
| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.5B | 0.6592 | 0.6118 | 0.6542 | 0.6759 | 0.6949 |
|
| 63 |
+
| BAAI/bge-m3 | 0.5B | 0.6573 | 0.6099 | 0.6533 | 0.6732 | 0.6930 |
|
| 64 |
+
| Qwen/Qwen3-Embedding-0.6B | 0.6B | 0.6321 | 0.5894 | 0.6274 | 0.6455 | 0.6662 |
|
| 65 |
+
| jinaai/jina-embeddings-v3 | 0.6B | 0.6293 | 0.5800 | 0.6254 | 0.6456 | 0.6665 |
|
| 66 |
+
| Alibaba-NLP/gte-multilingual-base | 0.3B | 0.6111 | 0.5542 | 0.6089 | 0.6302 | 0.6511 |
|
| 67 |
+
| openai/text-embedding-3-large | N/A | 0.6015 | 0.5466 | 0.5999 | 0.6187 | 0.6409 |
|
| 68 |
+
|
| 69 |
+
Descriptions of the benchmark datasets used for evaluation are as follows:
|
| 70 |
+
- **Ko-StrategyQA**
|
| 71 |
+
A Korean multi-hop open-domain question answering dataset designed for complex reasoning over multiple documents.
|
| 72 |
+
- **AutoRAGRetrieval**
|
| 73 |
+
A domain-diverse retrieval dataset covering finance, government, healthcare, legal, and e-commerce sectors.
|
| 74 |
+
- **MIRACLRetrieval**
|
| 75 |
+
A document retrieval benchmark built on Korean Wikipedia articles.
|
| 76 |
+
- **PublicHealthQA**
|
| 77 |
+
A retrieval dataset focused on medical and public health topics.
|
| 78 |
+
- **BelebeleRetrieval**
|
| 79 |
+
A dataset for retrieving relevant content from web and news articles in Korean.
|
| 80 |
+
- **MultiLongDocRetrieval**
|
| 81 |
+
A long-document retrieval benchmark based on Korean Wikipedia and mC4 corpus.
|
| 82 |
+
- **XPQARetrieval**
|
| 83 |
+
A real-world dataset constructed from user queries and relevant product documents in a Korean e-commerce platform.
|
| 84 |
+
|
| 85 |
+
## Direct Usage (Inverted index retrieval)
|
| 86 |
|
| 87 |
First install the Sentence Transformers library:
|
| 88 |
|
|
|
|
| 92 |
|
| 93 |
Then you can load this model and run inference.
|
| 94 |
```python
|
| 95 |
+
import torch
|
| 96 |
+
import numpy as np
|
| 97 |
+
from collections import defaultdict
|
| 98 |
+
from typing import Dict, List, Tuple
|
| 99 |
+
from transformers import AutoTokenizer
|
| 100 |
from sentence_transformers import SparseEncoder
|
| 101 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
+
MODEL_NAME = "telepix/PIXIE-Splade-Preview"
|
| 104 |
+
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
| 105 |
+
|
| 106 |
+
def _to_dense_numpy(x) -> np.ndarray:
|
| 107 |
+
"""
|
| 108 |
+
Safely converts a tensor returned by SparseEncoder to a dense numpy array.
|
| 109 |
+
"""
|
| 110 |
+
if hasattr(x, "to_dense"):
|
| 111 |
+
return x.to_dense().float().cpu().numpy()
|
| 112 |
+
# If it's already a numpy array or a dense tensor
|
| 113 |
+
if isinstance(x, torch.Tensor):
|
| 114 |
+
return x.float().cpu().numpy()
|
| 115 |
+
return np.asarray(x)
|
| 116 |
+
|
| 117 |
+
def _filter_special_ids(ids: List[int], tokenizer) -> List[int]:
|
| 118 |
+
"""
|
| 119 |
+
Filters out special token IDs from a list of token IDs.
|
| 120 |
+
"""
|
| 121 |
+
special = set(getattr(tokenizer, "all_special_ids", []) or [])
|
| 122 |
+
return [i for i in ids if i not in special]
|
| 123 |
+
|
| 124 |
+
def build_inverted_index(
|
| 125 |
+
model: SparseEncoder,
|
| 126 |
+
tokenizer,
|
| 127 |
+
documents: List[str],
|
| 128 |
+
batch_size: int = 8,
|
| 129 |
+
min_weight: float = 0.0,
|
| 130 |
+
) -> Tuple[Dict[int, List[Tuple[int, float]]], List[str]]:
|
| 131 |
+
"""
|
| 132 |
+
Generates document embeddings and constructs an inverted index.
|
| 133 |
+
The index maps token_id to a list of (doc_idx, weight) tuples.
|
| 134 |
+
index[token_id] = [(doc_idx, weight), ...]
|
| 135 |
+
"""
|
| 136 |
+
with torch.no_grad():
|
| 137 |
+
doc_emb = model.encode_document(documents, batch_size=batch_size)
|
| 138 |
+
doc_dense = _to_dense_numpy(doc_emb)
|
| 139 |
+
|
| 140 |
+
index: Dict[int, List[Tuple[int, float]]] = defaultdict(list)
|
| 141 |
+
|
| 142 |
+
for doc_idx, vec in enumerate(doc_dense):
|
| 143 |
+
# Extract only active tokens (those with weight above the threshold)
|
| 144 |
+
nz = np.flatnonzero(vec > min_weight)
|
| 145 |
+
# Optionally, remove special tokens
|
| 146 |
+
nz = _filter_special_ids(nz.tolist(), tokenizer)
|
| 147 |
+
|
| 148 |
+
for token_id in nz:
|
| 149 |
+
index[token_id].append((doc_idx, float(vec[token_id])))
|
| 150 |
+
|
| 151 |
+
return index
|
| 152 |
+
|
| 153 |
+
# -------------------------
|
| 154 |
+
# Search + Token Overlap Explanation
|
| 155 |
+
# -------------------------
|
| 156 |
+
def splade_token_overlap_inverted(
|
| 157 |
+
model: SparseEncoder,
|
| 158 |
+
tokenizer,
|
| 159 |
+
inverted_index: Dict[int, List[Tuple[int, float]]],
|
| 160 |
+
documents: List[str],
|
| 161 |
+
queries: List[str],
|
| 162 |
+
top_k_docs: int = 3,
|
| 163 |
+
top_k_tokens: int = 10,
|
| 164 |
+
min_weight: float = 0.0,
|
| 165 |
+
):
|
| 166 |
+
"""
|
| 167 |
+
Calculates SPLADE similarity using an inverted index and shows the
|
| 168 |
+
contribution (qw*dw) of the top_k_tokens 'overlapping tokens' for each top-ranked document.
|
| 169 |
+
"""
|
| 170 |
+
for qi, qtext in enumerate(queries):
|
| 171 |
+
with torch.no_grad():
|
| 172 |
+
q_vec = model.encode_query(qtext)
|
| 173 |
+
q_vec = _to_dense_numpy(q_vec).ravel()
|
| 174 |
+
|
| 175 |
+
# Active query tokens
|
| 176 |
+
q_nz = np.flatnonzero(q_vec > min_weight).tolist()
|
| 177 |
+
q_nz = _filter_special_ids(q_nz, tokenizer)
|
| 178 |
+
|
| 179 |
+
scores: Dict[int, float] = defaultdict(float)
|
| 180 |
+
# Token contribution per document: token_id -> (qw, dw, qw*dw)
|
| 181 |
+
per_doc_contrib: Dict[int, Dict[int, Tuple[float, float, float]]] = defaultdict(dict)
|
| 182 |
+
|
| 183 |
+
for tid in q_nz:
|
| 184 |
+
qw = float(q_vec[tid])
|
| 185 |
+
postings = inverted_index.get(tid, [])
|
| 186 |
+
for doc_idx, dw in postings:
|
| 187 |
+
prod = qw * dw
|
| 188 |
+
scores[doc_idx] += prod
|
| 189 |
+
# Store per-token contribution (can be summed if needed)
|
| 190 |
+
per_doc_contrib[doc_idx][tid] = (qw, dw, prod)
|
| 191 |
+
|
| 192 |
+
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k_docs]
|
| 193 |
+
|
| 194 |
+
print("\n============================")
|
| 195 |
+
print(f"[Query {qi}] {qtext}")
|
| 196 |
+
print("============================")
|
| 197 |
+
|
| 198 |
+
if not ranked:
|
| 199 |
+
print("โ ์ผ์น ํ ํฐ์ด ์์ด ๋ฌธ์ ์ค์ฝ์ด๊ฐ ์์ฑ๋์ง ์์์ต๋๋ค.")
|
| 200 |
+
continue
|
| 201 |
+
|
| 202 |
+
for rank, (doc_idx, score) in enumerate(ranked, start=1):
|
| 203 |
+
doc = documents[doc_idx]
|
| 204 |
+
print(f"\nโ Rank {rank} | Document {doc_idx}: {doc}")
|
| 205 |
+
print(f" [Similarity Score ({score:.6f})]")
|
| 206 |
+
|
| 207 |
+
contrib = per_doc_contrib[doc_idx]
|
| 208 |
+
if not contrib:
|
| 209 |
+
print("(๊ฒน์น๋ ํ ํฐ์ด ์์ต๋๋ค.)")
|
| 210 |
+
continue
|
| 211 |
+
|
| 212 |
+
# Extract top K contributing tokens
|
| 213 |
+
top = sorted(contrib.items(), key=lambda kv: kv[1][2], reverse=True)[:top_k_tokens]
|
| 214 |
+
token_ids = [tid for tid, _ in top]
|
| 215 |
+
tokens = tokenizer.convert_ids_to_tokens(token_ids)
|
| 216 |
+
|
| 217 |
+
print(" [Top Contributing Tokens]")
|
| 218 |
+
for (tid, (qw, dw, prod)), tok in zip(top, tokens):
|
| 219 |
+
print(f" {tok:20} {prod:.6f}")
|
| 220 |
+
|
| 221 |
+
if __name__ == "__main__":
|
| 222 |
+
# 1) Load model and tokenizer
|
| 223 |
+
model = SparseEncoder(MODEL_NAME).to(DEVICE)
|
| 224 |
+
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
|
| 225 |
+
|
| 226 |
+
# 2) Example data
|
| 227 |
+
queries = [
|
| 228 |
+
"ํ
๋ ํฝ์ค๋ ์ด๋ค ์ฐ์
๋ถ์ผ์์ ์์ฑ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ๋์?",
|
| 229 |
+
"๊ตญ๋ฐฉ ๋ถ์ผ์ ์ด๋ค ์์ฑ ์๋น์ค๊ฐ ์ ๊ณต๋๋์?",
|
| 230 |
+
"ํ
๋ ํฝ์ค์ ๊ธฐ์ ์์ค์ ์ด๋ ์ ๋์ธ๊ฐ์?",
|
| 231 |
+
]
|
| 232 |
+
documents = [
|
| 233 |
+
"ํ
๋ ํฝ์ค๋ ํด์, ์์, ๋์
๋ฑ ๋ค์ํ ๋ถ์ผ์์ ์์ฑ ๋ฐ์ดํฐ๋ฅผ ๋ถ์ํ์ฌ ์๋น์ค๋ฅผ ์ ๊ณตํฉ๋๋ค.",
|
| 234 |
+
"์ ์ฐฐ ๋ฐ ๊ฐ์ ๋ชฉ์ ์ ์์ฑ ์์์ ํตํด ๊ตญ๋ฐฉ ๊ด๋ จ ์ ๋ฐ ๋ถ์ ์๋น์ค๋ฅผ ์ ๊ณตํฉ๋๋ค.",
|
| 235 |
+
"TelePIX์ ๊ดํ ํ์ฌ์ฒด ๋ฐ AI ๋ถ์ ๊ธฐ์ ์ Global standard๋ฅผ ์ํํ๋ ์์ค์ผ๋ก ํ๊ฐ๋ฐ๊ณ ์์ต๋๋ค.",
|
| 236 |
+
"ํ
๋ ํฝ์ค๋ ์ฐ์ฃผ์์ ์์งํ ์ ๋ณด๋ฅผ ๋ถ์ํ์ฌ '์ฐ์ฃผ ๊ฒฝ์ (Space Economy)'๋ผ๋ ์๋ก์ด ๊ฐ์น๋ฅผ ์ฐฝ์ถํ๊ณ ์์ต๋๋ค.",
|
| 237 |
+
"ํ
๋ ํฝ์ค๋ ์์ฑ ์์ ํ๋๋ถํฐ ๋ถ์, ์๋น์ค ์ ๊ณต๊น์ง ์ ์ฃผ๊ธฐ๋ฅผ ์์ฐ๋ฅด๋ ์๋ฃจ์
์ ์ ๊ณตํฉ๋๋ค.",
|
| 238 |
+
]
|
| 239 |
+
|
| 240 |
+
# 3) Build document index (inverted index)
|
| 241 |
+
inverted_index = build_inverted_index(
|
| 242 |
+
model=model,
|
| 243 |
+
tokenizer=tokenizer,
|
| 244 |
+
documents=documents,
|
| 245 |
+
batch_size=8,
|
| 246 |
+
min_weight=0.0, # Adjust to 1e-6 ~ 1e-4 to filter out very small noise
|
| 247 |
+
)
|
| 248 |
+
|
| 249 |
+
# 4) Search and explain token overlap
|
| 250 |
+
splade_token_overlap_inverted(
|
| 251 |
+
model=model,
|
| 252 |
+
tokenizer=tokenizer,
|
| 253 |
+
inverted_index=inverted_index,
|
| 254 |
+
documents=documents,
|
| 255 |
+
queries=queries,
|
| 256 |
+
top_k_docs=2, # Print only the top 3 documents
|
| 257 |
+
top_k_tokens=5, # Top 10 contributing tokens for each document
|
| 258 |
+
min_weight=0.0,
|
| 259 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 260 |
|
|
|
|
| 261 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 262 |
```
|
| 263 |
|
| 264 |
+
## License
|
| 265 |
+
The PIXIE-Splade-Preview model is licensed under Apache License 2.0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 266 |
|
| 267 |
+
## Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
```
|
| 269 |
+
@software{TelePIX-PIXIE-Splade-Preview,
|
| 270 |
+
title={PIXIE-Splade-Preview},
|
| 271 |
+
author={TelePIX AI Research Team},
|
| 272 |
+
year={2025},
|
| 273 |
+
url={https://huggingface.co/telepix/PIXIE-Splade-Preview}
|
|
|
|
|
|
|
|
|
|
| 274 |
}
|
| 275 |
```
|
| 276 |
|
| 277 |
+
## Contact
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 278 |
|
| 279 |
+
If you have any suggestions or questions about the PIXIE, please reach out to the authors at bmkim@telepix.net.
|
|
|