telepix
/

PIXIE-Splade-Preview

@@ -4,59 +4,85 @@ tags:
 - sparse-encoder
 - sparse
 - splade
-- generated_from_trainer
-- dataset_size:1112040
-- loss:SpladeLoss
-- loss:SparseMultipleNegativesRankingLoss
-- loss:FlopsLoss
-widget:
-- text: 매크로 (명사). 복잡한 입력을 컴퓨터 프로그램에 대해 비교적 인간 친화적으로 줄인 표현. 전처리기는 컴파일되기 전에 모든 내장된 매크로를
-    소스 코드로 확장한다.
-- text: "브레네 호수  \n브레네 호수는 스위스 보주주 조 계곡에 위치한 호수입니다. 이 호수는 조 호수의 북쪽에 있으며, 단 200미터 떨어져\
-    \ 있습니다. 해발 1002미터로 조 호수보다 2미터 낮습니다."
-- text: 그 앨범 "Making Lite of Myself"를 만든 코미디언의 국적은 무엇인가요?
-- text: 비어 있음의 의미는 무엇인가요?
-- text: '파트라데비(콘카니어: 포트라데오)는 고아의 페르넴 탈루크에 위치한 마을로, 고아와 마하라슈트라 경계에 있습니다. 이 마을에는 파트라데비
-    검문소가 위치해 있습니다.'
 pipeline_tag: feature-extraction
 library_name: sentence-transformers
 ---
-# SPLADE Sparse Encoder
-This is a [SPLADE Sparse Encoder](https://www.sbert.net/docs/sparse_encoder/usage/usage.html) model trained on the json dataset using the [sentence-transformers](https://www.SBERT.net) library. It maps sentences & paragraphs to a 50000-dimensional sparse vector space   and can be used for semantic search and sparse retrieval.
-## Model Details
-### Model Description
 - **Model Type:** SPLADE Sparse Encoder
 <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
-- **Maximum Sequence Length:** 3072 tokens
 - **Output Dimensionality:** 50000 dimensions
 - **Similarity Function:** Dot Product
-- **Training Dataset:**
-    - json
-<!-- - **Language:** Unknown -->
-<!-- - **License:** Unknown -->
-### Model Sources
-- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
-- **Documentation:** [Sparse Encoder Documentation](https://www.sbert.net/docs/sparse_encoder/usage/usage.html)
-- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
-- **Hugging Face:** [Sparse Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=sparse-encoder)
 ### Full Model Architecture
 ```
 SparseEncoder(
-  (0): MLMTransformer({'max_seq_length': 3072, 'do_lower_case': False, 'architecture': 'ModernBertForMaskedLM'})
   (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 50000})
 )
 ```
-## Usage
-### Direct Usage (Sentence Transformers)
 First install the Sentence Transformers library:
@@ -66,341 +92,188 @@ pip install -U sentence-transformers
 Then you can load this model and run inference.
 ```python
 from sentence_transformers import SparseEncoder
-# Download from the 🤗 Hub
-model = SparseEncoder("sparse_encoder_model_id")
-# Run inference
-sentences = [
-    '파트라데비는 고아의 페르넴 타룩에 위치한 마을로, 고아는 어느 나라에 있는 주인가요?',
-    '파트라데비(콘카니어: 포트라데오)는 고아의 페르넴 탈루크에 위치한 마을로, 고아와 마하라슈트라 경계에 있습니다. 이 마을에는 파트라데비 검문소가 위치해 있습니다.',
-    '콘디바데 A.m 콘디바데 A.m은 인도의 한 마을입니다. 이 마을은 마하라슈트라 주의 푸네 지구 마왈 탈루카에 위치해 있습니다.',
-]
-embeddings = model.encode(sentences)
-print(embeddings.shape)
-# [3, 50000]
-# Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
-print(similarities)
-# tensor([[25.1626, 27.0573,  7.1256],
-#         [27.0573, 84.2966, 31.7376],
-#         [ 7.1256, 31.7376, 74.3025]])
-```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
-## Training Details
-### Training Dataset
-#### json
-* Dataset: json
-* Size: 1,112,040 training samples
-* Columns: <code>anchor</code>, <code>positive</code>, <code>negative_1</code>, <code>negative_2</code>, and <code>negative_3</code>
-* Approximate statistics based on the first 1000 samples:
-  |         | anchor                                                                            | positive                                                                           | negative_1                                                                         | negative_2                                                                        | negative_3                                                                        |
-  |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
-  | type    | string                                                                            | string                                                                             | string                                                                             | string                                                                            | string                                                                            |
-  | details | <ul><li>min: 3 tokens</li><li>mean: 18.8 tokens</li><li>max: 126 tokens</li></ul> | <ul><li>min: 15 tokens</li><li>mean: 50.36 tokens</li><li>max: 77 tokens</li></ul> | <ul><li>min: 16 tokens</li><li>mean: 47.98 tokens</li><li>max: 73 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 47.69 tokens</li><li>max: 79 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 47.96 tokens</li><li>max: 78 tokens</li></ul> |
-* Samples:
-  | anchor                                                             | positive                                               | negative_1                                                    | negative_2                                                     | negative_3                                                      |
-  |:-------------------------------------------------------------------|:-------------------------------------------------------|:--------------------------------------------------------------|:---------------------------------------------------------------|:----------------------------------------------------------------|
-  | <code>난촨구와 둥촨구는 어느 나라에 위치해 있습니까?</code>                            | <code>난촨구(南川区)는 중국 충칭의 구이자 이전의 현이다.</code>             | <code>남풍현(南丰县)은 중국 장시성(江西省) 푸저우(福州)에 위치한 군이다.</code>          | <code>도교, 광둥 도교(道滘)는 중국 남부 광둥성 동관 시의 관할 하에 있는 도시입니다.</code>    | <code>동포구 동포구는 중국 쓰촨성의 구역입니다. 이곳은 메이산시의 관할 하에 있습니다.</code>      |
-  | <code>가짜대나무(Pseudosasa)와 별꽃(Cerastium)은 모두 자생 식물과 관련이 있습니까?</code> | <code>가짜사사(Pseudosasa)는 풀과에 속하는 동아시아 대나무의 속입니다.</code> | <code>세팔로소루스(Cephalosorus)는 데이지 과에 속하는 꽃이 피는 식물의 속입니다.</code> | <code>가짜기생충속(Pseudoparasitus)은 라엘라피다에 속하는 진드기의 속입니다.</code>    | <code>페리타사(Peritassa)는 쐐기풀과(Celastraceae) 식물의 속입니다.</code>      |
-  | <code>그저우와 헤이룽장성 동닝은 어떤 나라와 접경하고 있습니까?</code>                      | <code>허주(贺州)는 중화인민공화국 광시 좡족 자치구 북동부에 위치한 지급시이다.</code> | <code>지관구(지관구)는 중국 인민공화국 헤이룽장성 지시시의 구이자 시청 소재지입니다.</code>     | <code>헤동 가도(河东街道)는 중국 광시(广西) 리우저우(柳州) 청중 구(城中区)의 가도입니다.</code> | <code>화닝현 (华宁县; 병음: Huáníng Xiàn)은 중국 윈난성 유시시에 위치해 있습니다.</code> |
-* Loss: [<code>SpladeLoss</code>](https://sbert.net/docs/package_reference/sparse_encoder/losses.html#spladeloss) with these parameters:
-  ```json
-  {
-      "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score')",
-      "document_regularizer_weight": 3e-05,
-      "query_regularizer_weight": 5e-05
-  }
-  ```
-### Training Hyperparameters
-#### Non-Default Hyperparameters
-- `per_device_train_batch_size`: 6
-- `gradient_accumulation_steps`: 4
-- `learning_rate`: 2e-06
-- `warmup_ratio`: 0.1
-- `bf16`: True
-- `ddp_find_unused_parameters`: True
-- `ddp_timeout`: 7200
-- `batch_sampler`: no_duplicates
-#### All Hyperparameters
-<details><summary>Click to expand</summary>
-- `overwrite_output_dir`: False
-- `do_predict`: False
-- `eval_strategy`: no
-- `prediction_loss_only`: True
-- `per_device_train_batch_size`: 6
-- `per_device_eval_batch_size`: 8
-- `per_gpu_train_batch_size`: None
-- `per_gpu_eval_batch_size`: None
-- `gradient_accumulation_steps`: 4
-- `eval_accumulation_steps`: None
-- `torch_empty_cache_steps`: None
-- `learning_rate`: 2e-06
-- `weight_decay`: 0.0
-- `adam_beta1`: 0.9
-- `adam_beta2`: 0.999
-- `adam_epsilon`: 1e-08
-- `max_grad_norm`: 1.0
-- `num_train_epochs`: 3
-- `max_steps`: -1
-- `lr_scheduler_type`: linear
-- `lr_scheduler_kwargs`: {}
-- `warmup_ratio`: 0.1
-- `warmup_steps`: 0
-- `log_level`: passive
-- `log_level_replica`: warning
-- `log_on_each_node`: True
-- `logging_nan_inf_filter`: True
-- `save_safetensors`: True
-- `save_on_each_node`: False
-- `save_only_model`: False
-- `restore_callback_states_from_checkpoint`: False
-- `no_cuda`: False
-- `use_cpu`: False
-- `use_mps_device`: False
-- `seed`: 42
-- `data_seed`: None
-- `jit_mode_eval`: False
-- `use_ipex`: False
-- `bf16`: True
-- `fp16`: False
-- `fp16_opt_level`: O1
-- `half_precision_backend`: auto
-- `bf16_full_eval`: False
-- `fp16_full_eval`: False
-- `tf32`: None
-- `local_rank`: 2
-- `ddp_backend`: None
-- `tpu_num_cores`: None
-- `tpu_metrics_debug`: False
-- `debug`: []
-- `dataloader_drop_last`: True
-- `dataloader_num_workers`: 0
-- `dataloader_prefetch_factor`: None
-- `past_index`: -1
-- `disable_tqdm`: False
-- `remove_unused_columns`: True
-- `label_names`: None
-- `load_best_model_at_end`: False
-- `ignore_data_skip`: False
-- `fsdp`: []
-- `fsdp_min_num_params`: 0
-- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
-- `tp_size`: 0
-- `fsdp_transformer_layer_cls_to_wrap`: None
-- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
-- `deepspeed`: None
-- `label_smoothing_factor`: 0.0
-- `optim`: adamw_torch
-- `optim_args`: None
-- `adafactor`: False
-- `group_by_length`: False
-- `length_column_name`: length
-- `ddp_find_unused_parameters`: True
-- `ddp_bucket_cap_mb`: None
-- `ddp_broadcast_buffers`: False
-- `dataloader_pin_memory`: True
-- `dataloader_persistent_workers`: False
-- `skip_memory_metrics`: True
-- `use_legacy_prediction_loop`: False
-- `push_to_hub`: False
-- `resume_from_checkpoint`: None
-- `hub_model_id`: None
-- `hub_strategy`: every_save
-- `hub_private_repo`: None
-- `hub_always_push`: False
-- `gradient_checkpointing`: False
-- `gradient_checkpointing_kwargs`: None
-- `include_inputs_for_metrics`: False
-- `include_for_metrics`: []
-- `eval_do_concat_batches`: True
-- `fp16_backend`: auto
-- `push_to_hub_model_id`: None
-- `push_to_hub_organization`: None
-- `mp_parameters`:
-- `auto_find_batch_size`: False
-- `full_determinism`: False
-- `torchdynamo`: None
-- `ray_scope`: last
-- `ddp_timeout`: 7200
-- `torch_compile`: False
-- `torch_compile_backend`: None
-- `torch_compile_mode`: None
-- `include_tokens_per_second`: False
-- `include_num_input_tokens_seen`: False
-- `neftune_noise_alpha`: None
-- `optim_target_modules`: None
-- `batch_eval_metrics`: False
-- `eval_on_start`: False
-- `use_liger_kernel`: False
-- `eval_use_gather_object`: False
-- `average_tokens_across_devices`: False
-- `prompts`: None
-- `batch_sampler`: no_duplicates
-- `multi_dataset_batch_sampler`: proportional
-- `router_mapping`: {}
-- `learning_rate_mapping`: {}
-</details>
-### Training Logs
-| Epoch  | Step  | Training Loss |
-|:------:|:-----:|:-------------:|
-| 0.0863 | 1000  | 4.8919        |
-| 0.1727 | 2000  | 3.4433        |
-| 0.2590 | 3000  | 3.1294        |
-| 0.3453 | 4000  | 2.9256        |
-| 0.4316 | 5000  | 2.8705        |
-| 0.5180 | 6000  | 2.2949        |
-| 0.6043 | 7000  | 1.451         |
-| 0.6906 | 8000  | 1.1573        |
-| 0.7770 | 9000  | 1.0298        |
-| 0.8633 | 10000 | 1.1008        |
-| 0.9496 | 11000 | 1.3943        |
-| 1.0360 | 12000 | 2.1922        |
-| 1.1223 | 13000 | 2.6991        |
-| 1.2087 | 14000 | 2.4977        |
-| 1.2950 | 15000 | 2.448         |
-| 1.3813 | 16000 | 2.4044        |
-| 1.4676 | 17000 | 2.3224        |
-| 1.5540 | 18000 | 1.4636        |
-| 1.6403 | 19000 | 1.0056        |
-| 1.7266 | 20000 | 0.8397        |
-| 1.8129 | 21000 | 0.8211        |
-| 1.8993 | 22000 | 0.9905        |
-| 1.9856 | 23000 | 1.3015        |
-| 2.0720 | 24000 | 2.3987        |
-| 2.1583 | 25000 | 2.3067        |
-| 2.2447 | 26000 | 2.2579        |
-| 2.3310 | 27000 | 2.2134        |
-| 2.4173 | 28000 | 2.2357        |
-| 2.5036 | 29000 | 1.867         |
-| 2.5900 | 30000 | 1.0632        |
-| 2.6763 | 31000 | 0.8168        |
-| 2.7626 | 32000 | 0.7357        |
-| 2.8489 | 33000 | 0.7851        |
-| 2.9353 | 34000 | 1.0681        |
-### Framework Versions
-- Python: 3.11.12
-- Sentence Transformers: 5.0.0
-- Transformers: 4.51.3
-- PyTorch: 2.7.0+cu128
-- Accelerate: 1.5.2
-- Datasets: 2.21.0
-- Tokenizers: 0.21.1
-## Citation
-### BibTeX
-#### Sentence Transformers
-```bibtex
-@inproceedings{reimers-2019-sentence-bert,
-    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
-    author = "Reimers, Nils and Gurevych, Iryna",
-    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
-    month = "11",
-    year = "2019",
-    publisher = "Association for Computational Linguistics",
-    url = "https://arxiv.org/abs/1908.10084",
-}
 ```
-#### SpladeLoss
-```bibtex
-@misc{formal2022distillationhardnegativesampling,
-      title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
-      author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
-      year={2022},
-      eprint={2205.04733},
-      archivePrefix={arXiv},
-      primaryClass={cs.IR},
-      url={https://arxiv.org/abs/2205.04733},
-}
-```
-#### SparseMultipleNegativesRankingLoss
-```bibtex
-@misc{henderson2017efficient,
-    title={Efficient Natural Language Response Suggestion for Smart Reply},
-    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
-    year={2017},
-    eprint={1705.00652},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
-}
 ```
-#### FlopsLoss
-```bibtex
-@article{paria2020minimizing,
-    title={Minimizing flops to learn efficient sparse representations},
-    author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
-    journal={arXiv preprint arXiv:2004.05665},
-    year={2020}
 }
 ```
-<!--
-## Glossary
-*Clearly define terms in order to be accessible across audiences.*
--->
-<!--
-## Model Card Authors
-*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
--->
-<!--
-## Model Card Contact
-*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
--->

 - sparse-encoder
 - sparse
 - splade
 pipeline_tag: feature-extraction
 library_name: sentence-transformers
+license: apache-2.0
 ---
+<p align="center">
+    <img src="https://cdn-uploads.huggingface.co/production/uploads/61d6f4a4d49065ee28a1ee7e/V8n2En7BlMNHoi1YXVv8Q.png" width="400"/>
+<p>
+# PIXIE-Splade-Preview
+**PIXIE-Splade-Preview** is a Korean-only [SPLADE](https://arxiv.org/abs/2403.06789) (Sparse Lexical and Expansion) retriever, developed by [TelePIX Co., Ltd](https://telepix.net/).
+**PIXIE** stands for Tele**PIX** **I**ntelligent **E**mbedding, representing TelePIX’s high-performance embedding technology.
+This model is trained exclusively on Korean data and outputs sparse lexical vectors that are directly
+compatible with inverted indexing (e.g., Lucene/Elasticsearch).
+Because each non-zero weight corresponds to a Korean subword/token,
+interpretability is built-in: you can inspect which tokens drive retrieval.
+## Why SPLADE for Korean Search?
+- **Inverted Index Ready**: Directly index weighted tokens in standard IR stacks (Lucene/Elasticsearch).
+- **Interpretable by Design**: Top-k contributing tokens per query/document explain *why* a hit matched.
+- **Production-Friendly**: Fast candidate generation at web scale; memory/latency tunable via sparsity thresholds.
+- **Hybrid-Retrieval Friendly**: Combine with dense retrievers via score fusion.
+## Model Description
 - **Model Type:** SPLADE Sparse Encoder
 <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
+- **Maximum Sequence Length:** 8192 tokens
 - **Output Dimensionality:** 50000 dimensions
 - **Similarity Function:** Dot Product
+- **Language:** Korean
+- **License:** apache-2.0
 ### Full Model Architecture
 ```
 SparseEncoder(
+  (0): MLMTransformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertForMaskedLM'})
   (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 50000})
 )
 ```
+## Quality Benchmarks
+**PIXIE-Splade-Preview** delivers consistently strong performance across a diverse set of domain-specific and open-domain benchmarks in Korean, demonstrating its effectiveness in real-world search applications.
+The table below presents the retrieval performance of several embedding models evaluated on a variety of Korean MTEM benchmarks.
+We report Normalized Discounted Cumulative Gain (NDCG) scores, which measure how well a ranked list of documents aligns with ground truth relevance. Higher values indicate better retrieval quality.
+### 7 Datasets of MTEB (Korean)
+Our model, **telepix/PIXIE-Splade-Preview**, achieves strong performance across most metrics and benchmarks,
+demonstrating strong generalization across domains such as multi-hop QA, long-document retrieval, public health, and e-commerce.
+| Model Name | # params | Avg. NDCG | NDCG@1 | NDCG@3 | NDCG@5 | NDCG@10 |
+|------|:---:|:---:|:---:|:---:|:---:|:---:|
+| telepix/PIXIE-Rune-Preview | 0.5B | 0.6905 | 0.6461 | 0.6859 | 0.7063 | 0.7238 |
+| telepix/PIXIE-Splade-Preview | 0.1B | **0.6677** | **0.6238** | **0.6628** | **0.6831** | **0.7009** |
+|  |  |  |  |  |  |  |
+| nlpai-lab/KURE-v1 | 0.5B | 0.6751 | 0.6277 | 0.6725 | 0.6907 | 0.7095 |
+| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.5B | 0.6592 | 0.6118 | 0.6542 | 0.6759 | 0.6949 |
+| BAAI/bge-m3 | 0.5B | 0.6573 | 0.6099 | 0.6533 | 0.6732 | 0.6930 |
+| Qwen/Qwen3-Embedding-0.6B | 0.6B | 0.6321 | 0.5894 | 0.6274 | 0.6455 | 0.6662 |
+| jinaai/jina-embeddings-v3 | 0.6B | 0.6293 | 0.5800 | 0.6254 | 0.6456 | 0.6665 |
+| Alibaba-NLP/gte-multilingual-base | 0.3B | 0.6111 | 0.5542 | 0.6089 | 0.6302 | 0.6511 |
+| openai/text-embedding-3-large | N/A | 0.6015 | 0.5466 | 0.5999 | 0.6187 | 0.6409 |
+Descriptions of the benchmark datasets used for evaluation are as follows:
+- **Ko-StrategyQA**
+  A Korean multi-hop open-domain question answering dataset designed for complex reasoning over multiple documents.
+- **AutoRAGRetrieval**
+  A domain-diverse retrieval dataset covering finance, government, healthcare, legal, and e-commerce sectors.
+- **MIRACLRetrieval**
+  A document retrieval benchmark built on Korean Wikipedia articles.
+- **PublicHealthQA**
+  A retrieval dataset focused on medical and public health topics.
+- **BelebeleRetrieval**
+  A dataset for retrieving relevant content from web and news articles in Korean.
+- **MultiLongDocRetrieval**
+  A long-document retrieval benchmark based on Korean Wikipedia and mC4 corpus.
+- **XPQARetrieval**
+  A real-world dataset constructed from user queries and relevant product documents in a Korean e-commerce platform.
+## Direct Usage (Inverted index retrieval)
 First install the Sentence Transformers library:
 Then you can load this model and run inference.
 ```python
+import torch
+import numpy as np
+from collections import defaultdict
+from typing import Dict, List, Tuple
+from transformers import AutoTokenizer
 from sentence_transformers import SparseEncoder
+MODEL_NAME = "telepix/PIXIE-Splade-Preview"
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+def _to_dense_numpy(x) -> np.ndarray:
+    """
+    Safely converts a tensor returned by SparseEncoder to a dense numpy array.
+    """
+    if hasattr(x, "to_dense"):
+        return x.to_dense().float().cpu().numpy()
+    # If it's already a numpy array or a dense tensor
+    if isinstance(x, torch.Tensor):
+        return x.float().cpu().numpy()
+    return np.asarray(x)
+def _filter_special_ids(ids: List[int], tokenizer) -> List[int]:
+    """
+    Filters out special token IDs from a list of token IDs.
+    """
+    special = set(getattr(tokenizer, "all_special_ids", []) or [])
+    return [i for i in ids if i not in special]
+def build_inverted_index(
+    model: SparseEncoder,
+    tokenizer,
+    documents: List[str],
+    batch_size: int = 8,
+    min_weight: float = 0.0,
+) -> Tuple[Dict[int, List[Tuple[int, float]]], List[str]]:
+    """
+    Generates document embeddings and constructs an inverted index.
+    The index maps token_id to a list of (doc_idx, weight) tuples.
+    index[token_id] = [(doc_idx, weight), ...]
+    """
+    with torch.no_grad():
+        doc_emb = model.encode_document(documents, batch_size=batch_size)
+    doc_dense = _to_dense_numpy(doc_emb)
+    index: Dict[int, List[Tuple[int, float]]] = defaultdict(list)
+    for doc_idx, vec in enumerate(doc_dense):
+        # Extract only active tokens (those with weight above the threshold)
+        nz = np.flatnonzero(vec > min_weight)
+        # Optionally, remove special tokens
+        nz = _filter_special_ids(nz.tolist(), tokenizer)
+        for token_id in nz:
+            index[token_id].append((doc_idx, float(vec[token_id])))
+    return index
+# -------------------------
+# Search + Token Overlap Explanation
+# -------------------------
+def splade_token_overlap_inverted(
+    model: SparseEncoder,
+    tokenizer,
+    inverted_index: Dict[int, List[Tuple[int, float]]],
+    documents: List[str],
+    queries: List[str],
+    top_k_docs: int = 3,
+    top_k_tokens: int = 10,
+    min_weight: float = 0.0,
+):
+    """
+    Calculates SPLADE similarity using an inverted index and shows the
+    contribution (qw*dw) of the top_k_tokens 'overlapping tokens' for each top-ranked document.
+    """
+    for qi, qtext in enumerate(queries):
+        with torch.no_grad():
+            q_vec = model.encode_query(qtext)
+        q_vec = _to_dense_numpy(q_vec).ravel()
+        # Active query tokens
+        q_nz = np.flatnonzero(q_vec > min_weight).tolist()
+        q_nz = _filter_special_ids(q_nz, tokenizer)
+        scores: Dict[int, float] = defaultdict(float)
+        # Token contribution per document: token_id -> (qw, dw, qw*dw)
+        per_doc_contrib: Dict[int, Dict[int, Tuple[float, float, float]]] = defaultdict(dict)
+        for tid in q_nz:
+            qw = float(q_vec[tid])
+            postings = inverted_index.get(tid, [])
+            for doc_idx, dw in postings:
+                prod = qw * dw
+                scores[doc_idx] += prod
+                # Store per-token contribution (can be summed if needed)
+                per_doc_contrib[doc_idx][tid] = (qw, dw, prod)
+        ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k_docs]
+        print("\n============================")
+        print(f"[Query {qi}] {qtext}")
+        print("============================")
+        if not ranked:
+            print("→ 일치 토큰이 없어 문서 스코어가 생성되지 않았습니다.")
+            continue
+        for rank, (doc_idx, score) in enumerate(ranked, start=1):
+            doc = documents[doc_idx]
+            print(f"\n→ Rank {rank} | Document {doc_idx}: {doc}")
+            print(f"  [Similarity Score ({score:.6f})]")
+            contrib = per_doc_contrib[doc_idx]
+            if not contrib:
+                print("(겹치는 토큰이 없습니다.)")
+                continue
+            # Extract top K contributing tokens
+            top = sorted(contrib.items(), key=lambda kv: kv[1][2], reverse=True)[:top_k_tokens]
+            token_ids = [tid for tid, _ in top]
+            tokens = tokenizer.convert_ids_to_tokens(token_ids)
+            print("  [Top Contributing Tokens]")
+            for (tid, (qw, dw, prod)), tok in zip(top, tokens):
+                print(f"    {tok:20} {prod:.6f}")
+if __name__ == "__main__":
+    # 1) Load model and tokenizer
+    model = SparseEncoder(MODEL_NAME).to(DEVICE)
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
+    # 2) Example data
+    queries = [
+        "텔레픽스는 어떤 산업 분야에서 위성 데이터를 활용하나요?",
+        "국방 분야에 어떤 위성 서비스가 제공되나요?",
+        "텔레픽스의 기술 수준은 어느 정도인가요?",
+    ]
+    documents = [
+        "텔레픽스는 해양, 자원, 농업 등 다양한 분야에서 위성 데이터를 분석하여 서비스를 제공합니다.",
+        "정찰 및 감시 목적의 위성 영상을 통해 국방 관련 정밀 분석 서비스를 제공합니다.",
+        "TelePIX의 광학 탑재체 및 AI 분석 기술은 Global standard를 상회하는 수준으로 평가받고 있습니다.",
+        "텔레픽스는 우주에서 수집한 정보를 분석하여 '우주 경제(Space Economy)'라는 새로운 가치를 창출하고 있습니다.",
+        "텔레픽스는 위성 영상 획득부터 분석, 서비스 제공까지 전 주기를 아우르는 솔루션을 제공합니다.",
+    ]
+    # 3) Build document index (inverted index)
+    inverted_index = build_inverted_index(
+        model=model,
+        tokenizer=tokenizer,
+        documents=documents,
+        batch_size=8,
+        min_weight=0.0,  # Adjust to 1e-6 ~ 1e-4 to filter out very small noise
+    )
+    # 4) Search and explain token overlap
+    splade_token_overlap_inverted(
+        model=model,
+        tokenizer=tokenizer,
+        inverted_index=inverted_index,
+        documents=documents,
+        queries=queries,
+        top_k_docs=2,     # Print only the top 3 documents
+        top_k_tokens=5,  # Top 10 contributing tokens for each document
+        min_weight=0.0,
+    )
 ```
+## License
+The PIXIE-Splade-Preview model is licensed under Apache License 2.0.
+## Citation
 ```
+@software{TelePIX-PIXIE-Splade-Preview,
+  title={PIXIE-Splade-Preview},
+  author={TelePIX AI Research Team},
+  year={2025},
+  url={https://huggingface.co/telepix/PIXIE-Splade-Preview}
 }
 ```
+## Contact
+If you have any suggestions or questions about the PIXIE, please reach out to the authors at bmkim@telepix.net.