Feature Extraction
sentence-transformers
Safetensors
English
Korean
modernbert
sentence-similarity
sparse-encoder
sparse
splade
retrieval
multimodal
multi-modal
crossmodal
cross-modal
aerospace
telepix
text-embeddings-inference
Instructions to use telepix/PIXIE-Splade-v1.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use telepix/PIXIE-Splade-v1.5 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("telepix/PIXIE-Splade-v1.5") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| - ko | |
| tags: | |
| - sentence-transformers | |
| - sentence-similarity | |
| - sparse-encoder | |
| - sparse | |
| - splade | |
| - retrieval | |
| - multimodal | |
| - multi-modal | |
| - crossmodal | |
| - cross-modal | |
| - feature-extraction | |
| - aerospace | |
| - telepix | |
| pipeline_tag: feature-extraction | |
| library_name: sentence-transformers | |
| license: apache-2.0 | |
| <p align="center"> | |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/61d6f4a4d49065ee28a1ee7e/V8n2En7BlMNHoi1YXVv8Q.png" width="400"/> | |
| <p> | |
| # PIXIE-Splade-v1.5 | |
| **PIXIE-Splade-v1.5** is a **bilingual (ko, en)** [SPLADE](https://arxiv.org/abs/2403.06789) retriever, developed by [TelePIX Co., Ltd](https://telepix.net/). | |
| **PIXIE** stands for Tele**PIX** **I**ntelligent **E**mbedding, representing TelePIXβs high-performance embedding technology. | |
| This model is specifically optimized for retrieval tasks in Korean and English, and demonstrates strong performance in aerospace domain. Through extensive fine-tuning and domain-specific evaluation, PIXIE shows robust retrieval quality for real-world use cases such as document understanding, technical QA, and information retrieval in aerospace and related high-precision fields. | |
| PIXIE-Splade-v1.5 outputs sparse lexical vectors that are directly | |
| compatible with inverted indexing (e.g., Lucene/Elasticsearch). | |
| Because each non-zero weight corresponds to a Ko-En subword/token, | |
| interpretability is built-in: you can inspect which tokens drive retrieval. | |
| ## Why SPLADE for Search? | |
| - **Inverted Index Ready**: Directly index weighted tokens in standard IR stacks (Lucene/Elasticsearch). | |
| - **Interpretable by Design**: Top-k contributing tokens per query/document explain *why* a hit matched. | |
| - **Production-Friendly**: Fast candidate generation at web scale; memory/latency tunable via sparsity thresholds. | |
| - **Hybrid-Retrieval Friendly**: Combine with dense retrievers via score fusion. | |
| ## Model Description | |
| - **Model Type:** SPLADE Sparse Encoder | |
| <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) --> | |
| - **Maximum Sequence Length:** 5632 tokens | |
| - **Output Dimensionality:** 50000 dimensions | |
| - **Similarity Function:** Dot Product | |
| - **Language:** Bilingual β Korean and English | |
| - **Domain Specialization:** Aerospace Information Retrieval | |
| - **License:** apache-2.0 | |
| ### Full Model Architecture | |
| ``` | |
| SparseEncoder( | |
| (0): MLMTransformer({'max_seq_length': 5632, 'do_lower_case': False, 'architecture': 'ModernBertForMaskedLM'}) | |
| (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 50000}) | |
| ) | |
| ``` | |
| ## Quality Benchmarks | |
| **PIXIE-Splade-v1.5** is a bilingual embedding model specialized for Korean and English retrieval tasks. | |
| It delivers consistently strong performance across a diverse set of domain-specific and open-domain benchmarks in both languages, demonstrating its effectiveness in real-world search applications. | |
| The table below presents the retrieval performance of several sparse embedding models evaluated on a variety of Korean and English benchmarks. | |
| We report **Normalized Discounted Cumulative Gain (nDCG@10)** scores, which measure how well a ranked list of documents aligns with ground truth relevance. Higher values indicate better retrieval quality. | |
| All evaluations were conducted using the open-source **[Korean-MTEB-Retrieval-Evaluators](https://github.com/BM-K/Korean-MTEB-Retrieval-Evaluators)** codebase to ensure consistent dataset handling, indexing, retrieval, and nDCG@10 computation across models. | |
| ### Benchmark Overview and Dataset Descriptions | |
| | Model Name | # params | STELLA (ko-en) | STELLA (en-en) | MTEB (ko) | RTEB (en) | | |
| |------|:---:|:---:|:---:|:---:|:---:| | |
| | telepix/PIXIE-Rune-v1.0 (dense baseline) | 0.5B | 0.5972 | 0.7627 | 0.7603 | 0.5439 | | |
| | **telepix/PIXIE-Splade-v1.5** | **0.1B** | **0.4821** | **0.5824** | **0.7671** | **0.4401** | | |
| | telepix/PIXIE-Splade-v1.0 | 0.1B | 0.4148 | 0.6741 | 0.7025 | 0.3893 | | |
| | telepix/PIXIE-Splade-Preview | 0.1B | N/A | N/A | 0.7579 | N/A | | |
| | | | | | | | | |
| | opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1 | 0.2B | 0.2618 | 0.7055 | 0.5358 | 0.4376 | | |
| | naver/splade-v3 | 0.1B | N/A | 0.7836 | 0.0685 | 0.4859 | | |
| | BM25 | N/A | N/A | 0.6589 | 0.5071 | N/A | | |
| To better interpret the evaluation results above, we briefly describe the characteristics and evaluation intent of each benchmark suite used in this comparison. | |
| Each benchmark is designed to assess different aspects of retrieval capability, ranging from domain-specific technical understanding to open-domain and multilingual generalization. | |
| #### STELLA | |
| [STELLA](https://arxiv.org/abs/2601.03496) is an aerospace-domain Information Retrieval (IR) benchmark constructed from NASA Technical Reports Server (NTRS) documents. It is designed to evaluate both: | |
| - **Lexical matching** ability (does the retriever benefit from exact technical terms? | TCQ) | |
| - **Semantic matching** ability (can the retriever match concepts even when technical terms are not explicitly used? | TAQ). | |
| STELLA provides **dual-type synthetic queries** and a **cross-lingual extension** for multilingual evaluation while keeping the corpus in English. | |
| #### 6 Datasets of MTEB (Korean) | |
| Descriptions of the benchmark datasets used for evaluation are as follows: | |
| - **Ko-StrategyQA** | |
| A Korean multi-hop open-domain question answering dataset designed for complex reasoning over multiple documents. | |
| - **AutoRAGRetrieval** | |
| A domain-diverse retrieval dataset covering finance, government, healthcare, legal, and e-commerce sectors. | |
| - **MIRACLRetrieval** | |
| A document retrieval benchmark built on Korean Wikipedia articles. | |
| - **PublicHealthQA** | |
| A retrieval dataset focused on medical and public health topics. | |
| - **BelebeleRetrieval** | |
| A dataset for retrieving relevant content from web and news articles in Korean. | |
| - **MultiLongDocRetrieval** | |
| A long-document retrieval benchmark based on Korean Wikipedia and mC4 corpus. | |
| #### RTEB (English) | |
| Retrieval Embedding Benchmark ([RTEB](https://huggingface.co/blog/rteb)), a new benchmark designed to reliably evaluate the retrieval accuracy of embedding models for real-world applications. Existing benchmarks struggle to measure true generalization, while RTEB addresses this with a hybrid strategy of open and private datasets. Its goal is simple: to create a fair, transparent, and application-focused standard for measuring how models perform on data they havenβt seen before. | |
| ## Direct Use (Inverted-Index Retrieval) | |
| ```python | |
| import torch | |
| import numpy as np | |
| from collections import defaultdict | |
| from typing import Dict, List, Tuple | |
| from transformers import AutoTokenizer | |
| from sentence_transformers import SparseEncoder | |
| model_name= 'telepix/PIXIE-Splade-v1.5' | |
| device = "cuda" if torch.cuda.is_available() else "cpu" | |
| def _to_dense_numpy(x) -> np.ndarray: | |
| if hasattr(x, "to_dense"): | |
| return x.to_dense().float().cpu().numpy() | |
| if isinstance(x, torch.Tensor): | |
| return x.float().cpu().numpy() | |
| return np.asarray(x) | |
| def _filter_special_ids(ids: List[int], tokenizer) -> List[int]: | |
| special = set(getattr(tokenizer, "all_special_ids", []) or []) | |
| return [i for i in ids if i not in special] | |
| def build_inverted_index( | |
| model: SparseEncoder, | |
| tokenizer, | |
| documents: List[str], | |
| batch_size: int = 8, | |
| min_weight: float = 0.0, | |
| ) -> Tuple[Dict[int, List[Tuple[int, float]]], List[str]]: | |
| with torch.no_grad(): | |
| doc_emb = model.encode_document(documents, batch_size=batch_size) | |
| doc_dense = _to_dense_numpy(doc_emb) | |
| index: Dict[int, List[Tuple[int, float]]] = defaultdict(list) | |
| for doc_idx, vec in enumerate(doc_dense): | |
| nz = np.flatnonzero(vec > min_weight) | |
| nz = _filter_special_ids(nz.tolist(), tokenizer) | |
| for token_id in nz: | |
| index[token_id].append((doc_idx, float(vec[token_id]))) | |
| return index | |
| def splade_token_overlap_inverted( | |
| model: SparseEncoder, | |
| tokenizer, | |
| inverted_index: Dict[int, List[Tuple[int, float]]], | |
| documents: List[str], | |
| queries: List[str], | |
| top_k_docs: int = 3, | |
| top_k_tokens: int = 5, | |
| min_weight: float = 0.0, | |
| ): | |
| for qi, qtext in enumerate(queries): | |
| with torch.no_grad(): | |
| q_vec = model.encode_query(qtext) | |
| q_vec = _to_dense_numpy(q_vec).ravel() | |
| q_nz = np.flatnonzero(q_vec > min_weight).tolist() | |
| q_nz = _filter_special_ids(q_nz, tokenizer) | |
| scores: Dict[int, float] = defaultdict(float) | |
| per_doc_contrib: Dict[int, Dict[int, Tuple[float, float, float]]] = defaultdict(dict) | |
| for tid in q_nz: | |
| qw = float(q_vec[tid]) | |
| postings = inverted_index.get(tid, []) | |
| for doc_idx, dw in postings: | |
| prod = qw * dw | |
| scores[doc_idx] += prod | |
| per_doc_contrib[doc_idx][tid] = (qw, dw, prod) | |
| ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k_docs] | |
| print("\n" + "="*60) | |
| print(f"[Query {qi + 1}] {qtext}") | |
| print("="*60) | |
| if not ranked: | |
| print("β No matching documents found.") | |
| continue | |
| for rank, (doc_idx, score) in enumerate(ranked, start=1): | |
| doc = documents[doc_idx] | |
| print(f"\nβ Rank {rank} | Score: {score:.4f}") | |
| print(f" Document: \"{doc}\"") | |
| contrib = per_doc_contrib[doc_idx] | |
| if not contrib: | |
| print(" (No overlapping tokens)") | |
| continue | |
| top = sorted(contrib.items(), key=lambda kv: kv[1][2], reverse=True)[:top_k_tokens] | |
| token_ids = [tid for tid, _ in top] | |
| tokens = tokenizer.convert_ids_to_tokens(token_ids) | |
| print(f" [Top {top_k_tokens} Contributing Tokens]") | |
| print(f" {'Token':<20} {'Score (qw*dw)':>15}") | |
| print(f" {'-'*35}") | |
| for (tid, (qw, dw, prod)), tok in zip(top, tokens): | |
| clean_tok = tok.replace("##", "") | |
| print(f" {clean_tok:<20} {prod:15.4f}") | |
| if __name__ == "__main__": | |
| print(f"Loading model: {model_name}...") | |
| model = SparseEncoder(model_name).to(device) | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| documents = [ | |
| "ν λ ν½μ€λ μμ± λ°μ΄ν°λ₯Ό λΆμνμ¬ ν΄μ, λμ λ± λ€μν λΆμΌμ μ루μ μ μ 곡ν©λλ€.", | |
| "κ³ ν΄μλ κ΄ν μμ± μμμ κ΅λ°© λ° μ μ°° λͺ©μ μΌλ‘ μ€μνκ² νμ©λ©λλ€.", | |
| "TelePIX provides advanced solutions by analyzing satellite data for ocean and agriculture.", | |
| "High-resolution optical satellite imagery is critical for defense and reconnaissance.", | |
| "Space economy creates new value through the utilization of space-based data." | |
| ] | |
| # Cross-lingual test queries :) | |
| queries = [ | |
| "ν λ ν½μ€λ μ΄λ€ μ°μ λΆμΌμμ μμ± λ°μ΄ν°λ₯Ό νμ©νλμ?", | |
| "Utilization of satellite imagery for defense", | |
| ] | |
| print("Building inverted index...") | |
| inverted_index = build_inverted_index( | |
| model=model, | |
| tokenizer=tokenizer, | |
| documents=documents, | |
| batch_size=4, | |
| min_weight=0.01, # λ Έμ΄μ¦ μ κ±°λ₯Ό μν΄ μ½κ°μ thresholdλ₯Ό μ€ μ μμ΅λλ€. | |
| ) | |
| splade_token_overlap_inverted( | |
| model=model, | |
| tokenizer=tokenizer, | |
| inverted_index=inverted_index, | |
| documents=documents, | |
| queries=queries, | |
| top_k_docs=2, | |
| top_k_tokens=5 | |
| ) | |
| ``` | |
| ## License | |
| The PIXIE-Splade-v1.5 model is licensed under Apache License 2.0. | |
| ## Citation | |
| ``` | |
| @misc{TelePIX-PIXIE-Splade-v1.5, | |
| title={PIXIE-Splade-v1.5}, | |
| author={TelePIX AI Research Team and Bongmin Kim}, | |
| year={2026}, | |
| url={https://huggingface.co/telepix/PIXIE-Splade-v1.5} | |
| } | |
| ``` | |
| ## Contact | |
| If you have any suggestions or questions about the PIXIE, please reach out to the authors at bmkim@telepix.net. |