| | --- |
| | tags: |
| | - sentence-transformers |
| | - sentence-similarity |
| | - dense-encoder |
| | - dense |
| | - feature-extraction |
| | - retrieval |
| | - multimodal |
| | - multi-modal |
| | - crossmodal |
| | - cross-modal |
| | - aerospace |
| | - telepix |
| | language: |
| | - af |
| | - ar |
| | - az |
| | - be |
| | - bg |
| | - bn |
| | - ca |
| | - ceb |
| | - cs |
| | - cy |
| | - da |
| | - de |
| | - el |
| | - en |
| | - es |
| | - et |
| | - eu |
| | - fa |
| | - fi |
| | - fr |
| | - gl |
| | - gu |
| | - he |
| | - hi |
| | - hr |
| | - ht |
| | - hu |
| | - hy |
| | - id |
| | - is |
| | - it |
| | - ja |
| | - jv |
| | - ka |
| | - kk |
| | - km |
| | - kn |
| | - ko |
| | - ky |
| | - lo |
| | - lt |
| | - lv |
| | - mk |
| | - ml |
| | - mn |
| | - mr |
| | - ms |
| | - my |
| | - ne |
| | - nl |
| | - pa |
| | - pl |
| | - pt |
| | - qu |
| | - ro |
| | - ru |
| | - si |
| | - sk |
| | - sl |
| | - so |
| | - sq |
| | - sr |
| | - sv |
| | - sw |
| | - ta |
| | - te |
| | - th |
| | - tl |
| | - tr |
| | - uk |
| | - ur |
| | - vi |
| | - yo |
| | - zh |
| | pipeline_tag: feature-extraction |
| | library_name: sentence-transformers |
| | license: apache-2.0 |
| | --- |
| | <p align="center"> |
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/61d6f4a4d49065ee28a1ee7e/V8n2En7BlMNHoi1YXVv8Q.png" width="400"/> |
| | <p> |
| | |
| | # PIXIE-Rune-v1.0 |
| | **PIXIE-Rune-v1.0** is an encoder-based embedding model trained on Korean and English information retrieval dataset, |
| | developed by [TelePIX Co., Ltd](https://telepix.net/). |
| | **PIXIE** stands for Tele**PIX** **I**ntelligent **E**mbedding, representing TelePIXโs high-performance embedding technology. |
| | This model is specifically optimized for semantic retrieval tasks in Korean and English, and demonstrates strong performance in aerospace domain. Through extensive fine-tuning and domain-specific evaluation, PIXIE shows robust retrieval quality for real-world use cases such as document understanding, technical QA, and semantic search in aerospace and related high-precision fields. |
| | It also performs competitively across a wide range of open-domain Korean and English retrieval benchmarks, making it a versatile foundation for multilingual semantic search systems. |
| |
|
| |
|
| | ## Model Description |
| | - **Model Type:** Sentence Transformer |
| | <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) --> |
| | - **Maximum Sequence Length:** 6144 tokens |
| | - **Output Dimensionality:** 1024 dimensions |
| | - **Similarity Function:** Cosine Similarity |
| | - **Language:** Multilingual โ optimized for high performance in Korean and English |
| | - **Domain Specialization:** Aerospace Information Retrieval |
| | - **License:** apache-2.0 |
| |
|
| | ### Full Model Architecture |
| |
|
| | ``` |
| | SentenceTransformer( |
| | (0): Transformer({'max_seq_length': 6144, 'do_lower_case': False}) with Transformer model: XLMRobertaModel |
| | (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
| | (2): Normalize() |
| | ) |
| | ``` |
| |
|
| | ## Quality Benchmarks |
| | **PIXIE-Rune-v1.0** is a multilingual embedding model specialized for Korean and English retrieval tasks. |
| | It delivers consistently strong performance across a diverse set of domain-specific and open-domain benchmarks in both languages, demonstrating its effectiveness in real-world semantic search applications. |
| | The table below presents the retrieval performance of several embedding models evaluated on a variety of Korean and English benchmarks. |
| | We report **Normalized Discounted Cumulative Gain (nDCG@10)** scores, which measure how well a ranked list of documents aligns with ground truth relevance. Higher values indicate better retrieval quality. |
| | |
| | All evaluations were conducted using the open-source **[Korean-MTEB-Retrieval-Evaluators](https://github.com/BM-K/Korean-MTEB-Retrieval-Evaluators)** codebase to ensure consistent dataset handling, indexing, retrieval, and nDCG@10 computation across models. |
| |
|
| | ### Benchmark Overview and Dataset Descriptions |
| | | Model Name | # params | STELLA (XL) | MTEB (ko) | BEIR (en) | |
| | |------|:---:|:---:|:---:|:---:| |
| | | **telepix/PIXIE-Rune-v1.0** | **0.5B** | **0.6345** | **0.7603** | **0.5872** | |
| | | | | | | | |
| | | nvidia/llama-embed-nemotron-8b | 8B | 0.7181 | 0.7813 | 0.6935 | |
| | | Qwen/Qwen3-Embedding-8B | 8B | 0.6154 | 0.7839 | 0.6701 | |
| | | Snowflake/snowflake-arctic-embed-l-v2.0 | 0.5B | 0.5448 | 0.7390 | 0.6006 | |
| | | BAAI/bge-m3 | 0.5B | 0.5056 | 0.7483 | 0.5573 | |
| | | Qwen/Qwen3-Embedding-0.6B | 0.6B | 0.4707 | 0.7017 | 0.5839 | |
| | | Octen/Octen-Embedding-0.6B | 0.6B | 0.4683 | 0.7057 | 0.5769 | |
| | | Salesforce/SFR-Embedding-Mistral | 7B | 0.4579 | N/A | N/A | |
| | | Alibaba-NLP/gte-multilingual-base | 0.3B | 0.4097 | 0.7084 | 0.5746 | |
| | | intfloat/multilingual-e5-large-instruct | 0.6B | 0.2384 | 0.7050 | N/A | |
| | | jinaai/jina-embeddings-v3 | 0.5B | N/A | 0.7088 | 0.4861 | |
| | | openai/text-embedding-3-large | N/A | N/A | 0.6646 | N/A | |
| |
|
| | To better interpret the evaluation results above, we briefly describe the characteristics and evaluation intent of each benchmark suite used in this comparison. |
| | Each benchmark is designed to assess different aspects of retrieval capability, ranging from domain-specific technical understanding to open-domain and multilingual generalization. |
| |
|
| | #### STELLA |
| | [STELLA](https://arxiv.org/abs/2601.03496) is an aerospace-domain Information Retrieval (IR) benchmark constructed from NASA Technical Reports Server (NTRS) documents. It is designed to evaluate both: |
| |
|
| | - **Lexical matching** ability (does the retriever benefit from exact technical terms? | TCQ) |
| | - **Semantic matching** ability (can the retriever match concepts even when technical terms are not explicitly used? | TAQ). |
| |
|
| | STELLA provides **dual-type synthetic queries** and a **cross-lingual extension** for multilingual evaluation while keeping the corpus in English. |
| |
|
| | #### 6 Datasets of MTEB (Korean) |
| | Descriptions of the benchmark datasets used for evaluation are as follows: |
| | - **Ko-StrategyQA** |
| | A Korean multi-hop open-domain question answering dataset designed for complex reasoning over multiple documents. |
| | - **AutoRAGRetrieval** |
| | A domain-diverse retrieval dataset covering finance, government, healthcare, legal, and e-commerce sectors. |
| | - **MIRACLRetrieval** |
| | A document retrieval benchmark built on Korean Wikipedia articles. |
| | - **PublicHealthQA** |
| | A retrieval dataset focused on medical and public health topics. |
| | - **BelebeleRetrieval** |
| | A dataset for retrieving relevant content from web and news articles in Korean. |
| | - **MultiLongDocRetrieval** |
| | A long-document retrieval benchmark based on Korean Wikipedia and mC4 corpus. |
| |
|
| | #### 7 Datasets of BEIR (English) |
| | Descriptions of the benchmark datasets used for evaluation are as follows: |
| | - **ArguAna** |
| | A dataset for argument retrieval based on claim-counterclaim pairs from online debate forums. |
| | - **FEVER** |
| | A fact verification dataset using Wikipedia for evidence-based claim validation. |
| | - **FiQA-2018** |
| | A retrieval benchmark tailored to the finance domain with real-world questions and answers. |
| | - **HotpotQA** |
| | A multi-hop open-domain QA dataset requiring reasoning across multiple documents. |
| | - **MSMARCO** |
| | A large-scale benchmark using real Bing search queries and corresponding web documents. |
| | - **NQ** |
| | A Google QA dataset where user questions are answered using Wikipedia articles. |
| | - **SCIDOCS** |
| | A citation-based document retrieval dataset focused on scientific papers. |
| | |
| | ## Direct Use (Semantic Search) |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | # Load the model |
| | model_name = 'telepix/PIXIE-Rune-v1.0' |
| | model = SentenceTransformer(model_name) |
| | |
| | # Define the queries and documents |
| | queries = [ |
| | "ํ
๋ ํฝ์ค๋ ์ด๋ค ์ฐ์
๋ถ์ผ์์ ์์ฑ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ๋์?", |
| | "๊ตญ๋ฐฉ ๋ถ์ผ์ ์ด๋ค ์์ฑ ์๋น์ค๊ฐ ์ ๊ณต๋๋์?", |
| | "ํ
๋ ํฝ์ค์ ๊ธฐ์ ์์ค์ ์ด๋ ์ ๋์ธ๊ฐ์?", |
| | ] |
| | documents = [ |
| | "ํ
๋ ํฝ์ค๋ ํด์, ์์, ๋์
๋ฑ ๋ค์ํ ๋ถ์ผ์์ ์์ฑ ๋ฐ์ดํฐ๋ฅผ ๋ถ์ํ์ฌ ์๋น์ค๋ฅผ ์ ๊ณตํฉ๋๋ค.", |
| | "์ ์ฐฐ ๋ฐ ๊ฐ์ ๋ชฉ์ ์ ์์ฑ ์์์ ํตํด ๊ตญ๋ฐฉ ๊ด๋ จ ์ ๋ฐ ๋ถ์ ์๋น์ค๋ฅผ ์ ๊ณตํฉ๋๋ค.", |
| | "TelePIX์ ๊ดํ ํ์ฌ์ฒด ๋ฐ AI ๋ถ์ ๊ธฐ์ ์ Global standard๋ฅผ ์ํํ๋ ์์ค์ผ๋ก ํ๊ฐ๋ฐ๊ณ ์์ต๋๋ค.", |
| | "ํ
๋ ํฝ์ค๋ ์ฐ์ฃผ์์ ์์งํ ์ ๋ณด๋ฅผ ๋ถ์ํ์ฌ '์ฐ์ฃผ ๊ฒฝ์ (Space Economy)'๋ผ๋ ์๋ก์ด ๊ฐ์น๋ฅผ ์ฐฝ์ถํ๊ณ ์์ต๋๋ค.", |
| | "ํ
๋ ํฝ์ค๋ ์์ฑ ์์ ํ๋๋ถํฐ ๋ถ์, ์๋น์ค ์ ๊ณต๊น์ง ์ ์ฃผ๊ธฐ๋ฅผ ์์ฐ๋ฅด๋ ์๋ฃจ์
์ ์ ๊ณตํฉ๋๋ค.", |
| | ] |
| | |
| | # Compute embeddings: use `prompt_name="query"` to encode queries! |
| | query_embeddings = model.encode(queries, prompt_name="query") |
| | document_embeddings = model.encode(documents) |
| | |
| | # Compute cosine similarity scores |
| | scores = model.similarity(query_embeddings, document_embeddings) |
| | |
| | # Output the results |
| | for query, query_scores in zip(queries, scores): |
| | doc_score_pairs = list(zip(documents, query_scores)) |
| | doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) |
| | print("Query:", query) |
| | for document, score in doc_score_pairs: |
| | print(score, document) |
| | |
| | ``` |
| |
|
| | ## License |
| | The PIXIE-Rune-v1.0 model is licensed under Apache License 2.0. |
| |
|
| | ## Citation |
| | ``` |
| | @misc{TelePIX-PIXIE-Rune-v1.0, |
| | title={PIXIE-Rune-v1.0}, |
| | author={TelePIX AI Research Team and Bongmin Kim}, |
| | year={2026}, |
| | url={https://huggingface.co/telepix/PIXIE-Rune-v1.0} |
| | } |
| | ``` |
| |
|
| | ## Contact |
| |
|
| | If you have any suggestions or questions about the PIXIE, please reach out to the authors at bmkim@telepix.net. |