BM-K commited on
Commit
86ff0ee
ยท
verified ยท
1 Parent(s): dfa56fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +248 -3
README.md CHANGED
@@ -1,3 +1,248 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - dense-encoder
6
+ - dense
7
+ - feature-extraction
8
+ - retrieval
9
+ - multimodal
10
+ - multi-modal
11
+ - crossmodal
12
+ - cross-modal
13
+ - aerospace
14
+ - telepix
15
+ language:
16
+ - af
17
+ - ar
18
+ - az
19
+ - be
20
+ - bg
21
+ - bn
22
+ - ca
23
+ - ceb
24
+ - cs
25
+ - cy
26
+ - da
27
+ - de
28
+ - el
29
+ - en
30
+ - es
31
+ - et
32
+ - eu
33
+ - fa
34
+ - fi
35
+ - fr
36
+ - gl
37
+ - gu
38
+ - he
39
+ - hi
40
+ - hr
41
+ - ht
42
+ - hu
43
+ - hy
44
+ - id
45
+ - is
46
+ - it
47
+ - ja
48
+ - jv
49
+ - ka
50
+ - kk
51
+ - km
52
+ - kn
53
+ - ko
54
+ - ky
55
+ - lo
56
+ - lt
57
+ - lv
58
+ - mk
59
+ - ml
60
+ - mn
61
+ - mr
62
+ - ms
63
+ - my
64
+ - ne
65
+ - nl
66
+ - pa
67
+ - pl
68
+ - pt
69
+ - qu
70
+ - ro
71
+ - ru
72
+ - si
73
+ - sk
74
+ - sl
75
+ - so
76
+ - sq
77
+ - sr
78
+ - sv
79
+ - sw
80
+ - ta
81
+ - te
82
+ - th
83
+ - tl
84
+ - tr
85
+ - uk
86
+ - ur
87
+ - vi
88
+ - yo
89
+ - zh
90
+ pipeline_tag: feature-extraction
91
+ library_name: sentence-transformers
92
+ license: apache-2.0
93
+ ---
94
+ <p align="center">
95
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/61d6f4a4d49065ee28a1ee7e/V8n2En7BlMNHoi1YXVv8Q.png" width="400"/>
96
+ <p>
97
+
98
+ # PIXIE-Rune-v1.0
99
+ **PIXIE-Rune-v1.0** is an encoder-based embedding model trained on Korean and English information retrieval dataset,
100
+ developed by [TelePIX Co., Ltd](https://telepix.net/).
101
+ **PIXIE** stands for Tele**PIX** **I**ntelligent **E**mbedding, representing TelePIXโ€™s high-performance embedding technology.
102
+ This model is specifically optimized for semantic retrieval tasks in Korean and English, and demonstrates strong performance in aerospace domain. Through extensive fine-tuning and domain-specific evaluation, PIXIE shows robust retrieval quality for real-world use cases such as document understanding, technical QA, and semantic search in aerospace and related high-precision fields.
103
+ It also performs competitively across a wide range of open-domain Korean and English retrieval benchmarks, making it a versatile foundation for multilingual semantic search systems.
104
+
105
+
106
+ ## Model Description
107
+ - **Model Type:** Sentence Transformer
108
+ <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
109
+ - **Maximum Sequence Length:** 6144 tokens
110
+ - **Output Dimensionality:** 1024 dimensions
111
+ - **Similarity Function:** Cosine Similarity
112
+ - **Language:** Multilingual โ€” optimized for high performance in Korean and English
113
+ - **Domain Specialization:** Aerospace Information Retrieval
114
+ - **License:** apache-2.0
115
+
116
+ ### Full Model Architecture
117
+
118
+ ```
119
+ SentenceTransformer(
120
+ (0): Transformer({'max_seq_length': 6144, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
121
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
122
+ (2): Normalize()
123
+ )
124
+ ```
125
+
126
+ ## Quality Benchmarks
127
+ **PIXIE-Rune-v1.0** is a multilingual embedding model specialized for Korean and English retrieval tasks.
128
+ It delivers consistently strong performance across a diverse set of domain-specific and open-domain benchmarks in both languages, demonstrating its effectiveness in real-world semantic search applications.
129
+ The table below presents the retrieval performance of several embedding models evaluated on a variety of Korean and English benchmarks.
130
+ We report **Normalized Discounted Cumulative Gain (nDCG@10)** scores, which measure how well a ranked list of documents aligns with ground truth relevance. Higher values indicate better retrieval quality.
131
+
132
+ All evaluations were conducted using the open-source **[Korean-MTEB-Retrieval-Evaluators](https://github.com/BM-K/Korean-MTEB-Retrieval-Evaluators)** codebase to ensure consistent dataset handling, indexing, retrieval, and nDCG@10 computation across models.
133
+
134
+ ### Benchmark Overview and Dataset Descriptions
135
+ | Model Name | # params | STELLA (XL) | MTEB (ko) | BEIR (en) |
136
+ |------|:---:|:---:|:---:|:---:|
137
+ | **telepix/PIXIE-Rune-v1.0** | **0.5B** | **0.6345** | **0.7603** | **0.5872** |
138
+ | | | | | |
139
+ | nvidia/llama-embed-nemotron-8b | 8B | 0.7181 | N/A | N/A |
140
+ | Qwen/Qwen3-Embedding-8B | 8B | 0.6154 | N/A | N/A |
141
+ | Snowflake/snowflake-arctic-embed-l-v2.0 | 0.5B | 0.5448 | 0.7390 | 0.6006 |
142
+ | BAAI/bge-m3 | 0.5B | 0.5056 | 0.7483 | 0.5573 |
143
+ | Salesforce/SFR-Embedding-Mistral | 7B | 0.4579 | N/A | N/A |
144
+ | Alibaba-NLP/gte-multilingual-base | 0.3B | 0.4097 | 0.7084 | 0.5746 |
145
+ | intfloat/multilingual-e5-large-instruct | 0.6B | 0.2384 | 0.7050 | N/A |
146
+ | jinaai/jina-embeddings-v3 | 0.5B | N/A | 0.7088 | 0.4861 |
147
+ | Qwen/Qwen3-Embedding-0.6B | 0.6B | N/A | 0.7017 | 0.5839 |
148
+ | openai/text-embedding-3-large | N/A | N/A | 0.6646 | N/A |
149
+
150
+ To better interpret the evaluation results above, we briefly describe the characteristics and evaluation intent of each benchmark suite used in this comparison.
151
+ Each benchmark is designed to assess different aspects of retrieval capability, ranging from domain-specific technical understanding to open-domain and multilingual generalization.
152
+
153
+ #### STELLA
154
+ [STELLA](https://arxiv.org/abs/2601.03496) is an aerospace-domain Information Retrieval (IR) benchmark constructed from NASA Technical Reports Server (NTRS) documents. It is designed to evaluate both:
155
+
156
+ - **Lexical matching** ability (does the retriever benefit from exact technical terms? | TCQ)
157
+ - **Semantic matching** ability (can the retriever match concepts even when technical terms are not explicitly used? | TAQ).
158
+
159
+ STELLA provides **dual-type synthetic queries** and a **cross-lingual extension** for multilingual evaluation while keeping the corpus in English.
160
+
161
+ #### 6 Datasets of MTEB (Korean)
162
+ Descriptions of the benchmark datasets used for evaluation are as follows:
163
+ - **Ko-StrategyQA**
164
+ A Korean multi-hop open-domain question answering dataset designed for complex reasoning over multiple documents.
165
+ - **AutoRAGRetrieval**
166
+ A domain-diverse retrieval dataset covering finance, government, healthcare, legal, and e-commerce sectors.
167
+ - **MIRACLRetrieval**
168
+ A document retrieval benchmark built on Korean Wikipedia articles.
169
+ - **PublicHealthQA**
170
+ A retrieval dataset focused on medical and public health topics.
171
+ - **BelebeleRetrieval**
172
+ A dataset for retrieving relevant content from web and news articles in Korean.
173
+ - **MultiLongDocRetrieval**
174
+ A long-document retrieval benchmark based on Korean Wikipedia and mC4 corpus.
175
+
176
+ #### 7 Datasets of BEIR (English)
177
+ Descriptions of the benchmark datasets used for evaluation are as follows:
178
+ - **ArguAna**
179
+ A dataset for argument retrieval based on claim-counterclaim pairs from online debate forums.
180
+ - **FEVER**
181
+ A fact verification dataset using Wikipedia for evidence-based claim validation.
182
+ - **FiQA-2018**
183
+ A retrieval benchmark tailored to the finance domain with real-world questions and answers.
184
+ - **HotpotQA**
185
+ A multi-hop open-domain QA dataset requiring reasoning across multiple documents.
186
+ - **MSMARCO**
187
+ A large-scale benchmark using real Bing search queries and corresponding web documents.
188
+ - **NQ**
189
+ A Google QA dataset where user questions are answered using Wikipedia articles.
190
+ - **SCIDOCS**
191
+ A citation-based document retrieval dataset focused on scientific papers.
192
+
193
+ ## Direct Use (Semantic Search)
194
+
195
+ ```python
196
+ from sentence_transformers import SentenceTransformer
197
+
198
+ # Load the model
199
+ model_name = 'telepix/PIXIE-Rune-v1.0'
200
+ model = SentenceTransformer(model_name)
201
+
202
+ # Define the queries and documents
203
+ queries = [
204
+ "ํ…”๋ ˆํ”ฝ์Šค๋Š” ์–ด๋–ค ์‚ฐ์—… ๋ถ„์•ผ์—์„œ ์œ„์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋‚˜์š”?",
205
+ "๊ตญ๋ฐฉ ๋ถ„์•ผ์— ์–ด๋–ค ์œ„์„ฑ ์„œ๋น„์Šค๊ฐ€ ์ œ๊ณต๋˜๋‚˜์š”?",
206
+ "ํ…”๋ ˆํ”ฝ์Šค์˜ ๊ธฐ์ˆ  ์ˆ˜์ค€์€ ์–ด๋А ์ •๋„์ธ๊ฐ€์š”?",
207
+ ]
208
+ documents = [
209
+ "ํ…”๋ ˆํ”ฝ์Šค๋Š” ํ•ด์–‘, ์ž์›, ๋†์—… ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ์œ„์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์„œ๋น„์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.",
210
+ "์ •์ฐฐ ๋ฐ ๊ฐ์‹œ ๋ชฉ์ ์˜ ์œ„์„ฑ ์˜์ƒ์„ ํ†ตํ•ด ๊ตญ๋ฐฉ ๊ด€๋ จ ์ •๋ฐ€ ๋ถ„์„ ์„œ๋น„์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.",
211
+ "TelePIX์˜ ๊ด‘ํ•™ ํƒ‘์žฌ์ฒด ๋ฐ AI ๋ถ„์„ ๊ธฐ์ˆ ์€ Global standard๋ฅผ ์ƒํšŒํ•˜๋Š” ์ˆ˜์ค€์œผ๋กœ ํ‰๊ฐ€๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.",
212
+ "ํ…”๋ ˆํ”ฝ์Šค๋Š” ์šฐ์ฃผ์—์„œ ์ˆ˜์ง‘ํ•œ ์ •๋ณด๋ฅผ ๋ถ„์„ํ•˜์—ฌ '์šฐ์ฃผ ๊ฒฝ์ œ(Space Economy)'๋ผ๋Š” ์ƒˆ๋กœ์šด ๊ฐ€์น˜๋ฅผ ์ฐฝ์ถœํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.",
213
+ "ํ…”๋ ˆํ”ฝ์Šค๋Š” ์œ„์„ฑ ์˜์ƒ ํš๋“๋ถ€ํ„ฐ ๋ถ„์„, ์„œ๋น„์Šค ์ œ๊ณต๊นŒ์ง€ ์ „ ์ฃผ๊ธฐ๋ฅผ ์•„์šฐ๋ฅด๋Š” ์†”๋ฃจ์…˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.",
214
+ ]
215
+
216
+ # Compute embeddings: use `prompt_name="query"` to encode queries!
217
+ query_embeddings = model.encode(queries, prompt_name="query")
218
+ document_embeddings = model.encode(documents)
219
+
220
+ # Compute cosine similarity scores
221
+ scores = model.similarity(query_embeddings, document_embeddings)
222
+
223
+ # Output the results
224
+ for query, query_scores in zip(queries, scores):
225
+ doc_score_pairs = list(zip(documents, query_scores))
226
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
227
+ print("Query:", query)
228
+ for document, score in doc_score_pairs:
229
+ print(score, document)
230
+
231
+ ```
232
+
233
+ ## License
234
+ The PIXIE-Rune-v1.0 model is licensed under Apache License 2.0.
235
+
236
+ ## Citation
237
+ ```
238
+ @misc{TelePIX-PIXIE-Rune-v1.0,
239
+ title={PIXIE-Rune-v1.0},
240
+ author={TelePIX AI Research Team and Bongmin Kim},
241
+ year={2026},
242
+ url={https://huggingface.co/telepix/PIXIE-Rune-v1.0}
243
+ }
244
+ ```
245
+
246
+ ## Contact
247
+
248
+ If you have any suggestions or questions about the PIXIE, please reach out to the authors at bmkim@telepix.net.