Update README.md
Browse files
README.md
CHANGED
|
@@ -35,6 +35,73 @@ SentenceTransformer(
|
|
| 35 |
)
|
| 36 |
```
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
## Usage
|
| 39 |
|
| 40 |
### Direct Usage (Sentence Transformers)
|
|
|
|
| 35 |
)
|
| 36 |
```
|
| 37 |
|
| 38 |
+
## Quality Benchmarks
|
| 39 |
+
**PIXIE-Rune-M-v1.0** is a multilingual embedding model specialized for Korean and English retrieval tasks.
|
| 40 |
+
It delivers consistently strong performance across a diverse set of domain-specific and open-domain benchmarks in both languages, demonstrating its effectiveness in real-world semantic search applications.
|
| 41 |
+
The table below presents the retrieval performance of several embedding models evaluated on a variety of Korean and English benchmarks.
|
| 42 |
+
We report **Normalized Discounted Cumulative Gain (NDCG)** scores, which measure how well a ranked list of documents aligns with ground truth relevance. Higher values indicate better retrieval quality.
|
| 43 |
+
- **Avg. NDCG**: Average of NDCG@1, @3, @5, and @10 across all benchmark datasets.
|
| 44 |
+
- **NDCG@k**: Relevance quality of the top-*k* retrieved results.
|
| 45 |
+
|
| 46 |
+
#### Korean Retrieval Benchmarks
|
| 47 |
+
Our model, **telepix/PIXIE-Rune-M-v1.0**, achieves state-of-the-art performance across most metrics and benchmarks, demonstrating strong generalization across domains such as multi-hop QA, long-document retrieval, public health, and e-commerce.
|
| 48 |
+
|
| 49 |
+
| Model Name | # params | Avg. NDCG | NDCG@1 | NDCG@3 | NDCG@5 | NDCG@10 |
|
| 50 |
+
|------|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 51 |
+
| **telepix/PIXIE-Rune-M-v1.0** | 385M | **0.6905** | **0.6461** | **0.6859** | **0.7063** | **0.7238** |
|
| 52 |
+
| nlpai-lab/KURE-v1 | 568M | 0.6751 | 0.6277 | 0.6725 | 0.6907 | 0.7095 |
|
| 53 |
+
| dragonekue/BGE-m3-ko | 568M | 0.6658 | 0.6225 | 0.6627 | 0.6795 | 0.6985 |
|
| 54 |
+
| Snowflake/snowflake-arctic-embed-l-v2.0 | 568M | 0.6592 | 0.6118 | 0.6542 | 0.6759 | 0.6949 |
|
| 55 |
+
| BAAI/bge-m3 | 568M | 0.6573 | 0.6099 | 0.6533 | 0.6732 | 0.6930 |
|
| 56 |
+
| Qwen/Qwen3-Embedding-0.6B | 595M | 0.6321 | 0.5894 | 0.6274 | 0.6455 | 0.6662 |
|
| 57 |
+
| Alibaba-NLP/gte-Qwen-7B-instruct | 7711M | 0.6202 | 0.5698 | 0.6200 | 0.6349 | 0.6564 |
|
| 58 |
+
| openai/text-embedding-3-large | N/A | 0.6015 | 0.5466 | 0.5999 | 0.6187 | 0.6409 |
|
| 59 |
+
| Salesforce/SFR-Embedding-2_R | 7111M | 0.5979 | 0.5451 | 0.5959 | 0.6158 | 0.6348 |
|
| 60 |
+
|
| 61 |
+
Descriptions of the benchmark datasets used for evaluation are as follows:
|
| 62 |
+
- **Ko-StrategyQA**
|
| 63 |
+
A Korean multi-hop open-domain question answering dataset designed for complex reasoning over multiple documents.
|
| 64 |
+
- **AutoRAGRetrieval**
|
| 65 |
+
A domain-diverse retrieval dataset covering finance, government, healthcare, legal, and e-commerce sectors.
|
| 66 |
+
- **MIRACLRetrieval**
|
| 67 |
+
A document retrieval benchmark built on Korean Wikipedia articles.
|
| 68 |
+
- **PublicHealthQA**
|
| 69 |
+
A retrieval dataset focused on medical and public health topics.
|
| 70 |
+
- **BelebeleRetrieval**
|
| 71 |
+
A dataset for retrieving relevant content from web and news articles in Korean.
|
| 72 |
+
- **MultiLongDocRetrieval**
|
| 73 |
+
A long-document retrieval benchmark based on Korean Wikipedia and mC4 corpus.
|
| 74 |
+
- **XPQARetrieval**
|
| 75 |
+
A real-world dataset constructed from user queries and relevant product documents in a Korean e-commerce platform.
|
| 76 |
+
|
| 77 |
+
#### English Retrieval Benchmarks
|
| 78 |
+
Our model, **telepix/PIXIE-Rune-M-v1.0**, achieves strong performance on a wide range of tasks, including fact verification, multi-hop question answering, financial QA, and scientific document retrieval, demonstrating competitive generalization across diverse domains.
|
| 79 |
+
|
| 80 |
+
| Model Name | # params | Avg. NDCG | NDCG@1 | NDCG@3 | NDCG@5 | NDCG@10 |
|
| 81 |
+
|------|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 82 |
+
| **telepix/PIXIE-Rune-M-v1.0** | 385M | **123** | **123** | **123** | **123** | **123** |
|
| 83 |
+
| Snowflake/snowflake-arctic-embed-l-v2.0 | 568M | 0.5812 | 0.5725 | 0.5705 | 0.5811 | 0.6006 |
|
| 84 |
+
| Qwen/Qwen3-Embedding-0.6B | 595M | 0.5558 | 0.5321 | 0.5451 | 0.5620 | 0.5839 |
|
| 85 |
+
| BAAI/bge-m3 | 568M | 0.5318 | 0.5078 | 0.5231 | 0.5389 | 0.5573 |
|
| 86 |
+
| dragonekue/BGE-m3-ko | 568M | 0.5307 | 0.5125 | 0.5174 | 0.5362 | 0.5566 |
|
| 87 |
+
| nlpai-lab/KURE-v1 | 568M | 0.5272 | 0.5017 | 0.5171 | 0.5353 | 0.5548 |
|
| 88 |
+
|
| 89 |
+
Descriptions of the benchmark datasets used for evaluation are as follows:
|
| 90 |
+
- **ArguAna**
|
| 91 |
+
A dataset for argument retrieval based on claim-counterclaim pairs from online debate forums.
|
| 92 |
+
- **FEVER**
|
| 93 |
+
A fact verification dataset using Wikipedia for evidence-based claim validation.
|
| 94 |
+
- **FiQA-2018**
|
| 95 |
+
A retrieval benchmark tailored to the finance domain with real-world questions and answers.
|
| 96 |
+
- **HotpotQA**
|
| 97 |
+
A multi-hop open-domain QA dataset requiring reasoning across multiple documents.
|
| 98 |
+
- **MSMARCO**
|
| 99 |
+
A large-scale benchmark using real Bing search queries and corresponding web documents.
|
| 100 |
+
- **NQ**
|
| 101 |
+
A Google QA dataset where user questions are answered using Wikipedia articles.
|
| 102 |
+
- **SCIDOCS**
|
| 103 |
+
A citation-based document retrieval dataset focused on scientific papers.
|
| 104 |
+
|
| 105 |
## Usage
|
| 106 |
|
| 107 |
### Direct Usage (Sentence Transformers)
|