sewoong commited on
Commit
fa98e65
·
verified ·
1 Parent(s): 8dade64

docs: update README with V28 benchmark results

Browse files
Files changed (1) hide show
  1. README.md +36 -221
README.md CHANGED
@@ -1,245 +1,60 @@
1
  ---
2
- language:
3
- - ko
4
- - en
5
- - multilingual
6
- license: apache-2.0
7
  tags:
8
- - sparse-retrieval
9
- - splade
10
- - korean
11
- - opensearch
12
- - neural-search
13
- - neural-sparse
14
- - xlm-roberta
15
- - multilingual
16
  library_name: transformers
17
  pipeline_tag: feature-extraction
18
- base_model: xlm-roberta-base
19
- datasets:
20
- - wikipedia
21
- - klue
22
- - korquad
23
  ---
24
 
25
- # Korean Neural Sparse Encoder V26
26
 
27
- Korean multilingual neural sparse encoder for OpenSearch neural sparse search, based on XLM-RoBERTa with IDF-aware FLOPS loss and enhanced stopword suppression.
28
 
29
  ## Model Description
30
 
31
- This model is based on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) and fine-tuned for Korean/multilingual term expansion in neural sparse retrieval tasks using SPLADE architecture with knowledge distillation from BGE-M3.
32
-
33
- ### Key Features
34
-
35
- - **Multilingual Support**: Based on XLM-RoBERTa, supports Korean and other languages
36
- - **IDF-Aware Training**: Uses document frequency-aware FLOPS loss for better term weighting
37
- - **Enhanced Stopword Suppression**: V26 improvements eliminate stopword dominance
38
- - **Knowledge Distillation**: Learns from BGE-M3 teacher model
39
- - **OpenSearch Compatible**: Designed for OpenSearch neural sparse search
40
-
41
- ## V26 Improvements
42
-
43
- V26 addresses the stopword dominance issue found in V25:
44
 
45
- | Parameter | V25 | V26 | Change |
46
- |-----------|-----|-----|--------|
47
- | lambda_flops | 0.002 | 0.010 | 5x increase |
48
- | stopword_penalty | 5.0 | 15.0 | 3x increase |
49
- | idf_alpha | 2.5 | 4.0 | Sharper curve |
50
- | special_token_penalty | - | 100.0 | NEW |
51
- | stopword_list | 163 | 242 | Extended |
52
 
53
- **Key Fix**: Special tokens (`<s>`, `</s>`) were excluded from IDF normalization to prevent range compression.
 
 
 
 
 
 
54
 
55
- ## Benchmark Results (2026-01-28)
56
 
57
- Evaluated on 1,000 Korean QA pairs:
58
 
59
- | Method | Recall@1 | Recall@5 | Recall@10 | MRR | nDCG@10 |
60
- |--------|----------|----------|-----------|-----|---------|
61
- | **Neural Sparse (V26)** | **40.7%** | **51.4%** | **56.1%** | **0.4555** | **0.4806** |
62
- | Semantic (BGE-M3) | 37.1% | 50.2% | 53.1% | 0.4307 | 0.4553 |
63
- | BM25 | 30.0% | 42.2% | 44.6% | 0.3541 | 0.3767 |
64
 
65
- ### Performance Comparison
66
 
67
- | Metric | V25 | V26 | Improvement |
68
- |--------|-----|-----|-------------|
69
- | Recall@1 | 28.2% | **40.7%** | **+44.3%** |
70
- | vs BM25 | -6% | **+35.7%** | ✅ Fixed |
71
- | vs Semantic | -24% | **+3.6pp** | ✅ Surpassed |
72
 
73
- **Statistical Significance**: All comparisons are statistically significant (p < 0.01)
74
 
75
  ## Training Details
76
 
77
- ### Architecture
78
-
79
- ```
80
- Input -> XLM-RoBERTa-base -> log(1 + ReLU(logits)) -> Max Pooling -> Sparse Vector
81
- ```
82
-
83
- ### Hyperparameters
84
-
85
- | Parameter | Value |
86
- |-----------|-------|
87
- | Base Model | xlm-roberta-base |
88
- | Parameters | 278M |
89
- | Learning Rate | 2e-5 |
90
- | Epochs | 25 |
91
- | Batch Size | 48 |
92
- | Max Length | 192 |
93
- | Lambda FLOPS | 0.010 |
94
- | Stopword Penalty | 15.0 |
95
- | IDF Alpha | 4.0 |
96
- | Special Token Penalty | 100.0 |
97
-
98
- ### Loss Function
99
-
100
- ```python
101
- L_total = L_infonce # Contrastive learning
102
- + lambda_flops * L_flops_idf # IDF-aware FLOPS regularization
103
- + lambda_kd * L_kd # Knowledge distillation from BGE-M3
104
- + margin_loss # Triplet margin loss
105
- ```
106
-
107
- ## Usage
108
-
109
- ### With Transformers
110
-
111
- ```python
112
- from transformers import AutoTokenizer, AutoModelForMaskedLM
113
- import torch
114
-
115
- # Load model and tokenizer
116
- tokenizer = AutoTokenizer.from_pretrained("sewoong/korean-neural-sparse-encoder-v26")
117
- model = AutoModelForMaskedLM.from_pretrained("sewoong/korean-neural-sparse-encoder-v26")
118
-
119
- # Encode text
120
- text = "당뇨병 치료 방법"
121
- inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=192)
122
-
123
- with torch.no_grad():
124
- outputs = model(**inputs)
125
- logits = outputs.logits
126
- # SPLADE transformation: log(1 + ReLU(logits))
127
- sparse_repr = torch.log1p(torch.relu(logits))
128
- # Max pooling over sequence
129
- sparse_repr = sparse_repr.max(dim=1).values
130
-
131
- # Get top activated tokens
132
- top_k = 10
133
- top_values, top_indices = sparse_repr[0].topk(top_k)
134
- print("Top-10 activated tokens:")
135
- for idx, val in zip(top_indices.tolist(), top_values.tolist()):
136
- if val > 0:
137
- token = tokenizer.decode([idx]).strip()
138
- print(f" {token}: {val:.4f}")
139
- ```
140
-
141
- ### Example Output
142
-
143
- For the query "당뇨병 치료 방법" (diabetes treatment methods):
144
-
145
- ```
146
- Top-10 activated tokens:
147
- 병: 3.8709
148
- 당: 3.8478
149
- 치료: 3.8428
150
- 뇨: 3.8229
151
- 혈: 2.9696
152
- 방법: 2.7375
153
- 당뇨: 2.5123
154
- 혈당: 2.3456
155
- 의료: 2.1234
156
- 약: 2.0123
157
- ```
158
-
159
- **Note**: V26 now correctly activates semantic tokens (병, 당, 치료, 뇨) instead of stopwords (있습니다, 수, 하는) that dominated V25.
160
-
161
- ### With OpenSearch
162
-
163
- ```python
164
- from opensearchpy import OpenSearch
165
-
166
- # Create neural sparse index
167
- index_body = {
168
- "settings": {
169
- "index.knn": True
170
- },
171
- "mappings": {
172
- "properties": {
173
- "text": {"type": "text"},
174
- "sparse_embedding": {
175
- "type": "rank_features"
176
- }
177
- }
178
- }
179
- }
180
-
181
- # Index document with sparse embedding
182
- doc = {
183
- "text": "당뇨병 치료 방법에 대한 안내",
184
- "sparse_embedding": {
185
- "병": 3.87, "당": 3.85, "치료": 3.84, "뇨": 3.82, "방법": 2.74
186
- }
187
- }
188
-
189
- # Neural sparse search
190
- query = {
191
- "query": {
192
- "neural_sparse": {
193
- "sparse_embedding": {
194
- "query_text": "당뇨병 치료",
195
- "model_id": "your-model-id"
196
- }
197
- }
198
- }
199
- }
200
- ```
201
-
202
- ## Intended Use
203
-
204
- This model is designed for:
205
-
206
- - **OpenSearch Neural Sparse Search**: Term expansion for better recall
207
- - **Korean Document Search**: Finding relevant Korean documents
208
- - **Multilingual Search**: Supports XLM-RoBERTa's 100+ languages
209
- - **Medical/Legal Domain Search**: Optimized for specialized terminology
210
-
211
- ## Limitations
212
-
213
- - Best performance with max 192 tokens
214
- - Primary optimization for Korean, but supports multilingual
215
- - Requires SPLADE-style sparse vector extraction
216
 
217
  ## Version History
218
 
219
- | Version | Date | Recall@1 | Key Changes |
220
- |---------|------|----------|-------------|
221
- | V26 | 2026-01-28 | **40.7%** | IDF-aware FLOPS, enhanced stopword suppression |
222
- | V25 | 2026-01-22 | 28.2% | XLM-RoBERTa base, knowledge distillation |
223
- | V24 | 2026-01-15 | 25.1% | Curriculum learning |
224
-
225
- ## Citation
226
-
227
- ```bibtex
228
- @misc{korean-neural-sparse-encoder-v26,
229
- title={Korean Neural Sparse Encoder V26: IDF-Aware FLOPS with Enhanced Stopword Suppression},
230
- author={sewoong},
231
- year={2026},
232
- url={https://huggingface.co/sewoong/korean-neural-sparse-encoder-v26}
233
- }
234
- ```
235
-
236
- ## License
237
-
238
- Apache 2.0
239
-
240
- ## Acknowledgments
241
-
242
- - Base model: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
243
- - Teacher model: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)
244
- - Architecture: [SPLADE](https://arxiv.org/abs/2107.05720)
245
- - Integration: [OpenSearch Neural Sparse Search](https://opensearch.org/docs/latest/search-plugins/neural-sparse-search/)
 
1
  ---
2
+ language: ko
 
 
 
 
3
  tags:
4
+ - neural-sparse
5
+ - opensearch
6
+ - korean
7
+ - xlm-roberta
8
+ - sparse-retrieval
9
+ - information-retrieval
10
+ license: apache-2.0
 
11
  library_name: transformers
12
  pipeline_tag: feature-extraction
 
 
 
 
 
13
  ---
14
 
15
+ # Korean Neural Sparse Encoder V28
16
 
17
+ Korean-optimized neural sparse retrieval model based on XLM-RoBERTa with Context Gate architecture.
18
 
19
  ## Model Description
20
 
21
+ - **Architecture**: SPLADEDocContextGated (XLM-RoBERTa-base + Context Gate)
22
+ - **Parameters**: 345M
23
+ - **Training Data**: 8M+ Korean text pairs (V29.0 dataset)
24
+ - **Training**: 25 epochs, 8x NVIDIA B200 GPUs (DDP), BF16
25
+ - **Teacher**: BAAI/bge-m3 (knowledge distillation)
 
 
 
 
 
 
 
 
26
 
27
+ ## Ko-StrategyQA Benchmark (592 queries, 9,251 documents)
 
 
 
 
 
 
28
 
29
+ | Method | Recall@1 | Recall@5 | Recall@10 | MRR | P50 (ms) |
30
+ |--------|----------|----------|-----------|-----|----------|
31
+ | **semantic** (BGE-M3) | 73.5% | 87.3% | 89.4% | 0.795 | 16.1 |
32
+ | hybrid_linear_0.3 | 70.3% | 86.0% | 88.7% | 0.772 | 96.6 |
33
+ | bm25_semantic_rrf | 67.4% | 85.5% | 87.8% | 0.751 | 67.7 |
34
+ | bm25 | 53.7% | 75.3% | 81.9% | 0.626 | 15.2 |
35
+ | **neural_sparse** (this model) | 16.2% | 40.2% | 54.9% | 0.265 | 18.1 |
36
 
37
+ ## Usage with OpenSearch
38
 
 
39
 
 
 
 
 
 
40
 
41
+ ## Usage with Transformers
42
 
 
 
 
 
 
43
 
 
44
 
45
  ## Training Details
46
 
47
+ - **Version**: V28 (Context-Gated SPLADE)
48
+ - **Base Model**: xlm-roberta-base
49
+ - **Loss**: InfoNCE + FLOPS + KD (BGE-M3) + Language Penalty
50
+ - **Curriculum**: 2-phase (Foundation -> Balanced with hard negatives)
51
+ - **Final Train Loss**: 1.8255
52
+ - **Final Val Loss**: 1.9558
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  ## Version History
55
 
56
+ | Version | Recall@1 | Architecture |
57
+ |---------|----------|--------------|
58
+ | V28 | 16.2% | SPLADEDocContextGated |
59
+ | V26 | 30.4% | SPLADEDocXLMR + IDF |
60
+ | V25 | 21.0% | SPLADEDocXLMR |