Korean neural sparse encoder: SPLADE-max with ModernBERT backbone

Browse files

Files changed (6) hide show

README.md +104 -38
config.json +33 -15
model.safetensors +2 -2
special_tokens_map.json +43 -7
tokenizer.json +2 -2
tokenizer_config.json +278 -11

README.md CHANGED Viewed

@@ -1,60 +1,126 @@
 ---
-language: ko
 tags:
   - neural-sparse
   - opensearch
   - korean
-  - xlm-roberta
-  - sparse-retrieval
   - information-retrieval
-license: apache-2.0
 library_name: transformers
 pipeline_tag: feature-extraction
 ---
-# Korean Neural Sparse Encoder V28
-Korean-optimized neural sparse retrieval model based on XLM-RoBERTa with Context Gate architecture.
 ## Model Description
-- **Architecture**: SPLADEDocContextGated (XLM-RoBERTa-base + Context Gate)
-- **Parameters**: 345M
-- **Training Data**: 8M+ Korean text pairs (V29.0 dataset)
-- **Training**: 25 epochs, 8x NVIDIA B200 GPUs (DDP), BF16
-- **Teacher**: BAAI/bge-m3 (knowledge distillation)
-## Ko-StrategyQA Benchmark (592 queries, 9,251 documents)
-| Method | Recall@1 | Recall@5 | Recall@10 | MRR | P50 (ms) |
-|--------|----------|----------|-----------|-----|----------|
-| **semantic** (BGE-M3) | 73.5% | 87.3% | 89.4% | 0.795 | 16.1 |
-| hybrid_linear_0.3 | 70.3% | 86.0% | 88.7% | 0.772 | 96.6 |
-| bm25_semantic_rrf | 67.4% | 85.5% | 87.8% | 0.751 | 67.7 |
-| bm25 | 53.7% | 75.3% | 81.9% | 0.626 | 15.2 |
-| **neural_sparse** (this model) | 16.2% | 40.2% | 54.9% | 0.265 | 18.1 |
 ## Usage with OpenSearch
-## Usage with Transformers
 ## Training Details
-- **Version**: V28 (Context-Gated SPLADE)
-- **Base Model**: xlm-roberta-base
-- **Loss**: InfoNCE + FLOPS + KD (BGE-M3) + Language Penalty
-- **Curriculum**: 2-phase (Foundation -> Balanced with hard negatives)
-- **Final Train Loss**: 1.8255
-- **Final Val Loss**: 1.9558
-## Version History
-| Version | Recall@1 | Architecture |
-|---------|----------|--------------|
-| V28 | 16.2% | SPLADEDocContextGated |
-| V26 | 30.4% | SPLADEDocXLMR + IDF |
-| V25 | 21.0% | SPLADEDocXLMR |

 ---
+language:
+  - ko
+license: apache-2.0
 tags:
   - neural-sparse
+  - splade
   - opensearch
   - korean
   - information-retrieval
 library_name: transformers
 pipeline_tag: feature-extraction
 ---
+# Korean Neural Sparse Encoder
+A SPLADE-max neural sparse encoder optimized for Korean text retrieval with OpenSearch.
 ## Model Description
+- **Architecture**: SPLADE-max (MLM head → log(1+ReLU) → max pooling)
+- **Base Model**: [skt/A.X-Encoder-base](https://huggingface.co/skt/A.X-Encoder-base) (ModernBERT)
+- **Vocabulary**: 50,000 tokens (48.4% Korean)
+- **Parameters**: 149M
+- **Hidden Size**: 768
+- **Layers**: 22
+- **Training**: InfoNCE + FLOPS regularization with quadratic lambda warmup
+- **Training Data**: 4.6M Korean triplets (query, positive, negative)
+## Benchmark Results
+Evaluated on Korean retrieval benchmarks using OpenSearch `neural_sparse` search:
+| Benchmark | Queries | BM25 R@1 | Neural Sparse R@1 | Semantic (BGE-M3) R@1 |
+|-----------|---------|----------|-------------------|----------------------|
+| Ko-StrategyQA | 592 | 53.7% | **62.2%** | 73.5% |
+| MIRACL-ko | 213 | 44.1% | **62.0%** | 70.9% |
+| Mr.TyDi-ko | 421 | 55.6% | **73.4%** | 84.1% |
+Neural sparse consistently outperforms BM25 across all benchmarks while maintaining sparse, interpretable representations.
+### Detailed Metrics
+| Benchmark | Method | R@1 | R@5 | R@10 | MRR | NDCG@10 |
+|-----------|--------|-----|-----|------|-----|---------|
+| Ko-StrategyQA | BM25 | 53.7% | 75.3% | 81.9% | 0.626 | 0.673 |
+| Ko-StrategyQA | Neural Sparse | 62.2% | 80.6% | 83.6% | 0.700 | 0.734 |
+| Ko-StrategyQA | Semantic | 73.5% | 87.3% | 89.4% | 0.795 | 0.819 |
+| MIRACL-ko | BM25 | 44.1% | 80.8% | 90.6% | 0.589 | — |
+| MIRACL-ko | Neural Sparse | 62.0% | 89.7% | 93.4% | 0.733 | — |
+| MIRACL-ko | Semantic | 70.9% | 93.9% | 97.7% | 0.810 | — |
+| Mr.TyDi-ko | BM25 | 55.6% | 79.1% | 85.7% | 0.656 | — |
+| Mr.TyDi-ko | Neural Sparse | 73.4% | 92.4% | 94.8% | 0.816 | — |
+| Mr.TyDi-ko | Semantic | 84.1% | 95.7% | 96.9% | 0.894 | — |
 ## Usage with OpenSearch
+### 1. Register the model
+```json
+POST /_plugins/_ml/models/_register
+{
+  "name": "korean-neural-sparse-encoder",
+  "version": "1.0.0",
+  "model_format": "TORCH_SCRIPT",
+  "model_config": {
+    "model_type": "bert",
+    "embedding_dimension": 1,
+    "framework_type": "huggingface_transformers"
+  },
+  "url": "https://huggingface.co/sewoong/korean-neural-sparse-encoder"
+}
+```
+### 2. Create a sparse index
+```json
+PUT /my-korean-index
+{
+  "settings": {
+    "index.knn": true
+  },
+  "mappings": {
+    "properties": {
+      "content": { "type": "text" },
+      "sparse_embedding": { "type": "rank_features" }
+    }
+  }
+}
+```
+### 3. Search with neural_sparse
+```json
+GET /my-korean-index/_search
+{
+  "query": {
+    "neural_sparse": {
+      "sparse_embedding": {
+        "query_text": "서울 여행 추천 장소",
+        "model_id": "<model_id>"
+      }
+    }
+  }
+}
+```
+## Sparsity Characteristics
+After training, the model produces ultra-sparse representations:
+- **Query tokens**: ~35 active (out of 50,000 vocabulary)
+- **Document tokens**: ~58 active
+- Activation sparsity > 99.9%
 ## Training Details
+- **Hardware**: 8× NVIDIA B200 (183GB VRAM each)
+- **Effective Batch Size**: 2,048 (64 per GPU × 4 gradient accumulation × 8 GPUs)
+- **Epochs**: 25
+- **Learning Rate**: 5e-5 with cosine decay
+- **FLOPS Regularization**: λ_q=0.01, λ_d=0.003 with 20K-step quadratic warmup
+- **Mixed Precision**: BF16
+## License
+Apache 2.0

config.json CHANGED Viewed

@@ -1,27 +1,45 @@
 {
   "architectures": [
-    "XLMRobertaForMaskedLM"
   ],
-  "attention_probs_dropout_prob": 0.1,
   "bos_token_id": 0,
-  "classifier_dropout": null,
   "dtype": "float32",
-  "eos_token_id": 2,
-  "hidden_act": "gelu",
-  "hidden_dropout_prob": 0.1,
   "hidden_size": 768,
   "initializer_range": 0.02,
-  "intermediate_size": 3072,
   "layer_norm_eps": 1e-05,
-  "max_position_embeddings": 514,
-  "model_type": "xlm-roberta",
   "num_attention_heads": 12,
-  "num_hidden_layers": 12,
-  "output_past": true,
-  "pad_token_id": 1,
   "position_embedding_type": "absolute",
   "transformers_version": "4.57.6",
-  "type_vocab_size": 1,
-  "use_cache": true,
-  "vocab_size": 250002
 }

 {
   "architectures": [
+    "ModernBertForMaskedLM"
   ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
   "bos_token_id": 0,
+  "classifier_activation": "gelu",
+  "classifier_bias": false,
+  "classifier_dropout": 0.0,
+  "classifier_pooling": "mean",
+  "cls_token_id": 0,
+  "decoder_bias": true,
+  "deterministic_flash_attn": false,
   "dtype": "float32",
+  "embedding_dropout": 0.0,
+  "eos_token_id": 1,
+  "global_attn_every_n_layers": 3,
+  "global_rope_theta": 160000,
+  "gradient_checkpointing": false,
+  "hidden_activation": "gelu",
   "hidden_size": 768,
+  "initializer_cutoff_factor": 2.0,
   "initializer_range": 0.02,
+  "intermediate_size": 1152,
   "layer_norm_eps": 1e-05,
+  "local_attention": 128,
+  "local_rope_theta": 10000.0,
+  "max_position_embeddings": 16384,
+  "mlp_bias": false,
+  "mlp_dropout": 0.0,
+  "model_type": "modernbert",
+  "norm_bias": false,
+  "norm_eps": 1e-05,
   "num_attention_heads": 12,
+  "num_hidden_layers": 22,
+  "pad_token_id": 49999,
   "position_embedding_type": "absolute",
+  "repad_logits_with_grad": false,
+  "sep_token_id": 1,
+  "sparse_pred_ignore_index": -100,
+  "sparse_prediction": false,
   "transformers_version": "4.57.6",
+  "vocab_size": 50000
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b29121fc71db41f44f9c2f36c235caadbb4996961c63e980f309b43e16324166
-size 1113205088

 version https://git-lfs.github.com/spec/v1
+oid sha256:ab9f9a35738765142d146b112b3a90e73044e858c73a277c01c504b2e70eb3ff
+size 597503064

special_tokens_map.json CHANGED Viewed

@@ -1,15 +1,51 @@
 {
-  "bos_token": "<s>",
-  "cls_token": "<s>",
-  "eos_token": "</s>",
   "mask_token": {
     "content": "<mask>",
-    "lstrip": true,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
-  "pad_token": "<pad>",
-  "sep_token": "</s>",
-  "unk_token": "<unk>"
 }

 {
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<cls>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<\\s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
   "mask_token": {
     "content": "<mask>",
+    "lstrip": false,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "<sep>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
 }

tokenizer.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3a56def25aa40facc030ea8b0b87f3688e4b3c39eb8b45d5702b3a1300fe2a20
-size 17082734

 version https://git-lfs.github.com/spec/v1
+oid sha256:ef9bebca9c6529bdefa19909059e07dfdfd7c2f8afeefbf4d230a784a3847d64
+size 1087185

tokenizer_config.json CHANGED Viewed

@@ -9,7 +9,7 @@
       "special": true
     },
     "1": {
-      "content": "<pad>",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
@@ -17,7 +17,7 @@
       "special": true
     },
     "2": {
-      "content": "</s>",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
@@ -25,16 +25,280 @@
       "special": true
     },
     "3": {
-      "content": "<unk>",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
-    "250001": {
       "content": "<mask>",
-      "lstrip": true,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
@@ -42,14 +306,17 @@
     }
   },
   "bos_token": "<s>",
-  "clean_up_tokenization_spaces": false,
-  "cls_token": "<s>",
-  "eos_token": "</s>",
   "extra_special_tokens": {},
   "mask_token": "<mask>",
-  "model_max_length": 512,
   "pad_token": "<pad>",
-  "sep_token": "</s>",
-  "tokenizer_class": "XLMRobertaTokenizer",
   "unk_token": "<unk>"
 }

       "special": true
     },
     "1": {
+      "content": "<\\s>",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "special": true
     },
     "2": {
+      "content": "<unk>",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "special": true
     },
     "3": {
+      "content": "<sep>",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
+    "4": {
       "content": "<mask>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<cls>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<unused0>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<unused1>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<unused2>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<unused3>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<unused4>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<unused5>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<unused6>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<unused7>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<unused8>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<unused9>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<unused10>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "17": {
+      "content": "<unused11>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "18": {
+      "content": "<unused12>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "19": {
+      "content": "<unused13>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "20": {
+      "content": "<unused14>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "21": {
+      "content": "<unused15>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "22": {
+      "content": "<unused16>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "23": {
+      "content": "<unused17>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "24": {
+      "content": "<unused18>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "25": {
+      "content": "<unused19>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "26": {
+      "content": "<unused20>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "27": {
+      "content": "<unused21>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "28": {
+      "content": "<unused22>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "29": {
+      "content": "<unused23>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "30": {
+      "content": "<unused24>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "31": {
+      "content": "<unused25>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "32": {
+      "content": "<unused26>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "33": {
+      "content": "<unused27>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "34": {
+      "content": "<unused28>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "35": {
+      "content": "<unused29>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "36": {
+      "content": "<unused30>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49999": {
+      "content": "<pad>",
+      "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
     }
   },
   "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<cls>",
+  "do_lower_case": false,
+  "eos_token": "<\\s>",
   "extra_special_tokens": {},
   "mask_token": "<mask>",
+  "model_max_length": 16384,
   "pad_token": "<pad>",
+  "sep_token": "<sep>",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
   "unk_token": "<unk>"
 }