Korean Neural Sparse Encoder (V33-ecom-v6)

SPLADEModernBERT sparse retrieval model for Korean, fine-tuned for e-commerce product search.

Model Details

Architecture: SPLADE-max (MLM → log(1+ReLU) → max pool)
Backbone: skt/A.X-Encoder-base (ModernBERT, 149M params)
Vocab: 50K tokens (48.4% Korean)
Sparsity: nz_q≈40, nz_d≈68 (ultra-sparse)
Training: V33 base (25 epochs, 4.84M triplets) + V6 ecom fine-tune (10 epochs)

E-Commerce Fine-Tuning (V6)

3K e-commerce triplets + 12K general triplets (20:80 ratio)
Single negative, no MarginMSE/KD
LR 2e-6, batch 64, 10 epochs
E-commerce data: Korean Wikipedia product articles with TF-IDF hard negatives

Benchmark Results

Benchmark	BM25	V33 Base	V33-ecom-v6	BGE-M3
Ko-StrategyQA (R@1)	53.7%	62.2%	63.0%	73.5%
MIRACL-ko (R@1)	44.1%	62.0%	62.0%	70.9%
Mr.TyDi-ko (R@1)	55.6%	73.4%	75.3%	84.1%
ecom-ko (R@1)	-	67.8%	78.6%	-

Key improvements over V33 base:

ecom-ko: +10.8pp (67.8% → 78.6%)
Mr.TyDi-ko: +1.9pp (73.4% → 75.3%)
Ko-StrategyQA: +0.8pp (62.2% → 63.0%)
No regression on MIRACL-ko

Usage with OpenSearch

// 1. Register model via ML Commons
POST /_plugins/_ml/models/_register
{
  "name": "korean-neural-sparse-encoder",
  "version": "1.0.0",
  "model_format": "TORCH_SCRIPT",
  "model_config": {
    "model_type": "sparse_encoding",
    "embedding_dimension": 1,
    "framework_type": "huggingface_transformers"
  },
  "url": "https://huggingface.co/sewoong/korean-neural-sparse-encoder"
}

// 2. Create index with neural sparse field
PUT /my-ecom-index
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "text": { "type": "text" },
      "sparse_embedding": {
        "type": "rank_features"
      }
    }
  }
}

// 3. Neural sparse search query
GET /my-ecom-index/_search
{
  "query": {
    "neural_sparse": {
      "sparse_embedding": {
        "query_text": "삼성 갤럭시 무선 이어폰",
        "model_id": "<model_id>"
      }
    }
  }
}

Usage with Transformers

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("sewoong/korean-neural-sparse-encoder")
model = AutoModelForMaskedLM.from_pretrained("sewoong/korean-neural-sparse-encoder")
model.eval()


def encode_sparse(text, max_length=256):
    """Encode text to sparse vector using SPLADE-max."""
    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=max_length,
        truncation=True,
        padding=True,
    )
    with torch.no_grad():
        logits = model(**inputs).logits

    # SPLADE-max: log(1 + ReLU(logits)) then max-pool over sequence
    sparse = torch.log1p(torch.relu(logits)).max(dim=1).values

    # Zero out special tokens
    special_ids = [tokenizer.cls_token_id, tokenizer.sep_token_id, tokenizer.pad_token_id]
    for sid in special_ids:
        if sid is not None:
            sparse[0, sid] = 0

    # Extract non-zero token weights
    idx = sparse[0].nonzero().squeeze(-1)
    weights = sparse[0, idx]
    tokens = tokenizer.convert_ids_to_tokens(idx.tolist())
    return dict(zip(tokens, weights.tolist()))


# Example: encode a query
result = encode_sparse("삼성 갤럭시 무선 이어폰")
for token, weight in sorted(result.items(), key=lambda x: -x[1])[:10]:
    print(f"{token}: {weight:.3f}")

# Example: encode a document
doc_result = encode_sparse(
    "삼성 갤럭시 버즈3 프로 무선 이어폰. 24bit Hi-Fi 사운드와 "
    "인텔리전트 액티브 노이즈 캔슬링으로 몰입감 있는 청취 경험을 제공합니다.",
    max_length=256,
)
print(f"\nDocument non-zero tokens: {len(doc_result)}")

Training Details

V33 Base Training

The base model was trained from scratch on 4.84M Korean language triplets spanning 46 diverse datasets.

Parameter	Value
Base model	skt/A.X-Encoder-base (ModernBERT)
Parameters	149M
Training data	4.84M triplets (46 shards)
Loss	InfoNCE + FLOPS regularization
FLOPS lambda_q	0.01
FLOPS lambda_d	0.003
Batch size	2048 (effective)
Learning rate	5e-5
Epochs	25
Hardware	8x NVIDIA B200 (183GB each)
Precision	bf16
Final nz_q	35
Final nz_d	58

V6 E-Commerce Fine-Tuning

Fine-tuned from V33 checkpoint with domain-specific e-commerce data while preserving general retrieval quality.

Parameter	Value
Training data	3K ecom + 12K general (20:80 ratio)
E-commerce source	Korean Wikipedia product articles
Negative mining	TF-IDF char_wb ngrams (2,3)
Loss	InfoNCE + FLOPS (same as V33, no KD)
Learning rate	2e-6
Batch size	64 per GPU
Grad accumulation	4
Epochs	10
Total steps	77
Final nz_q	40
Final nz_d	68

Key lessons from e-commerce fine-tuning:

MarginMSE/KD is harmful for sparse fine-tuning (destroys learned representations)
Must match original training recipe exactly (single neg, same FLOPS lambda)
20:80 domain/general ratio prevents catastrophic forgetting
Category balance: 20 e-commerce categories (electronics, fashion, beauty, food, appliances, furniture, kitchen, sports, automotive, pets, baby, health supplements, stationery, books, music, household, crafts, instruments, camping, gardening)

Fine-Tuning Toolkit

A complete toolkit for fine-tuning this model with your domain-specific data is included in the finetune/ directory.

Quick Start

# Clone the fine-tuning toolkit
git clone https://huggingface.co/sewoong/korean-neural-sparse-encoder
cd korean-neural-sparse-encoder/finetune

# Install dependencies
pip install -r requirements.txt

# Prepare your data (CSV/TSV -> JSONL)
python scripts/prepare_data.py --input your_data.csv --output data/your_domain/

# Fine-tune
python scripts/finetune.py --config configs/finetune.yaml

# Test encoding
python scripts/encode.py --model sewoong/korean-neural-sparse-encoder --text "samsung galaxy wireless earbuds"

Toolkit Contents

File	Description
`finetune/scripts/finetune.py`	Self-contained fine-tuning script (InfoNCE + FLOPS)
`finetune/scripts/encode.py`	Sparse vector encoding and similarity
`finetune/scripts/prepare_data.py`	CSV/TSV to JSONL data converter
`finetune/configs/finetune.yaml`	Proven V33-ecom-v6 training recipe
`finetune/notebooks/finetune_tutorial.ipynb`	Step-by-step Colab tutorial
`finetune/data/sample/`	50 train + 10 val Korean e-commerce sample triplets

GitHub repository: mateon01/korean-neural-sparse-encoder

Citation

@misc{korean-neural-sparse-encoder,
  title={Korean Neural Sparse Encoder},
  author={Sewoong Kim},
  year={2026},
  url={https://huggingface.co/sewoong/korean-neural-sparse-encoder}
}

Downloads last month: 23

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for sewoong/korean-neural-sparse-encoder

Base model

skt/A.X-Encoder-base

Finetuned

(6)

this model