Korean Neural Sparse Encoder

A SPLADE-based sparse encoder fine-tuned for Korean text, designed for neural sparse retrieval with OpenSearch.

Model Description

This model generates sparse vector representations for Korean text using the SPLADE (Sparse Lexical and Expansion) approach. It is optimized for:

Legal domain terminology: Korean legal terms and concepts
Medical domain terminology: Korean medical and healthcare terms
General Korean text: Everyday Korean language with synonym expansion

The model uses log(1 + ReLU(MLM_logits)) activation to produce sparse representations suitable for inverted index-based retrieval systems like OpenSearch Neural Sparse Search.

Training Details

Base Model: skt/A.X-Encoder-base
Training Method: Curriculum learning with contrastive loss
Parameters: 149,372,240
Vocabulary Size: 49,999 tokens
Max Sequence Length: 64 tokens

Training Results

Metric	Score
Recall@1	99.8%
MRR	0.9990

Usage

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
import torch.nn as nn

# Load model
tokenizer = AutoTokenizer.from_pretrained("sewoong/korean-neural-sparse-encoder-v1")
model = AutoModelForMaskedLM.from_pretrained("sewoong/korean-neural-sparse-encoder-v1")

def encode(text: str) -> torch.Tensor:
    """Encode text to sparse representation."""
    inputs = tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=64
    )
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        relu = nn.ReLU()
        token_scores = torch.log1p(relu(logits))
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        sparse_repr = (token_scores * mask).max(dim=1).values[0]
    return sparse_repr

# Example: Encode a query
sparse = encode("diabetes treatment methods")
top_values, top_indices = sparse.topk(10)

for idx, val in zip(top_indices, top_values):
    if val > 0:
        print(f"{tokenizer.decode([idx])}: {val:.4f}")

Get Top Activated Tokens

def get_top_tokens(text: str, top_k: int = 20) -> list:
    """Get top-k activated tokens from text."""
    sparse = encode(text)
    top_values, top_indices = sparse.topk(top_k)

    results = []
    for idx, val in zip(top_indices.tolist(), top_values.tolist()):
        if val > 0:
            token = tokenizer.decode([idx]).strip()
            results.append((token, round(val, 4)))
    return results

# Example
tokens = get_top_tokens("real estate contract termination conditions")
for token, weight in tokens:
    print(f"{token}: {weight}")

OpenSearch Integration

This model is designed to work with OpenSearch Neural Sparse Search.

Register Model in OpenSearch

POST /_plugins/_ml/models/_register
{
  "name": "korean-neural-sparse-encoder",
  "version": "1.0.0",
  "model_format": "TORCH_SCRIPT",
  "model_task_type": "SPARSE_ENCODING"
}

Create Neural Sparse Index

PUT /my-neural-sparse-index
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      },
      "content_sparse": {
        "type": "rank_features"
      }
    }
  }
}

Intended Use

Primary Use: Semantic search for Korean documents
Domains: Legal, medical, and general Korean text
Task: Document retrieval using sparse vector representations

Limitations

Optimized for Korean text; performance on other languages is not guaranteed
Maximum sequence length is 64 tokens
Best suited for short to medium-length queries and passages

Citation

@misc{korean-neural-sparse-encoder,
  author = {Sewoong Lee},
  title = {Korean Neural Sparse Encoder},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/sewoong/korean-neural-sparse-encoder-v1}
}

License

Apache 2.0

Downloads last month: 25

Safetensors

Model size

0.1B params

Tensor type

F32