Korean Neural Sparse Encoder

A SPLADE-based sparse encoder fine-tuned for Korean text, designed for neural sparse retrieval with OpenSearch.

Model Description

This model generates sparse vector representations for Korean text using the SPLADE (Sparse Lexical and Expansion) approach. It is optimized for:

  • Legal domain terminology: Korean legal terms and concepts
  • Medical domain terminology: Korean medical and healthcare terms
  • General Korean text: Everyday Korean language with synonym expansion

The model uses log(1 + ReLU(MLM_logits)) activation to produce sparse representations suitable for inverted index-based retrieval systems like OpenSearch Neural Sparse Search.

Training Details

  • Base Model: skt/A.X-Encoder-base
  • Training Method: Curriculum learning with contrastive loss
  • Parameters: 149,372,240
  • Vocabulary Size: 49,999 tokens
  • Max Sequence Length: 64 tokens

Training Results

Metric Score
Recall@1 99.8%
MRR 0.9990

Usage

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
import torch.nn as nn

# Load model
tokenizer = AutoTokenizer.from_pretrained("sewoong/korean-neural-sparse-encoder-v1")
model = AutoModelForMaskedLM.from_pretrained("sewoong/korean-neural-sparse-encoder-v1")

def encode(text: str) -> torch.Tensor:
    """Encode text to sparse representation."""
    inputs = tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=64
    )
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        relu = nn.ReLU()
        token_scores = torch.log1p(relu(logits))
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        sparse_repr = (token_scores * mask).max(dim=1).values[0]
    return sparse_repr

# Example: Encode a query
sparse = encode("diabetes treatment methods")
top_values, top_indices = sparse.topk(10)

for idx, val in zip(top_indices, top_values):
    if val > 0:
        print(f"{tokenizer.decode([idx])}: {val:.4f}")

Get Top Activated Tokens

def get_top_tokens(text: str, top_k: int = 20) -> list:
    """Get top-k activated tokens from text."""
    sparse = encode(text)
    top_values, top_indices = sparse.topk(top_k)

    results = []
    for idx, val in zip(top_indices.tolist(), top_values.tolist()):
        if val > 0:
            token = tokenizer.decode([idx]).strip()
            results.append((token, round(val, 4)))
    return results

# Example
tokens = get_top_tokens("real estate contract termination conditions")
for token, weight in tokens:
    print(f"{token}: {weight}")

OpenSearch Integration

This model is designed to work with OpenSearch Neural Sparse Search.

Register Model in OpenSearch

POST /_plugins/_ml/models/_register
{
  "name": "korean-neural-sparse-encoder",
  "version": "1.0.0",
  "model_format": "TORCH_SCRIPT",
  "model_task_type": "SPARSE_ENCODING"
}

Create Neural Sparse Index

PUT /my-neural-sparse-index
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      },
      "content_sparse": {
        "type": "rank_features"
      }
    }
  }
}

Intended Use

  • Primary Use: Semantic search for Korean documents
  • Domains: Legal, medical, and general Korean text
  • Task: Document retrieval using sparse vector representations

Limitations

  • Optimized for Korean text; performance on other languages is not guaranteed
  • Maximum sequence length is 64 tokens
  • Best suited for short to medium-length queries and passages

Citation

@misc{korean-neural-sparse-encoder,
  author = {Sewoong Lee},
  title = {Korean Neural Sparse Encoder},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/sewoong/korean-neural-sparse-encoder-v1}
}

License

Apache 2.0

Downloads last month
25
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support