You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

SPLADE Sparse Encoder

This is a SPLADE Sparse Encoder model finetuned from yjoonjang/splade-ko-v1 using the sentence-transformers library. It maps sentences & paragraphs to a 50000-dimensional sparse vector space and can be used for semantic search and sparse retrieval.

Model Details

Model Description

Model Type: SPLADE Sparse Encoder
Base model: yjoonjang/splade-ko-v1
Maximum Sequence Length: 8192 tokens
Output Dimensionality: 50000 dimensions
Similarity Function: Dot Product

Model Sources

Documentation: Sentence Transformers Documentation
Documentation: Sparse Encoder Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sparse Encoders on Hugging Face

Full Model Architecture

SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertForMaskedLM'})
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 50000})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("Ja-ck/splade-ko-yes24-ft")
# Run inference
sentences = [
    '시대에듀전기기능사필기',
    '2025 시대에듀 현직 교사 무료 강의가 있는 전기기능사 필기 한권합격',
    '2022 전기기능장 필기',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 50000]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[22.4728, 16.7258,  9.4879],
#         [16.7258, 40.2059, 12.3283],
#         [ 9.4879, 12.3283, 23.8251]])

Training Details

Training Dataset

Unnamed Dataset

Size: 921,681 training samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 3 tokens mean: 10.48 tokens max: 45 tokens	min: 3 tokens mean: 12.56 tokens max: 36 tokens	min: 3 tokens mean: 12.18 tokens max: 48 tokens

Samples:

anchor	positive	negative
`교결차이역`	`차이역`	`교토`
`교부들의신앙`	`교부들의 그리스도론`	`교부들의 가르침 I`
`(칙칙폭폭)기차여행기차`	`기차 여행`	`핑크퐁 베베핀 칙칙폭폭 사파리 기차놀이`

Loss: SpladeLoss with these parameters:

{
    "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score', gather_across_devices=False)",
    "document_regularizer_weight": 5e-05,
    "query_regularizer_weight": 0.0001
}

Evaluation Dataset

Unnamed Dataset

Size: 5,000 evaluation samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 3 tokens mean: 10.21 tokens max: 45 tokens	min: 3 tokens mean: 12.59 tokens max: 39 tokens	min: 3 tokens mean: 12.01 tokens max: 43 tokens

Samples:

anchor	positive	negative
`형김유정`	`김유정 - 형`	`형 (김유정 단편 걸작선)`
`어떤마술의금서목록창약`	`창약 어떤 마술의 금서목록 3`	`창약 어떤 마술의 금서목록 05권`
`건축 심리`	`마음의 건축`	`건축을 철학한다`

Loss: SpladeLoss with these parameters:

{
    "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score', gather_across_devices=False)",
    "document_regularizer_weight": 5e-05,
    "query_regularizer_weight": 0.0001
}

IR 평가 지표 해설

지표별 설명

지표	의미	해석	좋은 성능 기준
MRR@K	Mean Reciprocal Rank	첫 번째 정답이 몇 위에 있는지의 역수 평균	높을수록 좋음 (1.0 = 항상 1위)
Recall@K	재현율	K개 결과 중 정답이 포함된 비율	높을수록 좋음 (1.0 = 모든 정답 포함)
NDCG@K	Normalized DCG	랭킹 품질 (상위에 정답이 있을수록 높은 점수)	높을수록 좋음 (1.0 = 완벽한 랭킹)
Hit Rate@K	적중률	K개 안에 정답이 1개라도 있는지	높을수록 좋음 (1.0 = 항상 정답 포함)

각 지표의 구체적 의미

1. MRR (Mean Reciprocal Rank)

예시: 쿼리 "파이썬입문"에 대해 정답이 3위에 있다면
→ Reciprocal Rank = 1/3 = 0.333

MRR@K = 모든 쿼리의 Reciprocal Rank 평균

비즈니스 의미: 사용자가 정답을 찾기 위해 스크롤해야 하는 정도. MRR이 높을수록 정답이 상위에 노출됨.

2. Recall@K

예시: 정답이 2개인데 Top-10에 1개만 있다면
→ Recall@10 = 1/2 = 0.5

비즈니스 의미: 검색 결과에 정답이 얼마나 포함되는지. 추천 시스템에서 중요.

3. NDCG (Normalized Discounted Cumulative Gain)

예시: 정답이 1위에 있으면 높은 점수, 10위에 있으면 낮은 점수
→ 상위 랭킹일수록 가중치가 높음

비즈니스 의미: 랭킹의 품질. 정답이 1위에 있는 것이 10위에 있는 것보다 훨씬 좋음.

4. Hit Rate@K

예시: Top-10에 정답이 1개라도 있으면 1, 없으면 0
→ 이진 판단 (있다/없다)

비즈니스 의미: 검색 성공률. 사용자가 첫 페이지에서 원하는 상품을 찾을 확률.

실무 적용 시 권장 지표

시나리오	권장 지표	이유
검색 엔진	MRR@10, NDCG@10	상위 노출이 중요
추천 시스템	Recall@20, Hit Rate@20	다양한 관련 상품 노출이 중요
RAG/QA 시스템	Recall@5, MRR@5	정확한 문서 검색이 중요

결과 해석

현재 Fine-tuned 모델의 성능:

MRR@10 = 0.69: 평균적으로 정답이 1~2위 사이에 위치
Recall@10 = 0.87: 10개 결과 중 87%의 정답 포함
Hit Rate@10 = 0.90: 90%의 쿼리에서 정답이 Top-10에 존재

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

SpladeLoss

@misc{formal2022distillationhardnegativesampling,
      title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
      author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
      year={2022},
      eprint={2205.04733},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2205.04733},
}

SparseMultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

FlopsLoss

@article{paria2020minimizing,
    title={Minimizing flops to learn efficient sparse representations},
    author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
    journal={arXiv preprint arXiv:2004.05665},
    year={2020}
}