Update README.md

e195036 verified 4 months ago

13.1 kB

tags:
  - sentence-transformers
  - sparse-encoder
  - sparse
  - splade
  - generated_from_trainer
  - dataset_size:921681
  - loss:SpladeLoss
  - loss:SparseMultipleNegativesRankingLoss
  - loss:FlopsLoss
base_model: yjoonjang/splade-ko-v1
widget:
  - text: Node.js 교과서
  - text: Xistory 자이스토리 영어 어법·어휘 완성 (2024년)
  - text: 개념+유형 기본 라이트 초등수학 5-2 (2025년)
  - text: 100발 100중 기출문제집 1학기 기말고사 중등수학 2 (2020년)
  - text: 2025 시대에듀 현직 교사 무료 강의가 있는 전기기능사 필기 한권합격
pipeline_tag: feature-extraction
library_name: sentence-transformers

SPLADE Sparse Encoder

This is a SPLADE Sparse Encoder model finetuned from yjoonjang/splade-ko-v1 using the sentence-transformers library. It maps sentences & paragraphs to a 50000-dimensional sparse vector space and can be used for semantic search and sparse retrieval.

Model Details

Model Description

Model Type: SPLADE Sparse Encoder
Base model: yjoonjang/splade-ko-v1
Maximum Sequence Length: 8192 tokens
Output Dimensionality: 50000 dimensions
Similarity Function: Dot Product

Model Sources

Documentation: Sentence Transformers Documentation
Documentation: Sparse Encoder Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sparse Encoders on Hugging Face

Full Model Architecture

SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertForMaskedLM'})
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 50000})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("Ja-ck/splade-ko-yes24-ft")
# Run inference
sentences = [
    '시대에듀전기기능사필기',
    '2025 시대에듀 현직 교사 무료 강의가 있는 전기기능사 필기 한권합격',
    '2022 전기기능장 필기',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 50000]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[22.4728, 16.7258,  9.4879],
#         [16.7258, 40.2059, 12.3283],
#         [ 9.4879, 12.3283, 23.8251]])

Training Details

Training Dataset

Unnamed Dataset

Size: 921,681 training samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 3 tokens mean: 10.48 tokens max: 45 tokens	min: 3 tokens mean: 12.56 tokens max: 36 tokens	min: 3 tokens mean: 12.18 tokens max: 48 tokens

Samples:

anchor	positive	negative
`교결차이역`	`차이역`	`교토`
`교부들의신앙`	`교부들의 그리스도론`	`교부들의 가르침 I`
`(칙칙폭폭)기차여행기차`	`기차 여행`	`핑크퐁 베베핀 칙칙폭폭 사파리 기차놀이`

Loss: SpladeLoss with these parameters:

{
    "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score', gather_across_devices=False)",
    "document_regularizer_weight": 5e-05,
    "query_regularizer_weight": 0.0001
}

Evaluation Dataset

Unnamed Dataset

Size: 5,000 evaluation samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 3 tokens mean: 10.21 tokens max: 45 tokens	min: 3 tokens mean: 12.59 tokens max: 39 tokens	min: 3 tokens mean: 12.01 tokens max: 43 tokens

Samples:

anchor	positive	negative
`형김유정`	`김유정 - 형`	`형 (김유정 단편 걸작선)`
`어떤마술의금서목록창약`	`창약 어떤 마술의 금서목록 3`	`창약 어떤 마술의 금서목록 05권`
`건축 심리`	`마음의 건축`	`건축을 철학한다`

Loss: SpladeLoss with these parameters:

{
    "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score', gather_across_devices=False)",
    "document_regularizer_weight": 5e-05,
    "query_regularizer_weight": 0.0001
}

IR 평가 지표 해설

지표별 설명

지표	의미	해석	좋은 성능 기준
MRR@K	Mean Reciprocal Rank	첫 번째 정답이 몇 위에 있는지의 역수 평균	높을수록 좋음 (1.0 = 항상 1위)
Recall@K	재현율	K개 결과 중 정답이 포함된 비율	높을수록 좋음 (1.0 = 모든 정답 포함)
NDCG@K	Normalized DCG	랭킹 품질 (상위에 정답이 있을수록 높은 점수)	높을수록 좋음 (1.0 = 완벽한 랭킹)
Hit Rate@K	적중률	K개 안에 정답이 1개라도 있는지	높을수록 좋음 (1.0 = 항상 정답 포함)

각 지표의 구체적 의미

1. MRR (Mean Reciprocal Rank)

예시: 쿼리 "파이썬입문"에 대해 정답이 3위에 있다면
→ Reciprocal Rank = 1/3 = 0.333

MRR@K = 모든 쿼리의 Reciprocal Rank 평균

비즈니스 의미: 사용자가 정답을 찾기 위해 스크롤해야 하는 정도. MRR이 높을수록 정답이 상위에 노출됨.

2. Recall@K

예시: 정답이 2개인데 Top-10에 1개만 있다면
→ Recall@10 = 1/2 = 0.5

비즈니스 의미: 검색 결과에 정답이 얼마나 포함되는지. 추천 시스템에서 중요.

3. NDCG (Normalized Discounted Cumulative Gain)

예시: 정답이 1위에 있으면 높은 점수, 10위에 있으면 낮은 점수
→ 상위 랭킹일수록 가중치가 높음

비즈니스 의미: 랭킹의 품질. 정답이 1위에 있는 것이 10위에 있는 것보다 훨씬 좋음.

4. Hit Rate@K

예시: Top-10에 정답이 1개라도 있으면 1, 없으면 0
→ 이진 판단 (있다/없다)

비즈니스 의미: 검색 성공률. 사용자가 첫 페이지에서 원하는 상품을 찾을 확률.

실무 적용 시 권장 지표

시나리오	권장 지표	이유
검색 엔진	MRR@10, NDCG@10	상위 노출이 중요
추천 시스템	Recall@20, Hit Rate@20	다양한 관련 상품 노출이 중요
RAG/QA 시스템	Recall@5, MRR@5	정확한 문서 검색이 중요

결과 해석

현재 Fine-tuned 모델의 성능:

MRR@10 = 0.69: 평균적으로 정답이 1~2위 사이에 위치
Recall@10 = 0.87: 10개 결과 중 87%의 정답 포함
Hit Rate@10 = 0.90: 90%의 쿼리에서 정답이 Top-10에 존재

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

SpladeLoss

@misc{formal2022distillationhardnegativesampling,
      title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
      author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
      year={2022},
      eprint={2205.04733},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2205.04733},
}

SparseMultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

FlopsLoss

@article{paria2020minimizing,
    title={Minimizing flops to learn efficient sparse representations},
    author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
    journal={arXiv preprint arXiv:2004.05665},
    year={2020}
}