Sentence Similarity
sentence-transformers
Safetensors
Model2Vec
Korean
feature-extraction
static-embedding
korean
matryoshka
Instructions to use kekeappa/kor-static-embedding-64 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use kekeappa/kor-static-embedding-64 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("kekeappa/kor-static-embedding-64") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Model2Vec
How to use kekeappa/kor-static-embedding-64 with Model2Vec:
from model2vec import StaticModel model = StaticModel.from_pretrained("kekeappa/kor-static-embedding-64") - Notebooks
- Google Colab
- Kaggle
Initial: kor-static-embedding-64 (Matryoshka 분리, 9MB)
Browse files- README.md +75 -0
- config_sentence_transformers.json +14 -0
- model.safetensors +3 -0
- modules.json +8 -0
- tokenizer.json +0 -0
README.md
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- ko
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
library_name: sentence-transformers
|
| 6 |
+
pipeline_tag: sentence-similarity
|
| 7 |
+
tags:
|
| 8 |
+
- sentence-transformers
|
| 9 |
+
- sentence-similarity
|
| 10 |
+
- feature-extraction
|
| 11 |
+
- static-embedding
|
| 12 |
+
- model2vec
|
| 13 |
+
- korean
|
| 14 |
+
- ko
|
| 15 |
+
- matryoshka
|
| 16 |
+
datasets:
|
| 17 |
+
- kakaobrain/kor_nli
|
| 18 |
+
- mteb/KorSTS
|
| 19 |
+
- klue/klue
|
| 20 |
+
- Helsinki-NLP/opus-100
|
| 21 |
+
base_model: klue/roberta-base
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
# kor-static-embedding-64
|
| 25 |
+
|
| 26 |
+
한국어 특화 **초경량 Static Embedding** 모델 — **9MB**, **64차원**.
|
| 27 |
+
|
| 28 |
+
[kekeappa/kor-static-embedding-512](https://huggingface.co/kekeappa/kor-static-embedding-512)를 Matryoshka 학습으로 만들고 **64차원으로 잘라낸 변종**입니다. 같은 모델 패밀리에 4개 차원 존재 — 용도에 맞게 선택:
|
| 29 |
+
|
| 30 |
+
| 차원 | 크기 | 용도 |
|
| 31 |
+
|---:|---:|---|
|
| 32 |
+
| **[64](https://huggingface.co/kekeappa/kor-static-embedding-64)** | 9MB | 🌐 브라우저 · 모바일 · 엣지 |
|
| 33 |
+
| **[128](https://huggingface.co/kekeappa/kor-static-embedding-128)** | 17MB | ⚡ 가벼운 검색·분류 |
|
| 34 |
+
| **[256](https://huggingface.co/kekeappa/kor-static-embedding-256)** | 34MB | ⚖️ 가성비 |
|
| 35 |
+
| **[512](https://huggingface.co/kekeappa/kor-static-embedding-512)** | 68MB | 🎯 최고 정확도 |
|
| 36 |
+
|
| 37 |
+
## 성능 (KorSTS / KLUE-STS)
|
| 38 |
+
|
| 39 |
+
| 벤치마크 | Pearson | **Spearman** |
|
| 40 |
+
|---|---:|---:|
|
| 41 |
+
| KorSTS-test | 0.7382 | **0.7337** |
|
| 42 |
+
| KorSTS-valid | — | **0.7885** |
|
| 43 |
+
| KLUE-STS-val | — | **0.6582** |
|
| 44 |
+
|
| 45 |
+
## 사용
|
| 46 |
+
|
| 47 |
+
```python
|
| 48 |
+
from sentence_transformers import SentenceTransformer
|
| 49 |
+
|
| 50 |
+
model = SentenceTransformer("kekeappa/kor-static-embedding-64")
|
| 51 |
+
emb = model.encode(["한국어 문장", "임베딩 테스트"], normalize_embeddings=True)
|
| 52 |
+
print(emb.shape) # (2, 64)
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
## 특징
|
| 56 |
+
|
| 57 |
+
- **아키텍처**: StaticEmbedding (model2vec 계열) — 트랜스포머 attention 없음
|
| 58 |
+
- **추론**: CPU 최적, GPU 불필요
|
| 59 |
+
- **속도**: 단일 쿼리 < 1ms (브라우저에서도 빠름)
|
| 60 |
+
- **한영 호환**: cross-lingual 학습됨 — 한국어 쿼리로 영어 문서 검색 가능
|
| 61 |
+
|
| 62 |
+
## 학습 방법
|
| 63 |
+
|
| 64 |
+
4-stage 학습:
|
| 65 |
+
1. **Distillation 초기화**: `BM-K/KoSimCSE-roberta-multitask` teacher의 vocab 임베딩 → PCA + Zipf weighting
|
| 66 |
+
2. **KorNLI MNRL**: `kakaobrain/kor_nli` (multi_nli + snli) 277K triplet
|
| 67 |
+
3. **Cross-lingual MNRL**: OPUS-100 ko-en parallel 200K pair
|
| 68 |
+
4. **Matryoshka regression**: KorSTS + KLUE-STS + NLLB로 번역한 영어 STS-B
|
| 69 |
+
- 64/128/256/512 차원 동시 최적화 (`MatryoshkaLoss`)
|
| 70 |
+
|
| 71 |
+
학습 코드: https://github.com/johunsang/kor-static-embedding-512
|
| 72 |
+
|
| 73 |
+
## 라이선스
|
| 74 |
+
|
| 75 |
+
Apache 2.0
|
config_sentence_transformers.json
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"__version__": {
|
| 3 |
+
"pytorch": "2.12.0",
|
| 4 |
+
"sentence_transformers": "5.5.0",
|
| 5 |
+
"transformers": "5.8.1"
|
| 6 |
+
},
|
| 7 |
+
"default_prompt_name": null,
|
| 8 |
+
"model_type": "SentenceTransformer",
|
| 9 |
+
"prompts": {
|
| 10 |
+
"document": "",
|
| 11 |
+
"query": ""
|
| 12 |
+
},
|
| 13 |
+
"similarity_fn_name": "cosine"
|
| 14 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:610690daedd7fb81bd2446e6ebaba191f2f0519b74334f62aa649cd69ef13fc4
|
| 3 |
+
size 8192096
|
modules.json
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"idx": 0,
|
| 4 |
+
"name": "0",
|
| 5 |
+
"path": "",
|
| 6 |
+
"type": "sentence_transformers.sentence_transformer.modules.static_embedding.StaticEmbedding"
|
| 7 |
+
}
|
| 8 |
+
]
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|