kekeappa's picture
Initial: kor-static-embedding-64 (Matryoshka ๋ถ„๋ฆฌ, 9MB)
cd83eb6 verified
---
language:
- ko
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- static-embedding
- model2vec
- korean
- ko
- matryoshka
datasets:
- kakaobrain/kor_nli
- mteb/KorSTS
- klue/klue
- Helsinki-NLP/opus-100
base_model: klue/roberta-base
---
# kor-static-embedding-64
ํ•œ๊ตญ์–ด ํŠนํ™” **์ดˆ๊ฒฝ๋Ÿ‰ Static Embedding** ๋ชจ๋ธ โ€” **9MB**, **64์ฐจ์›**.
[kekeappa/kor-static-embedding-512](https://huggingface.co/kekeappa/kor-static-embedding-512)๋ฅผ Matryoshka ํ•™์Šต์œผ๋กœ ๋งŒ๋“ค๊ณ  **64์ฐจ์›์œผ๋กœ ์ž˜๋ผ๋‚ธ ๋ณ€์ข…**์ž…๋‹ˆ๋‹ค. ๊ฐ™์€ ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ์— 4๊ฐœ ์ฐจ์› ์กด์žฌ โ€” ์šฉ๋„์— ๋งž๊ฒŒ ์„ ํƒ:
| ์ฐจ์› | ํฌ๊ธฐ | ์šฉ๋„ |
|---:|---:|---|
| **[64](https://huggingface.co/kekeappa/kor-static-embedding-64)** | 9MB | ๐ŸŒ ๋ธŒ๋ผ์šฐ์ € ยท ๋ชจ๋ฐ”์ผ ยท ์—ฃ์ง€ |
| **[128](https://huggingface.co/kekeappa/kor-static-embedding-128)** | 17MB | โšก ๊ฐ€๋ฒผ์šด ๊ฒ€์ƒ‰ยท๋ถ„๋ฅ˜ |
| **[256](https://huggingface.co/kekeappa/kor-static-embedding-256)** | 34MB | โš–๏ธ ๊ฐ€์„ฑ๋น„ |
| **[512](https://huggingface.co/kekeappa/kor-static-embedding-512)** | 68MB | ๐ŸŽฏ ์ตœ๊ณ  ์ •ํ™•๋„ |
## ์„ฑ๋Šฅ (KorSTS / KLUE-STS)
| ๋ฒค์น˜๋งˆํฌ | Pearson | **Spearman** |
|---|---:|---:|
| KorSTS-test | 0.7382 | **0.7337** |
| KorSTS-valid | โ€” | **0.7885** |
| KLUE-STS-val | โ€” | **0.6582** |
## ์‚ฌ์šฉ
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("kekeappa/kor-static-embedding-64")
emb = model.encode(["ํ•œ๊ตญ์–ด ๋ฌธ์žฅ", "์ž„๋ฒ ๋”ฉ ํ…Œ์ŠคํŠธ"], normalize_embeddings=True)
print(emb.shape) # (2, 64)
```
## ํŠน์ง•
- **์•„ํ‚คํ…์ฒ˜**: StaticEmbedding (model2vec ๊ณ„์—ด) โ€” ํŠธ๋žœ์Šคํฌ๋จธ attention ์—†์Œ
- **์ถ”๋ก **: CPU ์ตœ์ , GPU ๋ถˆํ•„์š”
- **์†๋„**: ๋‹จ์ผ ์ฟผ๋ฆฌ < 1ms (๋ธŒ๋ผ์šฐ์ €์—์„œ๋„ ๋น ๋ฆ„)
- **ํ•œ์˜ ํ˜ธํ™˜**: cross-lingual ํ•™์Šต๋จ โ€” ํ•œ๊ตญ์–ด ์ฟผ๋ฆฌ๋กœ ์˜์–ด ๋ฌธ์„œ ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅ
## ํ•™์Šต ๋ฐฉ๋ฒ•
4-stage ํ•™์Šต:
1. **Distillation ์ดˆ๊ธฐํ™”**: `BM-K/KoSimCSE-roberta-multitask` teacher์˜ vocab ์ž„๋ฒ ๋”ฉ โ†’ PCA + Zipf weighting
2. **KorNLI MNRL**: `kakaobrain/kor_nli` (multi_nli + snli) 277K triplet
3. **Cross-lingual MNRL**: OPUS-100 ko-en parallel 200K pair
4. **Matryoshka regression**: KorSTS + KLUE-STS + NLLB๋กœ ๋ฒˆ์—ญํ•œ ์˜์–ด STS-B
- 64/128/256/512 ์ฐจ์› ๋™์‹œ ์ตœ์ ํ™” (`MatryoshkaLoss`)
ํ•™์Šต ์ฝ”๋“œ: https://github.com/johunsang/kor-static-embedding-512
## ๋ผ์ด์„ ์Šค
Apache 2.0