Sentence Similarity
sentence-transformers
Safetensors
Model2Vec
Korean
feature-extraction
static-embedding
korean
matryoshka
Instructions to use kekeappa/kor-static-embedding-64 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use kekeappa/kor-static-embedding-64 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("kekeappa/kor-static-embedding-64") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Model2Vec
How to use kekeappa/kor-static-embedding-64 with Model2Vec:
from model2vec import StaticModel model = StaticModel.from_pretrained("kekeappa/kor-static-embedding-64") - Notebooks
- Google Colab
- Kaggle
File size: 2,493 Bytes
cd83eb6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | ---
language:
- ko
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- static-embedding
- model2vec
- korean
- ko
- matryoshka
datasets:
- kakaobrain/kor_nli
- mteb/KorSTS
- klue/klue
- Helsinki-NLP/opus-100
base_model: klue/roberta-base
---
# kor-static-embedding-64
ํ๊ตญ์ด ํนํ **์ด๊ฒฝ๋ Static Embedding** ๋ชจ๋ธ โ **9MB**, **64์ฐจ์**.
[kekeappa/kor-static-embedding-512](https://huggingface.co/kekeappa/kor-static-embedding-512)๋ฅผ Matryoshka ํ์ต์ผ๋ก ๋ง๋ค๊ณ **64์ฐจ์์ผ๋ก ์๋ผ๋ธ ๋ณ์ข
**์
๋๋ค. ๊ฐ์ ๋ชจ๋ธ ํจ๋ฐ๋ฆฌ์ 4๊ฐ ์ฐจ์ ์กด์ฌ โ ์ฉ๋์ ๋ง๊ฒ ์ ํ:
| ์ฐจ์ | ํฌ๊ธฐ | ์ฉ๋ |
|---:|---:|---|
| **[64](https://huggingface.co/kekeappa/kor-static-embedding-64)** | 9MB | ๐ ๋ธ๋ผ์ฐ์ ยท ๋ชจ๋ฐ์ผ ยท ์ฃ์ง |
| **[128](https://huggingface.co/kekeappa/kor-static-embedding-128)** | 17MB | โก ๊ฐ๋ฒผ์ด ๊ฒ์ยท๋ถ๋ฅ |
| **[256](https://huggingface.co/kekeappa/kor-static-embedding-256)** | 34MB | โ๏ธ ๊ฐ์ฑ๋น |
| **[512](https://huggingface.co/kekeappa/kor-static-embedding-512)** | 68MB | ๐ฏ ์ต๊ณ ์ ํ๋ |
## ์ฑ๋ฅ (KorSTS / KLUE-STS)
| ๋ฒค์น๋งํฌ | Pearson | **Spearman** |
|---|---:|---:|
| KorSTS-test | 0.7382 | **0.7337** |
| KorSTS-valid | โ | **0.7885** |
| KLUE-STS-val | โ | **0.6582** |
## ์ฌ์ฉ
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("kekeappa/kor-static-embedding-64")
emb = model.encode(["ํ๊ตญ์ด ๋ฌธ์ฅ", "์๋ฒ ๋ฉ ํ
์คํธ"], normalize_embeddings=True)
print(emb.shape) # (2, 64)
```
## ํน์ง
- **์ํคํ
์ฒ**: StaticEmbedding (model2vec ๊ณ์ด) โ ํธ๋์คํฌ๋จธ attention ์์
- **์ถ๋ก **: CPU ์ต์ , GPU ๋ถํ์
- **์๋**: ๋จ์ผ ์ฟผ๋ฆฌ < 1ms (๋ธ๋ผ์ฐ์ ์์๋ ๋น ๋ฆ)
- **ํ์ ํธํ**: cross-lingual ํ์ต๋จ โ ํ๊ตญ์ด ์ฟผ๋ฆฌ๋ก ์์ด ๋ฌธ์ ๊ฒ์ ๊ฐ๋ฅ
## ํ์ต ๋ฐฉ๋ฒ
4-stage ํ์ต:
1. **Distillation ์ด๊ธฐํ**: `BM-K/KoSimCSE-roberta-multitask` teacher์ vocab ์๋ฒ ๋ฉ โ PCA + Zipf weighting
2. **KorNLI MNRL**: `kakaobrain/kor_nli` (multi_nli + snli) 277K triplet
3. **Cross-lingual MNRL**: OPUS-100 ko-en parallel 200K pair
4. **Matryoshka regression**: KorSTS + KLUE-STS + NLLB๋ก ๋ฒ์ญํ ์์ด STS-B
- 64/128/256/512 ์ฐจ์ ๋์ ์ต์ ํ (`MatryoshkaLoss`)
ํ์ต ์ฝ๋: https://github.com/johunsang/kor-static-embedding-512
## ๋ผ์ด์ ์ค
Apache 2.0
|