Sentence Similarity
sentence-transformers
Safetensors
Model2Vec
Korean
feature-extraction
static-embedding
korean
matryoshka
Instructions to use kekeappa/kor-static-embedding-64 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use kekeappa/kor-static-embedding-64 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("kekeappa/kor-static-embedding-64") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Model2Vec
How to use kekeappa/kor-static-embedding-64 with Model2Vec:
from model2vec import StaticModel model = StaticModel.from_pretrained("kekeappa/kor-static-embedding-64") - Notebooks
- Google Colab
- Kaggle
metadata
language:
- ko
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- static-embedding
- model2vec
- korean
- ko
- matryoshka
datasets:
- kakaobrain/kor_nli
- mteb/KorSTS
- klue/klue
- Helsinki-NLP/opus-100
base_model: klue/roberta-base
kor-static-embedding-64
ํ๊ตญ์ด ํนํ ์ด๊ฒฝ๋ Static Embedding ๋ชจ๋ธ โ 9MB, 64์ฐจ์.
kekeappa/kor-static-embedding-512๋ฅผ Matryoshka ํ์ต์ผ๋ก ๋ง๋ค๊ณ 64์ฐจ์์ผ๋ก ์๋ผ๋ธ ๋ณ์ข ์ ๋๋ค. ๊ฐ์ ๋ชจ๋ธ ํจ๋ฐ๋ฆฌ์ 4๊ฐ ์ฐจ์ ์กด์ฌ โ ์ฉ๋์ ๋ง๊ฒ ์ ํ:
| ์ฐจ์ | ํฌ๊ธฐ | ์ฉ๋ |
|---|---|---|
| 64 | 9MB | ๐ ๋ธ๋ผ์ฐ์ ยท ๋ชจ๋ฐ์ผ ยท ์ฃ์ง |
| 128 | 17MB | โก ๊ฐ๋ฒผ์ด ๊ฒ์ยท๋ถ๋ฅ |
| 256 | 34MB | โ๏ธ ๊ฐ์ฑ๋น |
| 512 | 68MB | ๐ฏ ์ต๊ณ ์ ํ๋ |
์ฑ๋ฅ (KorSTS / KLUE-STS)
| ๋ฒค์น๋งํฌ | Pearson | Spearman |
|---|---|---|
| KorSTS-test | 0.7382 | 0.7337 |
| KorSTS-valid | โ | 0.7885 |
| KLUE-STS-val | โ | 0.6582 |
์ฌ์ฉ
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("kekeappa/kor-static-embedding-64")
emb = model.encode(["ํ๊ตญ์ด ๋ฌธ์ฅ", "์๋ฒ ๋ฉ ํ
์คํธ"], normalize_embeddings=True)
print(emb.shape) # (2, 64)
ํน์ง
- ์ํคํ ์ฒ: StaticEmbedding (model2vec ๊ณ์ด) โ ํธ๋์คํฌ๋จธ attention ์์
- ์ถ๋ก : CPU ์ต์ , GPU ๋ถํ์
- ์๋: ๋จ์ผ ์ฟผ๋ฆฌ < 1ms (๋ธ๋ผ์ฐ์ ์์๋ ๋น ๋ฆ)
- ํ์ ํธํ: cross-lingual ํ์ต๋จ โ ํ๊ตญ์ด ์ฟผ๋ฆฌ๋ก ์์ด ๋ฌธ์ ๊ฒ์ ๊ฐ๋ฅ
ํ์ต ๋ฐฉ๋ฒ
4-stage ํ์ต:
- Distillation ์ด๊ธฐํ:
BM-K/KoSimCSE-roberta-multitaskteacher์ vocab ์๋ฒ ๋ฉ โ PCA + Zipf weighting - KorNLI MNRL:
kakaobrain/kor_nli(multi_nli + snli) 277K triplet - Cross-lingual MNRL: OPUS-100 ko-en parallel 200K pair
- Matryoshka regression: KorSTS + KLUE-STS + NLLB๋ก ๋ฒ์ญํ ์์ด STS-B
- 64/128/256/512 ์ฐจ์ ๋์ ์ต์ ํ (
MatryoshkaLoss)
- 64/128/256/512 ์ฐจ์ ๋์ ์ต์ ํ (
ํ์ต ์ฝ๋: https://github.com/johunsang/kor-static-embedding-512
๋ผ์ด์ ์ค
Apache 2.0