Sentence Similarity
sentence-transformers
Safetensors
Model2Vec
Korean
feature-extraction
static-embedding
korean
matryoshka
Instructions to use kekeappa/kor-static-embedding-64 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use kekeappa/kor-static-embedding-64 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("kekeappa/kor-static-embedding-64") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Model2Vec
How to use kekeappa/kor-static-embedding-64 with Model2Vec:
from model2vec import StaticModel model = StaticModel.from_pretrained("kekeappa/kor-static-embedding-64") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - ko | |
| license: apache-2.0 | |
| library_name: sentence-transformers | |
| pipeline_tag: sentence-similarity | |
| tags: | |
| - sentence-transformers | |
| - sentence-similarity | |
| - feature-extraction | |
| - static-embedding | |
| - model2vec | |
| - korean | |
| - ko | |
| - matryoshka | |
| datasets: | |
| - kakaobrain/kor_nli | |
| - mteb/KorSTS | |
| - klue/klue | |
| - Helsinki-NLP/opus-100 | |
| base_model: klue/roberta-base | |
| # kor-static-embedding-64 | |
| ํ๊ตญ์ด ํนํ **์ด๊ฒฝ๋ Static Embedding** ๋ชจ๋ธ โ **9MB**, **64์ฐจ์**. | |
| [kekeappa/kor-static-embedding-512](https://huggingface.co/kekeappa/kor-static-embedding-512)๋ฅผ Matryoshka ํ์ต์ผ๋ก ๋ง๋ค๊ณ **64์ฐจ์์ผ๋ก ์๋ผ๋ธ ๋ณ์ข **์ ๋๋ค. ๊ฐ์ ๋ชจ๋ธ ํจ๋ฐ๋ฆฌ์ 4๊ฐ ์ฐจ์ ์กด์ฌ โ ์ฉ๋์ ๋ง๊ฒ ์ ํ: | |
| | ์ฐจ์ | ํฌ๊ธฐ | ์ฉ๋ | | |
| |---:|---:|---| | |
| | **[64](https://huggingface.co/kekeappa/kor-static-embedding-64)** | 9MB | ๐ ๋ธ๋ผ์ฐ์ ยท ๋ชจ๋ฐ์ผ ยท ์ฃ์ง | | |
| | **[128](https://huggingface.co/kekeappa/kor-static-embedding-128)** | 17MB | โก ๊ฐ๋ฒผ์ด ๊ฒ์ยท๋ถ๋ฅ | | |
| | **[256](https://huggingface.co/kekeappa/kor-static-embedding-256)** | 34MB | โ๏ธ ๊ฐ์ฑ๋น | | |
| | **[512](https://huggingface.co/kekeappa/kor-static-embedding-512)** | 68MB | ๐ฏ ์ต๊ณ ์ ํ๋ | | |
| ## ์ฑ๋ฅ (KorSTS / KLUE-STS) | |
| | ๋ฒค์น๋งํฌ | Pearson | **Spearman** | | |
| |---|---:|---:| | |
| | KorSTS-test | 0.7382 | **0.7337** | | |
| | KorSTS-valid | โ | **0.7885** | | |
| | KLUE-STS-val | โ | **0.6582** | | |
| ## ์ฌ์ฉ | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("kekeappa/kor-static-embedding-64") | |
| emb = model.encode(["ํ๊ตญ์ด ๋ฌธ์ฅ", "์๋ฒ ๋ฉ ํ ์คํธ"], normalize_embeddings=True) | |
| print(emb.shape) # (2, 64) | |
| ``` | |
| ## ํน์ง | |
| - **์ํคํ ์ฒ**: StaticEmbedding (model2vec ๊ณ์ด) โ ํธ๋์คํฌ๋จธ attention ์์ | |
| - **์ถ๋ก **: CPU ์ต์ , GPU ๋ถํ์ | |
| - **์๋**: ๋จ์ผ ์ฟผ๋ฆฌ < 1ms (๋ธ๋ผ์ฐ์ ์์๋ ๋น ๋ฆ) | |
| - **ํ์ ํธํ**: cross-lingual ํ์ต๋จ โ ํ๊ตญ์ด ์ฟผ๋ฆฌ๋ก ์์ด ๋ฌธ์ ๊ฒ์ ๊ฐ๋ฅ | |
| ## ํ์ต ๋ฐฉ๋ฒ | |
| 4-stage ํ์ต: | |
| 1. **Distillation ์ด๊ธฐํ**: `BM-K/KoSimCSE-roberta-multitask` teacher์ vocab ์๋ฒ ๋ฉ โ PCA + Zipf weighting | |
| 2. **KorNLI MNRL**: `kakaobrain/kor_nli` (multi_nli + snli) 277K triplet | |
| 3. **Cross-lingual MNRL**: OPUS-100 ko-en parallel 200K pair | |
| 4. **Matryoshka regression**: KorSTS + KLUE-STS + NLLB๋ก ๋ฒ์ญํ ์์ด STS-B | |
| - 64/128/256/512 ์ฐจ์ ๋์ ์ต์ ํ (`MatryoshkaLoss`) | |
| ํ์ต ์ฝ๋: https://github.com/johunsang/kor-static-embedding-512 | |
| ## ๋ผ์ด์ ์ค | |
| Apache 2.0 | |