kekeappa's picture
Initial: kor-static-embedding-256 (Matryoshka ๋ถ„๋ฆฌ, 34MB)
4740310 verified
metadata
language:
  - ko
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - static-embedding
  - model2vec
  - korean
  - ko
  - matryoshka
datasets:
  - kakaobrain/kor_nli
  - mteb/KorSTS
  - klue/klue
  - Helsinki-NLP/opus-100
base_model: klue/roberta-base

kor-static-embedding-256

ํ•œ๊ตญ์–ด ํŠนํ™” ์ดˆ๊ฒฝ๋Ÿ‰ Static Embedding ๋ชจ๋ธ โ€” 34MB, 256์ฐจ์›.

kekeappa/kor-static-embedding-512๋ฅผ Matryoshka ํ•™์Šต์œผ๋กœ ๋งŒ๋“ค๊ณ  256์ฐจ์›์œผ๋กœ ์ž˜๋ผ๋‚ธ ๋ณ€์ข…์ž…๋‹ˆ๋‹ค. ๊ฐ™์€ ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ์— 4๊ฐœ ์ฐจ์› ์กด์žฌ โ€” ์šฉ๋„์— ๋งž๊ฒŒ ์„ ํƒ:

์ฐจ์› ํฌ๊ธฐ ์šฉ๋„
64 9MB ๐ŸŒ ๋ธŒ๋ผ์šฐ์ € ยท ๋ชจ๋ฐ”์ผ ยท ์—ฃ์ง€
128 17MB โšก ๊ฐ€๋ฒผ์šด ๊ฒ€์ƒ‰ยท๋ถ„๋ฅ˜
256 34MB โš–๏ธ ๊ฐ€์„ฑ๋น„
512 68MB ๐ŸŽฏ ์ตœ๊ณ  ์ •ํ™•๋„

์„ฑ๋Šฅ (KorSTS / KLUE-STS)

๋ฒค์น˜๋งˆํฌ Pearson Spearman
KorSTS-test 0.7738 0.7690
KorSTS-valid โ€” 0.8234
KLUE-STS-val โ€” 0.6838

์‚ฌ์šฉ

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("kekeappa/kor-static-embedding-256")
emb = model.encode(["ํ•œ๊ตญ์–ด ๋ฌธ์žฅ", "์ž„๋ฒ ๋”ฉ ํ…Œ์ŠคํŠธ"], normalize_embeddings=True)
print(emb.shape)  # (2, 256)

ํŠน์ง•

  • ์•„ํ‚คํ…์ฒ˜: StaticEmbedding (model2vec ๊ณ„์—ด) โ€” ํŠธ๋žœ์Šคํฌ๋จธ attention ์—†์Œ
  • ์ถ”๋ก : CPU ์ตœ์ , GPU ๋ถˆํ•„์š”
  • ์†๋„: ๋‹จ์ผ ์ฟผ๋ฆฌ < 1ms (๋ธŒ๋ผ์šฐ์ €์—์„œ๋„ ๋น ๋ฆ„)
  • ํ•œ์˜ ํ˜ธํ™˜: cross-lingual ํ•™์Šต๋จ โ€” ํ•œ๊ตญ์–ด ์ฟผ๋ฆฌ๋กœ ์˜์–ด ๋ฌธ์„œ ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅ

ํ•™์Šต ๋ฐฉ๋ฒ•

4-stage ํ•™์Šต:

  1. Distillation ์ดˆ๊ธฐํ™”: BM-K/KoSimCSE-roberta-multitask teacher์˜ vocab ์ž„๋ฒ ๋”ฉ โ†’ PCA + Zipf weighting
  2. KorNLI MNRL: kakaobrain/kor_nli (multi_nli + snli) 277K triplet
  3. Cross-lingual MNRL: OPUS-100 ko-en parallel 200K pair
  4. Matryoshka regression: KorSTS + KLUE-STS + NLLB๋กœ ๋ฒˆ์—ญํ•œ ์˜์–ด STS-B
    • 64/128/256/512 ์ฐจ์› ๋™์‹œ ์ตœ์ ํ™” (MatryoshkaLoss)

ํ•™์Šต ์ฝ”๋“œ: https://github.com/johunsang/kor-static-embedding-512

๋ผ์ด์„ ์Šค

Apache 2.0