kekeappa commited on
Commit
cd83eb6
·
verified ·
1 Parent(s): 4f3a9b1

Initial: kor-static-embedding-64 (Matryoshka 분리, 9MB)

Browse files
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ license: apache-2.0
5
+ library_name: sentence-transformers
6
+ pipeline_tag: sentence-similarity
7
+ tags:
8
+ - sentence-transformers
9
+ - sentence-similarity
10
+ - feature-extraction
11
+ - static-embedding
12
+ - model2vec
13
+ - korean
14
+ - ko
15
+ - matryoshka
16
+ datasets:
17
+ - kakaobrain/kor_nli
18
+ - mteb/KorSTS
19
+ - klue/klue
20
+ - Helsinki-NLP/opus-100
21
+ base_model: klue/roberta-base
22
+ ---
23
+
24
+ # kor-static-embedding-64
25
+
26
+ 한국어 특화 **초경량 Static Embedding** 모델 — **9MB**, **64차원**.
27
+
28
+ [kekeappa/kor-static-embedding-512](https://huggingface.co/kekeappa/kor-static-embedding-512)를 Matryoshka 학습으로 만들고 **64차원으로 잘라낸 변종**입니다. 같은 모델 패밀리에 4개 차원 존재 — 용도에 맞게 선택:
29
+
30
+ | 차원 | 크기 | 용도 |
31
+ |---:|---:|---|
32
+ | **[64](https://huggingface.co/kekeappa/kor-static-embedding-64)** | 9MB | 🌐 브라우저 · 모바일 · 엣지 |
33
+ | **[128](https://huggingface.co/kekeappa/kor-static-embedding-128)** | 17MB | ⚡ 가벼운 검색·분류 |
34
+ | **[256](https://huggingface.co/kekeappa/kor-static-embedding-256)** | 34MB | ⚖️ 가성비 |
35
+ | **[512](https://huggingface.co/kekeappa/kor-static-embedding-512)** | 68MB | 🎯 최고 정확도 |
36
+
37
+ ## 성능 (KorSTS / KLUE-STS)
38
+
39
+ | 벤치마크 | Pearson | **Spearman** |
40
+ |---|---:|---:|
41
+ | KorSTS-test | 0.7382 | **0.7337** |
42
+ | KorSTS-valid | — | **0.7885** |
43
+ | KLUE-STS-val | — | **0.6582** |
44
+
45
+ ## 사용
46
+
47
+ ```python
48
+ from sentence_transformers import SentenceTransformer
49
+
50
+ model = SentenceTransformer("kekeappa/kor-static-embedding-64")
51
+ emb = model.encode(["한국어 문장", "임베딩 테스트"], normalize_embeddings=True)
52
+ print(emb.shape) # (2, 64)
53
+ ```
54
+
55
+ ## 특징
56
+
57
+ - **아키텍처**: StaticEmbedding (model2vec 계열) — 트랜스포머 attention 없음
58
+ - **추론**: CPU 최적, GPU 불필요
59
+ - **속도**: 단일 쿼리 < 1ms (브라우저에서도 빠름)
60
+ - **한영 호환**: cross-lingual 학습됨 — 한국어 쿼리로 영어 문서 검색 가능
61
+
62
+ ## 학습 방법
63
+
64
+ 4-stage 학습:
65
+ 1. **Distillation 초기화**: `BM-K/KoSimCSE-roberta-multitask` teacher의 vocab 임베딩 → PCA + Zipf weighting
66
+ 2. **KorNLI MNRL**: `kakaobrain/kor_nli` (multi_nli + snli) 277K triplet
67
+ 3. **Cross-lingual MNRL**: OPUS-100 ko-en parallel 200K pair
68
+ 4. **Matryoshka regression**: KorSTS + KLUE-STS + NLLB로 번역한 영어 STS-B
69
+ - 64/128/256/512 차원 동시 최적화 (`MatryoshkaLoss`)
70
+
71
+ 학습 코드: https://github.com/johunsang/kor-static-embedding-512
72
+
73
+ ## 라이선스
74
+
75
+ Apache 2.0
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "pytorch": "2.12.0",
4
+ "sentence_transformers": "5.5.0",
5
+ "transformers": "5.8.1"
6
+ },
7
+ "default_prompt_name": null,
8
+ "model_type": "SentenceTransformer",
9
+ "prompts": {
10
+ "document": "",
11
+ "query": ""
12
+ },
13
+ "similarity_fn_name": "cosine"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:610690daedd7fb81bd2446e6ebaba191f2f0519b74334f62aa649cd69ef13fc4
3
+ size 8192096
modules.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.sentence_transformer.modules.static_embedding.StaticEmbedding"
7
+ }
8
+ ]
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff