thkmon kekeappa commited on
Commit
04169cc
ยท
0 Parent(s):

Duplicate from kekeappa/kor-static-embedding-512

Browse files

Co-authored-by: jo <kekeappa@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
0_StaticEmbedding/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6766657f4150b56c3f6eab07ca89a1cc833e334b9fe96b1c33ca798fae8d6b42
3
+ size 65536096
0_StaticEmbedding/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
README.md ADDED
@@ -0,0 +1,318 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ license: apache-2.0
5
+ library_name: sentence-transformers
6
+ pipeline_tag: sentence-similarity
7
+ tags:
8
+ - sentence-transformers
9
+ - sentence-similarity
10
+ - feature-extraction
11
+ - static-embedding
12
+ - model2vec
13
+ - korean
14
+ - ko
15
+ - klue
16
+ - korsts
17
+ datasets:
18
+ - kakaobrain/kor_nli
19
+ - mteb/KorSTS
20
+ - klue/klue
21
+ base_model: klue/roberta-base
22
+ ---
23
+
24
+ # kor-static-embedding-512
25
+
26
+ ํ•œ๊ตญ์–ด ํŠนํ™” **Static Embedding** ๋ชจ๋ธ โ€” ํŠธ๋žœ์Šคํฌ๋จธ ์—†์ด ํ† ํฐ ์ž„๋ฒ ๋”ฉ lookup + ํ‰๊ท ๋งŒ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ์ดˆ๊ฒฝ๋Ÿ‰ ํ•œ๊ตญ์–ด ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ.
27
+
28
+ **68MB** ํฌ๊ธฐ๋กœ **BGE-M3 ์„ฑ๋Šฅ์˜ 92%** ๋‹ฌ์„ฑ (ํ•œ๊ตญ์–ด STS ํ‰๊ท  Spearman ๊ธฐ์ค€), CPU์—์„œ **158๋ฐฐ ๋น ๋ฅธ** ์ถ”๋ก .
29
+
30
+ ## ๋ชจ๋ธ ๊ฐœ์š”
31
+
32
+ | ํ•ญ๋ชฉ | ๊ฐ’ |
33
+ |---|---|
34
+ | ์•„ํ‚คํ…์ฒ˜ | `sentence_transformers.models.StaticEmbedding` ([model2vec](https://github.com/MinishLab/model2vec) ๊ณ„์—ด) |
35
+ | Base ํ† ํฌ๋‚˜์ด์ € | `klue/roberta-base` (ํ•œ๊ตญ์–ด vocab 32K) |
36
+ | ์ž„๋ฒ ๋”ฉ ์ฐจ์› | **512** |
37
+ | ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ | 16,384,000 |
38
+ | ๋ชจ๋ธ ํฌ๊ธฐ | **68MB** |
39
+ | ํ•™์Šต ๋ฐ์ดํ„ฐ | KorNLI (multi_nli + snli) + KorSTS + KLUE-STS |
40
+ | ์ถ”๋ก  ํ™˜๊ฒฝ | CPU์—์„œ ์ตœ์  (GPU ๋ถˆํ•„์š”) |
41
+ | ๋‹ค๊ตญ์–ด | ํ•œ๊ตญ์–ด ์ „์šฉ |
42
+
43
+ ## ์„ค์น˜ ๋ฐ ์‚ฌ์šฉ๋ฒ•
44
+
45
+ ### 1๋‹จ๊ณ„: ์„ค์น˜
46
+
47
+ ```bash
48
+ # ๊ฐ€์ƒํ™˜๊ฒฝ ๊ถŒ์žฅ
49
+ python3 -m venv .venv
50
+ source .venv/bin/activate # Windows: .venv\Scripts\activate
51
+
52
+ # ํŒจํ‚ค์ง€ ์„ค์น˜ (torch ํฌํ•จ, CPU ์ „์šฉ ๊ฐ€๋Šฅ)
53
+ pip install sentence-transformers
54
+ ```
55
+
56
+ > ํ•„์š” ํŒจํ‚ค์ง€๋Š” `sentence-transformers`๋งŒ ์„ค์น˜ํ•˜๋ฉด ์ž๋™์œผ๋กœ `torch`, `transformers`, `huggingface_hub` ๋“ฑ ์˜์กด์„ฑ์ด ๋”ฐ๋ผ์˜ต๋‹ˆ๋‹ค.
57
+ > ๋””์Šคํฌ ์ ˆ์•ฝ์„ ์›ํ•˜๋ฉด CPU ์ „์šฉ torch: `pip install torch --index-url https://download.pytorch.org/whl/cpu`
58
+
59
+ ### 2๋‹จ๊ณ„: ๋ชจ๋ธ ๋กœ๋“œ
60
+
61
+ ```python
62
+ from sentence_transformers import SentenceTransformer
63
+
64
+ model = SentenceTransformer("kekeappa/kor-static-embedding-512")
65
+ # ์ฒซ ์‹คํ–‰ ์‹œ ๋ชจ๋ธ ์ž๋™ ๋‹ค์šด๋กœ๋“œ (~68MB)
66
+ # ์บ์‹œ ์œ„์น˜: ~/.cache/huggingface/hub/
67
+ ```
68
+
69
+ ### 3๋‹จ๊ณ„: ์ž„๋ฒ ๋”ฉ ์ถ”์ถœ
70
+
71
+ ```python
72
+ sentences = [
73
+ "์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์ •๋ง ์ข‹๋„ค์š”.",
74
+ "ํ–‡์‚ด์ด ๋”ฐ๋œปํ•˜๊ณ  ๊ธฐ๋ถ„ ์ข‹์€ ํ•˜๋ฃจ์ž…๋‹ˆ๋‹ค.",
75
+ "๋น„๊ฐ€ ์™€์„œ ์šฐ์‚ฐ์„ ์ฑ™๊ฒจ์•ผ ํ•ฉ๋‹ˆ๋‹ค.",
76
+ ]
77
+ embeddings = model.encode(sentences, normalize_embeddings=True)
78
+ print(embeddings.shape) # (3, 512)
79
+
80
+ # ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ (์ •๊ทœํ™”๋œ ๋ฒกํ„ฐ์˜ ๋‚ด์  = ์ฝ”์‚ฌ์ธ)
81
+ similarity_matrix = embeddings @ embeddings.T
82
+ print(similarity_matrix)
83
+ ```
84
+
85
+ ### 4๋‹จ๊ณ„: ํ™œ์šฉ ์˜ˆ์‹œ
86
+
87
+ #### A. ์˜๋ฏธ ๊ฒ€์ƒ‰ (Semantic Search)
88
+ ```python
89
+ import numpy as np
90
+
91
+ # ์ฝ”ํผ์Šค ์ธ๋ฑ์‹ฑ (ํ•œ ๋ฒˆ๋งŒ)
92
+ corpus = [
93
+ "๊น€์น˜์ฐŒ๊ฐœ ๋งŒ๋“œ๋Š” ๋ฒ•",
94
+ "๋”ฅ๋Ÿฌ๋‹ ์ž…๋ฌธ ๊ฐ•์˜",
95
+ "์ฃผ๋ง ๋“ฑ์‚ฐ ์ถ”์ฒœ ์ฝ”์Šค",
96
+ "ํŒŒ์ด์ฌ ๋ฐ์ดํ„ฐ ๋ถ„์„",
97
+ "์ œ์ฃผ๋„ ์—ฌํ–‰ ์ผ์ •",
98
+ ]
99
+ corpus_emb = model.encode(corpus, normalize_embeddings=True, batch_size=64)
100
+
101
+ # ์ฟผ๋ฆฌ (๋ฐ˜๋ณต ๊ฐ€๋Šฅ)
102
+ def search(query, top_k=3):
103
+ q_emb = model.encode([query], normalize_embeddings=True)
104
+ scores = (q_emb @ corpus_emb.T)[0]
105
+ top_idx = np.argsort(-scores)[:top_k]
106
+ return [(corpus[i], float(scores[i])) for i in top_idx]
107
+
108
+ print(search("์ธ๊ณต์ง€๋Šฅ ํ•™์Šต"))
109
+ # โ†’ [('๋”ฅ๋Ÿฌ๋‹ ์ž…๋ฌธ ๊ฐ•์˜', 0.41), ('ํŒŒ์ด์ฌ ๋ฐ์ดํ„ฐ ๋ถ„์„', 0.18), ...]
110
+ ```
111
+
112
+ #### B. ๋‘ ๋ฌธ์žฅ ์œ ์‚ฌ๋„
113
+ ```python
114
+ emb = model.encode(["์ข‹์€ ์•„์นจ์ž…๋‹ˆ๋‹ค", "๊ตฟ๋ชจ๋‹์ด์—์š”"], normalize_embeddings=True)
115
+ similarity = float((emb[0] * emb[1]).sum())
116
+ print(f"์œ ์‚ฌ๋„: {similarity:.4f}")
117
+ ```
118
+
119
+ #### C. ํด๋Ÿฌ์Šคํ„ฐ๋ง (KMeans)
120
+ ```python
121
+ from sklearn.cluster import KMeans
122
+
123
+ sentences = [
124
+ "๊น€์น˜์ฐŒ๊ฐœ ๋“์ด๋Š” ๋ฒ•", "๋œ์žฅ์ฐŒ๊ฐœ ๋งŒ๋“ค๊ธฐ", "๋น„๋น”๋ฐฅ ๋ ˆ์‹œํ”ผ",
125
+ "ํŒŒ์ด์ฌ ์ž…๋ฌธ", "์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ ๊ธฐ์ดˆ", "๋ฆฌ์•กํŠธ ์‚ฌ์šฉ๋ฒ•",
126
+ "์ œ์ฃผ๋„ ์—ฌํ–‰", "๋ถ€์‚ฐ ์—ฌํ–‰ ์ฝ”์Šค", "๊ฒฝ์ฃผ ์—ญ์‚ฌ ํƒ๋ฐฉ",
127
+ ]
128
+ emb = model.encode(sentences, normalize_embeddings=True)
129
+ labels = KMeans(n_clusters=3, random_state=42, n_init=10).fit_predict(emb)
130
+ for i, s in enumerate(sentences):
131
+ print(f"[{labels[i]}] {s}")
132
+ ```
133
+
134
+ #### D. ๋ฒกํ„ฐ DB ์—ฐ๋™ (FAISS / Qdrant / Chroma)
135
+ ```python
136
+ # FAISS ์˜ˆ์‹œ
137
+ import faiss
138
+ import numpy as np
139
+
140
+ embeddings = model.encode(corpus, normalize_embeddings=True).astype("float32")
141
+ index = faiss.IndexFlatIP(512) # Inner Product (์ •๊ทœํ™” ํ–ˆ์œผ๋ฏ€๋กœ = ์ฝ”์‚ฌ์ธ)
142
+ index.add(embeddings)
143
+
144
+ # ๊ฒ€์ƒ‰
145
+ query_emb = model.encode(["์ธ๊ณต์ง€๋Šฅ"], normalize_embeddings=True).astype("float32")
146
+ distances, indices = index.search(query_emb, k=3)
147
+ for idx, dist in zip(indices[0], distances[0]):
148
+ print(f" [{dist:.4f}] {corpus[idx]}")
149
+ ```
150
+
151
+ ### ์ฃผ์š” ์˜ต์…˜
152
+
153
+ | ์˜ต์…˜ | ์„ค๋ช… | ๊ธฐ๋ณธ๊ฐ’ | ๊ถŒ์žฅ |
154
+ |---|---|---|---|
155
+ | `normalize_embeddings` | L2 ์ •๊ทœํ™” (์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„์šฉ) | `False` | **`True`** |
156
+ | `batch_size` | ๋ฐฐ์น˜ ํฌ๊ธฐ (CPU์—์„œ ํด์ˆ˜๋ก ๋น ๋ฆ„) | 32 | **128~512** |
157
+ | `show_progress_bar` | tqdm ์ง„ํ–‰๋ฐ” | `True` | ๋Œ€๋Ÿ‰ ์ฒ˜๋ฆฌ ์‹œ `True`, API ํ˜ธ์ถœ ์‹œ `False` |
158
+ | `convert_to_numpy` | numpy ๋ฐฐ์—ด๋กœ ๋ณ€ํ™˜ | `True` | ๋Œ€๋ถ€๋ถ„ `True` |
159
+ | `device` | "cpu" / "cuda" / "mps" | ์ž๋™ ๊ฐ์ง€ | CPU ์ตœ์  (GPU ๋ถˆํ•„์š”) |
160
+
161
+ ### ํŠธ๋Ÿฌ๋ธ”์ŠˆํŒ…
162
+
163
+ | ๋ฌธ์ œ | ์›์ธ / ํ•ด๊ฒฐ |
164
+ |---|---|
165
+ | `ModuleNotFoundError: sentence_transformers` | `pip install sentence-transformers` |
166
+ | ์ฒซ ๋กœ๋”ฉ์ด ๋„ˆ๋ฌด ๋А๋ฆผ | ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ ์ค‘ (~68MB). ์บ์‹œ ํ›„ 0.3์ดˆ๋งŒ์— ๋กœ๋“œ |
167
+ | ํ•œ๊ตญ์–ด ๋ฌธ์žฅ์—์„œ ์ ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋‚ฎ์Œ | `normalize_embeddings=True` ๋ˆ„๋ฝ ํ™•์ธ |
168
+ | ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ | `batch_size` ์ค„์ด๊ธฐ (์˜ˆ: 32 โ†’ 8) |
169
+ | ์–ด์ˆœ/๋ถ€์ •๋ฌธ ๊ตฌ๋ถ„ ์•ˆ ๋จ | Static Embedding์˜ ๋ณธ์งˆ์  ํ•œ๊ณ„ (์•„๋ž˜ [ํ•œ๊ณ„](#ํ•œ๊ณ„) ์ฐธ์กฐ) |
170
+
171
+ ## ๋ฒค์น˜๋งˆํฌ (BAAI/bge-m3 ๋น„๊ต)
172
+
173
+ ### ์„ฑ๋Šฅ (Spearman ์ƒ๊ด€๊ณ„์ˆ˜)
174
+
175
+ | ๋ฒค์น˜๋งˆํฌ | N | **kor-static-embedding-512** | BAAI/bge-m3 | ๋น„์œจ |
176
+ |---|---:|---:|---:|---:|
177
+ | KorSTS-test | 1,376 | **0.7758** | 0.8026 | **96.7%** |
178
+ | KorSTS-valid | 1,465 | **0.8248** | 0.8317 | **99.2%** |
179
+ | KLUE-STS-validation | 519 | **0.7119** | 0.8773 | 81.1% |
180
+ | **ํ‰๊ท ** | โ€” | **0.7708** | 0.8372 | **92.1%** |
181
+
182
+ ### ํฌ๊ธฐยท์ž์› (% ํ™˜์‚ฐ, BGE-M3 = 100%)
183
+
184
+ | ํ•ญ๋ชฉ | BGE-M3 | **kor-static-embedding-512** | ๋น„์œจ |
185
+ |---|---:|---:|---:|
186
+ | ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ | 100% (567.8M) | **2.89%** (16.4M) | 97.1% ์ ˆ์•ฝ |
187
+ | ๋””์Šคํฌ ํฌ๊ธฐ | 100% (2,168MB) | **3.14%** (68MB) | 96.9% ์ ˆ์•ฝ |
188
+ | ์ž„๋ฒ ๋”ฉ ์ฐจ์› | 100% (1024) | **50%** (512) | 50% ์ถ•์†Œ |
189
+
190
+ ### ์†๋„ ์ƒ์„ธ (CPU, Apple M2)
191
+
192
+ #### 1. ๋ชจ๋ธ ๋กœ๋“œ ์‹œ๊ฐ„ โ€” ๋‚ฎ์„์ˆ˜๋ก ์ข‹์Œ
193
+
194
+ | ๋ชจ๋ธ | ๋กœ๋“œ ์‹œ๊ฐ„ | ๋น„์œจ |
195
+ |---|---:|---:|
196
+ | BGE-M3 | 24,042ms (24.0์ดˆ) | 100% |
197
+ | **kor-static-embedding-512** | **310ms** | **1.29%** (78ร— ๋น ๋ฆ„) |
198
+
199
+ #### 2. ๋‹จ์ผ ์ฟผ๋ฆฌ ์ง€์—ฐ์‹œ๊ฐ„ โ€” ๋‚ฎ์„์ˆ˜๋ก ์ข‹์Œ
200
+
201
+ | ๋ชจ๋ธ | p50 | p95 | p99 | ๋น„์œจ (p50) |
202
+ |---|---:|---:|---:|---:|
203
+ | BGE-M3 | 23.02ms | 24.30ms | 31.50ms | 100% |
204
+ | **kor-static-embedding-512** | **0.96ms** | 2.03ms | 2.37ms | **4.19%** (24ร— ๋น ๋ฆ„) |
205
+
206
+ #### 3. ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ๋Ÿ‰ โ€” ๋†’์„์ˆ˜๋ก ์ข‹์Œ
207
+
208
+ | Batch | BGE-M3 | **kor-static-embedding-512** | ๋น„์œจ |
209
+ |---:|---:|---:|---:|
210
+ | 1 | 42.5 sent/s | 1,132.9 sent/s | **2,662%** (26.6ร— ๋น ๋ฆ„) |
211
+ | 8 | 252.1 sent/s | 6,490.3 sent/s | **2,574%** (25.7ร— ๋น ๋ฆ„) |
212
+ | 32 | 346.3 sent/s | 20,095.5 sent/s | **5,803%** (58.0ร— ๋น ๋ฆ„) |
213
+ | 128 | 343.3 sent/s | 39,568.9 sent/s | **11,525%** (115ร— ๋น ๋ฆ„) |
214
+ | **512** | 324.6 sent/s | **92,468.3 sent/s** | **28,489%** (285ร— ๋น ๋ฆ„) |
215
+
216
+ โ†’ BGE-M3๋Š” batch 32์—์„œ ์ฒ˜๋ฆฌ๋Ÿ‰ ํฌํ™”, **kor-static-embedding-512๋Š” batch 512๊นŒ์ง€ ์„ ํ˜• ํ™•์žฅ**.
217
+
218
+ #### 4. ์‹ค์ „ ์‹œ๋‚˜๋ฆฌ์˜ค โ€” ๋Œ€๊ทœ๋ชจ ์ธ๋ฑ์‹ฑ ์‹œ๊ฐ„
219
+
220
+ | ๋ฌธ์„œ ์ˆ˜ | BGE-M3 | **kor-static-embedding-512** | ๋น„์œจ |
221
+ |---:|---:|---:|---:|
222
+ | 1๋งŒ ๊ฑด | 38.2์ดˆ | **0.3์ดˆ** | 0.82% |
223
+ | 10๋งŒ ๊ฑด | 6.4๋ถ„ | **3.1์ดˆ** | 0.82% |
224
+ | 100๋งŒ ๊ฑด | 1.1์‹œ๊ฐ„ | **31์ดˆ** | 0.82% |
225
+ | 1์ฒœ๋งŒ ๊ฑด | 10.6์‹œ๊ฐ„ | **5.2๋ถ„** | 0.82% |
226
+ | 1์–ต ๊ฑด (์ถ”์ •) | 4.4์ผ | **52๋ถ„** | 0.82% |
227
+
228
+ โ†’ **100๋งŒ ๊ฑด ์ธ๋ฑ์‹ฑ: 1์‹œ๊ฐ„ โ†’ 30์ดˆ** (122ร— ๋‹จ์ถ•)
229
+
230
+ #### 5. ๋น„์šฉยท์ž์› ์ ˆ๊ฐ ์š”์•ฝ
231
+
232
+ | ํ•ญ๋ชฉ | ์ ˆ๊ฐ๋ฅ  |
233
+ |---|---:|
234
+ | CPU ์ธํ”„๋ผ ๋น„์šฉ (๊ฐ™์€ ์ฒ˜๋ฆฌ๋Ÿ‰ ๊ธฐ์ค€) | **~99% ์ ˆ๊ฐ** |
235
+ | ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ | **~97% ์ ˆ๊ฐ** |
236
+ | ์‘๋‹ต ์ง€์—ฐ (์‚ฌ์šฉ์ž ์ฒด๊ฐ) | **~96% ๋‹จ์ถ•** |
237
+ | ์ฝœ๋“œ ์Šคํƒ€ํŠธ (์„œ๋ฒ„๋ฆฌ์Šค) | 24์ดˆ โ†’ 0.3์ดˆ (**99% ๋‹จ์ถ•**) |
238
+
239
+ ## ํ•™์Šต ๋ ˆ์‹œํ”ผ
240
+
241
+ **Stage 1: KorNLI MultipleNegativesRankingLoss**
242
+ - ๋ฐ์ดํ„ฐ: `kakaobrain/kor_nli` (multi_nli + snli)
243
+ - entailment๋ฅผ positive, contradiction์„ hard negative๋กœ โ†’ **277,826 triplet**
244
+ - Loss: `MultipleNegativesRankingLoss`
245
+ - batch=2048, lr=2e-1, epoch=1
246
+ - ํ•™์Šต ์‹œ๊ฐ„: ์•ฝ 25์ดˆ (A100 80GB PCIe)
247
+
248
+ **Stage 2: STS regression fine-tune**
249
+ - ๋ฐ์ดํ„ฐ: KorSTS-train (5,691) + KLUE-STS-train (11,668) = 17,359 pairs
250
+ - Loss: `CosineSimilarityLoss`
251
+ - batch=64, lr=2e-2, epoch=4
252
+ - ํ•™์Šต ์‹œ๊ฐ„: ์•ฝ 18์ดˆ (A100 80GB PCIe)
253
+ - best checkpoint: KorSTS-valid Spearman ๊ธฐ์ค€
254
+
255
+ **Stage 1 ์ข…๋ฃŒ ์‹œ์  ์ ์ˆ˜** (์ฐธ๊ณ ):
256
+ - KorSTS-test Spearman: 0.7519
257
+ - KorSTS-valid Spearman: 0.7983
258
+ - KLUE-STS-val Spearman: 0.5757
259
+
260
+ โ†’ Stage 2 (STS regression)๊ฐ€ ํŠนํžˆ KLUE ์ ์ˆ˜๋ฅผ 0.58 โ†’ 0.71๋กœ ํฌ๊ฒŒ ๋Œ์–ด์˜ฌ๋ฆผ.
261
+
262
+ ## ์ ํ•ฉํ•œ ์šฉ๋„
263
+
264
+ โœ… **๊ถŒ์žฅ**
265
+ - ๋Œ€๊ทœ๋ชจ RAG์˜ 1์ฐจ retrieval (์ˆ˜๋ฐฑ๋งŒ ๋ฌธ์„œ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ขํžˆ๊ธฐ)
266
+ - ์˜๋ฏธ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰, FAQ ๋งค์นญ, ์ถ”์ฒœ ์‹œ์Šคํ…œ
267
+ - ํด๋Ÿฌ์Šคํ„ฐ๋ง, ์ค‘๋ณต ์ œ๊ฑฐ, ์นดํ…Œ๊ณ ๋ฆฌ ๋ถ„๋ฅ˜
268
+ - ์˜จ๋””๋ฐ”์ด์Šค / ๋ชจ๋ฐ”์ผ ํ•œ๊ตญ์–ด ์ž„๋ฒ ๋”ฉ
269
+ - 2-stage ๊ฒ€์ƒ‰: kor-static-512(1์ฐจ) + BGE-M3(2์ฐจ ์žฌ์ •๋ ฌ)
270
+
271
+ โŒ **๋ถ€์ ํ•ฉ**
272
+ - ์–ด์ˆœยท๋ฌธ๋งฅ ๋ฏธ์„ธ ์ฐจ์ด๊ฐ€ ์ค‘์š”ํ•œ ์ž‘์—… (์–ด์ˆœ ์ •๋ณด ์—†์Œ)
273
+ - ๋‹ค๊ตญ์–ด ๊ฒ€์ƒ‰ (ํ•œ๊ตญ์–ด ์ „์šฉ)
274
+ - KLUE ๊ฐ™์€ ๋‰ด์Šค ๋„๋ฉ”์ธ์—์„œ ์ ˆ๋Œ€ ์ตœ๊ณ  ์„ฑ๋Šฅ ํ•„์š”์‹œ (BGE-M3 ๊ถŒ์žฅ)
275
+ - 8์ฒœ ํ† ํฐ ์ด์ƒ์˜ ๊ธด ๋ฌธ์„œ ๋‹จ์ผ ์ž„๋ฒ ๋”ฉ (mean pooling์€ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก ์•ฝํ•ด์ง)
276
+
277
+ ## ์•„ํ‚คํ…์ฒ˜
278
+
279
+ ์ด ๋ชจ๋ธ์€ ํŠธ๋žœ์Šคํฌ๋จธ attention์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋Œ€์‹ :
280
+
281
+ ```
282
+ ์ž…๋ ฅ: "์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”"
283
+ โ†“
284
+ [1] klue/roberta-base ํ† ํฌ๋‚˜์ด์ €
285
+ โ†’ ํ† ํฐ ID ์‹œํ€€์Šค
286
+ โ†“
287
+ [2] StaticEmbedding (32000 ร— 512 lookup table, 16.4M params)
288
+ โ†’ ๊ฐ ํ† ํฐ โ†’ 512์ฐจ์› ๋ฒกํ„ฐ
289
+ โ†“
290
+ [3] Mean pooling
291
+ โ†’ 512์ฐจ์› ๋ฌธ์žฅ ๋ฒกํ„ฐ
292
+ โ†“
293
+ [4] L2 ์ •๊ทœํ™” (normalize_embeddings=True ์‹œ)
294
+ ```
295
+
296
+ [Tom Aarsen์˜ Static Embeddings ๋ธ”๋กœ๊ทธ(HuggingFace)](https://huggingface.co/blog/static-embeddings)์™€ [MinishLab์˜ model2vec](https://github.com/MinishLab/model2vec)์—์„œ ๊ฒ€์ฆ๋œ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ํ•œ๊ตญ์–ด๋กœ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
297
+
298
+ ## ํ•œ๊ณ„
299
+
300
+ 1. **์–ด์ˆœ ๋ฌด์‹œ**: "์ฒ ์ˆ˜๊ฐ€ ์˜ํฌ๋ฅผ ์ข‹์•„ํ•œ๋‹ค" โ†” "์˜ํฌ๊ฐ€ ์ฒ ์ˆ˜๋ฅผ ์ข‹์•„ํ•œ๋‹ค" ๊ตฌ๋ถ„ ์•ฝํ•จ
301
+ 2. **๋‹ค์˜์–ด ์ฒ˜๋ฆฌ ์•ฝํ•จ**: "์€ํ–‰ ์ง์›" vs "๊ฐ•๋ณ€ ์€ํ–‰"์˜ "์€ํ–‰"์„ ๋™์ผํ•œ ๋ฒกํ„ฐ๋กœ ์ฒ˜๋ฆฌ
302
+ 3. **KLUE ๋„๋ฉ”์ธ ์„ฑ๋Šฅ ๊ฒฉ์ฐจ**: ๋‰ด์Šค ๋„๋ฉ”์ธ์—์„œ๋Š” BGE-M3 ๋Œ€๋น„ ๊ฒฉ์ฐจ ํผ (0.71 vs 0.88)
303
+ 4. **๋ถ€์ •/๋ฐ˜์–ด ์ฒ˜๋ฆฌ ์•ฝํ•จ**: "์ข‹์•„ํ•˜์ง€ ์•Š๋Š”๋‹ค"๋ฅผ "์ข‹์•„ํ•œ๋‹ค"์™€ ๋น„์Šทํ•˜๊ฒŒ ๋ณผ ์ˆ˜ ์žˆ์Œ
304
+
305
+ ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋Š” ๋ชจ๋“  BoW ๊ณ„์—ด ์ •์  ์ž„๋ฒ ๋”ฉ์˜ ๋ณธ์งˆ์  ํŠน์„ฑ์ž…๋‹ˆ๋‹ค. ์ •ํ™•๋„๊ฐ€ ์ ˆ๋Œ€์ ์ธ ๊ฒฝ์šฐ BGE-M3 ๊ถŒ์žฅ.
306
+
307
+ ## ์ธ์šฉ
308
+
309
+ ์ด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์‹ ๋‹ค๋ฉด, ๊ธฐ๋ฐ˜์ด ๋œ ์—ฐ๊ตฌ๋ฅผ ํ•จ๊ป˜ ์ธ์šฉํ•ด์ฃผ์„ธ์š”:
310
+
311
+ - Static Embeddings: https://huggingface.co/blog/static-embeddings
312
+ - model2vec: https://github.com/MinishLab/model2vec
313
+ - KorSTS / KorNLI: KakaoBrain KorNLUDatasets
314
+ - KLUE: https://klue-benchmark.com
315
+
316
+ ## ๋ผ์ด์„ ์Šค
317
+
318
+ Apache 2.0
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.57.6",
5
+ "pytorch": "2.4.1+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
modules.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "0_StaticEmbedding",
6
+ "type": "sentence_transformers.models.StaticEmbedding"
7
+ }
8
+ ]