thkmon's picture
Duplicate from kekeappa/kor-static-embedding-512
04169cc
---
language:
- ko
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- static-embedding
- model2vec
- korean
- ko
- klue
- korsts
datasets:
- kakaobrain/kor_nli
- mteb/KorSTS
- klue/klue
base_model: klue/roberta-base
---
# kor-static-embedding-512
ํ•œ๊ตญ์–ด ํŠนํ™” **Static Embedding** ๋ชจ๋ธ โ€” ํŠธ๋žœ์Šคํฌ๋จธ ์—†์ด ํ† ํฐ ์ž„๋ฒ ๋”ฉ lookup + ํ‰๊ท ๋งŒ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ์ดˆ๊ฒฝ๋Ÿ‰ ํ•œ๊ตญ์–ด ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ.
**68MB** ํฌ๊ธฐ๋กœ **BGE-M3 ์„ฑ๋Šฅ์˜ 92%** ๋‹ฌ์„ฑ (ํ•œ๊ตญ์–ด STS ํ‰๊ท  Spearman ๊ธฐ์ค€), CPU์—์„œ **158๋ฐฐ ๋น ๋ฅธ** ์ถ”๋ก .
## ๋ชจ๋ธ ๊ฐœ์š”
| ํ•ญ๋ชฉ | ๊ฐ’ |
|---|---|
| ์•„ํ‚คํ…์ฒ˜ | `sentence_transformers.models.StaticEmbedding` ([model2vec](https://github.com/MinishLab/model2vec) ๊ณ„์—ด) |
| Base ํ† ํฌ๋‚˜์ด์ € | `klue/roberta-base` (ํ•œ๊ตญ์–ด vocab 32K) |
| ์ž„๋ฒ ๋”ฉ ์ฐจ์› | **512** |
| ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ | 16,384,000 |
| ๋ชจ๋ธ ํฌ๊ธฐ | **68MB** |
| ํ•™์Šต ๋ฐ์ดํ„ฐ | KorNLI (multi_nli + snli) + KorSTS + KLUE-STS |
| ์ถ”๋ก  ํ™˜๊ฒฝ | CPU์—์„œ ์ตœ์  (GPU ๋ถˆํ•„์š”) |
| ๋‹ค๊ตญ์–ด | ํ•œ๊ตญ์–ด ์ „์šฉ |
## ์„ค์น˜ ๋ฐ ์‚ฌ์šฉ๋ฒ•
### 1๋‹จ๊ณ„: ์„ค์น˜
```bash
# ๊ฐ€์ƒํ™˜๊ฒฝ ๊ถŒ์žฅ
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# ํŒจํ‚ค์ง€ ์„ค์น˜ (torch ํฌํ•จ, CPU ์ „์šฉ ๊ฐ€๋Šฅ)
pip install sentence-transformers
```
> ํ•„์š” ํŒจํ‚ค์ง€๋Š” `sentence-transformers`๋งŒ ์„ค์น˜ํ•˜๋ฉด ์ž๋™์œผ๋กœ `torch`, `transformers`, `huggingface_hub` ๋“ฑ ์˜์กด์„ฑ์ด ๋”ฐ๋ผ์˜ต๋‹ˆ๋‹ค.
> ๋””์Šคํฌ ์ ˆ์•ฝ์„ ์›ํ•˜๋ฉด CPU ์ „์šฉ torch: `pip install torch --index-url https://download.pytorch.org/whl/cpu`
### 2๋‹จ๊ณ„: ๋ชจ๋ธ ๋กœ๋“œ
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("kekeappa/kor-static-embedding-512")
# ์ฒซ ์‹คํ–‰ ์‹œ ๋ชจ๋ธ ์ž๋™ ๋‹ค์šด๋กœ๋“œ (~68MB)
# ์บ์‹œ ์œ„์น˜: ~/.cache/huggingface/hub/
```
### 3๋‹จ๊ณ„: ์ž„๋ฒ ๋”ฉ ์ถ”์ถœ
```python
sentences = [
"์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์ •๋ง ์ข‹๋„ค์š”.",
"ํ–‡์‚ด์ด ๋”ฐ๋œปํ•˜๊ณ  ๊ธฐ๋ถ„ ์ข‹์€ ํ•˜๋ฃจ์ž…๋‹ˆ๋‹ค.",
"๋น„๊ฐ€ ์™€์„œ ์šฐ์‚ฐ์„ ์ฑ™๊ฒจ์•ผ ํ•ฉ๋‹ˆ๋‹ค.",
]
embeddings = model.encode(sentences, normalize_embeddings=True)
print(embeddings.shape) # (3, 512)
# ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ (์ •๊ทœํ™”๋œ ๋ฒกํ„ฐ์˜ ๋‚ด์  = ์ฝ”์‚ฌ์ธ)
similarity_matrix = embeddings @ embeddings.T
print(similarity_matrix)
```
### 4๋‹จ๊ณ„: ํ™œ์šฉ ์˜ˆ์‹œ
#### A. ์˜๋ฏธ ๊ฒ€์ƒ‰ (Semantic Search)
```python
import numpy as np
# ์ฝ”ํผ์Šค ์ธ๋ฑ์‹ฑ (ํ•œ ๋ฒˆ๋งŒ)
corpus = [
"๊น€์น˜์ฐŒ๊ฐœ ๋งŒ๋“œ๋Š” ๋ฒ•",
"๋”ฅ๋Ÿฌ๋‹ ์ž…๋ฌธ ๊ฐ•์˜",
"์ฃผ๋ง ๋“ฑ์‚ฐ ์ถ”์ฒœ ์ฝ”์Šค",
"ํŒŒ์ด์ฌ ๋ฐ์ดํ„ฐ ๋ถ„์„",
"์ œ์ฃผ๋„ ์—ฌํ–‰ ์ผ์ •",
]
corpus_emb = model.encode(corpus, normalize_embeddings=True, batch_size=64)
# ์ฟผ๋ฆฌ (๋ฐ˜๋ณต ๊ฐ€๋Šฅ)
def search(query, top_k=3):
q_emb = model.encode([query], normalize_embeddings=True)
scores = (q_emb @ corpus_emb.T)[0]
top_idx = np.argsort(-scores)[:top_k]
return [(corpus[i], float(scores[i])) for i in top_idx]
print(search("์ธ๊ณต์ง€๋Šฅ ํ•™์Šต"))
# โ†’ [('๋”ฅ๋Ÿฌ๋‹ ์ž…๋ฌธ ๊ฐ•์˜', 0.41), ('ํŒŒ์ด์ฌ ๋ฐ์ดํ„ฐ ๋ถ„์„', 0.18), ...]
```
#### B. ๋‘ ๋ฌธ์žฅ ์œ ์‚ฌ๋„
```python
emb = model.encode(["์ข‹์€ ์•„์นจ์ž…๋‹ˆ๋‹ค", "๊ตฟ๋ชจ๋‹์ด์—์š”"], normalize_embeddings=True)
similarity = float((emb[0] * emb[1]).sum())
print(f"์œ ์‚ฌ๋„: {similarity:.4f}")
```
#### C. ํด๋Ÿฌ์Šคํ„ฐ๋ง (KMeans)
```python
from sklearn.cluster import KMeans
sentences = [
"๊น€์น˜์ฐŒ๊ฐœ ๋“์ด๋Š” ๋ฒ•", "๋œ์žฅ์ฐŒ๊ฐœ ๋งŒ๋“ค๊ธฐ", "๋น„๋น”๋ฐฅ ๋ ˆ์‹œํ”ผ",
"ํŒŒ์ด์ฌ ์ž…๋ฌธ", "์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ ๊ธฐ์ดˆ", "๋ฆฌ์•กํŠธ ์‚ฌ์šฉ๋ฒ•",
"์ œ์ฃผ๋„ ์—ฌํ–‰", "๋ถ€์‚ฐ ์—ฌํ–‰ ์ฝ”์Šค", "๊ฒฝ์ฃผ ์—ญ์‚ฌ ํƒ๋ฐฉ",
]
emb = model.encode(sentences, normalize_embeddings=True)
labels = KMeans(n_clusters=3, random_state=42, n_init=10).fit_predict(emb)
for i, s in enumerate(sentences):
print(f"[{labels[i]}] {s}")
```
#### D. ๋ฒกํ„ฐ DB ์—ฐ๋™ (FAISS / Qdrant / Chroma)
```python
# FAISS ์˜ˆ์‹œ
import faiss
import numpy as np
embeddings = model.encode(corpus, normalize_embeddings=True).astype("float32")
index = faiss.IndexFlatIP(512) # Inner Product (์ •๊ทœํ™” ํ–ˆ์œผ๋ฏ€๋กœ = ์ฝ”์‚ฌ์ธ)
index.add(embeddings)
# ๊ฒ€์ƒ‰
query_emb = model.encode(["์ธ๊ณต์ง€๋Šฅ"], normalize_embeddings=True).astype("float32")
distances, indices = index.search(query_emb, k=3)
for idx, dist in zip(indices[0], distances[0]):
print(f" [{dist:.4f}] {corpus[idx]}")
```
### ์ฃผ์š” ์˜ต์…˜
| ์˜ต์…˜ | ์„ค๋ช… | ๊ธฐ๋ณธ๊ฐ’ | ๊ถŒ์žฅ |
|---|---|---|---|
| `normalize_embeddings` | L2 ์ •๊ทœํ™” (์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„์šฉ) | `False` | **`True`** |
| `batch_size` | ๋ฐฐ์น˜ ํฌ๊ธฐ (CPU์—์„œ ํด์ˆ˜๋ก ๋น ๋ฆ„) | 32 | **128~512** |
| `show_progress_bar` | tqdm ์ง„ํ–‰๋ฐ” | `True` | ๋Œ€๋Ÿ‰ ์ฒ˜๋ฆฌ ์‹œ `True`, API ํ˜ธ์ถœ ์‹œ `False` |
| `convert_to_numpy` | numpy ๋ฐฐ์—ด๋กœ ๋ณ€ํ™˜ | `True` | ๋Œ€๋ถ€๋ถ„ `True` |
| `device` | "cpu" / "cuda" / "mps" | ์ž๋™ ๊ฐ์ง€ | CPU ์ตœ์  (GPU ๋ถˆํ•„์š”) |
### ํŠธ๋Ÿฌ๋ธ”์ŠˆํŒ…
| ๋ฌธ์ œ | ์›์ธ / ํ•ด๊ฒฐ |
|---|---|
| `ModuleNotFoundError: sentence_transformers` | `pip install sentence-transformers` |
| ์ฒซ ๋กœ๋”ฉ์ด ๋„ˆ๋ฌด ๋А๋ฆผ | ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ ์ค‘ (~68MB). ์บ์‹œ ํ›„ 0.3์ดˆ๋งŒ์— ๋กœ๋“œ |
| ํ•œ๊ตญ์–ด ๋ฌธ์žฅ์—์„œ ์ ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋‚ฎ์Œ | `normalize_embeddings=True` ๋ˆ„๋ฝ ํ™•์ธ |
| ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ | `batch_size` ์ค„์ด๊ธฐ (์˜ˆ: 32 โ†’ 8) |
| ์–ด์ˆœ/๋ถ€์ •๋ฌธ ๊ตฌ๋ถ„ ์•ˆ ๋จ | Static Embedding์˜ ๋ณธ์งˆ์  ํ•œ๊ณ„ (์•„๋ž˜ [ํ•œ๊ณ„](#ํ•œ๊ณ„) ์ฐธ์กฐ) |
## ๋ฒค์น˜๋งˆํฌ (BAAI/bge-m3 ๋น„๊ต)
### ์„ฑ๋Šฅ (Spearman ์ƒ๊ด€๊ณ„์ˆ˜)
| ๋ฒค์น˜๋งˆํฌ | N | **kor-static-embedding-512** | BAAI/bge-m3 | ๋น„์œจ |
|---|---:|---:|---:|---:|
| KorSTS-test | 1,376 | **0.7758** | 0.8026 | **96.7%** |
| KorSTS-valid | 1,465 | **0.8248** | 0.8317 | **99.2%** |
| KLUE-STS-validation | 519 | **0.7119** | 0.8773 | 81.1% |
| **ํ‰๊ท ** | โ€” | **0.7708** | 0.8372 | **92.1%** |
### ํฌ๊ธฐยท์ž์› (% ํ™˜์‚ฐ, BGE-M3 = 100%)
| ํ•ญ๋ชฉ | BGE-M3 | **kor-static-embedding-512** | ๋น„์œจ |
|---|---:|---:|---:|
| ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ | 100% (567.8M) | **2.89%** (16.4M) | 97.1% ์ ˆ์•ฝ |
| ๋””์Šคํฌ ํฌ๊ธฐ | 100% (2,168MB) | **3.14%** (68MB) | 96.9% ์ ˆ์•ฝ |
| ์ž„๋ฒ ๋”ฉ ์ฐจ์› | 100% (1024) | **50%** (512) | 50% ์ถ•์†Œ |
### ์†๋„ ์ƒ์„ธ (CPU, Apple M2)
#### 1. ๋ชจ๋ธ ๋กœ๋“œ ์‹œ๊ฐ„ โ€” ๋‚ฎ์„์ˆ˜๋ก ์ข‹์Œ
| ๋ชจ๋ธ | ๋กœ๋“œ ์‹œ๊ฐ„ | ๋น„์œจ |
|---|---:|---:|
| BGE-M3 | 24,042ms (24.0์ดˆ) | 100% |
| **kor-static-embedding-512** | **310ms** | **1.29%** (78ร— ๋น ๋ฆ„) |
#### 2. ๋‹จ์ผ ์ฟผ๋ฆฌ ์ง€์—ฐ์‹œ๊ฐ„ โ€” ๋‚ฎ์„์ˆ˜๋ก ์ข‹์Œ
| ๋ชจ๋ธ | p50 | p95 | p99 | ๋น„์œจ (p50) |
|---|---:|---:|---:|---:|
| BGE-M3 | 23.02ms | 24.30ms | 31.50ms | 100% |
| **kor-static-embedding-512** | **0.96ms** | 2.03ms | 2.37ms | **4.19%** (24ร— ๋น ๋ฆ„) |
#### 3. ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ๋Ÿ‰ โ€” ๋†’์„์ˆ˜๋ก ์ข‹์Œ
| Batch | BGE-M3 | **kor-static-embedding-512** | ๋น„์œจ |
|---:|---:|---:|---:|
| 1 | 42.5 sent/s | 1,132.9 sent/s | **2,662%** (26.6ร— ๋น ๋ฆ„) |
| 8 | 252.1 sent/s | 6,490.3 sent/s | **2,574%** (25.7ร— ๋น ๋ฆ„) |
| 32 | 346.3 sent/s | 20,095.5 sent/s | **5,803%** (58.0ร— ๋น ๋ฆ„) |
| 128 | 343.3 sent/s | 39,568.9 sent/s | **11,525%** (115ร— ๋น ๋ฆ„) |
| **512** | 324.6 sent/s | **92,468.3 sent/s** | **28,489%** (285ร— ๋น ๋ฆ„) |
โ†’ BGE-M3๋Š” batch 32์—์„œ ์ฒ˜๋ฆฌ๋Ÿ‰ ํฌํ™”, **kor-static-embedding-512๋Š” batch 512๊นŒ์ง€ ์„ ํ˜• ํ™•์žฅ**.
#### 4. ์‹ค์ „ ์‹œ๋‚˜๋ฆฌ์˜ค โ€” ๋Œ€๊ทœ๋ชจ ์ธ๋ฑ์‹ฑ ์‹œ๊ฐ„
| ๋ฌธ์„œ ์ˆ˜ | BGE-M3 | **kor-static-embedding-512** | ๋น„์œจ |
|---:|---:|---:|---:|
| 1๋งŒ ๊ฑด | 38.2์ดˆ | **0.3์ดˆ** | 0.82% |
| 10๋งŒ ๊ฑด | 6.4๋ถ„ | **3.1์ดˆ** | 0.82% |
| 100๋งŒ ๊ฑด | 1.1์‹œ๊ฐ„ | **31์ดˆ** | 0.82% |
| 1์ฒœ๋งŒ ๊ฑด | 10.6์‹œ๊ฐ„ | **5.2๋ถ„** | 0.82% |
| 1์–ต ๊ฑด (์ถ”์ •) | 4.4์ผ | **52๋ถ„** | 0.82% |
โ†’ **100๋งŒ ๊ฑด ์ธ๋ฑ์‹ฑ: 1์‹œ๊ฐ„ โ†’ 30์ดˆ** (122ร— ๋‹จ์ถ•)
#### 5. ๋น„์šฉยท์ž์› ์ ˆ๊ฐ ์š”์•ฝ
| ํ•ญ๋ชฉ | ์ ˆ๊ฐ๋ฅ  |
|---|---:|
| CPU ์ธํ”„๋ผ ๋น„์šฉ (๊ฐ™์€ ์ฒ˜๋ฆฌ๋Ÿ‰ ๊ธฐ์ค€) | **~99% ์ ˆ๊ฐ** |
| ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ | **~97% ์ ˆ๊ฐ** |
| ์‘๋‹ต ์ง€์—ฐ (์‚ฌ์šฉ์ž ์ฒด๊ฐ) | **~96% ๋‹จ์ถ•** |
| ์ฝœ๋“œ ์Šคํƒ€ํŠธ (์„œ๋ฒ„๋ฆฌ์Šค) | 24์ดˆ โ†’ 0.3์ดˆ (**99% ๋‹จ์ถ•**) |
## ํ•™์Šต ๋ ˆ์‹œํ”ผ
**Stage 1: KorNLI MultipleNegativesRankingLoss**
- ๋ฐ์ดํ„ฐ: `kakaobrain/kor_nli` (multi_nli + snli)
- entailment๋ฅผ positive, contradiction์„ hard negative๋กœ โ†’ **277,826 triplet**
- Loss: `MultipleNegativesRankingLoss`
- batch=2048, lr=2e-1, epoch=1
- ํ•™์Šต ์‹œ๊ฐ„: ์•ฝ 25์ดˆ (A100 80GB PCIe)
**Stage 2: STS regression fine-tune**
- ๋ฐ์ดํ„ฐ: KorSTS-train (5,691) + KLUE-STS-train (11,668) = 17,359 pairs
- Loss: `CosineSimilarityLoss`
- batch=64, lr=2e-2, epoch=4
- ํ•™์Šต ์‹œ๊ฐ„: ์•ฝ 18์ดˆ (A100 80GB PCIe)
- best checkpoint: KorSTS-valid Spearman ๊ธฐ์ค€
**Stage 1 ์ข…๋ฃŒ ์‹œ์  ์ ์ˆ˜** (์ฐธ๊ณ ):
- KorSTS-test Spearman: 0.7519
- KorSTS-valid Spearman: 0.7983
- KLUE-STS-val Spearman: 0.5757
โ†’ Stage 2 (STS regression)๊ฐ€ ํŠนํžˆ KLUE ์ ์ˆ˜๋ฅผ 0.58 โ†’ 0.71๋กœ ํฌ๊ฒŒ ๋Œ์–ด์˜ฌ๋ฆผ.
## ์ ํ•ฉํ•œ ์šฉ๋„
โœ… **๊ถŒ์žฅ**
- ๋Œ€๊ทœ๋ชจ RAG์˜ 1์ฐจ retrieval (์ˆ˜๋ฐฑ๋งŒ ๋ฌธ์„œ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ขํžˆ๊ธฐ)
- ์˜๋ฏธ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰, FAQ ๋งค์นญ, ์ถ”์ฒœ ์‹œ์Šคํ…œ
- ํด๋Ÿฌ์Šคํ„ฐ๋ง, ์ค‘๋ณต ์ œ๊ฑฐ, ์นดํ…Œ๊ณ ๋ฆฌ ๋ถ„๋ฅ˜
- ์˜จ๋””๋ฐ”์ด์Šค / ๋ชจ๋ฐ”์ผ ํ•œ๊ตญ์–ด ์ž„๋ฒ ๋”ฉ
- 2-stage ๊ฒ€์ƒ‰: kor-static-512(1์ฐจ) + BGE-M3(2์ฐจ ์žฌ์ •๋ ฌ)
โŒ **๋ถ€์ ํ•ฉ**
- ์–ด์ˆœยท๋ฌธ๋งฅ ๋ฏธ์„ธ ์ฐจ์ด๊ฐ€ ์ค‘์š”ํ•œ ์ž‘์—… (์–ด์ˆœ ์ •๋ณด ์—†์Œ)
- ๋‹ค๊ตญ์–ด ๊ฒ€์ƒ‰ (ํ•œ๊ตญ์–ด ์ „์šฉ)
- KLUE ๊ฐ™์€ ๋‰ด์Šค ๋„๋ฉ”์ธ์—์„œ ์ ˆ๋Œ€ ์ตœ๊ณ  ์„ฑ๋Šฅ ํ•„์š”์‹œ (BGE-M3 ๊ถŒ์žฅ)
- 8์ฒœ ํ† ํฐ ์ด์ƒ์˜ ๊ธด ๋ฌธ์„œ ๋‹จ์ผ ์ž„๋ฒ ๋”ฉ (mean pooling์€ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก ์•ฝํ•ด์ง)
## ์•„ํ‚คํ…์ฒ˜
์ด ๋ชจ๋ธ์€ ํŠธ๋žœ์Šคํฌ๋จธ attention์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋Œ€์‹ :
```
์ž…๋ ฅ: "์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”"
โ†“
[1] klue/roberta-base ํ† ํฌ๋‚˜์ด์ €
โ†’ ํ† ํฐ ID ์‹œํ€€์Šค
โ†“
[2] StaticEmbedding (32000 ร— 512 lookup table, 16.4M params)
โ†’ ๊ฐ ํ† ํฐ โ†’ 512์ฐจ์› ๋ฒกํ„ฐ
โ†“
[3] Mean pooling
โ†’ 512์ฐจ์› ๋ฌธ์žฅ ๋ฒกํ„ฐ
โ†“
[4] L2 ์ •๊ทœํ™” (normalize_embeddings=True ์‹œ)
```
[Tom Aarsen์˜ Static Embeddings ๋ธ”๋กœ๊ทธ(HuggingFace)](https://huggingface.co/blog/static-embeddings)์™€ [MinishLab์˜ model2vec](https://github.com/MinishLab/model2vec)์—์„œ ๊ฒ€์ฆ๋œ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ํ•œ๊ตญ์–ด๋กœ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
## ํ•œ๊ณ„
1. **์–ด์ˆœ ๋ฌด์‹œ**: "์ฒ ์ˆ˜๊ฐ€ ์˜ํฌ๋ฅผ ์ข‹์•„ํ•œ๋‹ค" โ†” "์˜ํฌ๊ฐ€ ์ฒ ์ˆ˜๋ฅผ ์ข‹์•„ํ•œ๋‹ค" ๊ตฌ๋ถ„ ์•ฝํ•จ
2. **๋‹ค์˜์–ด ์ฒ˜๋ฆฌ ์•ฝํ•จ**: "์€ํ–‰ ์ง์›" vs "๊ฐ•๋ณ€ ์€ํ–‰"์˜ "์€ํ–‰"์„ ๋™์ผํ•œ ๋ฒกํ„ฐ๋กœ ์ฒ˜๋ฆฌ
3. **KLUE ๋„๋ฉ”์ธ ์„ฑ๋Šฅ ๊ฒฉ์ฐจ**: ๋‰ด์Šค ๋„๋ฉ”์ธ์—์„œ๋Š” BGE-M3 ๋Œ€๋น„ ๊ฒฉ์ฐจ ํผ (0.71 vs 0.88)
4. **๋ถ€์ •/๋ฐ˜์–ด ์ฒ˜๋ฆฌ ์•ฝํ•จ**: "์ข‹์•„ํ•˜์ง€ ์•Š๋Š”๋‹ค"๋ฅผ "์ข‹์•„ํ•œ๋‹ค"์™€ ๋น„์Šทํ•˜๊ฒŒ ๋ณผ ์ˆ˜ ์žˆ์Œ
์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋Š” ๋ชจ๋“  BoW ๊ณ„์—ด ์ •์  ์ž„๋ฒ ๋”ฉ์˜ ๋ณธ์งˆ์  ํŠน์„ฑ์ž…๋‹ˆ๋‹ค. ์ •ํ™•๋„๊ฐ€ ์ ˆ๋Œ€์ ์ธ ๊ฒฝ์šฐ BGE-M3 ๊ถŒ์žฅ.
## ์ธ์šฉ
์ด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์‹ ๋‹ค๋ฉด, ๊ธฐ๋ฐ˜์ด ๋œ ์—ฐ๊ตฌ๋ฅผ ํ•จ๊ป˜ ์ธ์šฉํ•ด์ฃผ์„ธ์š”:
- Static Embeddings: https://huggingface.co/blog/static-embeddings
- model2vec: https://github.com/MinishLab/model2vec
- KorSTS / KorNLI: KakaoBrain KorNLUDatasets
- KLUE: https://klue-benchmark.com
## ๋ผ์ด์„ ์Šค
Apache 2.0