๐ฅ Vet KM-BERT Cross-Encoder
์์ํ ๋๋ฉ์ธ์ ํนํ๋ ํ๊ตญ์ด Cross-Encoder ๋ชจ๋ธ์ ๋๋ค. RAG ์์คํ ์ Reranking ๋จ๊ณ์์ ์ฌ์ฉ๋ฉ๋๋ค.
๋ชจ๋ธ ์ ๋ณด
- Base Model: madatnlp/km-bert
- Task: Binary Classification (์ง๋ฌธ-๋ฌธ์ ์ฐ๊ด์ฑ ํ๋จ)
- Language: Korean (ํ๊ตญ์ด)
- Domain: Veterinary Medicine (์์ํ)
ํ์ต ๋ฐ์ดํฐ
- ๋ฐ์ดํฐ์
: ์์ํ ๋ฌธ์ 213๊ฐ (5๊ฐ ์ง๋ฃ๊ณผ)
- ๋ด๊ณผ, ์๊ณผ, ์ธ๊ณผ, ์น๊ณผ, ํผ๋ถ๊ณผ
- ์ง๋ฌธ ์: 600๊ฐ (ํ์ต 420๊ฐ, ํ๊ฐ 180๊ฐ)
- ํ๋ ์ด์ ๋ฐฉ๋ฒ: LLM Scoring + Graph Refinement (LightGCN)
์ฑ๋ฅ
| Metric | Score |
|---|---|
| Accuracy | ~68% |
| F1-Score | ~0.72 |
| Precision | ~0.71 |
| Recall | ~0.73 |
์ฌ์ฉ ๋ฐฉ๋ฒ
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# ๋ชจ๋ธ ๋ก๋
model_name = "JOhyeongi/vet-kmbert-cross-encoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# ์ถ๋ก
query = "๊ฐ์์ง๊ฐ ๊ตฌํ ๋ฅผ ํด์."
document = "๊ฐ์์ง ๊ตฌํ ์ ์์ธ์ ๋ค์ํฉ๋๋ค..."
inputs = tokenizer(
[[query, document]],
padding=True,
truncation=True,
return_tensors="pt",
max_length=512
)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=1)
score = probs[0][1].item() # Relevance score
print(f"Relevance Score: {score:.4f}")
์ ์ฒด RAG ํ์ดํ๋ผ์ธ
์ด ๋ชจ๋ธ์ ๋ค์ ํ๋ก์ ํธ์ ์ผ๋ถ์ ๋๋ค:
- Repository: catholic_retreival
- Pipeline: Rationale Generation โ Retrieval โ Reranking โ Answer Generation
ํ์ต ์ค์
Epochs: 3
Batch Size: 8
Learning Rate: 2e-5
Max Length: 512
Optimizer: AdamW
Weight Decay: 0.01
Warmup Steps: 500
๋ผ์ด์ ์ค
MIT License
์ธ์ฉ
@misc{vet-kmbert-cross-encoder,
title={Vet KM-BERT Cross-Encoder: Korean Veterinary RAG System},
author={Catholic University},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/JOhyeongi/vet-kmbert-cross-encoder}
}
์ฐ๋ฝ์ฒ
- GitHub: jasonhk24/catholic_retreival
- Issues: GitHub Issues
- Downloads last month
- 11
Model tree for JOhyeongi/vet-kmbert-cross-encoder
Base model
madatnlp/km-bert