🏥 Vet KM-BERT Cross-Encoder

수의학 도메인에 특화된 한국어 Cross-Encoder 모델입니다. RAG 시스템의 Reranking 단계에서 사용됩니다.

모델 정보

Base Model: madatnlp/km-bert
Task: Binary Classification (질문-문서 연관성 판단)
Language: Korean (한국어)
Domain: Veterinary Medicine (수의학)

학습 데이터

데이터셋: 수의학 문서 213개 (5개 진료과)
- 내과, 안과, 외과, 치과, 피부과
질문 수: 600개 (학습 420개, 평가 180개)
큐레이션 방법: LLM Scoring + Graph Refinement (LightGCN)

성능

Metric	Score
Accuracy	~68%
F1-Score	~0.72
Precision	~0.71
Recall	~0.73

사용 방법

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# 모델 로드
model_name = "JOhyeongi/vet-kmbert-cross-encoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# 추론
query = "강아지가 구토를 해요."
document = "강아지 구토의 원인은 다양합니다..."

inputs = tokenizer(
    [[query, document]], 
    padding=True, 
    truncation=True, 
    return_tensors="pt",
    max_length=512
)

with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=1)
    score = probs[0][1].item()  # Relevance score

print(f"Relevance Score: {score:.4f}")

전체 RAG 파이프라인

이 모델은 다음 프로젝트의 일부입니다:

Repository: catholic_retreival
Pipeline: Rationale Generation → Retrieval → Reranking → Answer Generation

학습 설정

Epochs: 3
Batch Size: 8
Learning Rate: 2e-5
Max Length: 512
Optimizer: AdamW
Weight Decay: 0.01
Warmup Steps: 500

라이선스

MIT License

인용

@misc{vet-kmbert-cross-encoder,
  title={Vet KM-BERT Cross-Encoder: Korean Veterinary RAG System},
  author={Catholic University},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/JOhyeongi/vet-kmbert-cross-encoder}
}

연락처

GitHub: jasonhk24/catholic_retreival
Issues: GitHub Issues

Downloads last month: 4

Safetensors

Model size

98.7M params

Tensor type

F32

Model tree for JOhyeongi/vet-kmbert-cross-encoder

Base model

madatnlp/km-bert

Finetuned

(2)

this model