File size: 12,394 Bytes
# Korean Claim Detection Model for Fact-Checking

## 모델 소개 (Model Description)

이 모델은 **한국어 문장에서 팩트체크가 필요한 주장(claim)을 자동으로 탐지**하는 이진 분류 모델입니다.

뉴스 기사, 정치 토론, 소셜 미디어 게시물 등에서 검증이 필요한 주장을 식별하여, 팩트체킹 워크플로우의 첫 단계를 자동화할 수 있습니다.

This model **automatically detects claims that require fact-checking** in Korean sentences. It can identify verifiable claims in news articles, political debates, and social media posts, automating the first step of the fact-checking workflow.

- **Base Model**: [beomi/KcELECTRA-base-v2022](https://huggingface.co/beomi/KcELECTRA-base-v2022)
- **Task**: Claim Detection (Checkworthy Sentence Classification)
- **Language**: Korean (한국어)
- **Labels**:
  - `0`: 팩트체크가 불필요한 문장 (Non-checkworthy)
  - `1`: 팩트체크가 필요한 주장 (Checkworthy claim)

## 모델 목표 (Model Objective)

입력된 한국어 문장을 분석하여 다음을 판단합니다:
- **검증 가능한 사실적 주장**인지
- **팩트체킹이 필요한 정도**는 얼마나 되는지

This model analyzes Korean sentences to determine:
- Whether they contain **verifiable factual claims**
- The **degree to which fact-checking is needed**

### 팩트체크가 필요한 주장의 예시 (Checkworthy Claim Examples)

✅ **Label 1 (Checkworthy)**:
- "청년 실업률이 지난 3년간 계속 상승했습니다"
- "우리나라 GDP 성장률은 OECD 평균을 넘어섰습니다"
- "이 정책으로 일자리가 100만 개 창출될 것입니다"

❌ **Label 0 (Non-checkworthy)**:
- "오늘 토론회는 SBS 상암동 스튜디오에서 진행하고 있고요"
- "국민 여러분께 감사드립니다"
- "제 생각에는 이 정책이 좋은 것 같습니다"

## 데이터셋 (Dataset)

### 데이터 출처
- **Source**: CLEF CheckThat! Lab 2024
- **Task**: Task 1 - Check-Worthiness Estimation
- **Original Dataset**: English political debates and speeches
- **Translation**: Machine-translated to Korean for training

### 데이터셋 크기
- **Training Set**: 22,501 samples
- **Validation Set**: 1,032 samples
- **Test Set**: 318 samples

### 데이터 특성
- 정치 토론, 연설문, 뉴스 기사에서 추출된 문장
- 전문 팩트체커들이 레이블링한 고품질 데이터
- 클래스 불균형: Label 0 (65%) vs Label 1 (35%)

## 학습 세부사항 (Training Details)

### 학습 하이퍼파라미터
- **Epochs**: 5
- **Batch Size (Train)**: 32
- **Batch Size (Eval)**: 64
- **Learning Rate**: 3e-05
- **Weight Decay**: 0.01
- **Warmup Ratio**: 0.1
- **Precision**: BF16
- **Optimizer**: adamw_torch_fused
- **Max Sequence Length**: 128 tokens
- **Seed**: 42

### 학습 환경
- **GPU**: NVIDIA GeForce RTX 4090 (24GB)
- **Training Time**: 1.87 minutes
- **Framework**: Hugging Face Transformers
- **Early Stopping**: Patience 3 (based on F1 score)

## 성능 (Performance)

### Validation Metrics
- **Accuracy**: 97.58%
- **F1 Score**: 94.80%
- **Precision**: 93.83%
- **Recall**: 95.80%

### Test Metrics
- **Accuracy**: 89.31%
- **F1 Score**: 82.65%
- **Precision**: 92.05%
- **Recall**: 75.00%

### Confusion Matrix (Test Set)
```
           Predicted
           0      1
Actual 0   203    7    (96.7% 정확도)
       1    27    81   (75.0% 재현율)
```

**성능 해석**:
- **높은 Precision (92.05%)**: 모델이 "checkworthy"라고 예측한 문장의 92%가 실제로 팩트체크가 필요
- **적절한 Recall (75.00%)**: 실제 checkworthy 문장의 75%를 탐지
- **낮은 False Positive (7개)**: 불필요한 팩트체크 요청 최소화

## 사용 방법 (How to Use)

### 1. 설치 (Installation)

```bash
pip install transformers torch
```

### 2. 모델 로드 (Loading the Model)

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# 모델 로드
model_name = "jonghhhh/claim_factcheck"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# GPU 사용 (선택사항)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

print(f"✅ 모델 로드 완료! (Device: {device})")
```

### 3. 추론 예시 (Inference Example)

#### 단일 문장 분류

```python
def predict_claim(text):
    """
    입력 문장이 팩트체크가 필요한 주장인지 판단합니다.

    Args:
        text (str): 분석할 한국어 문장

    Returns:
        dict: {
            'text': 입력 문장,
            'is_checkworthy': True/False,
            'confidence': 0.0~1.0 (확신도),
            'label': 0 또는 1,
            'probabilities': {'non_checkworthy': 0.xx, 'checkworthy': 0.xx}
        }
    """
    # 토크나이징
    inputs = tokenizer(
        text,
        truncation=True,
        max_length=128,
        return_tensors="pt"
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # 추론
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)
        predicted_label = torch.argmax(probs, dim=-1).item()
        confidence = probs[0][predicted_label].item()

    return {
        'text': text,
        'is_checkworthy': bool(predicted_label),
        'confidence': confidence,
        'label': predicted_label,
        'probabilities': {
            'non_checkworthy': probs[0][0].item(),
            'checkworthy': probs[0][1].item()
        }
    }

# 사용 예시
examples = [
    "오늘 토론회는 SBS 상암동 스튜디오에서 진행하고 있고요.",
    "청년 실업률이 최근 3년간 계속 상승하고 있습니다.",
    "우리나라 GDP 성장률은 OECD 평균을 넘어섰습니다.",
    "국민 여러분께 진심으로 감사드립니다."
]

for text in examples:
    result = predict_claim(text)
    print(f"\n📝 입력: {result['text']}")
    print(f"{'🔍 팩트체크 필요' if result['is_checkworthy'] else '✅ 팩트체크 불필요'}")
    print(f"확신도: {result['confidence']*100:.1f}%")
    print(f"상세 확률: Non-CW {result['probabilities']['non_checkworthy']*100:.1f}% | CW {result['probabilities']['checkworthy']*100:.1f}%")
```

**출력 예시**:
```
📝 입력: 청년 실업률이 최근 3년간 계속 상승하고 있습니다.
🔍 팩트체크 필요
확신도: 94.3%
상세 확률: Non-CW 5.7% | CW 94.3%

📝 입력: 오늘 토론회는 SBS 상암동 스튜디오에서 진행하고 있고요.
✅ 팩트체크 불필요
확신도: 98.2%
상세 확률: Non-CW 98.2% | CW 1.8%
```

#### 배치 처리 (Batch Processing)

```python
def predict_claims_batch(texts, batch_size=32):
    """
    여러 문장을 배치로 처리합니다.

    Args:
        texts (list): 문장 리스트
        batch_size (int): 배치 크기

    Returns:
        list: 각 문장의 예측 결과 리스트
    """
    results = []

    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]

        # 배치 토크나이징
        inputs = tokenizer(
            batch_texts,
            truncation=True,
            max_length=128,
            padding=True,
            return_tensors="pt"
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}

        # 배치 추론
        with torch.no_grad():
            outputs = model(**inputs)
            probs = torch.softmax(outputs.logits, dim=-1)
            predicted_labels = torch.argmax(probs, dim=-1).cpu().numpy()

        # 결과 저장
        for j, text in enumerate(batch_texts):
            results.append({
                'text': text,
                'is_checkworthy': bool(predicted_labels[j]),
                'confidence': probs[j][predicted_labels[j]].item(),
                'label': int(predicted_labels[j])
            })

    return results

# 배치 추론 예시
texts = [
    "국회의원 정원을 300명으로 확대하겠습니다.",
    "감사합니다.",
    "2024년 경제성장률이 2.1%를 기록했습니다.",
    # ... 더 많은 문장들
]

batch_results = predict_claims_batch(texts)
checkworthy_claims = [r for r in batch_results if r['is_checkworthy']]
print(f"✅ 총 {len(texts)}개 문장 중 {len(checkworthy_claims)}개가 팩트체크 필요")
```

### 4. 실전 활용 예시 (Real-world Use Case)

```python
# 뉴스 기사에서 팩트체크 대상 추출
def extract_checkworthy_claims(article_text, threshold=0.7):
    """
    기사에서 팩트체크가 필요한 문장들을 추출합니다.

    Args:
        article_text (str): 뉴스 기사 전문
        threshold (float): checkworthy 판단 임계값 (0.0~1.0)

    Returns:
        list: 팩트체크 대상 문장들
    """
    # 문장 분리 (간단한 예시)
    sentences = [s.strip() for s in article_text.split('.') if s.strip()]

    # 배치 예측
    results = predict_claims_batch(sentences)

    # 임계값 이상의 checkworthy 문장만 필터링
    checkworthy_claims = [
        r for r in results
        if r['is_checkworthy'] and r['confidence'] >= threshold
    ]

    # 확신도 순으로 정렬
    checkworthy_claims.sort(key=lambda x: x['confidence'], reverse=True)

    return checkworthy_claims

# 사용 예시
article = """
정부는 오늘 경제정책 방향을 발표했습니다.
청년 실업률이 지난해 대비 2.3%p 감소했다고 밝혔습니다.
이는 역대 최대 폭의 하락입니다.
앞으로도 일자리 창출에 힘쓰겠다고 강조했습니다.
"""

claims = extract_checkworthy_claims(article, threshold=0.8)
print(f"🔍 발견된 팩트체크 대상: {len(claims)}개\n")

for i, claim in enumerate(claims, 1):
    print(f"{i}. {claim['text']}")
    print(f"   확신도: {claim['confidence']*100:.1f}%\n")
```

## 모델 아키텍처 (Model Architecture)

- **Model Type**: ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately)
- **Hidden Size**: 768
- **Number of Layers**: 12
- **Number of Attention Heads**: 12
- **Vocabulary Size**: 32,000
- **Max Sequence Length**: 128 tokens
- **Classification Head**: Linear layer (768 → 2)

## 한계 및 고려사항 (Limitations)

1. **도메인 특화**: 정치/뉴스 도메인에 최적화되어 있어, 일상 대화나 기술 문서에는 성능이 떨어질 수 있음
2. **길이 제한**: 최대 128 토큰까지만 처리 가능 (약 100-150 단어)
3. **기계 번역 데이터**: 영어에서 번역된 데이터로 학습되어 자연스러운 한국어 표현에서 성능 차이 가능
4. **이진 분류**: Checkworthy 정도를 0/1로만 분류 (세밀한 점수 제공 안 함)
5. **False Negative**: 실제 주장의 25%를 놓칠 수 있음 (Recall 75%)

## 개선 방향 (Future Improvements)

- [ ] 한국어 네이티브 팩트체크 데이터셋으로 추가 학습
- [ ] 더 긴 문맥 처리를 위한 모델 업그레이드 (max_length 256+)
- [ ] 다중 클래스 분류 (checkworthy 점수를 0-5 척도로)
- [ ] 주장의 주제 카테고리 분류 기능 추가

## 라이선스 (License)

이 모델은 베이스 모델인 [beomi/KcELECTRA-base-v2022](https://huggingface.co/beomi/KcELECTRA-base-v2022)의 라이선스를 따릅니다.

## 인용 (Citation)

이 모델을 연구나 프로젝트에 사용하신다면 다음과 같이 인용해주세요:

```bibtex
@misc{korean-claim-factcheck-2025,
  author = {Jonghhhh},
  title = {Korean Claim Detection Model for Fact-Checking},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/jonghhhh/claim_factcheck}},
  note = {Based on KcELECTRA-base-v2022}
}
```

## 참고 자료 (References)

- **Base Model**: [beomi/KcELECTRA-base-v2022](https://huggingface.co/beomi/KcELECTRA-base-v2022)
- **Dataset**: [CLEF CheckThat! Lab 2024](https://clef2025.clef-initiative.eu/index.php?page=Pages/Labs/CheckThat.html)
- **Paper**: [CheckThat! Lab: Check-Worthiness, Subjectivity, and Persuasion](https://link.springer.com/chapter/10.1007/978-3-031-13643-6_24)

## 연락처 (Contact)

질문이나 피드백이 있으시면 Issues를 통해 남겨주세요!

---

**Tags**: `claim-detection`, `fact-checking`, `korean`, `electra`, `text-classification`, `checkworthy`, `misinformation-detection`