YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Korean Claim Detection Model for Fact-Checking

모델 소개 (Model Description)

이 모델은 한국어 문장에서 팩트체크가 필요한 주장(claim)을 자동으로 탐지하는 이진 분류 모델입니다.

뉴스 기사, 정치 토론, 소셜 미디어 게시물 등에서 검증이 필요한 주장을 식별하여, 팩트체킹 워크플로우의 첫 단계를 자동화할 수 있습니다.

This model automatically detects claims that require fact-checking in Korean sentences. It can identify verifiable claims in news articles, political debates, and social media posts, automating the first step of the fact-checking workflow.

Base Model: beomi/KcELECTRA-base-v2022
Task: Claim Detection (Checkworthy Sentence Classification)
Language: Korean (한국어)
Labels:
- 0: 팩트체크가 불필요한 문장 (Non-checkworthy)
- 1: 팩트체크가 필요한 주장 (Checkworthy claim)

모델 목표 (Model Objective)

입력된 한국어 문장을 분석하여 다음을 판단합니다:

검증 가능한 사실적 주장인지
팩트체킹이 필요한 정도는 얼마나 되는지

This model analyzes Korean sentences to determine:

Whether they contain verifiable factual claims
The degree to which fact-checking is needed

팩트체크가 필요한 주장의 예시 (Checkworthy Claim Examples)

✅ Label 1 (Checkworthy):

"청년 실업률이 지난 3년간 계속 상승했습니다"
"우리나라 GDP 성장률은 OECD 평균을 넘어섰습니다"
"이 정책으로 일자리가 100만 개 창출될 것입니다"

❌ Label 0 (Non-checkworthy):

"오늘 토론회는 SBS 상암동 스튜디오에서 진행하고 있고요"
"국민 여러분께 감사드립니다"
"제 생각에는 이 정책이 좋은 것 같습니다"

데이터셋 (Dataset)

데이터 출처

Source: CLEF CheckThat! Lab 2024
Task: Task 1 - Check-Worthiness Estimation
Original Dataset: English political debates and speeches
Translation: Machine-translated to Korean for training

데이터셋 크기

Training Set: 22,501 samples
Validation Set: 1,032 samples
Test Set: 318 samples

데이터 특성

정치 토론, 연설문, 뉴스 기사에서 추출된 문장
전문 팩트체커들이 레이블링한 고품질 데이터
클래스 불균형: Label 0 (65%) vs Label 1 (35%)

학습 세부사항 (Training Details)

학습 하이퍼파라미터

Epochs: 5
Batch Size (Train): 32
Batch Size (Eval): 64
Learning Rate: 3e-05
Weight Decay: 0.01
Warmup Ratio: 0.1
Precision: BF16
Optimizer: adamw_torch_fused
Max Sequence Length: 128 tokens
Seed: 42

학습 환경

GPU: NVIDIA GeForce RTX 4090 (24GB)
Training Time: 1.87 minutes
Framework: Hugging Face Transformers
Early Stopping: Patience 3 (based on F1 score)

성능 (Performance)

Validation Metrics

Accuracy: 97.58%
F1 Score: 94.80%
Precision: 93.83%
Recall: 95.80%

Test Metrics

Accuracy: 89.31%
F1 Score: 82.65%
Precision: 92.05%
Recall: 75.00%

Confusion Matrix (Test Set)

           Predicted
           0      1
Actual 0   203    7    (96.7% 정확도)
       1    27    81   (75.0% 재현율)

성능 해석:

높은 Precision (92.05%): 모델이 "checkworthy"라고 예측한 문장의 92%가 실제로 팩트체크가 필요
적절한 Recall (75.00%): 실제 checkworthy 문장의 75%를 탐지
낮은 False Positive (7개): 불필요한 팩트체크 요청 최소화

사용 방법 (How to Use)

1. 설치 (Installation)

pip install transformers torch

2. 모델 로드 (Loading the Model)

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# 모델 로드
model_name = "jonghhhh/claim_factcheck"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# GPU 사용 (선택사항)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

print(f"✅ 모델 로드 완료! (Device: {device})")

3. 추론 예시 (Inference Example)

단일 문장 분류

def predict_claim(text):
    """
    입력 문장이 팩트체크가 필요한 주장인지 판단합니다.

    Args:
        text (str): 분석할 한국어 문장

    Returns:
        dict: {
            'text': 입력 문장,
            'is_checkworthy': True/False,
            'confidence': 0.0~1.0 (확신도),
            'label': 0 또는 1,
            'probabilities': {'non_checkworthy': 0.xx, 'checkworthy': 0.xx}
        }
    """
    # 토크나이징
    inputs = tokenizer(
        text,
        truncation=True,
        max_length=128,
        return_tensors="pt"
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # 추론
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)
        predicted_label = torch.argmax(probs, dim=-1).item()
        confidence = probs[0][predicted_label].item()

    return {
        'text': text,
        'is_checkworthy': bool(predicted_label),
        'confidence': confidence,
        'label': predicted_label,
        'probabilities': {
            'non_checkworthy': probs[0][0].item(),
            'checkworthy': probs[0][1].item()
        }
    }

# 사용 예시
examples = [
    "오늘 토론회는 SBS 상암동 스튜디오에서 진행하고 있고요.",
    "청년 실업률이 최근 3년간 계속 상승하고 있습니다.",
    "우리나라 GDP 성장률은 OECD 평균을 넘어섰습니다.",
    "국민 여러분께 진심으로 감사드립니다."
]

for text in examples:
    result = predict_claim(text)
    print(f"\n📝 입력: {result['text']}")
    print(f"{'🔍 팩트체크 필요' if result['is_checkworthy'] else '✅ 팩트체크 불필요'}")
    print(f"확신도: {result['confidence']*100:.1f}%")
    print(f"상세 확률: Non-CW {result['probabilities']['non_checkworthy']*100:.1f}% | CW {result['probabilities']['checkworthy']*100:.1f}%")

출력 예시:

📝 입력: 청년 실업률이 최근 3년간 계속 상승하고 있습니다.
🔍 팩트체크 필요
확신도: 94.3%
상세 확률: Non-CW 5.7% | CW 94.3%

📝 입력: 오늘 토론회는 SBS 상암동 스튜디오에서 진행하고 있고요.
✅ 팩트체크 불필요
확신도: 98.2%
상세 확률: Non-CW 98.2% | CW 1.8%

배치 처리 (Batch Processing)

def predict_claims_batch(texts, batch_size=32):
    """
    여러 문장을 배치로 처리합니다.

    Args:
        texts (list): 문장 리스트
        batch_size (int): 배치 크기

    Returns:
        list: 각 문장의 예측 결과 리스트
    """
    results = []

    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]

        # 배치 토크나이징
        inputs = tokenizer(
            batch_texts,
            truncation=True,
            max_length=128,
            padding=True,
            return_tensors="pt"
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}

        # 배치 추론
        with torch.no_grad():
            outputs = model(**inputs)
            probs = torch.softmax(outputs.logits, dim=-1)
            predicted_labels = torch.argmax(probs, dim=-1).cpu().numpy()

        # 결과 저장
        for j, text in enumerate(batch_texts):
            results.append({
                'text': text,
                'is_checkworthy': bool(predicted_labels[j]),
                'confidence': probs[j][predicted_labels[j]].item(),
                'label': int(predicted_labels[j])
            })

    return results

# 배치 추론 예시
texts = [
    "국회의원 정원을 300명으로 확대하겠습니다.",
    "감사합니다.",
    "2024년 경제성장률이 2.1%를 기록했습니다.",
    # ... 더 많은 문장들
]

batch_results = predict_claims_batch(texts)
checkworthy_claims = [r for r in batch_results if r['is_checkworthy']]
print(f"✅ 총 {len(texts)}개 문장 중 {len(checkworthy_claims)}개가 팩트체크 필요")

4. 실전 활용 예시 (Real-world Use Case)

# 뉴스 기사에서 팩트체크 대상 추출
def extract_checkworthy_claims(article_text, threshold=0.7):
    """
    기사에서 팩트체크가 필요한 문장들을 추출합니다.

    Args:
        article_text (str): 뉴스 기사 전문
        threshold (float): checkworthy 판단 임계값 (0.0~1.0)

    Returns:
        list: 팩트체크 대상 문장들
    """
    # 문장 분리 (간단한 예시)
    sentences = [s.strip() for s in article_text.split('.') if s.strip()]

    # 배치 예측
    results = predict_claims_batch(sentences)

    # 임계값 이상의 checkworthy 문장만 필터링
    checkworthy_claims = [
        r for r in results
        if r['is_checkworthy'] and r['confidence'] >= threshold
    ]

    # 확신도 순으로 정렬
    checkworthy_claims.sort(key=lambda x: x['confidence'], reverse=True)

    return checkworthy_claims

# 사용 예시
article = """
정부는 오늘 경제정책 방향을 발표했습니다.
청년 실업률이 지난해 대비 2.3%p 감소했다고 밝혔습니다.
이는 역대 최대 폭의 하락입니다.
앞으로도 일자리 창출에 힘쓰겠다고 강조했습니다.
"""

claims = extract_checkworthy_claims(article, threshold=0.8)
print(f"🔍 발견된 팩트체크 대상: {len(claims)}개\n")

for i, claim in enumerate(claims, 1):
    print(f"{i}. {claim['text']}")
    print(f"   확신도: {claim['confidence']*100:.1f}%\n")

모델 아키텍처 (Model Architecture)

Model Type: ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately)
Hidden Size: 768
Number of Layers: 12
Number of Attention Heads: 12
Vocabulary Size: 32,000
Max Sequence Length: 128 tokens
Classification Head: Linear layer (768 → 2)

한계 및 고려사항 (Limitations)

도메인 특화: 정치/뉴스 도메인에 최적화되어 있어, 일상 대화나 기술 문서에는 성능이 떨어질 수 있음
길이 제한: 최대 128 토큰까지만 처리 가능 (약 100-150 단어)
기계 번역 데이터: 영어에서 번역된 데이터로 학습되어 자연스러운 한국어 표현에서 성능 차이 가능
이진 분류: Checkworthy 정도를 0/1로만 분류 (세밀한 점수 제공 안 함)
False Negative: 실제 주장의 25%를 놓칠 수 있음 (Recall 75%)

개선 방향 (Future Improvements)

한국어 네이티브 팩트체크 데이터셋으로 추가 학습
더 긴 문맥 처리를 위한 모델 업그레이드 (max_length 256+)
다중 클래스 분류 (checkworthy 점수를 0-5 척도로)
주장의 주제 카테고리 분류 기능 추가

라이선스 (License)

이 모델은 베이스 모델인 beomi/KcELECTRA-base-v2022의 라이선스를 따릅니다.

인용 (Citation)

이 모델을 연구나 프로젝트에 사용하신다면 다음과 같이 인용해주세요:

@misc{korean-claim-factcheck-2025,
  author = {Jonghhhh},
  title = {Korean Claim Detection Model for Fact-Checking},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/jonghhhh/claim_factcheck}},
  note = {Based on KcELECTRA-base-v2022}
}

참고 자료 (References)

Base Model: beomi/KcELECTRA-base-v2022
Dataset: CLEF CheckThat! Lab 2024
Paper: CheckThat! Lab: Check-Worthiness, Subjectivity, and Persuasion

연락처 (Contact)

질문이나 피드백이 있으시면 Issues를 통해 남겨주세요!

Tags: claim-detection, fact-checking, korean, electra, text-classification, checkworthy, misinformation-detection

Downloads last month: 6

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

jonghhhh
/

claim_factcheck