Upload Korean Claim Detection Model for Fact-Checking

Browse files

Files changed (9) hide show

README.md +372 -0
config.json +31 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +58 -0
training_args.bin +3 -0
training_metadata.json +45 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,372 @@

+# Korean Claim Detection Model for Fact-Checking
+## 모델 소개 (Model Description)
+이 모델은 **한국어 문장에서 팩트체크가 필요한 주장(claim)을 자동으로 탐지**하는 이진 분류 모델입니다.
+뉴스 기사, 정치 토론, 소셜 미디어 게시물 등에서 검증이 필요한 주장을 식별하여, 팩트체킹 워크플로우의 첫 단계를 자동화할 수 있습니다.
+This model **automatically detects claims that require fact-checking** in Korean sentences. It can identify verifiable claims in news articles, political debates, and social media posts, automating the first step of the fact-checking workflow.
+- **Base Model**: [beomi/KcELECTRA-base-v2022](https://huggingface.co/beomi/KcELECTRA-base-v2022)
+- **Task**: Claim Detection (Checkworthy Sentence Classification)
+- **Language**: Korean (한국어)
+- **Labels**:
+  - `0`: 팩트체크가 불필요한 문장 (Non-checkworthy)
+  - `1`: 팩트체크가 필요한 주장 (Checkworthy claim)
+## 모델 목표 (Model Objective)
+입력된 한국어 문장을 분석하여 다음을 판단합니다:
+- **검증 가능한 사실적 주장**인지
+- **팩트체킹이 필요한 정도**는 얼마나 되는지
+This model analyzes Korean sentences to determine:
+- Whether they contain **verifiable factual claims**
+- The **degree to which fact-checking is needed**
+### 팩트체크가 필요한 주장의 예시 (Checkworthy Claim Examples)
+✅ **Label 1 (Checkworthy)**:
+- "청년 실업률이 지난 3년간 계속 상승했습니다"
+- "우리나라 GDP 성장률은 OECD 평균을 넘어섰습니다"
+- "이 정책으로 일자리가 100만 개 창출될 것입니다"
+❌ **Label 0 (Non-checkworthy)**:
+- "오늘 토론회는 SBS 상암동 스튜디오에서 진행하고 있고요"
+- "국민 여러분께 감사드립니다"
+- "제 생각에는 이 정책이 좋은 것 같습니다"
+## 데이터셋 (Dataset)
+### 데이터 출처
+- **Source**: CLEF CheckThat! Lab 2024
+- **Task**: Task 1 - Check-Worthiness Estimation
+- **Original Dataset**: English political debates and speeches
+- **Translation**: Machine-translated to Korean for training
+### 데이터셋 크기
+- **Training Set**: 22,501 samples
+- **Validation Set**: 1,032 samples
+- **Test Set**: 318 samples
+### 데이터 특성
+- 정치 토론, 연설문, 뉴스 기사에서 추출된 문장
+- 전문 팩트체커들이 레이블링한 고품질 데이터
+- 클래스 불균형: Label 0 (65%) vs Label 1 (35%)
+## 학습 세부사항 (Training Details)
+### 학습 하이퍼파라미터
+- **Epochs**: 5
+- **Batch Size (Train)**: 32
+- **Batch Size (Eval)**: 64
+- **Learning Rate**: 3e-05
+- **Weight Decay**: 0.01
+- **Warmup Ratio**: 0.1
+- **Precision**: BF16
+- **Optimizer**: adamw_torch_fused
+- **Max Sequence Length**: 128 tokens
+- **Seed**: 42
+### 학습 환경
+- **GPU**: NVIDIA GeForce RTX 4090 (24GB)
+- **Training Time**: 1.87 minutes
+- **Framework**: Hugging Face Transformers
+- **Early Stopping**: Patience 3 (based on F1 score)
+## 성능 (Performance)
+### Validation Metrics
+- **Accuracy**: 97.58%
+- **F1 Score**: 94.80%
+- **Precision**: 93.83%
+- **Recall**: 95.80%
+### Test Metrics
+- **Accuracy**: 89.31%
+- **F1 Score**: 82.65%
+- **Precision**: 92.05%
+- **Recall**: 75.00%
+### Confusion Matrix (Test Set)
+```
+           Predicted
+           0      1
+Actual 0   203    7    (96.7% 정확도)
+       1    27    81   (75.0% 재현율)
+```
+**성능 해석**:
+- **높은 Precision (92.05%)**: 모델이 "checkworthy"라고 예측한 문장의 92%가 실제로 팩트체크가 필요
+- **적절한 Recall (75.00%)**: 실제 checkworthy 문장의 75%를 탐지
+- **낮은 False Positive (7개)**: 불필요한 팩트체크 요청 최소화
+## 사용 방법 (How to Use)
+### 1. 설치 (Installation)
+```bash
+pip install transformers torch
+```
+### 2. 모델 로드 (Loading the Model)
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# 모델 로드
+model_name = "jonghhhh/claim_factcheck"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# GPU 사용 (선택사항)
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
+model.eval()
+print(f"✅ 모델 로드 완료! (Device: {device})")
+```
+### 3. 추론 예시 (Inference Example)
+#### 단일 문장 분류
+```python
+def predict_claim(text):
+    """
+    입력 문장이 팩트체크가 필요한 주장인지 판단합니다.
+    Args:
+        text (str): 분석할 한국어 문장
+    Returns:
+        dict: {
+            'text': 입력 문장,
+            'is_checkworthy': True/False,
+            'confidence': 0.0~1.0 (확신도),
+            'label': 0 또는 1,
+            'probabilities': {'non_checkworthy': 0.xx, 'checkworthy': 0.xx}
+        }
+    """
+    # 토크나이징
+    inputs = tokenizer(
+        text,
+        truncation=True,
+        max_length=128,
+        return_tensors="pt"
+    )
+    inputs = {k: v.to(device) for k, v in inputs.items()}
+    # 추론
+    with torch.no_grad():
+        outputs = model(**inputs)
+        probs = torch.softmax(outputs.logits, dim=-1)
+        predicted_label = torch.argmax(probs, dim=-1).item()
+        confidence = probs[0][predicted_label].item()
+    return {
+        'text': text,
+        'is_checkworthy': bool(predicted_label),
+        'confidence': confidence,
+        'label': predicted_label,
+        'probabilities': {
+            'non_checkworthy': probs[0][0].item(),
+            'checkworthy': probs[0][1].item()
+        }
+    }
+# 사용 예시
+examples = [
+    "오늘 토론회는 SBS 상암동 스튜디오에서 진행하고 있고요.",
+    "청년 실업률이 최근 3년간 계속 상승하고 있습니다.",
+    "우리나라 GDP 성장률은 OECD 평균을 넘어섰습니다.",
+    "국민 여러분께 진심으로 감사드립니다."
+]
+for text in examples:
+    result = predict_claim(text)
+    print(f"\n📝 입력: {result['text']}")
+    print(f"{'🔍 팩트체크 필요' if result['is_checkworthy'] else '✅ 팩트체크 불필요'}")
+    print(f"확신도: {result['confidence']*100:.1f}%")
+    print(f"상세 확률: Non-CW {result['probabilities']['non_checkworthy']*100:.1f}% | CW {result['probabilities']['checkworthy']*100:.1f}%")
+```
+**출력 예시**:
+```
+📝 입력: 청년 실업률이 최근 3년간 계속 상승하고 있습니다.
+🔍 팩트체크 필요
+확신도: 94.3%
+상세 확률: Non-CW 5.7% | CW 94.3%
+📝 입력: 오늘 토론회는 SBS 상암동 스튜디오에서 진행하고 있고요.
+✅ 팩트체크 불필요
+확신도: 98.2%
+상세 확률: Non-CW 98.2% | CW 1.8%
+```
+#### 배치 처리 (Batch Processing)
+```python
+def predict_claims_batch(texts, batch_size=32):
+    """
+    여러 문장을 배치로 처리합니다.
+    Args:
+        texts (list): 문장 리스트
+        batch_size (int): 배치 크기
+    Returns:
+        list: 각 문장의 예측 결과 리스트
+    """
+    results = []
+    for i in range(0, len(texts), batch_size):
+        batch_texts = texts[i:i+batch_size]
+        # 배치 토크나이징
+        inputs = tokenizer(
+            batch_texts,
+            truncation=True,
+            max_length=128,
+            padding=True,
+            return_tensors="pt"
+        )
+        inputs = {k: v.to(device) for k, v in inputs.items()}
+        # 배치 추론
+        with torch.no_grad():
+            outputs = model(**inputs)
+            probs = torch.softmax(outputs.logits, dim=-1)
+            predicted_labels = torch.argmax(probs, dim=-1).cpu().numpy()
+        # 결과 저장
+        for j, text in enumerate(batch_texts):
+            results.append({
+                'text': text,
+                'is_checkworthy': bool(predicted_labels[j]),
+                'confidence': probs[j][predicted_labels[j]].item(),
+                'label': int(predicted_labels[j])
+            })
+    return results
+# 배치 추론 예시
+texts = [
+    "국회의원 정원을 300명으로 확대하겠습니다.",
+    "감사합니다.",
+    "2024년 경제성장률이 2.1%를 기록했습니다.",
+    # ... 더 많은 문장들
+]
+batch_results = predict_claims_batch(texts)
+checkworthy_claims = [r for r in batch_results if r['is_checkworthy']]
+print(f"✅ 총 {len(texts)}개 문장 중 {len(checkworthy_claims)}개가 팩트체크 필요")
+```
+### 4. 실전 활용 예시 (Real-world Use Case)
+```python
+# 뉴스 기사에서 팩트체크 대상 추출
+def extract_checkworthy_claims(article_text, threshold=0.7):
+    """
+    기사에서 팩트체크가 필요한 문장들을 추출합니다.
+    Args:
+        article_text (str): 뉴스 기사 전문
+        threshold (float): checkworthy 판단 임계값 (0.0~1.0)
+    Returns:
+        list: 팩트체크 대상 문장들
+    """
+    # 문장 분리 (간단한 예시)
+    sentences = [s.strip() for s in article_text.split('.') if s.strip()]
+    # 배치 예측
+    results = predict_claims_batch(sentences)
+    # 임계값 이상의 checkworthy 문장만 필터링
+    checkworthy_claims = [
+        r for r in results
+        if r['is_checkworthy'] and r['confidence'] >= threshold
+    ]
+    # 확신도 순으로 정렬
+    checkworthy_claims.sort(key=lambda x: x['confidence'], reverse=True)
+    return checkworthy_claims
+# 사용 예시
+article = """
+정부는 오늘 경제정책 방향을 발표했습니다.
+청년 실업률이 지난해 대비 2.3%p 감소했다고 밝혔습니다.
+이는 역대 최대 폭의 하락입니다.
+앞으로도 일자리 창출에 힘쓰겠다고 강조했습니다.
+"""
+claims = extract_checkworthy_claims(article, threshold=0.8)
+print(f"🔍 발견된 팩트체크 대상: {len(claims)}개\n")
+for i, claim in enumerate(claims, 1):
+    print(f"{i}. {claim['text']}")
+    print(f"   확신도: {claim['confidence']*100:.1f}%\n")
+```
+## 모델 아키텍처 (Model Architecture)
+- **Model Type**: ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately)
+- **Hidden Size**: 768
+- **Number of Layers**: 12
+- **Number of Attention Heads**: 12
+- **Vocabulary Size**: 32,000
+- **Max Sequence Length**: 128 tokens
+- **Classification Head**: Linear layer (768 → 2)
+## 한계 및 고려사항 (Limitations)
+1. **도메인 특화**: 정치/뉴스 도메인에 최적화되어 있어, 일상 대화나 기술 문서에는 성능이 떨어질 수 있음
+2. **길이 제한**: 최대 128 토큰까지만 처리 가능 (약 100-150 단어)
+3. **기계 번역 데이터**: 영어에서 번역된 데이터로 학습되어 자연스러운 한국어 표현에서 성능 차이 가능
+4. **이진 분류**: Checkworthy 정도를 0/1로만 분류 (세밀한 점수 제공 안 함)
+5. **False Negative**: 실제 주장의 25%를 놓칠 수 있음 (Recall 75%)
+## 개선 방향 (Future Improvements)
+- [ ] 한국어 네이티브 팩트체크 데이터셋으로 추가 학습
+- [ ] 더 긴 문맥 처리를 위한 모델 업그레이드 (max_length 256+)
+- [ ] 다중 클래스 분류 (checkworthy 점수를 0-5 척도로)
+- [ ] 주장의 주제 카테고리 분류 기능 추가
+## 라이선스 (License)
+이 모델은 베이스 모델인 [beomi/KcELECTRA-base-v2022](https://huggingface.co/beomi/KcELECTRA-base-v2022)의 라이선스를 따릅니다.
+## 인용 (Citation)
+이 모델을 연구나 프로젝트에 사용하신다면 다음과 같이 인용해주세요:
+```bibtex
+@misc{korean-claim-factcheck-2025,
+  author = {Jonghhhh},
+  title = {Korean Claim Detection Model for Fact-Checking},
+  year = {2025},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/jonghhhh/claim_factcheck}},
+  note = {Based on KcELECTRA-base-v2022}
+}
+```
+## 참고 자료 (References)
+- **Base Model**: [beomi/KcELECTRA-base-v2022](https://huggingface.co/beomi/KcELECTRA-base-v2022)
+- **Dataset**: [CLEF CheckThat! Lab 2024](https://clef2025.clef-initiative.eu/index.php?page=Pages/Labs/CheckThat.html)
+- **Paper**: [CheckThat! Lab: Check-Worthiness, Subjectivity, and Persuasion](https://link.springer.com/chapter/10.1007/978-3-031-13643-6_24)
+## 연락처 (Contact)
+질문이나 피드백이 있으시면 Issues를 통해 남겨주세요!
+---
+**Tags**: `claim-detection`, `fact-checking`, `korean`, `electra`, `text-classification`, `checkworthy`, `misinformation-detection`

config.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "architectures": [
+    "ElectraForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "embedding_size": 768,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "electra",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "summary_activation": "gelu",
+  "summary_last_dropout": 0.1,
+  "summary_type": "first",
+  "summary_use_proj": true,
+  "tokenizer_class": "BertTokenizer",
+  "transformers_version": "4.57.0",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 54343
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3e0fb4a6b3146e77ad3eede30b9246f51bf97c0ab75862c735c92d55c1b48d0d
+size 511137368

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bc3478f8b371f9fc9e4c05d2fc0dec8b6043db500963a3fa7e0f63967d10323a
+size 5368

training_metadata.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "model_name": "beomi/KcELECTRA-base-v2022",
+  "task": "binary_text_classification",
+  "num_labels": 2,
+  "training_args": {
+    "num_train_epochs": 5,
+    "per_device_train_batch_size": 32,
+    "per_device_eval_batch_size": 64,
+    "learning_rate": 3e-05,
+    "weight_decay": 0.01,
+    "warmup_ratio": 0.1,
+    "bf16": true,
+    "optimizer": "adamw_torch_fused",
+    "seed": 42
+  },
+  "training_time_minutes": 1.87,
+  "validation_metrics": {
+    "eval_loss": 0.06963474303483963,
+    "eval_accuracy": 0.9757751937984496,
+    "eval_f1": 0.9480249480249481,
+    "eval_precision": 0.9382716049382716,
+    "eval_recall": 0.957983193277311,
+    "eval_runtime": 0.5253,
+    "eval_samples_per_second": 1964.52,
+    "eval_steps_per_second": 32.361,
+    "epoch": 3.409090909090909
+  },
+  "test_metrics": {
+    "eval_loss": 0.26781579852104187,
+    "eval_accuracy": 0.8930817610062893,
+    "eval_f1": 0.826530612244898,
+    "eval_precision": 0.9204545454545454,
+    "eval_recall": 0.75,
+    "eval_runtime": 0.1474,
+    "eval_samples_per_second": 2156.782,
+    "eval_steps_per_second": 33.912,
+    "epoch": 3.409090909090909
+  },
+  "confusion_matrix": {
+    "TN": 203,
+    "FP": 7,
+    "FN": 27,
+    "TP": 81
+  }
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff