claim_factcheck / README.md

Upload Korean Claim Detection Model for Fact-Checking

7553492 verified 3 months ago

12.4 kB

	# Korean Claim Detection Model for Fact-Checking

	## 모델 소개 (Model Description)

	이 모델은 한국어 문장에서 팩트체크가 필요한 주장(claim)을 자동으로 탐지하는 이진 분류 모델입니다.

	뉴스 기사, 정치 토론, 소셜 미디어 게시물 등에서 검증이 필요한 주장을 식별하여, 팩트체킹 워크플로우의 첫 단계를 자동화할 수 있습니다.

	This model automatically detects claims that require fact-checking in Korean sentences. It can identify verifiable claims in news articles, political debates, and social media posts, automating the first step of the fact-checking workflow.

	- Base Model: [beomi/KcELECTRA-base-v2022](https://huggingface.co/beomi/KcELECTRA-base-v2022)
	- Task: Claim Detection (Checkworthy Sentence Classification)
	- Language: Korean (한국어)
	- Labels:
	- `0`: 팩트체크가 불필요한 문장 (Non-checkworthy)
	- `1`: 팩트체크가 필요한 주장 (Checkworthy claim)

	## 모델 목표 (Model Objective)

	입력된 한국어 문장을 분석하여 다음을 판단합니다:
	- 검증 가능한 사실적 주장인지
	- 팩트체킹이 필요한 정도는 얼마나 되는지

	This model analyzes Korean sentences to determine:
	- Whether they contain verifiable factual claims
	- The degree to which fact-checking is needed

	### 팩트체크가 필요한 주장의 예시 (Checkworthy Claim Examples)

	✅ Label 1 (Checkworthy):
	- "청년 실업률이 지난 3년간 계속 상승했습니다"
	- "우리나라 GDP 성장률은 OECD 평균을 넘어섰습니다"
	- "이 정책으로 일자리가 100만 개 창출될 것입니다"

	❌ Label 0 (Non-checkworthy):
	- "오늘 토론회는 SBS 상암동 스튜디오에서 진행하고 있고요"
	- "국민 여러분께 감사드립니다"
	- "제 생각에는 이 정책이 좋은 것 같습니다"

	## 데이터셋 (Dataset)

	### 데이터 출처
	- Source: CLEF CheckThat! Lab 2024
	- Task: Task 1 - Check-Worthiness Estimation
	- Original Dataset: English political debates and speeches
	- Translation: Machine-translated to Korean for training

	### 데이터셋 크기
	- Training Set: 22,501 samples
	- Validation Set: 1,032 samples
	- Test Set: 318 samples

	### 데이터 특성
	- 정치 토론, 연설문, 뉴스 기사에서 추출된 문장
	- 전문 팩트체커들이 레이블링한 고품질 데이터
	- 클래스 불균형: Label 0 (65%) vs Label 1 (35%)

	## 학습 세부사항 (Training Details)

	### 학습 하이퍼파라미터
	- Epochs: 5
	- Batch Size (Train): 32
	- Batch Size (Eval): 64
	- Learning Rate: 3e-05
	- Weight Decay: 0.01
	- Warmup Ratio: 0.1
	- Precision: BF16
	- Optimizer: adamw_torch_fused
	- Max Sequence Length: 128 tokens
	- Seed: 42

	### 학습 환경
	- GPU: NVIDIA GeForce RTX 4090 (24GB)
	- Training Time: 1.87 minutes
	- Framework: Hugging Face Transformers
	- Early Stopping: Patience 3 (based on F1 score)

	## 성능 (Performance)

	### Validation Metrics
	- Accuracy: 97.58%
	- F1 Score: 94.80%
	- Precision: 93.83%
	- Recall: 95.80%

	### Test Metrics
	- Accuracy: 89.31%
	- F1 Score: 82.65%
	- Precision: 92.05%
	- Recall: 75.00%

	### Confusion Matrix (Test Set)
	```
	Predicted
	0 1
	Actual 0 203 7 (96.7% 정확도)
	1 27 81 (75.0% 재현율)
	```

	성능 해석:
	- 높은 Precision (92.05%): 모델이 "checkworthy"라고 예측한 문장의 92%가 실제로 팩트체크가 필요
	- 적절한 Recall (75.00%): 실제 checkworthy 문장의 75%를 탐지
	- 낮은 False Positive (7개): 불필요한 팩트체크 요청 최소화

	## 사용 방법 (How to Use)

	### 1. 설치 (Installation)

	```bash
	pip install transformers torch
	```

	### 2. 모델 로드 (Loading the Model)

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# 모델 로드
	model_name = "jonghhhh/claim_factcheck"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# GPU 사용 (선택사항)
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)
	model.eval()

	print(f"✅ 모델 로드 완료! (Device: {device})")
	```

	### 3. 추론 예시 (Inference Example)

	#### 단일 문장 분류

	```python
	def predict_claim(text):
	"""
	입력 문장이 팩트체크가 필요한 주장인지 판단합니다.

	Args:
	text (str): 분석할 한국어 문장

	Returns:
	dict: {
	'text': 입력 문장,
	'is_checkworthy': True/False,
	'confidence': 0.0~1.0 (확신도),
	'label': 0 또는 1,
	'probabilities': {'non_checkworthy': 0.xx, 'checkworthy': 0.xx}
	}
	"""
	# 토크나이징
	inputs = tokenizer(
	text,
	truncation=True,
	max_length=128,
	return_tensors="pt"
	)
	inputs = {k: v.to(device) for k, v in inputs.items()}

	# 추론
	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.softmax(outputs.logits, dim=-1)
	predicted_label = torch.argmax(probs, dim=-1).item()
	confidence = probs[0][predicted_label].item()

	return {
	'text': text,
	'is_checkworthy': bool(predicted_label),
	'confidence': confidence,
	'label': predicted_label,
	'probabilities': {
	'non_checkworthy': probs[0][0].item(),
	'checkworthy': probs[0][1].item()
	}
	}

	# 사용 예시
	examples = [
	"오늘 토론회는 SBS 상암동 스튜디오에서 진행하고 있고요.",
	"청년 실업률이 최근 3년간 계속 상승하고 있습니다.",
	"우리나라 GDP 성장률은 OECD 평균을 넘어섰습니다.",
	"국민 여러분께 진심으로 감사드립니다."
	]

	for text in examples:
	result = predict_claim(text)
	print(f"\n📝 입력: {result['text']}")
	print(f"{'🔍 팩트체크 필요' if result['is_checkworthy'] else '✅ 팩트체크 불필요'}")
	print(f"확신도: {result['confidence']*100:.1f}%")
	print(f"상세 확률: Non-CW {result['probabilities']['non_checkworthy']100:.1f}% \| CW {result['probabilities']['checkworthy']100:.1f}%")
	```

	출력 예시:
	```
	📝 입력: 청년 실업률이 최근 3년간 계속 상승하고 있습니다.
	🔍 팩트체크 필요
	확신도: 94.3%
	상세 확률: Non-CW 5.7% \| CW 94.3%

	📝 입력: 오늘 토론회는 SBS 상암동 스튜디오에서 진행하고 있고요.
	✅ 팩트체크 불필요
	확신도: 98.2%
	상세 확률: Non-CW 98.2% \| CW 1.8%
	```

	#### 배치 처리 (Batch Processing)

	```python
	def predict_claims_batch(texts, batch_size=32):
	"""
	여러 문장을 배치로 처리합니다.

	Args:
	texts (list): 문장 리스트
	batch_size (int): 배치 크기

	Returns:
	list: 각 문장의 예측 결과 리스트
	"""
	results = []

	for i in range(0, len(texts), batch_size):
	batch_texts = texts[i:i+batch_size]

	# 배치 토크나이징
	inputs = tokenizer(
	batch_texts,
	truncation=True,
	max_length=128,
	padding=True,
	return_tensors="pt"
	)
	inputs = {k: v.to(device) for k, v in inputs.items()}

	# 배치 추론
	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.softmax(outputs.logits, dim=-1)
	predicted_labels = torch.argmax(probs, dim=-1).cpu().numpy()

	# 결과 저장
	for j, text in enumerate(batch_texts):
	results.append({
	'text': text,
	'is_checkworthy': bool(predicted_labels[j]),
	'confidence': probs[j][predicted_labels[j]].item(),
	'label': int(predicted_labels[j])
	})

	return results

	# 배치 추론 예시
	texts = [
	"국회의원 정원을 300명으로 확대하겠습니다.",
	"감사합니다.",
	"2024년 경제성장률이 2.1%를 기록했습니다.",
	# ... 더 많은 문장들
	]

	batch_results = predict_claims_batch(texts)
	checkworthy_claims = [r for r in batch_results if r['is_checkworthy']]
	print(f"✅ 총 {len(texts)}개 문장 중 {len(checkworthy_claims)}개가 팩트체크 필요")
	```

	### 4. 실전 활용 예시 (Real-world Use Case)

	```python
	# 뉴스 기사에서 팩트체크 대상 추출
	def extract_checkworthy_claims(article_text, threshold=0.7):
	"""
	기사에서 팩트체크가 필요한 문장들을 추출합니다.

	Args:
	article_text (str): 뉴스 기사 전문
	threshold (float): checkworthy 판단 임계값 (0.0~1.0)

	Returns:
	list: 팩트체크 대상 문장들
	"""
	# 문장 분리 (간단한 예시)
	sentences = [s.strip() for s in article_text.split('.') if s.strip()]

	# 배치 예측
	results = predict_claims_batch(sentences)

	# 임계값 이상의 checkworthy 문장만 필터링
	checkworthy_claims = [
	r for r in results
	if r['is_checkworthy'] and r['confidence'] >= threshold
	]

	# 확신도 순으로 정렬
	checkworthy_claims.sort(key=lambda x: x['confidence'], reverse=True)

	return checkworthy_claims

	# 사용 예시
	article = """
	정부는 오늘 경제정책 방향을 발표했습니다.
	청년 실업률이 지난해 대비 2.3%p 감소했다고 밝혔습니다.
	이는 역대 최대 폭의 하락입니다.
	앞으로도 일자리 창출에 힘쓰겠다고 강조했습니다.
	"""

	claims = extract_checkworthy_claims(article, threshold=0.8)
	print(f"🔍 발견된 팩트체크 대상: {len(claims)}개\n")

	for i, claim in enumerate(claims, 1):
	print(f"{i}. {claim['text']}")
	print(f" 확신도: {claim['confidence']*100:.1f}%\n")
	```

	## 모델 아키텍처 (Model Architecture)

	- Model Type: ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately)
	- Hidden Size: 768
	- Number of Layers: 12
	- Number of Attention Heads: 12
	- Vocabulary Size: 32,000
	- Max Sequence Length: 128 tokens
	- Classification Head: Linear layer (768 → 2)

	## 한계 및 고려사항 (Limitations)

	1. 도메인 특화: 정치/뉴스 도메인에 최적화되어 있어, 일상 대화나 기술 문서에는 성능이 떨어질 수 있음
	2. 길이 제한: 최대 128 토큰까지만 처리 가능 (약 100-150 단어)
	3. 기계 번역 데이터: 영어에서 번역된 데이터로 학습되어 자연스러운 한국어 표현에서 성능 차이 가능
	4. 이진 분류: Checkworthy 정도를 0/1로만 분류 (세밀한 점수 제공 안 함)
	5. False Negative: 실제 주장의 25%를 놓칠 수 있음 (Recall 75%)

	## 개선 방향 (Future Improvements)

	- [ ] 한국어 네이티브 팩트체크 데이터셋으로 추가 학습
	- [ ] 더 긴 문맥 처리를 위한 모델 업그레이드 (max_length 256+)
	- [ ] 다중 클래스 분류 (checkworthy 점수를 0-5 척도로)
	- [ ] 주장의 주제 카테고리 분류 기능 추가

	## 라이선스 (License)

	이 모델은 베이스 모델인 [beomi/KcELECTRA-base-v2022](https://huggingface.co/beomi/KcELECTRA-base-v2022)의 라이선스를 따릅니다.

	## 인용 (Citation)

	이 모델을 연구나 프로젝트에 사용하신다면 다음과 같이 인용해주세요:

	```bibtex
	@misc{korean-claim-factcheck-2025,
	author = {Jonghhhh},
	title = {Korean Claim Detection Model for Fact-Checking},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/jonghhhh/claim_factcheck}},
	note = {Based on KcELECTRA-base-v2022}
	}
	```

	## 참고 자료 (References)

	- Base Model: [beomi/KcELECTRA-base-v2022](https://huggingface.co/beomi/KcELECTRA-base-v2022)
	- Dataset: [CLEF CheckThat! Lab 2024](https://clef2025.clef-initiative.eu/index.php?page=Pages/Labs/CheckThat.html)
	- Paper: [CheckThat! Lab: Check-Worthiness, Subjectivity, and Persuasion](https://link.springer.com/chapter/10.1007/978-3-031-13643-6_24)

	## 연락처 (Contact)

	질문이나 피드백이 있으시면 Issues를 통해 남겨주세요!

	---

	Tags: `claim-detection`, `fact-checking`, `korean`, `electra`, `text-classification`, `checkworthy`, `misinformation-detection`