YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
ConDec: Detecting Contextual Deception in Scientific ML Claims
A benchmark paper for detecting technically-true-but-misleading claims in ML research.
What is Contextual Deception?
A claim is contextually deceptive when it is literally true but the context in which it's presented would lead a reasonable reader to a materially different conclusion than the full evidence would support.
Example:
"Our model achieves 95.3% accuracy on benchmark X" β but the model was tested on 10 benchmarks and degraded on 7. The statement is true, but the omitted context makes it deceptive.
This Repository Contains
Paper
paper.texβ Full LaTeX paper (ACL format) with all sections, tables, and bibliographySUPPLEMENTARY.mdβ Detailed supplementary materials (annotation guidelines, error analysis, prompts)TAXONOMY.mdβ Full taxonomy of 8 contextual deception types with definitions and examples
Dataset
The benchmark dataset lives at: t6harsh/contextual-deception-detection
It includes:
dataset_builder.pyβ Complete dataset construction pipeline (Sources A, B, C)evaluate.pyβ Evaluation framework for all 3 benchmark tasksfinetune_condec_bert.pyβ Fine-tuning script for ConDec-BERTanalyze_dataset.pyβ Analysis and LaTeX table generation
Key Results
| Model | Accuracy | Macro F1 | Misleading F1 |
|---|---|---|---|
| Human (expert) | 89.7% | 87.2 | 85.1 |
| ConDec-BERT (finetuned) | 74.1% | 70.8 | 67.4 |
| o1-preview | 68.4% | 65.1 | 61.3 |
| GPT-4 | 62.3% | 58.7 | 54.1 |
| Llama 3.1 405B | 58.3% | 54.2 | 49.8 |
All models substantially underperform humans. The gap is largest on Scope Exaggeration and Ambiguous Hedging β precisely the deception types most common in ML papers.
8 Deception Types
- Selective Reporting β Cherry-picking favorable results
- Scope Exaggeration β Claiming broader applicability than supported
- Baseline Manipulation β Unfair comparisons
- Metric Gaming β Choosing metrics that hide failures
- Opportunistic Splitting β Non-standard evaluation protocols
- Context Omission β Leaving out critical experimental details
- Ambiguous Hedging β Vague language masking weak results
- Causal Overclaiming β Implying causation from correlation
Three Benchmark Tasks
- Contextual Sufficiency Judgment β Classify whether context supports the claim (4-way)
- Missing Context Identification β Generate descriptions of what's absent
- Reader Inference Prediction β Predict naive vs. informed reader inferences
Citation
@inproceedings{condec2025,
title={Detecting Contextual Deception: A Benchmark for Identifying
Technically-True-but-Misleading Claims in Scientific ML Papers},
author={Anonymous},
booktitle={Proceedings of ACL},
year={2025}
}
Quick Start
# Install dependencies
pip install torch transformers datasets scikit-learn
# Generate dataset
python dataset_builder.py
# Analyze dataset
python analyze_dataset.py --data condec_dataset.jsonl --latex
# Fine-tune ConDec-BERT
python finetune_condec_bert.py --data condec_dataset.jsonl --output ./condec-bert
# Evaluate
python evaluate.py --data condec_dataset.jsonl --task task1