CTI-Bench Dataset Processing Script
This repository contains the processing script used to convert the original CTI-Bench TSV files into well-structured Hugging Face datasets with comprehensive documentation.
🎯 Overview
The script processes 6 different CTI-Bench task files and uploads them as separate, documented datasets:
- cti_bench_mcq - Multiple Choice Questions (2,500 entries)
- cti_bench_ate - Attack Technique Extraction (60 entries)
- cti_bench_vsp - Vulnerability Severity Prediction (1,000 entries)
- cti_bench_taa - Threat Actor Attribution (50 entries)
- cti_bench_rcm - Reverse Cyber Mapping (1,000 entries)
- cti_bench_rcm_2021 - Reverse Cyber Mapping 2021 (1,000 entries)
📊 Processed Datasets
All processed datasets are available at: tuandunghcmut
- 📋 Multiple Choice Questions
- 🔍 Attack Technique Extraction
- ⚠️ Vulnerability Severity Prediction
- 🎭 Threat Actor Attribution
- 🔄 Reverse Cyber Mapping
- 🔄 Reverse Cyber Mapping 2021
🚀 Usage
Prerequisites
pip install pandas datasets huggingface_hub
Authentication
Make sure you're logged in to Hugging Face:
huggingface-cli login
# or
hf auth login
Running the Script
- Clone the original CTI-Bench repository:
git clone https://github.com/xashru/cti-bench.git
- Run the processing script:
python process_cti_bench_with_docs.py --username YOUR_HF_USERNAME
Command Line Options
--username: Your Hugging Face username (required)--token: Hugging Face token (optional if already logged in)--data-dir: Path to CTI-bench data directory (default:cti-bench/data)
🔧 Features
Data Processing
- ✅ Standardized Schema: All datasets include consistent field naming
- ✅ Task Type Labels: Each entry includes a
task_typefield for identification - ✅ Clean Data: Proper handling of missing values and data types
- ✅ Chunk Processing: Handles large files efficiently
Documentation
- 📚 Comprehensive READMEs: Each dataset gets a detailed README with:
- Dataset description and statistics
- Field explanations
- Usage examples
- Citation information
- Task categories
- 🎯 Task-Specific Info: Tailored documentation for each CTI task type
- 📖 Code Examples: Ready-to-use Python snippets
Upload Features
- 🚀 Batch Processing: Processes all 6 datasets in one run
- 📤 Auto-Upload: Automatically uploads to Hugging Face Hub
- 📝 README Integration: Uploads documentation alongside data
- ⚡ Progress Tracking: Detailed logging and progress reports
📁 Dataset Structure
Each processed dataset follows this structure:
Multiple Choice Questions (MCQ)
{
'url': str, # Source MITRE ATT&CK URL
'question': str, # The cybersecurity question
'option_a': str, # Multiple choice option A
'option_b': str, # Multiple choice option B
'option_c': str, # Multiple choice option C
'option_d': str, # Multiple choice option D
'prompt': str, # Full instruction prompt
'ground_truth': str, # Correct answer (A, B, C, or D)
'task_type': str # Always "multiple_choice_question"
}
Attack Technique Extraction (ATE)
{
'url': str, # Source MITRE software URL
'platform': str, # Target platform (Enterprise, Mobile, etc.)
'description': str, # Malware/attack description
'prompt': str, # Full instruction with MITRE reference
'ground_truth': str, # MITRE technique IDs (e.g., "T1071, T1573")
'task_type': str # Always "attack_technique_extraction"
}
Vulnerability Severity Prediction (VSP)
{
'url': str, # CVE URL
'description': str, # CVE vulnerability description
'prompt': str, # CVSS instruction prompt
'cvss_vector': str, # CVSS v3.1 vector string
'task_type': str # Always "vulnerability_severity_prediction"
}
🎓 Original CTI-Bench Paper
This processing script is based on the CTI-Bench dataset from:
CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence
NeurIPS 2024
GitHub | Hugging Face
📄 Citation
If you use these processed datasets or this script, please cite the original paper:
@article{ctibench2024,
title={CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence},
author={[Authors]},
journal={NeurIPS 2024},
year={2024}
}
🤝 Contributing
Feel free to submit issues or pull requests to improve the processing script or documentation.
📜 License
This script is provided under the same license terms as the original CTI-Bench dataset.
Total Processed Samples: 5,610 cybersecurity evaluation examples across 6 different task types! 🎯