# CTI-Bench Dataset Processing Script This repository contains the processing script used to convert the original [CTI-Bench](https://github.com/xashru/cti-bench) TSV files into well-structured Hugging Face datasets with comprehensive documentation. ## 🎯 Overview The script processes 6 different CTI-Bench task files and uploads them as separate, documented datasets: 1. **cti_bench_mcq** - Multiple Choice Questions (2,500 entries) 2. **cti_bench_ate** - Attack Technique Extraction (60 entries) 3. **cti_bench_vsp** - Vulnerability Severity Prediction (1,000 entries) 4. **cti_bench_taa** - Threat Actor Attribution (50 entries) 5. **cti_bench_rcm** - Reverse Cyber Mapping (1,000 entries) 6. **cti_bench_rcm_2021** - Reverse Cyber Mapping 2021 (1,000 entries) ## 📊 Processed Datasets All processed datasets are available at: [tuandunghcmut](https://huggingface.co/tuandunghcmut) - 📋 [Multiple Choice Questions](https://huggingface.co/datasets/tuandunghcmut/cti_bench_mcq) - 🔍 [Attack Technique Extraction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_ate) - ⚠️ [Vulnerability Severity Prediction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_vsp) - 🎭 [Threat Actor Attribution](https://huggingface.co/datasets/tuandunghcmut/cti_bench_taa) - 🔄 [Reverse Cyber Mapping](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm) - 🔄 [Reverse Cyber Mapping 2021](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm_2021) ## 🚀 Usage ### Prerequisites ```bash pip install pandas datasets huggingface_hub ``` ### Authentication Make sure you're logged in to Hugging Face: ```bash huggingface-cli login # or hf auth login ``` ### Running the Script 1. Clone the original CTI-Bench repository: ```bash git clone https://github.com/xashru/cti-bench.git ``` 2. Run the processing script: ```bash python process_cti_bench_with_docs.py --username YOUR_HF_USERNAME ``` ### Command Line Options - `--username`: Your Hugging Face username (required) - `--token`: Hugging Face token (optional if already logged in) - `--data-dir`: Path to CTI-bench data directory (default: `cti-bench/data`) ## 🔧 Features ### Data Processing - ✅ **Standardized Schema**: All datasets include consistent field naming - ✅ **Task Type Labels**: Each entry includes a `task_type` field for identification - ✅ **Clean Data**: Proper handling of missing values and data types - ✅ **Chunk Processing**: Handles large files efficiently ### Documentation - 📚 **Comprehensive READMEs**: Each dataset gets a detailed README with: - Dataset description and statistics - Field explanations - Usage examples - Citation information - Task categories - 🎯 **Task-Specific Info**: Tailored documentation for each CTI task type - 📖 **Code Examples**: Ready-to-use Python snippets ### Upload Features - 🚀 **Batch Processing**: Processes all 6 datasets in one run - 📤 **Auto-Upload**: Automatically uploads to Hugging Face Hub - 📝 **README Integration**: Uploads documentation alongside data - ⚡ **Progress Tracking**: Detailed logging and progress reports ## 📁 Dataset Structure Each processed dataset follows this structure: ### Multiple Choice Questions (MCQ) ```python { 'url': str, # Source MITRE ATT&CK URL 'question': str, # The cybersecurity question 'option_a': str, # Multiple choice option A 'option_b': str, # Multiple choice option B 'option_c': str, # Multiple choice option C 'option_d': str, # Multiple choice option D 'prompt': str, # Full instruction prompt 'ground_truth': str, # Correct answer (A, B, C, or D) 'task_type': str # Always "multiple_choice_question" } ``` ### Attack Technique Extraction (ATE) ```python { 'url': str, # Source MITRE software URL 'platform': str, # Target platform (Enterprise, Mobile, etc.) 'description': str, # Malware/attack description 'prompt': str, # Full instruction with MITRE reference 'ground_truth': str, # MITRE technique IDs (e.g., "T1071, T1573") 'task_type': str # Always "attack_technique_extraction" } ``` ### Vulnerability Severity Prediction (VSP) ```python { 'url': str, # CVE URL 'description': str, # CVE vulnerability description 'prompt': str, # CVSS instruction prompt 'cvss_vector': str, # CVSS v3.1 vector string 'task_type': str # Always "vulnerability_severity_prediction" } ``` ## 🎓 Original CTI-Bench Paper This processing script is based on the CTI-Bench dataset from: > **CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence** > NeurIPS 2024 > [GitHub](https://github.com/xashru/cti-bench) | [Hugging Face](https://huggingface.co/datasets/AI4Sec/cti-bench) ## 📄 Citation If you use these processed datasets or this script, please cite the original paper: ```bibtex @article{ctibench2024, title={CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence}, author={[Authors]}, journal={NeurIPS 2024}, year={2024} } ``` ## 🤝 Contributing Feel free to submit issues or pull requests to improve the processing script or documentation. ## 📜 License This script is provided under the same license terms as the original CTI-Bench dataset. --- **Total Processed Samples**: 5,610 cybersecurity evaluation examples across 6 different task types! 🎯