| # CTI-Bench Dataset Processing Script | |
| This repository contains the processing script used to convert the original [CTI-Bench](https://github.com/xashru/cti-bench) TSV files into well-structured Hugging Face datasets with comprehensive documentation. | |
| ## 🎯 Overview | |
| The script processes 6 different CTI-Bench task files and uploads them as separate, documented datasets: | |
| 1. **cti_bench_mcq** - Multiple Choice Questions (2,500 entries) | |
| 2. **cti_bench_ate** - Attack Technique Extraction (60 entries) | |
| 3. **cti_bench_vsp** - Vulnerability Severity Prediction (1,000 entries) | |
| 4. **cti_bench_taa** - Threat Actor Attribution (50 entries) | |
| 5. **cti_bench_rcm** - Reverse Cyber Mapping (1,000 entries) | |
| 6. **cti_bench_rcm_2021** - Reverse Cyber Mapping 2021 (1,000 entries) | |
| ## 📊 Processed Datasets | |
| All processed datasets are available at: [tuandunghcmut](https://huggingface.co/tuandunghcmut) | |
| - 📋 [Multiple Choice Questions](https://huggingface.co/datasets/tuandunghcmut/cti_bench_mcq) | |
| - 🔍 [Attack Technique Extraction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_ate) | |
| - ⚠️ [Vulnerability Severity Prediction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_vsp) | |
| - 🎭 [Threat Actor Attribution](https://huggingface.co/datasets/tuandunghcmut/cti_bench_taa) | |
| - 🔄 [Reverse Cyber Mapping](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm) | |
| - 🔄 [Reverse Cyber Mapping 2021](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm_2021) | |
| ## 🚀 Usage | |
| ### Prerequisites | |
| ```bash | |
| pip install pandas datasets huggingface_hub | |
| ``` | |
| ### Authentication | |
| Make sure you're logged in to Hugging Face: | |
| ```bash | |
| huggingface-cli login | |
| # or | |
| hf auth login | |
| ``` | |
| ### Running the Script | |
| 1. Clone the original CTI-Bench repository: | |
| ```bash | |
| git clone https://github.com/xashru/cti-bench.git | |
| ``` | |
| 2. Run the processing script: | |
| ```bash | |
| python process_cti_bench_with_docs.py --username YOUR_HF_USERNAME | |
| ``` | |
| ### Command Line Options | |
| - `--username`: Your Hugging Face username (required) | |
| - `--token`: Hugging Face token (optional if already logged in) | |
| - `--data-dir`: Path to CTI-bench data directory (default: `cti-bench/data`) | |
| ## 🔧 Features | |
| ### Data Processing | |
| - ✅ **Standardized Schema**: All datasets include consistent field naming | |
| - ✅ **Task Type Labels**: Each entry includes a `task_type` field for identification | |
| - ✅ **Clean Data**: Proper handling of missing values and data types | |
| - ✅ **Chunk Processing**: Handles large files efficiently | |
| ### Documentation | |
| - 📚 **Comprehensive READMEs**: Each dataset gets a detailed README with: | |
| - Dataset description and statistics | |
| - Field explanations | |
| - Usage examples | |
| - Citation information | |
| - Task categories | |
| - 🎯 **Task-Specific Info**: Tailored documentation for each CTI task type | |
| - 📖 **Code Examples**: Ready-to-use Python snippets | |
| ### Upload Features | |
| - 🚀 **Batch Processing**: Processes all 6 datasets in one run | |
| - 📤 **Auto-Upload**: Automatically uploads to Hugging Face Hub | |
| - 📝 **README Integration**: Uploads documentation alongside data | |
| - ⚡ **Progress Tracking**: Detailed logging and progress reports | |
| ## 📁 Dataset Structure | |
| Each processed dataset follows this structure: | |
| ### Multiple Choice Questions (MCQ) | |
| ```python | |
| { | |
| 'url': str, # Source MITRE ATT&CK URL | |
| 'question': str, # The cybersecurity question | |
| 'option_a': str, # Multiple choice option A | |
| 'option_b': str, # Multiple choice option B | |
| 'option_c': str, # Multiple choice option C | |
| 'option_d': str, # Multiple choice option D | |
| 'prompt': str, # Full instruction prompt | |
| 'ground_truth': str, # Correct answer (A, B, C, or D) | |
| 'task_type': str # Always "multiple_choice_question" | |
| } | |
| ``` | |
| ### Attack Technique Extraction (ATE) | |
| ```python | |
| { | |
| 'url': str, # Source MITRE software URL | |
| 'platform': str, # Target platform (Enterprise, Mobile, etc.) | |
| 'description': str, # Malware/attack description | |
| 'prompt': str, # Full instruction with MITRE reference | |
| 'ground_truth': str, # MITRE technique IDs (e.g., "T1071, T1573") | |
| 'task_type': str # Always "attack_technique_extraction" | |
| } | |
| ``` | |
| ### Vulnerability Severity Prediction (VSP) | |
| ```python | |
| { | |
| 'url': str, # CVE URL | |
| 'description': str, # CVE vulnerability description | |
| 'prompt': str, # CVSS instruction prompt | |
| 'cvss_vector': str, # CVSS v3.1 vector string | |
| 'task_type': str # Always "vulnerability_severity_prediction" | |
| } | |
| ``` | |
| ## 🎓 Original CTI-Bench Paper | |
| This processing script is based on the CTI-Bench dataset from: | |
| > **CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence** | |
| > NeurIPS 2024 | |
| > [GitHub](https://github.com/xashru/cti-bench) | [Hugging Face](https://huggingface.co/datasets/AI4Sec/cti-bench) | |
| ## 📄 Citation | |
| If you use these processed datasets or this script, please cite the original paper: | |
| ```bibtex | |
| @article{ctibench2024, | |
| title={CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence}, | |
| author={[Authors]}, | |
| journal={NeurIPS 2024}, | |
| year={2024} | |
| } | |
| ``` | |
| ## 🤝 Contributing | |
| Feel free to submit issues or pull requests to improve the processing script or documentation. | |
| ## 📜 License | |
| This script is provided under the same license terms as the original CTI-Bench dataset. | |
| --- | |
| **Total Processed Samples**: 5,610 cybersecurity evaluation examples across 6 different task types! 🎯 | |