CTI-Bench Dataset Processing Script

This repository contains the processing script used to convert the original CTI-Bench TSV files into well-structured Hugging Face datasets with comprehensive documentation.

🎯 Overview

The script processes 6 different CTI-Bench task files and uploads them as separate, documented datasets:

cti_bench_mcq - Multiple Choice Questions (2,500 entries)
cti_bench_ate - Attack Technique Extraction (60 entries)
cti_bench_vsp - Vulnerability Severity Prediction (1,000 entries)
cti_bench_taa - Threat Actor Attribution (50 entries)
cti_bench_rcm - Reverse Cyber Mapping (1,000 entries)
cti_bench_rcm_2021 - Reverse Cyber Mapping 2021 (1,000 entries)

📊 Processed Datasets

All processed datasets are available at: tuandunghcmut

🚀 Usage

Prerequisites

pip install pandas datasets huggingface_hub

Authentication

Make sure you're logged in to Hugging Face:

huggingface-cli login
# or
hf auth login

Running the Script

Clone the original CTI-Bench repository:

git clone https://github.com/xashru/cti-bench.git

Run the processing script:

python process_cti_bench_with_docs.py --username YOUR_HF_USERNAME

Command Line Options

--username: Your Hugging Face username (required)
--token: Hugging Face token (optional if already logged in)
--data-dir: Path to CTI-bench data directory (default: cti-bench/data)

🔧 Features

Data Processing

✅ Standardized Schema: All datasets include consistent field naming
✅ Task Type Labels: Each entry includes a task_type field for identification
✅ Clean Data: Proper handling of missing values and data types
✅ Chunk Processing: Handles large files efficiently

Documentation

📚 Comprehensive READMEs: Each dataset gets a detailed README with:
- Dataset description and statistics
- Field explanations
- Usage examples
- Citation information
- Task categories
🎯 Task-Specific Info: Tailored documentation for each CTI task type
📖 Code Examples: Ready-to-use Python snippets

Upload Features

🚀 Batch Processing: Processes all 6 datasets in one run
📤 Auto-Upload: Automatically uploads to Hugging Face Hub
📝 README Integration: Uploads documentation alongside data
⚡ Progress Tracking: Detailed logging and progress reports

📁 Dataset Structure

Each processed dataset follows this structure:

Multiple Choice Questions (MCQ)

{
    'url': str,           # Source MITRE ATT&CK URL
    'question': str,      # The cybersecurity question
    'option_a': str,      # Multiple choice option A
    'option_b': str,      # Multiple choice option B  
    'option_c': str,      # Multiple choice option C
    'option_d': str,      # Multiple choice option D
    'prompt': str,        # Full instruction prompt
    'ground_truth': str,  # Correct answer (A, B, C, or D)
    'task_type': str      # Always "multiple_choice_question"
}

Attack Technique Extraction (ATE)

{
    'url': str,          # Source MITRE software URL
    'platform': str,     # Target platform (Enterprise, Mobile, etc.)
    'description': str,  # Malware/attack description
    'prompt': str,       # Full instruction with MITRE reference
    'ground_truth': str, # MITRE technique IDs (e.g., "T1071, T1573")
    'task_type': str     # Always "attack_technique_extraction"
}

Vulnerability Severity Prediction (VSP)

{
    'url': str,          # CVE URL
    'description': str,  # CVE vulnerability description
    'prompt': str,       # CVSS instruction prompt
    'cvss_vector': str,  # CVSS v3.1 vector string
    'task_type': str     # Always "vulnerability_severity_prediction"
}

🎓 Original CTI-Bench Paper

This processing script is based on the CTI-Bench dataset from:

CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence
NeurIPS 2024
GitHub | Hugging Face

📄 Citation

If you use these processed datasets or this script, please cite the original paper:

@article{ctibench2024,
  title={CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence},
  author={[Authors]},
  journal={NeurIPS 2024},
  year={2024}
}

🤝 Contributing

Feel free to submit issues or pull requests to improve the processing script or documentation.

📜 License

This script is provided under the same license terms as the original CTI-Bench dataset.

Total Processed Samples: 5,610 cybersecurity evaluation examples across 6 different task types! 🎯

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support