File size: 5,448 Bytes

be7aeb1

# CTI-Bench Dataset Processing Script

This repository contains the processing script used to convert the original [CTI-Bench](https://github.com/xashru/cti-bench) TSV files into well-structured Hugging Face datasets with comprehensive documentation.

## 🎯 Overview

The script processes 6 different CTI-Bench task files and uploads them as separate, documented datasets:

1. **cti_bench_mcq** - Multiple Choice Questions (2,500 entries)
2. **cti_bench_ate** - Attack Technique Extraction (60 entries)  
3. **cti_bench_vsp** - Vulnerability Severity Prediction (1,000 entries)
4. **cti_bench_taa** - Threat Actor Attribution (50 entries)
5. **cti_bench_rcm** - Reverse Cyber Mapping (1,000 entries)
6. **cti_bench_rcm_2021** - Reverse Cyber Mapping 2021 (1,000 entries)

## 📊 Processed Datasets

All processed datasets are available at: [tuandunghcmut](https://huggingface.co/tuandunghcmut)

- 📋 [Multiple Choice Questions](https://huggingface.co/datasets/tuandunghcmut/cti_bench_mcq)
- 🔍 [Attack Technique Extraction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_ate)
- ⚠️ [Vulnerability Severity Prediction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_vsp)
- 🎭 [Threat Actor Attribution](https://huggingface.co/datasets/tuandunghcmut/cti_bench_taa)
- 🔄 [Reverse Cyber Mapping](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm)
- 🔄 [Reverse Cyber Mapping 2021](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm_2021)

## 🚀 Usage

### Prerequisites

```bash
pip install pandas datasets huggingface_hub
```

### Authentication

Make sure you're logged in to Hugging Face:

```bash
huggingface-cli login
# or
hf auth login
```

### Running the Script

1. Clone the original CTI-Bench repository:
```bash
git clone https://github.com/xashru/cti-bench.git
```

2. Run the processing script:
```bash
python process_cti_bench_with_docs.py --username YOUR_HF_USERNAME
```

### Command Line Options

- `--username`: Your Hugging Face username (required)
- `--token`: Hugging Face token (optional if already logged in)
- `--data-dir`: Path to CTI-bench data directory (default: `cti-bench/data`)

## 🔧 Features

### Data Processing
- ✅ **Standardized Schema**: All datasets include consistent field naming
- ✅ **Task Type Labels**: Each entry includes a `task_type` field for identification
- ✅ **Clean Data**: Proper handling of missing values and data types
- ✅ **Chunk Processing**: Handles large files efficiently

### Documentation
- 📚 **Comprehensive READMEs**: Each dataset gets a detailed README with:
  - Dataset description and statistics
  - Field explanations
  - Usage examples
  - Citation information
  - Task categories
- 🎯 **Task-Specific Info**: Tailored documentation for each CTI task type
- 📖 **Code Examples**: Ready-to-use Python snippets

### Upload Features
- 🚀 **Batch Processing**: Processes all 6 datasets in one run
- 📤 **Auto-Upload**: Automatically uploads to Hugging Face Hub
- 📝 **README Integration**: Uploads documentation alongside data
- ⚡ **Progress Tracking**: Detailed logging and progress reports

## 📁 Dataset Structure

Each processed dataset follows this structure:

### Multiple Choice Questions (MCQ)
```python
{
    'url': str,           # Source MITRE ATT&CK URL
    'question': str,      # The cybersecurity question
    'option_a': str,      # Multiple choice option A
    'option_b': str,      # Multiple choice option B  
    'option_c': str,      # Multiple choice option C
    'option_d': str,      # Multiple choice option D
    'prompt': str,        # Full instruction prompt
    'ground_truth': str,  # Correct answer (A, B, C, or D)
    'task_type': str      # Always "multiple_choice_question"
}
```

### Attack Technique Extraction (ATE)
```python
{
    'url': str,          # Source MITRE software URL
    'platform': str,     # Target platform (Enterprise, Mobile, etc.)
    'description': str,  # Malware/attack description
    'prompt': str,       # Full instruction with MITRE reference
    'ground_truth': str, # MITRE technique IDs (e.g., "T1071, T1573")
    'task_type': str     # Always "attack_technique_extraction"
}
```

### Vulnerability Severity Prediction (VSP)
```python
{
    'url': str,          # CVE URL
    'description': str,  # CVE vulnerability description
    'prompt': str,       # CVSS instruction prompt
    'cvss_vector': str,  # CVSS v3.1 vector string
    'task_type': str     # Always "vulnerability_severity_prediction"
}
```

## 🎓 Original CTI-Bench Paper

This processing script is based on the CTI-Bench dataset from:

> **CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence**  
> NeurIPS 2024  
> [GitHub](https://github.com/xashru/cti-bench) | [Hugging Face](https://huggingface.co/datasets/AI4Sec/cti-bench)

## 📄 Citation

If you use these processed datasets or this script, please cite the original paper:

```bibtex
@article{ctibench2024,
  title={CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence},
  author={[Authors]},
  journal={NeurIPS 2024},
  year={2024}
}
```

## 🤝 Contributing

Feel free to submit issues or pull requests to improve the processing script or documentation.

## 📜 License

This script is provided under the same license terms as the original CTI-Bench dataset.

---

**Total Processed Samples**: 5,610 cybersecurity evaluation examples across 6 different task types! 🎯