File size: 5,448 Bytes
be7aeb1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# CTI-Bench Dataset Processing Script

This repository contains the processing script used to convert the original [CTI-Bench](https://github.com/xashru/cti-bench) TSV files into well-structured Hugging Face datasets with comprehensive documentation.

## 🎯 Overview

The script processes 6 different CTI-Bench task files and uploads them as separate, documented datasets:

1. **cti_bench_mcq** - Multiple Choice Questions (2,500 entries)
2. **cti_bench_ate** - Attack Technique Extraction (60 entries)  
3. **cti_bench_vsp** - Vulnerability Severity Prediction (1,000 entries)
4. **cti_bench_taa** - Threat Actor Attribution (50 entries)
5. **cti_bench_rcm** - Reverse Cyber Mapping (1,000 entries)
6. **cti_bench_rcm_2021** - Reverse Cyber Mapping 2021 (1,000 entries)

## 📊 Processed Datasets

All processed datasets are available at: [tuandunghcmut](https://huggingface.co/tuandunghcmut)

- 📋 [Multiple Choice Questions](https://huggingface.co/datasets/tuandunghcmut/cti_bench_mcq)
- 🔍 [Attack Technique Extraction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_ate)
- ⚠️ [Vulnerability Severity Prediction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_vsp)
- 🎭 [Threat Actor Attribution](https://huggingface.co/datasets/tuandunghcmut/cti_bench_taa)
- 🔄 [Reverse Cyber Mapping](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm)
- 🔄 [Reverse Cyber Mapping 2021](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm_2021)

## 🚀 Usage

### Prerequisites

```bash
pip install pandas datasets huggingface_hub
```

### Authentication

Make sure you're logged in to Hugging Face:

```bash
huggingface-cli login
# or
hf auth login
```

### Running the Script

1. Clone the original CTI-Bench repository:
```bash
git clone https://github.com/xashru/cti-bench.git
```

2. Run the processing script:
```bash
python process_cti_bench_with_docs.py --username YOUR_HF_USERNAME
```

### Command Line Options

- `--username`: Your Hugging Face username (required)
- `--token`: Hugging Face token (optional if already logged in)
- `--data-dir`: Path to CTI-bench data directory (default: `cti-bench/data`)

## 🔧 Features

### Data Processing
- ✅ **Standardized Schema**: All datasets include consistent field naming
- ✅ **Task Type Labels**: Each entry includes a `task_type` field for identification
- ✅ **Clean Data**: Proper handling of missing values and data types
- ✅ **Chunk Processing**: Handles large files efficiently

### Documentation
- 📚 **Comprehensive READMEs**: Each dataset gets a detailed README with:
  - Dataset description and statistics
  - Field explanations
  - Usage examples
  - Citation information
  - Task categories
- 🎯 **Task-Specific Info**: Tailored documentation for each CTI task type
- 📖 **Code Examples**: Ready-to-use Python snippets

### Upload Features
- 🚀 **Batch Processing**: Processes all 6 datasets in one run
- 📤 **Auto-Upload**: Automatically uploads to Hugging Face Hub
- 📝 **README Integration**: Uploads documentation alongside data
- ⚡ **Progress Tracking**: Detailed logging and progress reports

## 📁 Dataset Structure

Each processed dataset follows this structure:

### Multiple Choice Questions (MCQ)
```python
{
    'url': str,           # Source MITRE ATT&CK URL
    'question': str,      # The cybersecurity question
    'option_a': str,      # Multiple choice option A
    'option_b': str,      # Multiple choice option B  
    'option_c': str,      # Multiple choice option C
    'option_d': str,      # Multiple choice option D
    'prompt': str,        # Full instruction prompt
    'ground_truth': str,  # Correct answer (A, B, C, or D)
    'task_type': str      # Always "multiple_choice_question"
}
```

### Attack Technique Extraction (ATE)
```python
{
    'url': str,          # Source MITRE software URL
    'platform': str,     # Target platform (Enterprise, Mobile, etc.)
    'description': str,  # Malware/attack description
    'prompt': str,       # Full instruction with MITRE reference
    'ground_truth': str, # MITRE technique IDs (e.g., "T1071, T1573")
    'task_type': str     # Always "attack_technique_extraction"
}
```

### Vulnerability Severity Prediction (VSP)
```python
{
    'url': str,          # CVE URL
    'description': str,  # CVE vulnerability description
    'prompt': str,       # CVSS instruction prompt
    'cvss_vector': str,  # CVSS v3.1 vector string
    'task_type': str     # Always "vulnerability_severity_prediction"
}
```

## 🎓 Original CTI-Bench Paper

This processing script is based on the CTI-Bench dataset from:

> **CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence**  
> NeurIPS 2024  
> [GitHub](https://github.com/xashru/cti-bench) | [Hugging Face](https://huggingface.co/datasets/AI4Sec/cti-bench)

## 📄 Citation

If you use these processed datasets or this script, please cite the original paper:

```bibtex
@article{ctibench2024,
  title={CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence},
  author={[Authors]},
  journal={NeurIPS 2024},
  year={2024}
}
```

## 🤝 Contributing

Feel free to submit issues or pull requests to improve the processing script or documentation.

## 📜 License

This script is provided under the same license terms as the original CTI-Bench dataset.

---

**Total Processed Samples**: 5,610 cybersecurity evaluation examples across 6 different task types! 🎯