File size: 5,448 Bytes
be7aeb1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
# CTI-Bench Dataset Processing Script
This repository contains the processing script used to convert the original [CTI-Bench](https://github.com/xashru/cti-bench) TSV files into well-structured Hugging Face datasets with comprehensive documentation.
## 🎯 Overview
The script processes 6 different CTI-Bench task files and uploads them as separate, documented datasets:
1. **cti_bench_mcq** - Multiple Choice Questions (2,500 entries)
2. **cti_bench_ate** - Attack Technique Extraction (60 entries)
3. **cti_bench_vsp** - Vulnerability Severity Prediction (1,000 entries)
4. **cti_bench_taa** - Threat Actor Attribution (50 entries)
5. **cti_bench_rcm** - Reverse Cyber Mapping (1,000 entries)
6. **cti_bench_rcm_2021** - Reverse Cyber Mapping 2021 (1,000 entries)
## 📊 Processed Datasets
All processed datasets are available at: [tuandunghcmut](https://huggingface.co/tuandunghcmut)
- 📋 [Multiple Choice Questions](https://huggingface.co/datasets/tuandunghcmut/cti_bench_mcq)
- 🔍 [Attack Technique Extraction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_ate)
- ⚠️ [Vulnerability Severity Prediction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_vsp)
- 🎭 [Threat Actor Attribution](https://huggingface.co/datasets/tuandunghcmut/cti_bench_taa)
- 🔄 [Reverse Cyber Mapping](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm)
- 🔄 [Reverse Cyber Mapping 2021](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm_2021)
## 🚀 Usage
### Prerequisites
```bash
pip install pandas datasets huggingface_hub
```
### Authentication
Make sure you're logged in to Hugging Face:
```bash
huggingface-cli login
# or
hf auth login
```
### Running the Script
1. Clone the original CTI-Bench repository:
```bash
git clone https://github.com/xashru/cti-bench.git
```
2. Run the processing script:
```bash
python process_cti_bench_with_docs.py --username YOUR_HF_USERNAME
```
### Command Line Options
- `--username`: Your Hugging Face username (required)
- `--token`: Hugging Face token (optional if already logged in)
- `--data-dir`: Path to CTI-bench data directory (default: `cti-bench/data`)
## 🔧 Features
### Data Processing
- ✅ **Standardized Schema**: All datasets include consistent field naming
- ✅ **Task Type Labels**: Each entry includes a `task_type` field for identification
- ✅ **Clean Data**: Proper handling of missing values and data types
- ✅ **Chunk Processing**: Handles large files efficiently
### Documentation
- 📚 **Comprehensive READMEs**: Each dataset gets a detailed README with:
- Dataset description and statistics
- Field explanations
- Usage examples
- Citation information
- Task categories
- 🎯 **Task-Specific Info**: Tailored documentation for each CTI task type
- 📖 **Code Examples**: Ready-to-use Python snippets
### Upload Features
- 🚀 **Batch Processing**: Processes all 6 datasets in one run
- 📤 **Auto-Upload**: Automatically uploads to Hugging Face Hub
- 📝 **README Integration**: Uploads documentation alongside data
- ⚡ **Progress Tracking**: Detailed logging and progress reports
## 📁 Dataset Structure
Each processed dataset follows this structure:
### Multiple Choice Questions (MCQ)
```python
{
'url': str, # Source MITRE ATT&CK URL
'question': str, # The cybersecurity question
'option_a': str, # Multiple choice option A
'option_b': str, # Multiple choice option B
'option_c': str, # Multiple choice option C
'option_d': str, # Multiple choice option D
'prompt': str, # Full instruction prompt
'ground_truth': str, # Correct answer (A, B, C, or D)
'task_type': str # Always "multiple_choice_question"
}
```
### Attack Technique Extraction (ATE)
```python
{
'url': str, # Source MITRE software URL
'platform': str, # Target platform (Enterprise, Mobile, etc.)
'description': str, # Malware/attack description
'prompt': str, # Full instruction with MITRE reference
'ground_truth': str, # MITRE technique IDs (e.g., "T1071, T1573")
'task_type': str # Always "attack_technique_extraction"
}
```
### Vulnerability Severity Prediction (VSP)
```python
{
'url': str, # CVE URL
'description': str, # CVE vulnerability description
'prompt': str, # CVSS instruction prompt
'cvss_vector': str, # CVSS v3.1 vector string
'task_type': str # Always "vulnerability_severity_prediction"
}
```
## 🎓 Original CTI-Bench Paper
This processing script is based on the CTI-Bench dataset from:
> **CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence**
> NeurIPS 2024
> [GitHub](https://github.com/xashru/cti-bench) | [Hugging Face](https://huggingface.co/datasets/AI4Sec/cti-bench)
## 📄 Citation
If you use these processed datasets or this script, please cite the original paper:
```bibtex
@article{ctibench2024,
title={CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence},
author={[Authors]},
journal={NeurIPS 2024},
year={2024}
}
```
## 🤝 Contributing
Feel free to submit issues or pull requests to improve the processing script or documentation.
## 📜 License
This script is provided under the same license terms as the original CTI-Bench dataset.
---
**Total Processed Samples**: 5,610 cybersecurity evaluation examples across 6 different task types! 🎯
|