cti_bench_processor / README.md
tuandunghcmut's picture
Upload README.md with huggingface_hub
be7aeb1 verified
# CTI-Bench Dataset Processing Script
This repository contains the processing script used to convert the original [CTI-Bench](https://github.com/xashru/cti-bench) TSV files into well-structured Hugging Face datasets with comprehensive documentation.
## 🎯 Overview
The script processes 6 different CTI-Bench task files and uploads them as separate, documented datasets:
1. **cti_bench_mcq** - Multiple Choice Questions (2,500 entries)
2. **cti_bench_ate** - Attack Technique Extraction (60 entries)
3. **cti_bench_vsp** - Vulnerability Severity Prediction (1,000 entries)
4. **cti_bench_taa** - Threat Actor Attribution (50 entries)
5. **cti_bench_rcm** - Reverse Cyber Mapping (1,000 entries)
6. **cti_bench_rcm_2021** - Reverse Cyber Mapping 2021 (1,000 entries)
## 📊 Processed Datasets
All processed datasets are available at: [tuandunghcmut](https://huggingface.co/tuandunghcmut)
- 📋 [Multiple Choice Questions](https://huggingface.co/datasets/tuandunghcmut/cti_bench_mcq)
- 🔍 [Attack Technique Extraction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_ate)
- ⚠️ [Vulnerability Severity Prediction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_vsp)
- 🎭 [Threat Actor Attribution](https://huggingface.co/datasets/tuandunghcmut/cti_bench_taa)
- 🔄 [Reverse Cyber Mapping](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm)
- 🔄 [Reverse Cyber Mapping 2021](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm_2021)
## 🚀 Usage
### Prerequisites
```bash
pip install pandas datasets huggingface_hub
```
### Authentication
Make sure you're logged in to Hugging Face:
```bash
huggingface-cli login
# or
hf auth login
```
### Running the Script
1. Clone the original CTI-Bench repository:
```bash
git clone https://github.com/xashru/cti-bench.git
```
2. Run the processing script:
```bash
python process_cti_bench_with_docs.py --username YOUR_HF_USERNAME
```
### Command Line Options
- `--username`: Your Hugging Face username (required)
- `--token`: Hugging Face token (optional if already logged in)
- `--data-dir`: Path to CTI-bench data directory (default: `cti-bench/data`)
## 🔧 Features
### Data Processing
- ✅ **Standardized Schema**: All datasets include consistent field naming
- ✅ **Task Type Labels**: Each entry includes a `task_type` field for identification
- ✅ **Clean Data**: Proper handling of missing values and data types
- ✅ **Chunk Processing**: Handles large files efficiently
### Documentation
- 📚 **Comprehensive READMEs**: Each dataset gets a detailed README with:
- Dataset description and statistics
- Field explanations
- Usage examples
- Citation information
- Task categories
- 🎯 **Task-Specific Info**: Tailored documentation for each CTI task type
- 📖 **Code Examples**: Ready-to-use Python snippets
### Upload Features
- 🚀 **Batch Processing**: Processes all 6 datasets in one run
- 📤 **Auto-Upload**: Automatically uploads to Hugging Face Hub
- 📝 **README Integration**: Uploads documentation alongside data
- ⚡ **Progress Tracking**: Detailed logging and progress reports
## 📁 Dataset Structure
Each processed dataset follows this structure:
### Multiple Choice Questions (MCQ)
```python
{
'url': str, # Source MITRE ATT&CK URL
'question': str, # The cybersecurity question
'option_a': str, # Multiple choice option A
'option_b': str, # Multiple choice option B
'option_c': str, # Multiple choice option C
'option_d': str, # Multiple choice option D
'prompt': str, # Full instruction prompt
'ground_truth': str, # Correct answer (A, B, C, or D)
'task_type': str # Always "multiple_choice_question"
}
```
### Attack Technique Extraction (ATE)
```python
{
'url': str, # Source MITRE software URL
'platform': str, # Target platform (Enterprise, Mobile, etc.)
'description': str, # Malware/attack description
'prompt': str, # Full instruction with MITRE reference
'ground_truth': str, # MITRE technique IDs (e.g., "T1071, T1573")
'task_type': str # Always "attack_technique_extraction"
}
```
### Vulnerability Severity Prediction (VSP)
```python
{
'url': str, # CVE URL
'description': str, # CVE vulnerability description
'prompt': str, # CVSS instruction prompt
'cvss_vector': str, # CVSS v3.1 vector string
'task_type': str # Always "vulnerability_severity_prediction"
}
```
## 🎓 Original CTI-Bench Paper
This processing script is based on the CTI-Bench dataset from:
> **CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence**
> NeurIPS 2024
> [GitHub](https://github.com/xashru/cti-bench) | [Hugging Face](https://huggingface.co/datasets/AI4Sec/cti-bench)
## 📄 Citation
If you use these processed datasets or this script, please cite the original paper:
```bibtex
@article{ctibench2024,
title={CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence},
author={[Authors]},
journal={NeurIPS 2024},
year={2024}
}
```
## 🤝 Contributing
Feel free to submit issues or pull requests to improve the processing script or documentation.
## 📜 License
This script is provided under the same license terms as the original CTI-Bench dataset.
---
**Total Processed Samples**: 5,610 cybersecurity evaluation examples across 6 different task types! 🎯