Upload README.md with huggingface_hub

be7aeb1 verified 4 months ago

5.45 kB

	# CTI-Bench Dataset Processing Script

	This repository contains the processing script used to convert the original [CTI-Bench](https://github.com/xashru/cti-bench) TSV files into well-structured Hugging Face datasets with comprehensive documentation.

	## 🎯 Overview

	The script processes 6 different CTI-Bench task files and uploads them as separate, documented datasets:

	1. cti_bench_mcq - Multiple Choice Questions (2,500 entries)
	2. cti_bench_ate - Attack Technique Extraction (60 entries)
	3. cti_bench_vsp - Vulnerability Severity Prediction (1,000 entries)
	4. cti_bench_taa - Threat Actor Attribution (50 entries)
	5. cti_bench_rcm - Reverse Cyber Mapping (1,000 entries)
	6. cti_bench_rcm_2021 - Reverse Cyber Mapping 2021 (1,000 entries)

	## 📊 Processed Datasets

	All processed datasets are available at: [tuandunghcmut](https://huggingface.co/tuandunghcmut)

	- 📋 [Multiple Choice Questions](https://huggingface.co/datasets/tuandunghcmut/cti_bench_mcq)
	- 🔍 [Attack Technique Extraction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_ate)
	- ⚠️ [Vulnerability Severity Prediction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_vsp)
	- 🎭 [Threat Actor Attribution](https://huggingface.co/datasets/tuandunghcmut/cti_bench_taa)
	- 🔄 [Reverse Cyber Mapping](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm)
	- 🔄 [Reverse Cyber Mapping 2021](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm_2021)

	## 🚀 Usage

	### Prerequisites

	```bash
	pip install pandas datasets huggingface_hub
	```

	### Authentication

	Make sure you're logged in to Hugging Face:

	```bash
	huggingface-cli login
	# or
	hf auth login
	```

	### Running the Script

	1. Clone the original CTI-Bench repository:
	```bash
	git clone https://github.com/xashru/cti-bench.git
	```

	2. Run the processing script:
	```bash
	python process_cti_bench_with_docs.py --username YOUR_HF_USERNAME
	```

	### Command Line Options

	- `--username`: Your Hugging Face username (required)
	- `--token`: Hugging Face token (optional if already logged in)
	- `--data-dir`: Path to CTI-bench data directory (default: `cti-bench/data`)

	## 🔧 Features

	### Data Processing
	- ✅ Standardized Schema: All datasets include consistent field naming
	- ✅ Task Type Labels: Each entry includes a `task_type` field for identification
	- ✅ Clean Data: Proper handling of missing values and data types
	- ✅ Chunk Processing: Handles large files efficiently

	### Documentation
	- 📚 Comprehensive READMEs: Each dataset gets a detailed README with:
	- Dataset description and statistics
	- Field explanations
	- Usage examples
	- Citation information
	- Task categories
	- 🎯 Task-Specific Info: Tailored documentation for each CTI task type
	- 📖 Code Examples: Ready-to-use Python snippets

	### Upload Features
	- 🚀 Batch Processing: Processes all 6 datasets in one run
	- 📤 Auto-Upload: Automatically uploads to Hugging Face Hub
	- 📝 README Integration: Uploads documentation alongside data
	- ⚡ Progress Tracking: Detailed logging and progress reports

	## 📁 Dataset Structure

	Each processed dataset follows this structure:

	### Multiple Choice Questions (MCQ)
	```python
	{
	'url': str, # Source MITRE ATT&CK URL
	'question': str, # The cybersecurity question
	'option_a': str, # Multiple choice option A
	'option_b': str, # Multiple choice option B
	'option_c': str, # Multiple choice option C
	'option_d': str, # Multiple choice option D
	'prompt': str, # Full instruction prompt
	'ground_truth': str, # Correct answer (A, B, C, or D)
	'task_type': str # Always "multiple_choice_question"
	}
	```

	### Attack Technique Extraction (ATE)
	```python
	{
	'url': str, # Source MITRE software URL
	'platform': str, # Target platform (Enterprise, Mobile, etc.)
	'description': str, # Malware/attack description
	'prompt': str, # Full instruction with MITRE reference
	'ground_truth': str, # MITRE technique IDs (e.g., "T1071, T1573")
	'task_type': str # Always "attack_technique_extraction"
	}
	```

	### Vulnerability Severity Prediction (VSP)
	```python
	{
	'url': str, # CVE URL
	'description': str, # CVE vulnerability description
	'prompt': str, # CVSS instruction prompt
	'cvss_vector': str, # CVSS v3.1 vector string
	'task_type': str # Always "vulnerability_severity_prediction"
	}
	```

	## 🎓 Original CTI-Bench Paper

	This processing script is based on the CTI-Bench dataset from:

	> CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence
	> NeurIPS 2024
	> [GitHub](https://github.com/xashru/cti-bench) \| [Hugging Face](https://huggingface.co/datasets/AI4Sec/cti-bench)

	## 📄 Citation

	If you use these processed datasets or this script, please cite the original paper:

	```bibtex
	@article{ctibench2024,
	title={CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence},
	author={[Authors]},
	journal={NeurIPS 2024},
	year={2024}
	}
	```

	## 🤝 Contributing

	Feel free to submit issues or pull requests to improve the processing script or documentation.

	## 📜 License

	This script is provided under the same license terms as the original CTI-Bench dataset.

	---

	Total Processed Samples: 5,610 cybersecurity evaluation examples across 6 different task types! 🎯