tuandunghcmut
/

cti_bench_processor

Model card Files Files and versions

xet

Community

tuandunghcmut commited on Sep 27, 2025

Commit

be7aeb1

verified ·

1 Parent(s): d6060fa

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +160 -0

README.md ADDED Viewed

	@@ -0,0 +1,160 @@

+# CTI-Bench Dataset Processing Script
+This repository contains the processing script used to convert the original [CTI-Bench](https://github.com/xashru/cti-bench) TSV files into well-structured Hugging Face datasets with comprehensive documentation.
+## 🎯 Overview
+The script processes 6 different CTI-Bench task files and uploads them as separate, documented datasets:
+1. **cti_bench_mcq** - Multiple Choice Questions (2,500 entries)
+2. **cti_bench_ate** - Attack Technique Extraction (60 entries)
+3. **cti_bench_vsp** - Vulnerability Severity Prediction (1,000 entries)
+4. **cti_bench_taa** - Threat Actor Attribution (50 entries)
+5. **cti_bench_rcm** - Reverse Cyber Mapping (1,000 entries)
+6. **cti_bench_rcm_2021** - Reverse Cyber Mapping 2021 (1,000 entries)
+## 📊 Processed Datasets
+All processed datasets are available at: [tuandunghcmut](https://huggingface.co/tuandunghcmut)
+- 📋 [Multiple Choice Questions](https://huggingface.co/datasets/tuandunghcmut/cti_bench_mcq)
+- 🔍 [Attack Technique Extraction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_ate)
+- ⚠️ [Vulnerability Severity Prediction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_vsp)
+- 🎭 [Threat Actor Attribution](https://huggingface.co/datasets/tuandunghcmut/cti_bench_taa)
+- 🔄 [Reverse Cyber Mapping](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm)
+- 🔄 [Reverse Cyber Mapping 2021](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm_2021)
+## 🚀 Usage
+### Prerequisites
+```bash
+pip install pandas datasets huggingface_hub
+```
+### Authentication
+Make sure you're logged in to Hugging Face:
+```bash
+huggingface-cli login
+# or
+hf auth login
+```
+### Running the Script
+1. Clone the original CTI-Bench repository:
+```bash
+git clone https://github.com/xashru/cti-bench.git
+```
+2. Run the processing script:
+```bash
+python process_cti_bench_with_docs.py --username YOUR_HF_USERNAME
+```
+### Command Line Options
+- `--username`: Your Hugging Face username (required)
+- `--token`: Hugging Face token (optional if already logged in)
+- `--data-dir`: Path to CTI-bench data directory (default: `cti-bench/data`)
+## 🔧 Features
+### Data Processing
+- ✅ **Standardized Schema**: All datasets include consistent field naming
+- ✅ **Task Type Labels**: Each entry includes a `task_type` field for identification
+- ✅ **Clean Data**: Proper handling of missing values and data types
+- ✅ **Chunk Processing**: Handles large files efficiently
+### Documentation
+- 📚 **Comprehensive READMEs**: Each dataset gets a detailed README with:
+  - Dataset description and statistics
+  - Field explanations
+  - Usage examples
+  - Citation information
+  - Task categories
+- 🎯 **Task-Specific Info**: Tailored documentation for each CTI task type
+- 📖 **Code Examples**: Ready-to-use Python snippets
+### Upload Features
+- 🚀 **Batch Processing**: Processes all 6 datasets in one run
+- 📤 **Auto-Upload**: Automatically uploads to Hugging Face Hub
+- 📝 **README Integration**: Uploads documentation alongside data
+- ⚡ **Progress Tracking**: Detailed logging and progress reports
+## 📁 Dataset Structure
+Each processed dataset follows this structure:
+### Multiple Choice Questions (MCQ)
+```python
+{
+    'url': str,           # Source MITRE ATT&CK URL
+    'question': str,      # The cybersecurity question
+    'option_a': str,      # Multiple choice option A
+    'option_b': str,      # Multiple choice option B
+    'option_c': str,      # Multiple choice option C
+    'option_d': str,      # Multiple choice option D
+    'prompt': str,        # Full instruction prompt
+    'ground_truth': str,  # Correct answer (A, B, C, or D)
+    'task_type': str      # Always "multiple_choice_question"
+}
+```
+### Attack Technique Extraction (ATE)
+```python
+{
+    'url': str,          # Source MITRE software URL
+    'platform': str,     # Target platform (Enterprise, Mobile, etc.)
+    'description': str,  # Malware/attack description
+    'prompt': str,       # Full instruction with MITRE reference
+    'ground_truth': str, # MITRE technique IDs (e.g., "T1071, T1573")
+    'task_type': str     # Always "attack_technique_extraction"
+}
+```
+### Vulnerability Severity Prediction (VSP)
+```python
+{
+    'url': str,          # CVE URL
+    'description': str,  # CVE vulnerability description
+    'prompt': str,       # CVSS instruction prompt
+    'cvss_vector': str,  # CVSS v3.1 vector string
+    'task_type': str     # Always "vulnerability_severity_prediction"
+}
+```
+## 🎓 Original CTI-Bench Paper
+This processing script is based on the CTI-Bench dataset from:
+> **CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence**
+> NeurIPS 2024
+> [GitHub](https://github.com/xashru/cti-bench) | [Hugging Face](https://huggingface.co/datasets/AI4Sec/cti-bench)
+## 📄 Citation
+If you use these processed datasets or this script, please cite the original paper:
+```bibtex
+@article{ctibench2024,
+  title={CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence},
+  author={[Authors]},
+  journal={NeurIPS 2024},
+  year={2024}
+}
+```
+## 🤝 Contributing
+Feel free to submit issues or pull requests to improve the processing script or documentation.
+## 📜 License
+This script is provided under the same license terms as the original CTI-Bench dataset.
+---
+**Total Processed Samples**: 5,610 cybersecurity evaluation examples across 6 different task types! 🎯