tuandunghcmut commited on
Commit
be7aeb1
·
verified ·
1 Parent(s): d6060fa

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +160 -0
README.md ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CTI-Bench Dataset Processing Script
2
+
3
+ This repository contains the processing script used to convert the original [CTI-Bench](https://github.com/xashru/cti-bench) TSV files into well-structured Hugging Face datasets with comprehensive documentation.
4
+
5
+ ## 🎯 Overview
6
+
7
+ The script processes 6 different CTI-Bench task files and uploads them as separate, documented datasets:
8
+
9
+ 1. **cti_bench_mcq** - Multiple Choice Questions (2,500 entries)
10
+ 2. **cti_bench_ate** - Attack Technique Extraction (60 entries)
11
+ 3. **cti_bench_vsp** - Vulnerability Severity Prediction (1,000 entries)
12
+ 4. **cti_bench_taa** - Threat Actor Attribution (50 entries)
13
+ 5. **cti_bench_rcm** - Reverse Cyber Mapping (1,000 entries)
14
+ 6. **cti_bench_rcm_2021** - Reverse Cyber Mapping 2021 (1,000 entries)
15
+
16
+ ## 📊 Processed Datasets
17
+
18
+ All processed datasets are available at: [tuandunghcmut](https://huggingface.co/tuandunghcmut)
19
+
20
+ - 📋 [Multiple Choice Questions](https://huggingface.co/datasets/tuandunghcmut/cti_bench_mcq)
21
+ - 🔍 [Attack Technique Extraction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_ate)
22
+ - ⚠️ [Vulnerability Severity Prediction](https://huggingface.co/datasets/tuandunghcmut/cti_bench_vsp)
23
+ - 🎭 [Threat Actor Attribution](https://huggingface.co/datasets/tuandunghcmut/cti_bench_taa)
24
+ - 🔄 [Reverse Cyber Mapping](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm)
25
+ - 🔄 [Reverse Cyber Mapping 2021](https://huggingface.co/datasets/tuandunghcmut/cti_bench_rcm_2021)
26
+
27
+ ## 🚀 Usage
28
+
29
+ ### Prerequisites
30
+
31
+ ```bash
32
+ pip install pandas datasets huggingface_hub
33
+ ```
34
+
35
+ ### Authentication
36
+
37
+ Make sure you're logged in to Hugging Face:
38
+
39
+ ```bash
40
+ huggingface-cli login
41
+ # or
42
+ hf auth login
43
+ ```
44
+
45
+ ### Running the Script
46
+
47
+ 1. Clone the original CTI-Bench repository:
48
+ ```bash
49
+ git clone https://github.com/xashru/cti-bench.git
50
+ ```
51
+
52
+ 2. Run the processing script:
53
+ ```bash
54
+ python process_cti_bench_with_docs.py --username YOUR_HF_USERNAME
55
+ ```
56
+
57
+ ### Command Line Options
58
+
59
+ - `--username`: Your Hugging Face username (required)
60
+ - `--token`: Hugging Face token (optional if already logged in)
61
+ - `--data-dir`: Path to CTI-bench data directory (default: `cti-bench/data`)
62
+
63
+ ## 🔧 Features
64
+
65
+ ### Data Processing
66
+ - ✅ **Standardized Schema**: All datasets include consistent field naming
67
+ - ✅ **Task Type Labels**: Each entry includes a `task_type` field for identification
68
+ - ✅ **Clean Data**: Proper handling of missing values and data types
69
+ - ✅ **Chunk Processing**: Handles large files efficiently
70
+
71
+ ### Documentation
72
+ - 📚 **Comprehensive READMEs**: Each dataset gets a detailed README with:
73
+ - Dataset description and statistics
74
+ - Field explanations
75
+ - Usage examples
76
+ - Citation information
77
+ - Task categories
78
+ - 🎯 **Task-Specific Info**: Tailored documentation for each CTI task type
79
+ - 📖 **Code Examples**: Ready-to-use Python snippets
80
+
81
+ ### Upload Features
82
+ - 🚀 **Batch Processing**: Processes all 6 datasets in one run
83
+ - 📤 **Auto-Upload**: Automatically uploads to Hugging Face Hub
84
+ - 📝 **README Integration**: Uploads documentation alongside data
85
+ - ⚡ **Progress Tracking**: Detailed logging and progress reports
86
+
87
+ ## 📁 Dataset Structure
88
+
89
+ Each processed dataset follows this structure:
90
+
91
+ ### Multiple Choice Questions (MCQ)
92
+ ```python
93
+ {
94
+ 'url': str, # Source MITRE ATT&CK URL
95
+ 'question': str, # The cybersecurity question
96
+ 'option_a': str, # Multiple choice option A
97
+ 'option_b': str, # Multiple choice option B
98
+ 'option_c': str, # Multiple choice option C
99
+ 'option_d': str, # Multiple choice option D
100
+ 'prompt': str, # Full instruction prompt
101
+ 'ground_truth': str, # Correct answer (A, B, C, or D)
102
+ 'task_type': str # Always "multiple_choice_question"
103
+ }
104
+ ```
105
+
106
+ ### Attack Technique Extraction (ATE)
107
+ ```python
108
+ {
109
+ 'url': str, # Source MITRE software URL
110
+ 'platform': str, # Target platform (Enterprise, Mobile, etc.)
111
+ 'description': str, # Malware/attack description
112
+ 'prompt': str, # Full instruction with MITRE reference
113
+ 'ground_truth': str, # MITRE technique IDs (e.g., "T1071, T1573")
114
+ 'task_type': str # Always "attack_technique_extraction"
115
+ }
116
+ ```
117
+
118
+ ### Vulnerability Severity Prediction (VSP)
119
+ ```python
120
+ {
121
+ 'url': str, # CVE URL
122
+ 'description': str, # CVE vulnerability description
123
+ 'prompt': str, # CVSS instruction prompt
124
+ 'cvss_vector': str, # CVSS v3.1 vector string
125
+ 'task_type': str # Always "vulnerability_severity_prediction"
126
+ }
127
+ ```
128
+
129
+ ## 🎓 Original CTI-Bench Paper
130
+
131
+ This processing script is based on the CTI-Bench dataset from:
132
+
133
+ > **CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence**
134
+ > NeurIPS 2024
135
+ > [GitHub](https://github.com/xashru/cti-bench) | [Hugging Face](https://huggingface.co/datasets/AI4Sec/cti-bench)
136
+
137
+ ## 📄 Citation
138
+
139
+ If you use these processed datasets or this script, please cite the original paper:
140
+
141
+ ```bibtex
142
+ @article{ctibench2024,
143
+ title={CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence},
144
+ author={[Authors]},
145
+ journal={NeurIPS 2024},
146
+ year={2024}
147
+ }
148
+ ```
149
+
150
+ ## 🤝 Contributing
151
+
152
+ Feel free to submit issues or pull requests to improve the processing script or documentation.
153
+
154
+ ## 📜 License
155
+
156
+ This script is provided under the same license terms as the original CTI-Bench dataset.
157
+
158
+ ---
159
+
160
+ **Total Processed Samples**: 5,610 cybersecurity evaluation examples across 6 different task types! 🎯