tuandunghcmut commited on
Commit
d6060fa
·
verified ·
1 Parent(s): 615fc78

Upload process_cti_bench_with_docs.py with huggingface_hub

Browse files
Files changed (1) hide show
  1. process_cti_bench_with_docs.py +603 -0
process_cti_bench_with_docs.py ADDED
@@ -0,0 +1,603 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Script to process CTI-bench TSV files into Hugging Face datasets with comprehensive README documentation.
4
+ """
5
+
6
+ import pandas as pd
7
+ import os
8
+ from pathlib import Path
9
+ from datasets import Dataset
10
+ from huggingface_hub import HfApi, login
11
+ import argparse
12
+ import logging
13
+ import tempfile
14
+
15
+ # Set up logging
16
+ logging.basicConfig(level=logging.INFO)
17
+ logger = logging.getLogger(__name__)
18
+
19
+ def generate_mcq_readme(dataset_size):
20
+ """Generate README for Multiple Choice Questions dataset."""
21
+ return f"""# CTI-Bench: Multiple Choice Questions (MCQ)
22
+
23
+ ## Dataset Description
24
+
25
+ This dataset contains **{dataset_size:,} multiple choice questions** focused on cybersecurity knowledge, particularly based on the MITRE ATT&CK framework. It's part of the CTI-Bench suite for evaluating Large Language Models on Cyber Threat Intelligence tasks.
26
+
27
+ ## Dataset Structure
28
+
29
+ Each example contains:
30
+ - **url**: Source URL (typically MITRE ATT&CK technique pages)
31
+ - **question**: The cybersecurity question
32
+ - **option_a**: First multiple choice option
33
+ - **option_b**: Second multiple choice option
34
+ - **option_c**: Third multiple choice option
35
+ - **option_d**: Fourth multiple choice option
36
+ - **prompt**: Full prompt with instructions for the model
37
+ - **ground_truth**: Correct answer (A, B, C, or D)
38
+ - **task_type**: Always "multiple_choice_question"
39
+
40
+ ## Usage
41
+
42
+ ```python
43
+ from datasets import load_dataset
44
+
45
+ # Load the dataset
46
+ dataset = load_dataset("tuandunghcmut/cti_bench_mcq")
47
+
48
+ # Access a sample
49
+ sample = dataset['train'][0]
50
+ print(f"Question: {{sample['question']}}")
51
+ print(f"Options: A) {{sample['option_a']}}, B) {{sample['option_b']}}")
52
+ print(f"Answer: {{sample['ground_truth']}}")
53
+ ```
54
+
55
+ ## Example
56
+
57
+ **Question:** Which of the following mitigations involves preventing applications from running that haven't been downloaded from legitimate repositories?
58
+
59
+ **Options:**
60
+ - A) Audit
61
+ - B) Execution Prevention
62
+ - C) Operating System Configuration
63
+ - D) User Account Control
64
+
65
+ **Answer:** B
66
+
67
+ ## Citation
68
+
69
+ If you use this dataset, please cite the original CTI-Bench paper:
70
+
71
+ ```bibtex
72
+ @article{{ctibench2024,
73
+ title={{CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence}},
74
+ author={{[Authors]}},
75
+ journal={{NeurIPS 2024}},
76
+ year={{2024}}
77
+ }}
78
+ ```
79
+
80
+ ## Original Source
81
+
82
+ This dataset is derived from [CTI-Bench](https://github.com/xashru/cti-bench) and is available under the same license terms.
83
+
84
+ ## Tasks
85
+
86
+ This dataset is designed for:
87
+ - ✅ Multiple choice question answering
88
+ - ✅ Cybersecurity knowledge evaluation
89
+ - ✅ MITRE ATT&CK framework understanding
90
+ - ✅ Model benchmarking on CTI tasks
91
+ """
92
+
93
+ def generate_ate_readme(dataset_size):
94
+ """Generate README for Attack Technique Extraction dataset."""
95
+ return f"""# CTI-Bench: Attack Technique Extraction (ATE)
96
+
97
+ ## Dataset Description
98
+
99
+ This dataset contains **{dataset_size} examples** for extracting MITRE Enterprise attack technique IDs from malware and attack descriptions. It tests a model's ability to map cybersecurity descriptions to specific MITRE ATT&CK techniques.
100
+
101
+ ## Dataset Structure
102
+
103
+ Each example contains:
104
+ - **url**: Source URL (typically MITRE software/malware pages)
105
+ - **platform**: Target platform (Enterprise, Mobile, etc.)
106
+ - **description**: Detailed description of the malware or attack technique
107
+ - **prompt**: Full instruction prompt with MITRE technique reference list
108
+ - **ground_truth**: Comma-separated list of main MITRE technique IDs (e.g., "T1071, T1573, T1083")
109
+ - **task_type**: Always "attack_technique_extraction"
110
+
111
+ ## Usage
112
+
113
+ ```python
114
+ from datasets import load_dataset
115
+
116
+ # Load the dataset
117
+ dataset = load_dataset("tuandunghcmut/cti_bench_ate")
118
+
119
+ # Access a sample
120
+ sample = dataset['train'][0]
121
+ print(f"Description: {{sample['description']}}")
122
+ print(f"MITRE Techniques: {{sample['ground_truth']}}")
123
+ ```
124
+
125
+ ## Example
126
+
127
+ **Description:** 3PARA RAT is a remote access tool (RAT) developed in C++ and associated with the group Putter Panda. It communicates with its command and control (C2) servers via HTTP, with commands encrypted using the DES algorithm in CBC mode...
128
+
129
+ **Expected Output:** T1071, T1573, T1083, T1070
130
+
131
+ ## MITRE ATT&CK Techniques
132
+
133
+ The dataset covers techniques such as:
134
+ - **T1071**: Application Layer Protocol
135
+ - **T1573**: Encrypted Channel
136
+ - **T1083**: File and Directory Discovery
137
+ - **T1105**: Ingress Tool Transfer
138
+ - And many more...
139
+
140
+ ## Citation
141
+
142
+ ```bibtex
143
+ @article{{ctibench2024,
144
+ title={{CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence}},
145
+ author={{[Authors]}},
146
+ journal={{NeurIPS 2024}},
147
+ year={{2024}}
148
+ }}
149
+ ```
150
+
151
+ ## Original Source
152
+
153
+ This dataset is derived from [CTI-Bench](https://github.com/xashru/cti-bench) and is available under the same license terms.
154
+
155
+ ## Tasks
156
+
157
+ This dataset is designed for:
158
+ - ✅ Named entity recognition (MITRE technique IDs)
159
+ - ✅ Information extraction from cybersecurity text
160
+ - ✅ MITRE ATT&CK framework mapping
161
+ - ✅ Threat intelligence analysis
162
+ """
163
+
164
+ def generate_vsp_readme(dataset_size):
165
+ """Generate README for Vulnerability Severity Prediction dataset."""
166
+ return f"""# CTI-Bench: Vulnerability Severity Prediction (VSP)
167
+
168
+ ## Dataset Description
169
+
170
+ This dataset contains **{dataset_size:,} CVE descriptions** with corresponding CVSS v3.1 base scores. It evaluates a model's ability to assess vulnerability severity and generate proper CVSS vector strings.
171
+
172
+ ## Dataset Structure
173
+
174
+ Each example contains:
175
+ - **url**: CVE URL (typically from nvd.nist.gov)
176
+ - **description**: CVE description detailing the vulnerability
177
+ - **prompt**: Full instruction prompt explaining CVSS v3.1 metrics
178
+ - **cvss_vector**: Ground truth CVSS v3.1 vector string (e.g., "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H")
179
+ - **task_type**: Always "vulnerability_severity_prediction"
180
+
181
+ ## CVSS v3.1 Metrics
182
+
183
+ The dataset covers all base metrics:
184
+ - **AV** (Attack Vector): Network (N), Adjacent (A), Local (L), Physical (P)
185
+ - **AC** (Attack Complexity): Low (L), High (H)
186
+ - **PR** (Privileges Required): None (N), Low (L), High (H)
187
+ - **UI** (User Interaction): None (N), Required (R)
188
+ - **S** (Scope): Unchanged (U), Changed (C)
189
+ - **C** (Confidentiality): None (N), Low (L), High (H)
190
+ - **I** (Integrity): None (N), Low (L), High (H)
191
+ - **A** (Availability): None (N), Low (L), High (H)
192
+
193
+ ## Usage
194
+
195
+ ```python
196
+ from datasets import load_dataset
197
+
198
+ # Load the dataset
199
+ dataset = load_dataset("tuandunghcmut/cti_bench_vsp")
200
+
201
+ # Access a sample
202
+ sample = dataset['train'][0]
203
+ print(f"CVE: {{sample['description']}}")
204
+ print(f"CVSS Vector: {{sample['cvss_vector']}}")
205
+ ```
206
+
207
+ ## Example
208
+
209
+ **CVE Description:** In the Linux kernel through 6.7.1, there is a use-after-free in cec_queue_msg_fh, related to drivers/media/cec/core/cec-adap.c...
210
+
211
+ **CVSS Vector:** CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H
212
+
213
+ ## Citation
214
+
215
+ ```bibtex
216
+ @article{{ctibench2024,
217
+ title={{CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence}},
218
+ author={{[Authors]}},
219
+ journal={{NeurIPS 2024}},
220
+ year={{2024}}
221
+ }}
222
+ ```
223
+
224
+ ## Original Source
225
+
226
+ This dataset is derived from [CTI-Bench](https://github.com/xashru/cti-bench) and is available under the same license terms.
227
+
228
+ ## Tasks
229
+
230
+ This dataset is designed for:
231
+ - ✅ Vulnerability severity assessment
232
+ - ✅ CVSS score calculation
233
+ - ✅ Risk analysis and prioritization
234
+ - ✅ Cybersecurity impact evaluation
235
+ """
236
+
237
+ def generate_taa_readme(dataset_size):
238
+ """Generate README for Threat Actor Attribution dataset."""
239
+ return f"""# CTI-Bench: Threat Actor Attribution (TAA)
240
+
241
+ ## Dataset Description
242
+
243
+ This dataset contains **{dataset_size} examples** for threat actor attribution tasks. It evaluates a model's ability to identify and attribute cyber attacks to specific threat actors based on attack patterns, techniques, and indicators.
244
+
245
+ ## Dataset Structure
246
+
247
+ Each example contains:
248
+ - **task_type**: Always "threat_actor_attribution"
249
+ - Additional fields vary based on the specific attribution task
250
+ - Common fields include threat descriptions, attack patterns, and attribution targets
251
+
252
+ ## Usage
253
+
254
+ ```python
255
+ from datasets import load_dataset
256
+
257
+ # Load the dataset
258
+ dataset = load_dataset("tuandunghcmut/cti_bench_taa")
259
+
260
+ # Access a sample
261
+ sample = dataset['train'][0]
262
+ print(f"Task: {{sample['task_type']}}")
263
+ ```
264
+
265
+ ## Attribution Categories
266
+
267
+ The dataset may cover attribution to:
268
+ - **APT Groups**: Advanced Persistent Threat organizations
269
+ - **Nation-State Actors**: Government-sponsored cyber units
270
+ - **Cybercriminal Organizations**: Profit-motivated threat groups
271
+ - **Hacktivist Groups**: Ideologically motivated actors
272
+
273
+ ## Citation
274
+
275
+ ```bibtex
276
+ @article{{ctibench2024,
277
+ title={{CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence}},
278
+ author={{[Authors]}},
279
+ journal={{NeurIPS 2024}},
280
+ year={{2024}}
281
+ }}
282
+ ```
283
+
284
+ ## Original Source
285
+
286
+ This dataset is derived from [CTI-Bench](https://github.com/xashru/cti-bench) and is available under the same license terms.
287
+
288
+ ## Tasks
289
+
290
+ This dataset is designed for:
291
+ - ✅ Threat actor identification
292
+ - ✅ Attribution analysis
293
+ - ✅ Attack pattern recognition
294
+ - ✅ Intelligence correlation
295
+ """
296
+
297
+ def generate_rcm_readme(dataset_size, variant=""):
298
+ """Generate README for Reverse Cyber Mapping dataset."""
299
+ variant_text = f" ({variant})" if variant else ""
300
+ return f"""# CTI-Bench: Reverse Cyber Mapping (RCM){variant_text}
301
+
302
+ ## Dataset Description
303
+
304
+ This dataset contains **{dataset_size:,} examples** for reverse cyber mapping tasks. It evaluates a model's ability to work backwards from observed indicators or effects to identify the underlying attack techniques, tools, or threat actors.
305
+
306
+ ## Dataset Structure
307
+
308
+ Each example contains:
309
+ - **task_type**: Always "reverse_cyber_mapping"
310
+ - Additional fields vary based on the specific mapping task
311
+ - Common fields include indicators, observables, and mapping targets
312
+
313
+ ## Usage
314
+
315
+ ```python
316
+ from datasets import load_dataset
317
+
318
+ # Load the dataset
319
+ dataset = load_dataset("tuandunghcmut/cti_bench_rcm{'_2021' if '2021' in variant else ''}")
320
+
321
+ # Access a sample
322
+ sample = dataset['train'][0]
323
+ print(f"Task: {{sample['task_type']}}")
324
+ ```
325
+
326
+ ## Reverse Mapping Categories
327
+
328
+ The dataset may include mapping from:
329
+ - **Indicators of Compromise (IoCs)** → Attack techniques
330
+ - **Network signatures** → Malware families
331
+ - **Attack patterns** → Threat actors
332
+ - **Behavioral analysis** → MITRE techniques
333
+
334
+ ## Citation
335
+
336
+ ```bibtex
337
+ @article{{ctibench2024,
338
+ title={{CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence}},
339
+ author={{[Authors]}},
340
+ journal={{NeurIPS 2024}},
341
+ year={{2024}}
342
+ }}
343
+ ```
344
+
345
+ ## Original Source
346
+
347
+ This dataset is derived from [CTI-Bench](https://github.com/xashru/cti-bench) and is available under the same license terms.
348
+
349
+ ## Tasks
350
+
351
+ This dataset is designed for:
352
+ - ✅ Reverse engineering of attack chains
353
+ - ✅ Indicator-to-technique mapping
354
+ - ✅ Threat hunting and investigation
355
+ - ✅ Forensic analysis
356
+ """
357
+
358
+ def process_mcq_dataset(file_path):
359
+ """Process Multiple Choice Questions dataset."""
360
+ logger.info(f"Processing MCQ dataset: {file_path}")
361
+
362
+ df = pd.read_csv(file_path, sep='\t')
363
+
364
+ # Clean and structure the data
365
+ processed_data = []
366
+ for _, row in df.iterrows():
367
+ processed_data.append({
368
+ 'url': str(row['URL']) if pd.notna(row['URL']) else '',
369
+ 'question': str(row['Question']) if pd.notna(row['Question']) else '',
370
+ 'option_a': str(row['Option A']) if pd.notna(row['Option A']) else '',
371
+ 'option_b': str(row['Option B']) if pd.notna(row['Option B']) else '',
372
+ 'option_c': str(row['Option C']) if pd.notna(row['Option C']) else '',
373
+ 'option_d': str(row['Option D']) if pd.notna(row['Option D']) else '',
374
+ 'prompt': str(row['Prompt']) if pd.notna(row['Prompt']) else '',
375
+ 'ground_truth': str(row['GT']) if pd.notna(row['GT']) else '',
376
+ 'task_type': 'multiple_choice_question'
377
+ })
378
+
379
+ return Dataset.from_list(processed_data)
380
+
381
+ def process_ate_dataset(file_path):
382
+ """Process Attack Technique Extraction dataset."""
383
+ logger.info(f"Processing ATE dataset: {file_path}")
384
+
385
+ df = pd.read_csv(file_path, sep='\t')
386
+
387
+ processed_data = []
388
+ for _, row in df.iterrows():
389
+ processed_data.append({
390
+ 'url': str(row['URL']) if pd.notna(row['URL']) else '',
391
+ 'platform': str(row['Platform']) if pd.notna(row['Platform']) else '',
392
+ 'description': str(row['Description']) if pd.notna(row['Description']) else '',
393
+ 'prompt': str(row['Prompt']) if pd.notna(row['Prompt']) else '',
394
+ 'ground_truth': str(row['GT']) if pd.notna(row['GT']) else '',
395
+ 'task_type': 'attack_technique_extraction'
396
+ })
397
+
398
+ return Dataset.from_list(processed_data)
399
+
400
+ def process_vsp_dataset(file_path):
401
+ """Process Vulnerability Severity Prediction dataset."""
402
+ logger.info(f"Processing VSP dataset: {file_path}")
403
+
404
+ df = pd.read_csv(file_path, sep='\t')
405
+
406
+ processed_data = []
407
+ for _, row in df.iterrows():
408
+ processed_data.append({
409
+ 'url': str(row['URL']) if pd.notna(row['URL']) else '',
410
+ 'description': str(row['Description']) if pd.notna(row['Description']) else '',
411
+ 'prompt': str(row['Prompt']) if pd.notna(row['Prompt']) else '',
412
+ 'cvss_vector': str(row['GT']) if pd.notna(row['GT']) else '',
413
+ 'task_type': 'vulnerability_severity_prediction'
414
+ })
415
+
416
+ return Dataset.from_list(processed_data)
417
+
418
+ def process_taa_dataset(file_path):
419
+ """Process Threat Actor Attribution dataset."""
420
+ logger.info(f"Processing TAA dataset: {file_path}")
421
+
422
+ # Read in chunks due to potential large size
423
+ chunk_list = []
424
+ chunk_size = 10000
425
+
426
+ for chunk in pd.read_csv(file_path, sep='\t', chunksize=chunk_size):
427
+ chunk_list.append(chunk)
428
+
429
+ df = pd.concat(chunk_list, ignore_index=True)
430
+
431
+ processed_data = []
432
+ for _, row in df.iterrows():
433
+ # Handle different possible column structures for TAA
434
+ data_entry = {'task_type': 'threat_actor_attribution'}
435
+
436
+ # Try to map common column names
437
+ for col in df.columns:
438
+ col_lower = col.lower()
439
+ if 'url' in col_lower:
440
+ data_entry['url'] = str(row[col]) if pd.notna(row[col]) else ''
441
+ elif 'description' in col_lower or 'text' in col_lower:
442
+ data_entry['description'] = str(row[col]) if pd.notna(row[col]) else ''
443
+ elif 'prompt' in col_lower:
444
+ data_entry['prompt'] = str(row[col]) if pd.notna(row[col]) else ''
445
+ elif col == 'GT' or 'ground' in col_lower or 'truth' in col_lower:
446
+ data_entry['ground_truth'] = str(row[col]) if pd.notna(row[col]) else ''
447
+ else:
448
+ # Include other columns as-is
449
+ data_entry[col.lower().replace(' ', '_')] = str(row[col]) if pd.notna(row[col]) else ''
450
+
451
+ processed_data.append(data_entry)
452
+
453
+ return Dataset.from_list(processed_data)
454
+
455
+ def process_rcm_dataset(file_path):
456
+ """Process Reverse Cyber Mapping dataset."""
457
+ logger.info(f"Processing RCM dataset: {file_path}")
458
+
459
+ # Read in chunks due to potential large size
460
+ chunk_list = []
461
+ chunk_size = 10000
462
+
463
+ for chunk in pd.read_csv(file_path, sep='\t', chunksize=chunk_size):
464
+ chunk_list.append(chunk)
465
+
466
+ df = pd.concat(chunk_list, ignore_index=True)
467
+
468
+ processed_data = []
469
+ for _, row in df.iterrows():
470
+ data_entry = {'task_type': 'reverse_cyber_mapping'}
471
+
472
+ # Map columns dynamically
473
+ for col in df.columns:
474
+ col_lower = col.lower()
475
+ if 'url' in col_lower:
476
+ data_entry['url'] = str(row[col]) if pd.notna(row[col]) else ''
477
+ elif 'description' in col_lower or 'text' in col_lower:
478
+ data_entry['description'] = str(row[col]) if pd.notna(row[col]) else ''
479
+ elif 'prompt' in col_lower:
480
+ data_entry['prompt'] = str(row[col]) if pd.notna(row[col]) else ''
481
+ elif col == 'GT' or 'ground' in col_lower or 'truth' in col_lower:
482
+ data_entry['ground_truth'] = str(row[col]) if pd.notna(row[col]) else ''
483
+ else:
484
+ data_entry[col.lower().replace(' ', '_')] = str(row[col]) if pd.notna(row[col]) else ''
485
+
486
+ processed_data.append(data_entry)
487
+
488
+ return Dataset.from_list(processed_data)
489
+
490
+ def upload_dataset_to_hub_with_readme(dataset, dataset_name, username, readme_content, token=None):
491
+ """Upload dataset to Hugging Face Hub with README."""
492
+ try:
493
+ logger.info(f"Uploading {dataset_name} to Hugging Face Hub...")
494
+
495
+ # First, push the dataset
496
+ dataset.push_to_hub(
497
+ repo_id=f"{username}/{dataset_name}",
498
+ token=token,
499
+ private=False
500
+ )
501
+
502
+ # Then upload the README file using HfApi
503
+ api = HfApi()
504
+
505
+ # Create a temporary README file
506
+ with tempfile.NamedTemporaryFile(mode='w', suffix='.md', delete=False) as f:
507
+ f.write(readme_content)
508
+ readme_path = f.name
509
+
510
+ try:
511
+ # Upload README file
512
+ api.upload_file(
513
+ path_or_fileobj=readme_path,
514
+ path_in_repo="README.md",
515
+ repo_id=f"{username}/{dataset_name}",
516
+ repo_type="dataset",
517
+ token=token
518
+ )
519
+ finally:
520
+ # Clean up temp file
521
+ os.unlink(readme_path)
522
+
523
+ logger.info(f"Successfully uploaded {dataset_name} with documentation to {username}/{dataset_name}")
524
+ return True
525
+
526
+ except Exception as e:
527
+ logger.error(f"Error uploading {dataset_name}: {str(e)}")
528
+ return False
529
+
530
+ def main():
531
+ parser = argparse.ArgumentParser(description='Process CTI-bench TSV files and upload to Hugging Face Hub with documentation')
532
+ parser.add_argument('--username', default='tuandunghcmut', help='Hugging Face username')
533
+ parser.add_argument('--token', help='Hugging Face token (optional if logged in via CLI)')
534
+ parser.add_argument('--data-dir', default='cti-bench/data', help='Directory containing TSV files')
535
+
536
+ args = parser.parse_args()
537
+
538
+ data_dir = Path(args.data_dir)
539
+
540
+ # Define file processors with README generators
541
+ file_processors = {
542
+ 'cti-mcq.tsv': ('cti_bench_mcq', process_mcq_dataset, generate_mcq_readme),
543
+ 'cti-ate.tsv': ('cti_bench_ate', process_ate_dataset, generate_ate_readme),
544
+ 'cti-vsp.tsv': ('cti_bench_vsp', process_vsp_dataset, generate_vsp_readme),
545
+ 'cti-taa.tsv': ('cti_bench_taa', process_taa_dataset, generate_taa_readme),
546
+ 'cti-rcm.tsv': ('cti_bench_rcm', process_rcm_dataset, lambda size: generate_rcm_readme(size)),
547
+ 'cti-rcm-2021.tsv': ('cti_bench_rcm_2021', process_rcm_dataset, lambda size: generate_rcm_readme(size, "2021")),
548
+ }
549
+
550
+ successful_uploads = []
551
+ failed_uploads = []
552
+
553
+ # Process each file
554
+ for filename, (dataset_name, processor_func, readme_generator) in file_processors.items():
555
+ file_path = data_dir / filename
556
+
557
+ if not file_path.exists():
558
+ logger.warning(f"File not found: {file_path}")
559
+ failed_uploads.append(filename)
560
+ continue
561
+
562
+ try:
563
+ logger.info(f"Processing {filename}...")
564
+
565
+ # Process the dataset
566
+ dataset = processor_func(file_path)
567
+ dataset_size = len(dataset)
568
+ logger.info(f"Created dataset with {dataset_size:,} entries")
569
+
570
+ # Generate README
571
+ readme_content = readme_generator(dataset_size)
572
+
573
+ # Upload to Hub with README
574
+ success = upload_dataset_to_hub_with_readme(
575
+ dataset, dataset_name, args.username, readme_content, args.token
576
+ )
577
+
578
+ if success:
579
+ successful_uploads.append(dataset_name)
580
+ logger.info(f"✅ Successfully processed and uploaded: {dataset_name}")
581
+ else:
582
+ failed_uploads.append(filename)
583
+ logger.error(f"❌ Failed to upload: {dataset_name}")
584
+
585
+ except Exception as e:
586
+ logger.error(f"❌ Error processing {filename}: {str(e)}")
587
+ failed_uploads.append(filename)
588
+
589
+ # Summary
590
+ logger.info(f"\n🎉 Processing complete!")
591
+ logger.info(f"✅ Successfully uploaded {len(successful_uploads)} datasets with documentation:")
592
+ for name in successful_uploads:
593
+ logger.info(f" - https://huggingface.co/datasets/{args.username}/{name}")
594
+
595
+ if failed_uploads:
596
+ logger.info(f"❌ Failed to process {len(failed_uploads)} files:")
597
+ for name in failed_uploads:
598
+ logger.info(f" - {name}")
599
+
600
+ logger.info(f"\nVisit https://huggingface.co/{args.username} to see your uploaded datasets with full documentation!")
601
+
602
+ if __name__ == "__main__":
603
+ main()