| See curate_dataset.py in this repository for the full dataset curation pipeline. | |
| The dataset curation script combines: | |
| 1. AlicanKiraz0/All-CVE-Records-Training-Dataset (10K samples) | |
| 2. m-a-p/Code-Feedback (5K samples) | |
| 3. nvidia/OpenCodeReasoning (5K samples) | |
| 4. Synthetic cybersecurity examples (JSON output, AST, GDB, ROP) | |
| Run with: python curate_dataset.py | |