Text Generation
PEFT
English
security
vulnerability-detection
code-repair
zero-day
exploit-scanner
cybersecurity
sft
qlora
Instructions to use jacobmahon/zero-day-exploit-scanner-fixer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use jacobmahon/zero-day-exploit-scanner-fixer with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| base_model: Qwen/Qwen2.5-Coder-7B-Instruct | |
| tags: | |
| - security | |
| - vulnerability-detection | |
| - code-repair | |
| - zero-day | |
| - exploit-scanner | |
| - cybersecurity | |
| - sft | |
| - qlora | |
| - peft | |
| datasets: | |
| - hitoshura25/megavul | |
| - yikun-li/TitanVul | |
| - yikun-li/CleanVul | |
| language: | |
| - en | |
| pipeline_tag: text-generation | |
| # π Zero-Day Exploit Scanner & Fixer | |
| A fine-tuned code security model that **detects vulnerabilities** and **generates fixes** across multiple programming languages. | |
| Built on **Qwen2.5-Coder-7B-Instruct** with QLoRA fine-tuning on 90K+ real-world vulnerability-fix pairs from CVE/CWE databases. | |
| ## π― What It Does | |
| Given any code snippet, this model will: | |
| 1. **SCAN** β Determine if the code contains a security vulnerability (VULNERABLE / SAFE) | |
| 2. **IDENTIFY** β Classify the vulnerability type (CWE ID) and link to known CVEs | |
| 3. **EXPLAIN** β Describe the attack vector, impact, and exploitation mechanism | |
| 4. **FIX** β Generate corrected code that patches the vulnerability | |
| 5. **DOCUMENT** β Explain what was changed and why | |
| ## ποΈ Architecture | |
| | Component | Details | | |
| |-----------|---------| | |
| | **Base Model** | [Qwen/Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) | | |
| | **Method** | QLoRA (4-bit NF4 quantization) | | |
| | **LoRA Config** | r=16, Ξ±=32, dropout=0.05 | | |
| | **Target Modules** | q, k, v, o, gate, up, down projections | | |
| | **Training** | SFT with assistant-only loss | | |
| | **Max Length** | 2048 tokens | | |
| ## π Training Data | |
| Combined from 3 curated vulnerability datasets totaling **~90K samples**: | |
| | Dataset | Samples | Languages | Source | | |
| |---------|---------|-----------|--------| | |
| | [MegaVul](https://huggingface.co/datasets/hitoshura25/megavul) | ~17K | C/C++ | 992 repos, 169 CWE types, 2006-2023 | | |
| | [TitanVul](https://huggingface.co/datasets/yikun-li/TitanVul) | ~38K | C, C++, Java, Python, JS | Aggregated from 7 sources, deduplicated | | |
| | [CleanVul](https://huggingface.co/datasets/yikun-li/CleanVul) | ~26K | Multi-language | LLM-filtered, vulnerability_score β₯ 1 | | |
| | **Safe samples** | ~12K | Multi-language | Fixed code from TitanVul (negative examples) | | |
| ### Data Quality Controls | |
| - CleanVul filtered by `vulnerability_score >= 1` (removes ~27% noise) | |
| - TitanVul aggregates and deduplicates BigVul + DiverseVul + CVEFixes + PrimeVul + more | |
| - Safe code examples from patched functions reduce false positive rate | |
| - Each sample includes CVE ID, CWE type, vulnerability description, and commit message | |
| ## π Quick Start | |
| ### Installation | |
| ```bash | |
| pip install transformers peft torch bitsandbytes accelerate | |
| ``` | |
| ### Python API | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig | |
| from peft import PeftModel | |
| import torch | |
| # Load model | |
| bnb_config = BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_use_double_quant=True, | |
| bnb_4bit_quant_type="nf4", | |
| bnb_4bit_compute_dtype=torch.bfloat16, | |
| ) | |
| base_model = AutoModelForCausalLM.from_pretrained( | |
| "Qwen/Qwen2.5-Coder-7B-Instruct", | |
| quantization_config=bnb_config, | |
| device_map="auto", | |
| ) | |
| model = PeftModel.from_pretrained(base_model, "jacobmahon/zero-day-exploit-scanner-fixer") | |
| tokenizer = AutoTokenizer.from_pretrained("jacobmahon/zero-day-exploit-scanner-fixer") | |
| # Scan code | |
| messages = [ | |
| {"role": "system", "content": "You are a security expert. Analyze code for vulnerabilities and provide fixes."}, | |
| {"role": "user", "content": "Analyze this C code for vulnerabilities:\n```c\nvoid process(char *input) {\n char buf[64];\n strcpy(buf, input);\n}\n```"}, | |
| ] | |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) | |
| with torch.no_grad(): | |
| outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.3, top_p=0.9) | |
| print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| ### CLI Usage | |
| ```bash | |
| # Scan a code string | |
| python inference.py --code "char buf[10]; gets(buf);" | |
| # Scan a file | |
| python inference.py --file vulnerable.c | |
| # Interactive mode | |
| python inference.py --interactive | |
| ``` | |
| ## π Supported Vulnerability Types | |
| The model has been trained on **169+ CWE types** including: | |
| | Category | CWE Examples | | |
| |----------|-------------| | |
| | **Memory Safety** | CWE-119 (Buffer Overflow), CWE-120 (Buffer Copy), CWE-416 (Use After Free), CWE-476 (NULL Pointer Deref) | | |
| | **Injection** | CWE-79 (XSS), CWE-89 (SQL Injection), CWE-78 (OS Command Injection) | | |
| | **Authentication** | CWE-287 (Improper Auth), CWE-306 (Missing Auth), CWE-798 (Hardcoded Credentials) | | |
| | **Cryptography** | CWE-327 (Broken Crypto), CWE-330 (Insufficient Randomness) | | |
| | **Race Conditions** | CWE-362 (Race Condition), CWE-367 (TOCTOU) | | |
| | **Input Validation** | CWE-20 (Improper Input Validation), CWE-190 (Integer Overflow) | | |
| | **Access Control** | CWE-862 (Missing Authorization), CWE-863 (Incorrect Authorization) | | |
| | **Information Disclosure** | CWE-200 (Info Exposure), CWE-209 (Error Message Info Leak) | | |
| ## π¬ Training Recipe | |
| Based on research from: | |
| - **R2Vul** (arXiv:2504.04699) β Structured reasoning for vulnerability detection (81.47% F1) | |
| - **MSIVD** (arXiv:2406.05892) β Multi-task instruction tuning (0.92 F1 on BigVul) | |
| - **SecRepair** (arXiv:2401.03374) β Combined detection + repair with RL | |
| - **SecureCode** β QLoRA recipe: r=16, Ξ±=32, lr=2e-4, 3 epochs | |
| - **TitanVul** (arXiv:2507.21817) β 0.881 OOD accuracy on BenchVul benchmark | |
| ### Hyperparameters | |
| ```python | |
| learning_rate = 2e-4 # LoRA-optimized (10x base) | |
| num_train_epochs = 3 | |
| per_device_train_batch_size = 2 | |
| gradient_accumulation_steps = 8 # Effective batch = 16 | |
| max_length = 2048 | |
| lr_scheduler = "cosine" | |
| warmup_steps = 100 | |
| optimizer = "adamw_torch" | |
| quantization = "4-bit NF4 (double quant)" | |
| lora_rank = 16 | |
| lora_alpha = 32 | |
| lora_dropout = 0.05 | |
| ``` | |
| ## β οΈ Limitations & Ethical Use | |
| - **Not a replacement for professional security audits** β Use as a screening tool alongside manual review | |
| - **May produce false positives/negatives** β Always verify findings with static analysis tools (CodeQL, Semgrep) | |
| - **Training data bias** β Primarily C/C++ and Java; coverage for newer languages (Rust, Go, Kotlin) is limited | |
| - **Zero-day detection** β The model generalizes from known vulnerability patterns; truly novel attack vectors may not be detected | |
| - **Do not use for malicious purposes** β This tool is designed for defensive security only | |
| ## π Evaluation | |
| Recommended evaluation benchmarks: | |
| - [BenchVul](https://huggingface.co/datasets/yikun-li/BenchVul) β MITRE Top 25 CWEs, balanced real-world + synthetic | |
| - [SVEN](https://huggingface.co/datasets/bstee615/sven) β Curated CWE-typed pairs with character-level diffs | |
| ## π Training | |
| To reproduce or fine-tune further: | |
| ```bash | |
| # Install dependencies | |
| pip install transformers trl torch datasets trackio accelerate peft bitsandbytes | |
| # Run training (requires 24GB+ GPU) | |
| python train.py | |
| ``` | |
| See `train.py` in this repository for the full training script. | |
| ## π License | |
| Apache 2.0 | |
| ## π Acknowledgments | |
| - [Qwen Team](https://huggingface.co/Qwen) for Qwen2.5-Coder-7B-Instruct | |
| - [MegaVul](https://huggingface.co/datasets/hitoshura25/megavul), [TitanVul](https://huggingface.co/datasets/yikun-li/TitanVul), [CleanVul](https://huggingface.co/datasets/yikun-li/CleanVul) dataset authors | |
| - Research teams behind R2Vul, MSIVD, SecRepair, and SecureCode | |