--- license: apache-2.0 base_model: Qwen/Qwen2.5-Coder-7B-Instruct tags: - security - vulnerability-detection - code-repair - zero-day - exploit-scanner - cybersecurity - sft - qlora - peft datasets: - hitoshura25/megavul - yikun-li/TitanVul - yikun-li/CleanVul language: - en pipeline_tag: text-generation --- # 🔒 Zero-Day Exploit Scanner & Fixer A fine-tuned code security model that **detects vulnerabilities** and **generates fixes** across multiple programming languages. Built on **Qwen2.5-Coder-7B-Instruct** with QLoRA fine-tuning on 90K+ real-world vulnerability-fix pairs from CVE/CWE databases. ## 🎯 What It Does Given any code snippet, this model will: 1. **SCAN** — Determine if the code contains a security vulnerability (VULNERABLE / SAFE) 2. **IDENTIFY** — Classify the vulnerability type (CWE ID) and link to known CVEs 3. **EXPLAIN** — Describe the attack vector, impact, and exploitation mechanism 4. **FIX** — Generate corrected code that patches the vulnerability 5. **DOCUMENT** — Explain what was changed and why ## 🏗️ Architecture | Component | Details | |-----------|---------| | **Base Model** | [Qwen/Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) | | **Method** | QLoRA (4-bit NF4 quantization) | | **LoRA Config** | r=16, α=32, dropout=0.05 | | **Target Modules** | q, k, v, o, gate, up, down projections | | **Training** | SFT with assistant-only loss | | **Max Length** | 2048 tokens | ## 📊 Training Data Combined from 3 curated vulnerability datasets totaling **~90K samples**: | Dataset | Samples | Languages | Source | |---------|---------|-----------|--------| | [MegaVul](https://huggingface.co/datasets/hitoshura25/megavul) | ~17K | C/C++ | 992 repos, 169 CWE types, 2006-2023 | | [TitanVul](https://huggingface.co/datasets/yikun-li/TitanVul) | ~38K | C, C++, Java, Python, JS | Aggregated from 7 sources, deduplicated | | [CleanVul](https://huggingface.co/datasets/yikun-li/CleanVul) | ~26K | Multi-language | LLM-filtered, vulnerability_score ≥ 1 | | **Safe samples** | ~12K | Multi-language | Fixed code from TitanVul (negative examples) | ### Data Quality Controls - CleanVul filtered by `vulnerability_score >= 1` (removes ~27% noise) - TitanVul aggregates and deduplicates BigVul + DiverseVul + CVEFixes + PrimeVul + more - Safe code examples from patched functions reduce false positive rate - Each sample includes CVE ID, CWE type, vulnerability description, and commit message ## 🚀 Quick Start ### Installation ```bash pip install transformers peft torch bitsandbytes accelerate ``` ### Python API ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import PeftModel import torch # Load model bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) base_model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-Coder-7B-Instruct", quantization_config=bnb_config, device_map="auto", ) model = PeftModel.from_pretrained(base_model, "jacobmahon/zero-day-exploit-scanner-fixer") tokenizer = AutoTokenizer.from_pretrained("jacobmahon/zero-day-exploit-scanner-fixer") # Scan code messages = [ {"role": "system", "content": "You are a security expert. Analyze code for vulnerabilities and provide fixes."}, {"role": "user", "content": "Analyze this C code for vulnerabilities:\n```c\nvoid process(char *input) {\n char buf[64];\n strcpy(buf, input);\n}\n```"}, ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.3, top_p=0.9) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` ### CLI Usage ```bash # Scan a code string python inference.py --code "char buf[10]; gets(buf);" # Scan a file python inference.py --file vulnerable.c # Interactive mode python inference.py --interactive ``` ## 📋 Supported Vulnerability Types The model has been trained on **169+ CWE types** including: | Category | CWE Examples | |----------|-------------| | **Memory Safety** | CWE-119 (Buffer Overflow), CWE-120 (Buffer Copy), CWE-416 (Use After Free), CWE-476 (NULL Pointer Deref) | | **Injection** | CWE-79 (XSS), CWE-89 (SQL Injection), CWE-78 (OS Command Injection) | | **Authentication** | CWE-287 (Improper Auth), CWE-306 (Missing Auth), CWE-798 (Hardcoded Credentials) | | **Cryptography** | CWE-327 (Broken Crypto), CWE-330 (Insufficient Randomness) | | **Race Conditions** | CWE-362 (Race Condition), CWE-367 (TOCTOU) | | **Input Validation** | CWE-20 (Improper Input Validation), CWE-190 (Integer Overflow) | | **Access Control** | CWE-862 (Missing Authorization), CWE-863 (Incorrect Authorization) | | **Information Disclosure** | CWE-200 (Info Exposure), CWE-209 (Error Message Info Leak) | ## 🔬 Training Recipe Based on research from: - **R2Vul** (arXiv:2504.04699) — Structured reasoning for vulnerability detection (81.47% F1) - **MSIVD** (arXiv:2406.05892) — Multi-task instruction tuning (0.92 F1 on BigVul) - **SecRepair** (arXiv:2401.03374) — Combined detection + repair with RL - **SecureCode** — QLoRA recipe: r=16, α=32, lr=2e-4, 3 epochs - **TitanVul** (arXiv:2507.21817) — 0.881 OOD accuracy on BenchVul benchmark ### Hyperparameters ```python learning_rate = 2e-4 # LoRA-optimized (10x base) num_train_epochs = 3 per_device_train_batch_size = 2 gradient_accumulation_steps = 8 # Effective batch = 16 max_length = 2048 lr_scheduler = "cosine" warmup_steps = 100 optimizer = "adamw_torch" quantization = "4-bit NF4 (double quant)" lora_rank = 16 lora_alpha = 32 lora_dropout = 0.05 ``` ## ⚠️ Limitations & Ethical Use - **Not a replacement for professional security audits** — Use as a screening tool alongside manual review - **May produce false positives/negatives** — Always verify findings with static analysis tools (CodeQL, Semgrep) - **Training data bias** — Primarily C/C++ and Java; coverage for newer languages (Rust, Go, Kotlin) is limited - **Zero-day detection** — The model generalizes from known vulnerability patterns; truly novel attack vectors may not be detected - **Do not use for malicious purposes** — This tool is designed for defensive security only ## 📚 Evaluation Recommended evaluation benchmarks: - [BenchVul](https://huggingface.co/datasets/yikun-li/BenchVul) — MITRE Top 25 CWEs, balanced real-world + synthetic - [SVEN](https://huggingface.co/datasets/bstee615/sven) — Curated CWE-typed pairs with character-level diffs ## 🏃 Training To reproduce or fine-tune further: ```bash # Install dependencies pip install transformers trl torch datasets trackio accelerate peft bitsandbytes # Run training (requires 24GB+ GPU) python train.py ``` See `train.py` in this repository for the full training script. ## 📄 License Apache 2.0 ## 🙏 Acknowledgments - [Qwen Team](https://huggingface.co/Qwen) for Qwen2.5-Coder-7B-Instruct - [MegaVul](https://huggingface.co/datasets/hitoshura25/megavul), [TitanVul](https://huggingface.co/datasets/yikun-li/TitanVul), [CleanVul](https://huggingface.co/datasets/yikun-li/CleanVul) dataset authors - Research teams behind R2Vul, MSIVD, SecRepair, and SecureCode