jacobmahon's picture
Add comprehensive model card
5ba63a2 verified
|
Raw
History Blame Contribute Delete
7.49 kB
---
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-7B-Instruct
tags:
- security
- vulnerability-detection
- code-repair
- zero-day
- exploit-scanner
- cybersecurity
- sft
- qlora
- peft
datasets:
- hitoshura25/megavul
- yikun-li/TitanVul
- yikun-li/CleanVul
language:
- en
pipeline_tag: text-generation
---
# πŸ”’ Zero-Day Exploit Scanner & Fixer
A fine-tuned code security model that **detects vulnerabilities** and **generates fixes** across multiple programming languages.
Built on **Qwen2.5-Coder-7B-Instruct** with QLoRA fine-tuning on 90K+ real-world vulnerability-fix pairs from CVE/CWE databases.
## 🎯 What It Does
Given any code snippet, this model will:
1. **SCAN** β€” Determine if the code contains a security vulnerability (VULNERABLE / SAFE)
2. **IDENTIFY** β€” Classify the vulnerability type (CWE ID) and link to known CVEs
3. **EXPLAIN** β€” Describe the attack vector, impact, and exploitation mechanism
4. **FIX** β€” Generate corrected code that patches the vulnerability
5. **DOCUMENT** β€” Explain what was changed and why
## πŸ—οΈ Architecture
| Component | Details |
|-----------|---------|
| **Base Model** | [Qwen/Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) |
| **Method** | QLoRA (4-bit NF4 quantization) |
| **LoRA Config** | r=16, Ξ±=32, dropout=0.05 |
| **Target Modules** | q, k, v, o, gate, up, down projections |
| **Training** | SFT with assistant-only loss |
| **Max Length** | 2048 tokens |
## πŸ“Š Training Data
Combined from 3 curated vulnerability datasets totaling **~90K samples**:
| Dataset | Samples | Languages | Source |
|---------|---------|-----------|--------|
| [MegaVul](https://huggingface.co/datasets/hitoshura25/megavul) | ~17K | C/C++ | 992 repos, 169 CWE types, 2006-2023 |
| [TitanVul](https://huggingface.co/datasets/yikun-li/TitanVul) | ~38K | C, C++, Java, Python, JS | Aggregated from 7 sources, deduplicated |
| [CleanVul](https://huggingface.co/datasets/yikun-li/CleanVul) | ~26K | Multi-language | LLM-filtered, vulnerability_score β‰₯ 1 |
| **Safe samples** | ~12K | Multi-language | Fixed code from TitanVul (negative examples) |
### Data Quality Controls
- CleanVul filtered by `vulnerability_score >= 1` (removes ~27% noise)
- TitanVul aggregates and deduplicates BigVul + DiverseVul + CVEFixes + PrimeVul + more
- Safe code examples from patched functions reduce false positive rate
- Each sample includes CVE ID, CWE type, vulnerability description, and commit message
## πŸš€ Quick Start
### Installation
```bash
pip install transformers peft torch bitsandbytes accelerate
```
### Python API
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
# Load model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-7B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "jacobmahon/zero-day-exploit-scanner-fixer")
tokenizer = AutoTokenizer.from_pretrained("jacobmahon/zero-day-exploit-scanner-fixer")
# Scan code
messages = [
{"role": "system", "content": "You are a security expert. Analyze code for vulnerabilities and provide fixes."},
{"role": "user", "content": "Analyze this C code for vulnerabilities:\n```c\nvoid process(char *input) {\n char buf[64];\n strcpy(buf, input);\n}\n```"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.3, top_p=0.9)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
### CLI Usage
```bash
# Scan a code string
python inference.py --code "char buf[10]; gets(buf);"
# Scan a file
python inference.py --file vulnerable.c
# Interactive mode
python inference.py --interactive
```
## πŸ“‹ Supported Vulnerability Types
The model has been trained on **169+ CWE types** including:
| Category | CWE Examples |
|----------|-------------|
| **Memory Safety** | CWE-119 (Buffer Overflow), CWE-120 (Buffer Copy), CWE-416 (Use After Free), CWE-476 (NULL Pointer Deref) |
| **Injection** | CWE-79 (XSS), CWE-89 (SQL Injection), CWE-78 (OS Command Injection) |
| **Authentication** | CWE-287 (Improper Auth), CWE-306 (Missing Auth), CWE-798 (Hardcoded Credentials) |
| **Cryptography** | CWE-327 (Broken Crypto), CWE-330 (Insufficient Randomness) |
| **Race Conditions** | CWE-362 (Race Condition), CWE-367 (TOCTOU) |
| **Input Validation** | CWE-20 (Improper Input Validation), CWE-190 (Integer Overflow) |
| **Access Control** | CWE-862 (Missing Authorization), CWE-863 (Incorrect Authorization) |
| **Information Disclosure** | CWE-200 (Info Exposure), CWE-209 (Error Message Info Leak) |
## πŸ”¬ Training Recipe
Based on research from:
- **R2Vul** (arXiv:2504.04699) β€” Structured reasoning for vulnerability detection (81.47% F1)
- **MSIVD** (arXiv:2406.05892) β€” Multi-task instruction tuning (0.92 F1 on BigVul)
- **SecRepair** (arXiv:2401.03374) β€” Combined detection + repair with RL
- **SecureCode** β€” QLoRA recipe: r=16, Ξ±=32, lr=2e-4, 3 epochs
- **TitanVul** (arXiv:2507.21817) β€” 0.881 OOD accuracy on BenchVul benchmark
### Hyperparameters
```python
learning_rate = 2e-4 # LoRA-optimized (10x base)
num_train_epochs = 3
per_device_train_batch_size = 2
gradient_accumulation_steps = 8 # Effective batch = 16
max_length = 2048
lr_scheduler = "cosine"
warmup_steps = 100
optimizer = "adamw_torch"
quantization = "4-bit NF4 (double quant)"
lora_rank = 16
lora_alpha = 32
lora_dropout = 0.05
```
## ⚠️ Limitations & Ethical Use
- **Not a replacement for professional security audits** β€” Use as a screening tool alongside manual review
- **May produce false positives/negatives** β€” Always verify findings with static analysis tools (CodeQL, Semgrep)
- **Training data bias** β€” Primarily C/C++ and Java; coverage for newer languages (Rust, Go, Kotlin) is limited
- **Zero-day detection** β€” The model generalizes from known vulnerability patterns; truly novel attack vectors may not be detected
- **Do not use for malicious purposes** β€” This tool is designed for defensive security only
## πŸ“š Evaluation
Recommended evaluation benchmarks:
- [BenchVul](https://huggingface.co/datasets/yikun-li/BenchVul) β€” MITRE Top 25 CWEs, balanced real-world + synthetic
- [SVEN](https://huggingface.co/datasets/bstee615/sven) β€” Curated CWE-typed pairs with character-level diffs
## πŸƒ Training
To reproduce or fine-tune further:
```bash
# Install dependencies
pip install transformers trl torch datasets trackio accelerate peft bitsandbytes
# Run training (requires 24GB+ GPU)
python train.py
```
See `train.py` in this repository for the full training script.
## πŸ“„ License
Apache 2.0
## πŸ™ Acknowledgments
- [Qwen Team](https://huggingface.co/Qwen) for Qwen2.5-Coder-7B-Instruct
- [MegaVul](https://huggingface.co/datasets/hitoshura25/megavul), [TitanVul](https://huggingface.co/datasets/yikun-li/TitanVul), [CleanVul](https://huggingface.co/datasets/yikun-li/CleanVul) dataset authors
- Research teams behind R2Vul, MSIVD, SecRepair, and SecureCode