---
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-7B-Instruct
tags:
  - security
  - vulnerability-detection
  - code-repair
  - zero-day
  - exploit-scanner
  - cybersecurity
  - sft
  - qlora
  - peft
datasets:
  - hitoshura25/megavul
  - yikun-li/TitanVul
  - yikun-li/CleanVul
language:
  - en
pipeline_tag: text-generation
---

# 🔒 Zero-Day Exploit Scanner & Fixer

A fine-tuned code security model that **detects vulnerabilities** and **generates fixes** across multiple programming languages.

Built on **Qwen2.5-Coder-7B-Instruct** with QLoRA fine-tuning on 90K+ real-world vulnerability-fix pairs from CVE/CWE databases.

## 🎯 What It Does

Given any code snippet, this model will:

1. **SCAN** — Determine if the code contains a security vulnerability (VULNERABLE / SAFE)
2. **IDENTIFY** — Classify the vulnerability type (CWE ID) and link to known CVEs
3. **EXPLAIN** — Describe the attack vector, impact, and exploitation mechanism
4. **FIX** — Generate corrected code that patches the vulnerability
5. **DOCUMENT** — Explain what was changed and why

## 🏗️ Architecture

| Component | Details |
|-----------|---------|
| **Base Model** | [Qwen/Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) |
| **Method** | QLoRA (4-bit NF4 quantization) |
| **LoRA Config** | r=16, α=32, dropout=0.05 |
| **Target Modules** | q, k, v, o, gate, up, down projections |
| **Training** | SFT with assistant-only loss |
| **Max Length** | 2048 tokens |

## 📊 Training Data

Combined from 3 curated vulnerability datasets totaling **~90K samples**:

| Dataset | Samples | Languages | Source |
|---------|---------|-----------|--------|
| [MegaVul](https://huggingface.co/datasets/hitoshura25/megavul) | ~17K | C/C++ | 992 repos, 169 CWE types, 2006-2023 |
| [TitanVul](https://huggingface.co/datasets/yikun-li/TitanVul) | ~38K | C, C++, Java, Python, JS | Aggregated from 7 sources, deduplicated |
| [CleanVul](https://huggingface.co/datasets/yikun-li/CleanVul) | ~26K | Multi-language | LLM-filtered, vulnerability_score ≥ 1 |
| **Safe samples** | ~12K | Multi-language | Fixed code from TitanVul (negative examples) |

### Data Quality Controls
- CleanVul filtered by `vulnerability_score >= 1` (removes ~27% noise)
- TitanVul aggregates and deduplicates BigVul + DiverseVul + CVEFixes + PrimeVul + more
- Safe code examples from patched functions reduce false positive rate
- Each sample includes CVE ID, CWE type, vulnerability description, and commit message

## 🚀 Quick Start

### Installation

```bash
pip install transformers peft torch bitsandbytes accelerate
```

### Python API

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-7B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "jacobmahon/zero-day-exploit-scanner-fixer")
tokenizer = AutoTokenizer.from_pretrained("jacobmahon/zero-day-exploit-scanner-fixer")

# Scan code
messages = [
    {"role": "system", "content": "You are a security expert. Analyze code for vulnerabilities and provide fixes."},
    {"role": "user", "content": "Analyze this C code for vulnerabilities:\n```c\nvoid process(char *input) {\n    char buf[64];\n    strcpy(buf, input);\n}\n```"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.3, top_p=0.9)

print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

### CLI Usage

```bash
# Scan a code string
python inference.py --code "char buf[10]; gets(buf);"

# Scan a file
python inference.py --file vulnerable.c

# Interactive mode
python inference.py --interactive
```

## 📋 Supported Vulnerability Types

The model has been trained on **169+ CWE types** including:

| Category | CWE Examples |
|----------|-------------|
| **Memory Safety** | CWE-119 (Buffer Overflow), CWE-120 (Buffer Copy), CWE-416 (Use After Free), CWE-476 (NULL Pointer Deref) |
| **Injection** | CWE-79 (XSS), CWE-89 (SQL Injection), CWE-78 (OS Command Injection) |
| **Authentication** | CWE-287 (Improper Auth), CWE-306 (Missing Auth), CWE-798 (Hardcoded Credentials) |
| **Cryptography** | CWE-327 (Broken Crypto), CWE-330 (Insufficient Randomness) |
| **Race Conditions** | CWE-362 (Race Condition), CWE-367 (TOCTOU) |
| **Input Validation** | CWE-20 (Improper Input Validation), CWE-190 (Integer Overflow) |
| **Access Control** | CWE-862 (Missing Authorization), CWE-863 (Incorrect Authorization) |
| **Information Disclosure** | CWE-200 (Info Exposure), CWE-209 (Error Message Info Leak) |

## 🔬 Training Recipe

Based on research from:
- **R2Vul** (arXiv:2504.04699) — Structured reasoning for vulnerability detection (81.47% F1)
- **MSIVD** (arXiv:2406.05892) — Multi-task instruction tuning (0.92 F1 on BigVul)
- **SecRepair** (arXiv:2401.03374) — Combined detection + repair with RL
- **SecureCode** — QLoRA recipe: r=16, α=32, lr=2e-4, 3 epochs
- **TitanVul** (arXiv:2507.21817) — 0.881 OOD accuracy on BenchVul benchmark

### Hyperparameters

```python
learning_rate = 2e-4        # LoRA-optimized (10x base)
num_train_epochs = 3
per_device_train_batch_size = 2
gradient_accumulation_steps = 8  # Effective batch = 16
max_length = 2048
lr_scheduler = "cosine"
warmup_steps = 100
optimizer = "adamw_torch"
quantization = "4-bit NF4 (double quant)"
lora_rank = 16
lora_alpha = 32
lora_dropout = 0.05
```

## ⚠️ Limitations & Ethical Use

- **Not a replacement for professional security audits** — Use as a screening tool alongside manual review
- **May produce false positives/negatives** — Always verify findings with static analysis tools (CodeQL, Semgrep)
- **Training data bias** — Primarily C/C++ and Java; coverage for newer languages (Rust, Go, Kotlin) is limited
- **Zero-day detection** — The model generalizes from known vulnerability patterns; truly novel attack vectors may not be detected
- **Do not use for malicious purposes** — This tool is designed for defensive security only

## 📚 Evaluation

Recommended evaluation benchmarks:
- [BenchVul](https://huggingface.co/datasets/yikun-li/BenchVul) — MITRE Top 25 CWEs, balanced real-world + synthetic
- [SVEN](https://huggingface.co/datasets/bstee615/sven) — Curated CWE-typed pairs with character-level diffs

## 🏃 Training

To reproduce or fine-tune further:

```bash
# Install dependencies
pip install transformers trl torch datasets trackio accelerate peft bitsandbytes

# Run training (requires 24GB+ GPU)
python train.py
```

See `train.py` in this repository for the full training script.

## 📄 License

Apache 2.0

## 🙏 Acknowledgments

- [Qwen Team](https://huggingface.co/Qwen) for Qwen2.5-Coder-7B-Instruct
- [MegaVul](https://huggingface.co/datasets/hitoshura25/megavul), [TitanVul](https://huggingface.co/datasets/yikun-li/TitanVul), [CleanVul](https://huggingface.co/datasets/yikun-li/CleanVul) dataset authors
- Research teams behind R2Vul, MSIVD, SecRepair, and SecureCode