Add comprehensive model card

5ba63a2 verified 2 months ago

7.49 kB

license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-7B-Instruct
tags:
  - security
  - vulnerability-detection
  - code-repair
  - zero-day
  - exploit-scanner
  - cybersecurity
  - sft
  - qlora
  - peft
datasets:
  - hitoshura25/megavul
  - yikun-li/TitanVul
  - yikun-li/CleanVul
language:
  - en
pipeline_tag: text-generation

🔒 Zero-Day Exploit Scanner & Fixer

A fine-tuned code security model that detects vulnerabilities and generates fixes across multiple programming languages.

Built on Qwen2.5-Coder-7B-Instruct with QLoRA fine-tuning on 90K+ real-world vulnerability-fix pairs from CVE/CWE databases.

🎯 What It Does

Given any code snippet, this model will:

SCAN — Determine if the code contains a security vulnerability (VULNERABLE / SAFE)
IDENTIFY — Classify the vulnerability type (CWE ID) and link to known CVEs
EXPLAIN — Describe the attack vector, impact, and exploitation mechanism
FIX — Generate corrected code that patches the vulnerability
DOCUMENT — Explain what was changed and why

🏗️ Architecture

Component	Details
Base Model	Qwen/Qwen2.5-Coder-7B-Instruct
Method	QLoRA (4-bit NF4 quantization)
LoRA Config	r=16, α=32, dropout=0.05
Target Modules	q, k, v, o, gate, up, down projections
Training	SFT with assistant-only loss
Max Length	2048 tokens

📊 Training Data

Combined from 3 curated vulnerability datasets totaling ~90K samples:

Dataset	Samples	Languages	Source
MegaVul	~17K	C/C++	992 repos, 169 CWE types, 2006-2023
TitanVul	~38K	C, C++, Java, Python, JS	Aggregated from 7 sources, deduplicated
CleanVul	~26K	Multi-language	LLM-filtered, vulnerability_score ≥ 1
Safe samples	~12K	Multi-language	Fixed code from TitanVul (negative examples)

Data Quality Controls

CleanVul filtered by vulnerability_score >= 1 (removes ~27% noise)
TitanVul aggregates and deduplicates BigVul + DiverseVul + CVEFixes + PrimeVul + more
Safe code examples from patched functions reduce false positive rate
Each sample includes CVE ID, CWE type, vulnerability description, and commit message

🚀 Quick Start

Installation

pip install transformers peft torch bitsandbytes accelerate

Python API

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-7B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "jacobmahon/zero-day-exploit-scanner-fixer")
tokenizer = AutoTokenizer.from_pretrained("jacobmahon/zero-day-exploit-scanner-fixer")

# Scan code
messages = [
    {"role": "system", "content": "You are a security expert. Analyze code for vulnerabilities and provide fixes."},
    {"role": "user", "content": "Analyze this C code for vulnerabilities:\n```c\nvoid process(char *input) {\n    char buf[64];\n    strcpy(buf, input);\n}\n```"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.3, top_p=0.9)

print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

CLI Usage

# Scan a code string
python inference.py --code "char buf[10]; gets(buf);"

# Scan a file
python inference.py --file vulnerable.c

# Interactive mode
python inference.py --interactive

📋 Supported Vulnerability Types

The model has been trained on 169+ CWE types including:

Category	CWE Examples
Memory Safety	CWE-119 (Buffer Overflow), CWE-120 (Buffer Copy), CWE-416 (Use After Free), CWE-476 (NULL Pointer Deref)
Injection	CWE-79 (XSS), CWE-89 (SQL Injection), CWE-78 (OS Command Injection)
Authentication	CWE-287 (Improper Auth), CWE-306 (Missing Auth), CWE-798 (Hardcoded Credentials)
Cryptography	CWE-327 (Broken Crypto), CWE-330 (Insufficient Randomness)
Race Conditions	CWE-362 (Race Condition), CWE-367 (TOCTOU)
Input Validation	CWE-20 (Improper Input Validation), CWE-190 (Integer Overflow)
Access Control	CWE-862 (Missing Authorization), CWE-863 (Incorrect Authorization)
Information Disclosure	CWE-200 (Info Exposure), CWE-209 (Error Message Info Leak)

🔬 Training Recipe

Based on research from:

R2Vul (arXiv:2504.04699) — Structured reasoning for vulnerability detection (81.47% F1)
MSIVD (arXiv:2406.05892) — Multi-task instruction tuning (0.92 F1 on BigVul)
SecRepair (arXiv:2401.03374) — Combined detection + repair with RL
SecureCode — QLoRA recipe: r=16, α=32, lr=2e-4, 3 epochs
TitanVul (arXiv:2507.21817) — 0.881 OOD accuracy on BenchVul benchmark

Hyperparameters

learning_rate = 2e-4        # LoRA-optimized (10x base)
num_train_epochs = 3
per_device_train_batch_size = 2
gradient_accumulation_steps = 8  # Effective batch = 16
max_length = 2048
lr_scheduler = "cosine"
warmup_steps = 100
optimizer = "adamw_torch"
quantization = "4-bit NF4 (double quant)"
lora_rank = 16
lora_alpha = 32
lora_dropout = 0.05

⚠️ Limitations & Ethical Use

Not a replacement for professional security audits — Use as a screening tool alongside manual review
May produce false positives/negatives — Always verify findings with static analysis tools (CodeQL, Semgrep)
Training data bias — Primarily C/C++ and Java; coverage for newer languages (Rust, Go, Kotlin) is limited
Zero-day detection — The model generalizes from known vulnerability patterns; truly novel attack vectors may not be detected
Do not use for malicious purposes — This tool is designed for defensive security only

📚 Evaluation

Recommended evaluation benchmarks:

BenchVul — MITRE Top 25 CWEs, balanced real-world + synthetic
SVEN — Curated CWE-typed pairs with character-level diffs

🏃 Training

To reproduce or fine-tune further:

# Install dependencies
pip install transformers trl torch datasets trackio accelerate peft bitsandbytes

# Run training (requires 24GB+ GPU)
python train.py

See train.py in this repository for the full training script.

📄 License

Apache 2.0

🙏 Acknowledgments

Qwen Team for Qwen2.5-Coder-7B-Instruct
MegaVul, TitanVul, CleanVul dataset authors
Research teams behind R2Vul, MSIVD, SecRepair, and SecureCode