PySecPatch-7B / README.md
abkmystery's picture
Publish PySecPatch 0.1.1
add8f29 verified
|
Raw
History Blame Contribute Delete
4.98 kB
---
base_model: Qwen/Qwen2.5-Coder-7B-Instruct
library_name: peft
pipeline_tag: text-generation
license: apache-2.0
language:
- en
tags:
- code
- python
- cybersecurity
- vulnerability-detection
- vulnerability-repair
- secure-code-generation
datasets:
- abkmystery/PySecPatch-72K
---
# PySecPatch-7B
PySecPatch-7B is a defensive Python security adapter for `Qwen/Qwen2.5-Coder-7B-Instruct`. It is designed for vulnerability triage, CWE classification, security explanation, and candidate secure-code generation. It should be used with human review and automated verification.
**Author:** Ahmed Bin Khalid, Independent Researcher ([ORCID 0000-0002-0616-2604](https://orcid.org/0000-0002-0616-2604))
This repository contains the PEFT adapter, tokenizer files, training metadata, and chat template. The Qwen base weights are not redistributed.
## Evaluation
### Family-disjoint holdout
The final adapter and unmodified base were evaluated on the same 3,200 records.
| Metric | Base | PySecPatch |
|---|---:|---:|
| Classification accuracy | 4.81% | 90.72% |
| Classification F1 | 9.18% | 93.40% |
| Strict JSON | 91.53% | 99.44% |
| Clean-negative preservation | 0.00% | 100.00% |
| Security-control pass | 1.63% | 88.08% |
| Normalized exact repair | 0.17% | 83.33% |
| Parseable fixed code | 6.42% | 99.21% |
Paired classification yielded 2,770 adapter-only correct predictions and 21 base-only correct predictions across 3,200 records (`log10(p) = -787.25`, exact two-sided McNemar).
### External and repository evaluation
On the pinned SALLM scored subset, PySecPatch achieved 26.77% functional pass, 31.88% security-test pass, and 9.58% secure-functional pass. Four of 100 prompts lacked upstream fixtures.
Repository-format holdout performance was 38.00% patch application and 34.38% security-control pass. On a frozen 24-case repository suite, vulnerability detection and clean preservation were perfect, but none of 12 vulnerable patches passed every acceptance gate. These results do not support autonomous repair or state-of-the-art claims.
## Training
The adapter was trained in two consecutive QLoRA stages. Stage A used 8,400 train records from a 12,000-record corpus. Stage B continued from that adapter using 48,000 train records from a separate 60,000-record corpus. Validation, test, and holdout splits never entered optimization.
Both corpora contain generated Python examples and are released under Apache-2.0. The combined dataset repository contains 72,000 records spanning 43 CWEs in the second stage and 35 CWEs in the first stage.
## Usage
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = "Qwen/Qwen2.5-Coder-7B-Instruct"
adapter = "abkmystery/PySecPatch-7B"
tokenizer = AutoTokenizer.from_pretrained(adapter)
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)
```
Use the system instruction:
```text
You are PySecPatch, a defensive Python secure coding model. Identify vulnerabilities, explain risk, and produce minimal safe patches. Return strict JSON only.
```
The expected response keys are `is_vulnerable`, `cwe`, `vuln_type`, `vulnerable_lines`, `explanation`, `fixed_code`, `patch_summary`, and `safe_test`.
## Intended Use
- Defensive analysis of user-supplied Python code.
- Candidate finding classification and CWE identification.
- Security explanations and review assistance.
- Candidate snippet repairs subject to tests and human review.
- Research on security specialization and generalization.
## Limitations
- External secure-functional generation is substantially weaker than controlled holdout performance.
- Repository-level unified diffs often fail to apply or pass verification.
- Line localization is moderate (`F1 = 0.4407`).
- The training corpora are generated rather than mined from real repositories.
- The model can miss vulnerabilities and can produce plausible but incomplete repairs.
Do not use PySecPatch for autonomous deployment, unauthorized scanning, exploit development, or offensive automation.
## Reproducibility
Adapter model SHA-256:
```text
4c2b5c7c0d2982b99de9c319e998274fc12f3aae5bf8d2c3b5db58c5864dc65b
```
Full evaluation reports, raw predictions, frozen hashes, and scripts are linked from the GitHub and archival releases.
Training dataset: [`10.5281/zenodo.21016753`](https://doi.org/10.5281/zenodo.21016753).
## Citation
```bibtex
@software{khalid2026pysecpatch,
author = {Bin Khalid, Ahmed},
title = {PySecPatch: Defensive Python Vulnerability Triage and Repair Research Artifacts},
year = {2026},
version = {0.1.1},
url = {https://github.com/abkmystery/PySecPatch}
}
```
Released under the Apache License 2.0. See `LICENSE`.
Current software archive: [`10.5281/zenodo.21015885`](https://doi.org/10.5281/zenodo.21015885). All versions: [`10.5281/zenodo.21015503`](https://doi.org/10.5281/zenodo.21015503).