Publish PySecPatch 0.1.1

add8f29 verified 3 days ago

4.98 kB

	---
	base_model: Qwen/Qwen2.5-Coder-7B-Instruct
	library_name: peft
	pipeline_tag: text-generation
	license: apache-2.0
	language:
	- en
	tags:
	- code
	- python
	- cybersecurity
	- vulnerability-detection
	- vulnerability-repair
	- secure-code-generation
	datasets:
	- abkmystery/PySecPatch-72K
	---

	# PySecPatch-7B

	PySecPatch-7B is a defensive Python security adapter for `Qwen/Qwen2.5-Coder-7B-Instruct`. It is designed for vulnerability triage, CWE classification, security explanation, and candidate secure-code generation. It should be used with human review and automated verification.

	Author: Ahmed Bin Khalid, Independent Researcher ([ORCID 0000-0002-0616-2604](https://orcid.org/0000-0002-0616-2604))

	This repository contains the PEFT adapter, tokenizer files, training metadata, and chat template. The Qwen base weights are not redistributed.

	## Evaluation

	### Family-disjoint holdout

	The final adapter and unmodified base were evaluated on the same 3,200 records.

	\| Metric \| Base \| PySecPatch \|
	\|---\|---:\|---:\|
	\| Classification accuracy \| 4.81% \| 90.72% \|
	\| Classification F1 \| 9.18% \| 93.40% \|
	\| Strict JSON \| 91.53% \| 99.44% \|
	\| Clean-negative preservation \| 0.00% \| 100.00% \|
	\| Security-control pass \| 1.63% \| 88.08% \|
	\| Normalized exact repair \| 0.17% \| 83.33% \|
	\| Parseable fixed code \| 6.42% \| 99.21% \|

	Paired classification yielded 2,770 adapter-only correct predictions and 21 base-only correct predictions across 3,200 records (`log10(p) = -787.25`, exact two-sided McNemar).

	### External and repository evaluation

	On the pinned SALLM scored subset, PySecPatch achieved 26.77% functional pass, 31.88% security-test pass, and 9.58% secure-functional pass. Four of 100 prompts lacked upstream fixtures.

	Repository-format holdout performance was 38.00% patch application and 34.38% security-control pass. On a frozen 24-case repository suite, vulnerability detection and clean preservation were perfect, but none of 12 vulnerable patches passed every acceptance gate. These results do not support autonomous repair or state-of-the-art claims.

	## Training

	The adapter was trained in two consecutive QLoRA stages. Stage A used 8,400 train records from a 12,000-record corpus. Stage B continued from that adapter using 48,000 train records from a separate 60,000-record corpus. Validation, test, and holdout splits never entered optimization.

	Both corpora contain generated Python examples and are released under Apache-2.0. The combined dataset repository contains 72,000 records spanning 43 CWEs in the second stage and 35 CWEs in the first stage.

	## Usage

	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer

	base = "Qwen/Qwen2.5-Coder-7B-Instruct"
	adapter = "abkmystery/PySecPatch-7B"

	tokenizer = AutoTokenizer.from_pretrained(adapter)
	model = AutoModelForCausalLM.from_pretrained(base, device_map="auto")
	model = PeftModel.from_pretrained(model, adapter)
	```

	Use the system instruction:

	```text
	You are PySecPatch, a defensive Python secure coding model. Identify vulnerabilities, explain risk, and produce minimal safe patches. Return strict JSON only.
	```

	The expected response keys are `is_vulnerable`, `cwe`, `vuln_type`, `vulnerable_lines`, `explanation`, `fixed_code`, `patch_summary`, and `safe_test`.

	## Intended Use

	- Defensive analysis of user-supplied Python code.
	- Candidate finding classification and CWE identification.
	- Security explanations and review assistance.
	- Candidate snippet repairs subject to tests and human review.
	- Research on security specialization and generalization.

	## Limitations

	- External secure-functional generation is substantially weaker than controlled holdout performance.
	- Repository-level unified diffs often fail to apply or pass verification.
	- Line localization is moderate (`F1 = 0.4407`).
	- The training corpora are generated rather than mined from real repositories.
	- The model can miss vulnerabilities and can produce plausible but incomplete repairs.

	Do not use PySecPatch for autonomous deployment, unauthorized scanning, exploit development, or offensive automation.

	## Reproducibility

	Adapter model SHA-256:

	```text
	4c2b5c7c0d2982b99de9c319e998274fc12f3aae5bf8d2c3b5db58c5864dc65b
	```

	Full evaluation reports, raw predictions, frozen hashes, and scripts are linked from the GitHub and archival releases.

	Training dataset: [`10.5281/zenodo.21016753`](https://doi.org/10.5281/zenodo.21016753).

	## Citation

	```bibtex
	@software{khalid2026pysecpatch,
	author = {Bin Khalid, Ahmed},
	title = {PySecPatch: Defensive Python Vulnerability Triage and Repair Research Artifacts},
	year = {2026},
	version = {0.1.1},
	url = {https://github.com/abkmystery/PySecPatch}
	}
	```

	Released under the Apache License 2.0. See `LICENSE`.

	Current software archive: [`10.5281/zenodo.21015885`](https://doi.org/10.5281/zenodo.21015885). All versions: [`10.5281/zenodo.21015503`](https://doi.org/10.5281/zenodo.21015503).