CVE Backport Code Generation โ Qwen2.5-Coder-32B (v4)
Fine-tuned Qwen2.5-Coder-32B-Instruct for security patch backporting via per-hunk code generation, with CVE test case generation.
Instead of generating unified diffs, this model takes a vulnerable code region and a fix description, and outputs the fixed version of the code. A programmatic diff then produces the final patch. Optionally, the model can also generate a test case that verifies the fix.
Quick Start
git clone https://github.com/openSUSE/cve-backport-tool
cd cve-backport-tool
./setup.sh # downloads GGUF, registers with ollama
python3 cve-backport.py \
--cve CVE-2024-1234 \
--package curl \
--patch upstream-fix.patch \
--obs-fetch --obs-project openSUSE:Leap:15.6:Update \
--retry 3
GGUF Downloads
| File | Quant | Size | Notes |
|---|---|---|---|
cve-backport-codegen-v4-q8_0.gguf |
Q8_0 | 33 GB | Recommended (v4, 36K dataset + test generation) |
cve-backport-codegen-v3-q8_0.gguf |
Q8_0 | 33 GB | v3 (35K dataset, 98% precision) |
Evaluation (v4)
Per-hunk evaluation on 100 held-out examples the model never saw during training:
| Metric | v3 (n=20) | v4 (n=100) |
|---|---|---|
| Average recall | 94% | 93% |
| Average precision | 98% | 95% |
| Exact match | 16/20 | 87/100 |
| Failures (<10%) | 0/20 | 4/100 |
By tier:
- Identical (upstream patch applies directly): 94% recall
- Adapted (line numbers/context differ): 86% recall
Test Generation (new in v4)
50 held-out CVEs with known reference tests:
- Average quality score: 0.67
- All 50 produced structurally valid tests
- 17/50 matched reference test exactly
Comparison with Frontier Models
Same eval, same 100 examples, optimized prompts with markdown stripping:
| Model | Recall | Precision | Exact | Failures |
|---|---|---|---|---|
| CVE Backport v4 (32B fine-tuned) | 93% | 95% | 87/100 | 4 |
| Gemini 3.1 Pro (frontier, zero-shot) | 27% | 24% | 10/100 | 50 |
| Gemini 2.0 Flash (frontier, zero-shot) | 13% | 17% | 4/100 | 81 |
Fine-tuning on 36K domain-specific examples outperforms frontier models by 3-7x on this task.
Prompt Format
ChatML format. Each prompt covers one hunk region with 15 lines of context padding.
Code Generation (3-turn)
System:
You are a security patch backporting assistant.
Given vulnerable source code and a description of the upstream fix, output the FIXED version of the code.
Rules:
- Output ONLY the fixed code, nothing else โ no explanations, no markdown fences
- Preserve exact formatting, indentation, and style of the original
- Make ONLY the changes described in the fix โ do not modify anything else
- Do not add comments about what you changed
User:
## File: lib/ftp.c
## Lines: 2836-2912
```c
{vulnerable code region with 15-line padding}
```
## Fix
CVE-2017-8817: FTP wildcard matching โ zero terminate the entry path
```diff
{upstream patch}
```
Assistant: The fixed code (same region with the security fix applied).
Test Generation (5-turn, new in v4)
After the code generation turn, an optional second turn:
User:
Write a test case that:
1. Triggers the vulnerability in the original code above
2. Passes after applying your fix
Output ONLY the test code, nothing else.
Assistant: Test code targeting the specific CVE.
Training
| Base model | Qwen2.5-Coder-32B-Instruct |
| Method | QLoRA (4-bit NF4, r=64, alpha=128) |
| Epochs | 2 |
| Learning rate | 1e-4 |
| Max sequence length | 4,096 tokens |
| Batch size | 1 (gradient accumulation 8) |
| Training examples | 36,166 (35,396 codegen + 770 codegen+test) |
| Training time | 41.2 hours |
| Hardware | 2x NVIDIA H100 NVL 94GB |
| Label masking | Multi-turn aware (both assistant segments trained) |
Training Data
openSUSE/cve-backport-codegen-dataset โ 36,166 per-hunk examples from openSUSE maintenance patches, covering 145+ packages and 2,300+ CVEs, with per-example SPDX license metadata.
Reproducibility
Trained using the Teapot composable training pipeline:
teapot compose configs/cve-backport.config
teapot train configs/cve-backport.config --backend qlora-hf
teapot eval configs/cve-backport.config
Dataset: openSUSE/cve-backport-codegen-dataset (train.jsonl + eval.jsonl).
Intended Use
This model assists with security patch backporting in Linux distribution maintenance. It is a research tool โ all generated patches must be reviewed by a maintainer before application.
Important: This model was fine-tuned for code generation accuracy, not for safety alignment. It inherits the base model's safety training but has no additional guardrails. In particular:
- The model follows fix descriptions literally. If the fix description contains malicious instructions (e.g., "add a backdoor"), the model will comply. Fix descriptions must come from trusted sources โ typically upstream patches, not user input.
- The tool is designed for use with trusted inputs (upstream CVE patches, OBS source packages). It should not be exposed as a public API without input validation.
- Generated patches and test cases must always be reviewed by a maintainer before application.
Adding safety training to the fine-tuning was considered but deliberately deferred โ our evaluation showed that domain precision (98% in v3) is sensitive to training data composition, and mixing safety examples risks degrading the model's core capability. The correct mitigation is input validation in the tool, not model-level refusal.
Known Issues
- Prompt echo (v4): The v4 model occasionally echoes prompt structure (
## File:, markdown fences) into its code output, likely from the 5-turn test generation training data. The CLI tool strips these automatically. This is a minor regression from v3. - Test generation quality varies: Test cases for simple vulnerability patterns (null deref, bounds check, injection) are useful. For complex multi-file patches with adapted context, the model may produce generic placeholder tests.
License
Apache-2.0 (inherited from Qwen2.5-Coder-32B-Instruct).
- Downloads last month
- 25
8-bit
Model tree for openSUSE/CVE-Backport-Qwen2.5-Coder-32B
Base model
Qwen/Qwen2.5-32B