SecureCode v2.0: A Production-Grade Dataset for Training Security-Aware Code Generation Models
Paper • 2512.18542 • Published • 5
Security-specialized code generation model fine-tuned on the SecureCode and SecureCode Web datasets.
Part of the SecureCode model collection by perfecXion.ai.
| Property | Value |
|---|---|
| Base Model | google/gemma-4-26b-a4b-it |
| Architecture | Gemma 4 Mixture-of-Experts (26B total, 4B active per token) |
| Method | QLoRA (4-bit NormalFloat quantization) |
| Parameters Trained | ~1-2% via LoRA adapters |
| Tier | Tier 3: Large Security Specialist |
| Parameter | Value |
|---|---|
| Quantization | 4-bit NormalFloat (NF4) |
| Compute Dtype | bfloat16 |
| Double Quantization | Enabled |
| LoRA Rank | 16 |
| LoRA Alpha | 32 |
| LoRA Dropout | 0.05 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Parameter | Value |
|---|---|
| Learning Rate | 2e-4 |
| LR Scheduler | Cosine with 100-step warmup |
| Epochs | 3 |
| Per-device Batch Size | 2 |
| Gradient Accumulation | 8x |
| Effective Batch Size | 16 |
| Max Sequence Length | 4,096 tokens |
| Optimizer | paged_adamw_8bit |
| Precision | bf16 |
| Component | Specification |
|---|---|
| System | NVIDIA DGX Spark |
| GPU | NVIDIA GB10 |
| Memory | 128 GB Unified (CPU/GPU) |
Combined and deduplicated from two datasets:
| Dataset | Examples | Focus |
|---|---|---|
| scthornton/securecode | 2,185 | Web + AI/ML security (OWASP Top 10 2021 + LLM Top 10 2025) |
| scthornton/securecode-web | 1,378 | Web security with framework-specific patterns |
Vulnerability Standards:
Programming Languages: Python, JavaScript, Java, Go, PHP, TypeScript, C#, Ruby, Rust, Kotlin, YAML, HCL
Frameworks: 49+ including LangChain, OpenAI, Anthropic, HuggingFace, Django, Express.js, Spring Boot, FastAPI, and more
Training Format: 4-turn conversational examples:
Every example is grounded in real CVEs and published security incidents.
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
# Load with 4-bit quantization (matches training)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-26b-a4b-it",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("scthornton/gemma4-26b-securecode")
model = PeftModel.from_pretrained(base_model, "scthornton/gemma4-26b-securecode")
messages = [
{"role": "user", "content": "How do I implement JWT authentication with refresh tokens in Python?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Standard code models generate functional but often insecure code. SecureCode-trained models:
| Model | Parameters | Base |
|---|---|---|
| llama-3.2-3b-securecode | 3B | Llama 3.2 3B |
| codegemma-7b-securecode | 7B | CodeGemma 7B IT |
| deepseek-coder-6.7b-securecode | 6.7B | DeepSeek Coder |
| qwen-coder-7b-securecode | 7B | Qwen Coder 7B |
| codellama-13b-securecode | 13B | Code Llama 13B |
| qwen2.5-coder-14b-securecode | 14B | Qwen 2.5 Coder 14B |
| starcoder2-15b-securecode | 15B | StarCoder2 15B |
| granite-20b-code-securecode | 20B | Granite 20B Code |
| gemma4-26b-securecode | 26B (4B active) | Gemma 4 26B-A4B IT |
@misc{thornton2026securecode,
title={SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models},
author={Thornton, Scott},
year={2026},
publisher={perfecXion.ai},
url={https://huggingface.co/datasets/scthornton/securecode},
note={arXiv:2512.18542}
}