YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

CodeT5+ Vulnerability Fixer

A code repair model that generates secure fixes for vulnerable code. Given vulnerable code + CWE type + programming language, it produces the patched version.

Fine-tuned from Salesforce/codet5p-220m (220M parameters) on 7,374 vulnerable→fixed code pairs.

Quick Start

from transformers import AutoTokenizer, T5ForConditionalGeneration

model_id = "ayshajavd/codet5p-vuln-fixer"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id)
model.eval()

# CWE-aware input format
code = """
def get_user(username):
    query = f"SELECT * FROM users WHERE username = '{username}'"
    conn = sqlite3.connect('db.sqlite')
    return conn.execute(query).fetchone()
"""

input_text = f"fix SQL Injection vulnerability in python: {code}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)

import torch
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=512,
        num_beams=5,
        early_stopping=True,
        no_repeat_ngram_size=3,
    )

fixed_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(fixed_code)

Model Details

Property Value
Architecture T5ForConditionalGeneration (encoder-decoder, 8 layers each)
Base Model Salesforce/codet5p-220m
Parameters 222,882,048 (222M)
Task Seq2Seq code repair (vulnerable → fixed)
Input Format fix <CWE_NAME> vulnerability in <language>: <code>
Max Sequence Length 512 tokens (input and output)
Generation Beam search (num_beams=5)

Evaluation Results (Test Set — 941 samples)

Metric Score
BLEU 81.0
ROUGE-1 0.802
ROUGE-2 0.745
ROUGE-L 0.788
Exact Match 1.4%
Eval Loss 0.175

vs Previous Model (flan-t5-small)

Old (v1) New (v2) Improvement
Base model flan-t5-small (60M) CodeT5+ 220M 3.7x larger
Eval loss 0.547 0.175 3.1x better
CWE-aware input Context about vulnerability type
BLEU evaluation 81.0 Proper code similarity metric

Supported Languages

Python, JavaScript, Java, C, C++, PHP, Go, Ruby

The model was trained on a diverse multi-language dataset. Performance is strongest on C/C++ (largest training subset from BigVul).

Training Details

Parameter Value
Learning Rate 1e-4 (constant schedule)
Effective Batch Size 32 (8/device × 2 GPUs × 2 grad_accum)
Epochs 6 (early stopped at epoch 3 best)
Best Epoch 3 (eval_loss=0.1752)
Precision fp16
Gradient Checkpointing Enabled
Early Stopping Patience=3
Optimizer AdamW
Hardware 2× NVIDIA T4 16GB (Kaggle)

Training Recipe References

  • T5APR (arxiv:2309.15742): lr=1e-4, constant scheduler — Optuna-validated for CodeT5 code repair
  • MultiMend (arxiv:2501.16044): Same config, validated on 6 benchmarks

Training Data

Trained on the code-security-vulnerability-dataset:

  • 7,374 training samples (vulnerable code with fixes)
  • 994 validation samples
  • 941 test samples

Filtered from 175K total samples to only include vulnerable samples with meaningful code fixes (>10 characters).

Input Format

The model uses a CWE-aware input format that tells it what vulnerability to fix:

fix <Vulnerability Name> vulnerability in <language>: <vulnerable code>

Examples:

  • fix SQL Injection vulnerability in python: <code>
  • fix Buffer Overflow vulnerability in c: <code>
  • fix Cross-Site Scripting vulnerability in javascript: <code>

Limitations

  1. 512 token limit: Long functions are truncated — fix quality degrades for very long code
  2. Formatting: Generated fixes may lose original indentation/formatting
  3. Rare CWEs: Performance is lower on vulnerability types with few training examples
  4. Not a replacement: Should complement manual code review and established SAST tools
  5. Language bias: Strongest on C/C++ (largest training subset)

Interactive Demo

Try the model in our Code Security Analyzer Space — paste any code and get vulnerability detection + fix suggestions.

Citation

@misc{codet5p-vuln-fixer,
  title={CodeT5+ Vulnerability Fixer: CWE-Aware Code Repair with Seq2Seq Generation},
  author={ayshajavd},
  year={2025},
  url={https://huggingface.co/ayshajavd/codet5p-vuln-fixer}
}
Downloads last month
64
Safetensors
Model size
77M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ayshajavd/codet5p-vuln-fixer

Finetuned
(95)
this model

Dataset used to train ayshajavd/codet5p-vuln-fixer

Space using ayshajavd/codet5p-vuln-fixer 1

Papers for ayshajavd/codet5p-vuln-fixer

Evaluation results