YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

HaiJava-Surgeon-v2 πŸ”§

License Model BLEU

A specialized code language model for Java bug fixing and code refinement.


πŸ“Š Quick Stats

Metric Value
Parameters 7.6 billion
Base Model Qwen2.5-Coder-7B-Instruct
Task Java Bug Fixing
BLEU Score 0.5876 (Good)
Exact Match 1.97%
Model Size ~15 GB

🎯 Model Overview

HaiJava-Surgeon-v2 is a fine-tuned code language model specialized for automatically identifying and fixing bugs in Java code. Built on Qwen2.5-Coder-7B-Instruct, this model has been trained using supervised fine-tuning with LoRA adapters on the CodeXGLUE Code Refinement dataset.

Key Features

  • βœ… Specialized for Java: Fine-tuned specifically for Java bug patterns
  • βœ… Method-Level Fixes: Handles complete method-level bug fixes
  • βœ… Multiple Bug Types: Null checks, bounds checking, exception handling, logic errors
  • βœ… Consistent Performance: Stable BLEU ~0.59 across diverse samples
  • βœ… Production Ready: Merged model, no adapter loading required
  • βœ… Optimized: Supports BF16 for modern GPUs (GB10, H100, etc.)

πŸ“ˆ Evaluation Results

Evaluated on CodeXGLUE Code Refinement benchmark (3,400 samples):

Overall Performance

  • BLEU Score: 0.5876 (58.76% similarity to human fixes)
  • Exact Match: 1.97% (67/3,400)
  • Quality Grade: B+ (Good)
  • Position: Above baseline (0.50), approaching CodeBERT (0.65)

Comparison to Baselines

Model BLEU Status
PLBART (SOTA) 0.74 State-of-the-art
CodeBERT 0.65 Strong baseline
HaiJava-Surgeon-v2 0.5876 This model
Transformer Baseline 0.50 Basic seq2seq

πŸ“„ See full evaluation details: EVALUATION_RESULTS.json
πŸ“‹ Complete model card: MODEL_CARD.md


πŸ—οΈ Model Details

  • Base Model: Qwen/Qwen2.5-Coder-7B-Instruct
  • Training Method: LoRA fine-tuning + merging
  • Merge Date: 2026-01-18
  • Precision: FP16 / BF16
  • Parameters: 7.6 billion
  • Architecture: Transformer decoder (32 layers, 4K hidden, GQA)
  • Context Length: 2,048 tokens (training) / 32,768 (max)

LoRA Configuration (Stage 2 Training)

  • Rank (r): 64
  • Alpha: 16
  • Dropout: 0.05
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Training Dataset: CodeXGLUE Code Refinement (small subset, ~5,835 samples)
  • Starting Point: v1 model (merged with its LoRA adapters)
  • Training Steps: 4,000 steps
  • Purpose: Further refine bug-fixing on clearer, shorter examples

πŸš€ Quick Start

Installation

pip install transformers torch accelerate

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "haijava-surgeon-v2",
    torch_dtype=torch.bfloat16,  # or torch.float16
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("haijava-surgeon-v2")

# Prepare buggy code
buggy_code = """
public int divide(int x, int y) {
    return x / y;  // Bug: No division by zero check
}
"""

# Create prompt
messages = [
    {"role": "system", "content": "You are a Java code fixing assistant."},
    {"role": "user", "content": f"Fix the following buggy Java code:\\n\\nBuggy Code:\\n{buggy_code}\\n\\nFixed Code:\\n"}
]

# Generate fix
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.1,
    do_sample=False
)

fixed_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(fixed_code)

Expected Output

public int divide(int x, int y) {
    if (y == 0) {
        throw new ArithmeticException("Division by zero");
    }
    return x / y;
}

Recommended Generation Parameters

# For production (deterministic, reliable)
generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.1,
    "do_sample": False,
    "repetition_penalty": 1.05,
    "use_cache": True
}

# For exploration (diverse alternatives)
generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95,
    "do_sample": True,
    "num_return_sequences": 3
}

πŸ’‘ Example Bug Fixes

The model can handle various bug types:

1. Null Pointer Prevention

// Input (buggy)
Calendar cal = Calendar.getInstance();
cal.setTime(date);  // Bug: date might be null

// Output (fixed)
Calendar cal = null;
if (date != null) {
    cal = Calendar.getInstance();
    cal.setTime(date);
}

2. Bounds Checking

// Input (buggy)
if (index == array.length - 1) {
    return false;
}

// Output (fixed)
if (index >= array.length - 1) {  // Fixed: >= instead of ==
    return false;
}

3. Exception Handling

// Input (buggy)
file.seek(position);
file.write(value);

// Output (fixed)
try {
    file.seek(position);
    file.write(value);
} catch (Exception e) {
    e.printStackTrace();
}

πŸ“ Repository Structure

haijava-surgeon-v2/
β”œβ”€β”€ model-*.safetensors      # Model weights (4 shards, ~15GB total)
β”œβ”€β”€ config.json              # Model configuration
β”œβ”€β”€ generation_config.json   # Default generation settings
β”œβ”€β”€ tokenizer.json           # Tokenizer vocabulary
β”œβ”€β”€ tokenizer_config.json    # Tokenizer configuration
β”œβ”€β”€ README.md               # This file
β”œβ”€β”€ MODEL_CARD.md           # Comprehensive model documentation
└── EVALUATION_RESULTS.json # Detailed evaluation metrics

πŸ“š Documentation

  • MODEL_CARD.md - Complete model documentation

    • Training details
    • Evaluation methodology
    • Limitations and ethical considerations
    • Usage examples
    • Citation information
  • EVALUATION_RESULTS.json - Detailed evaluation metrics

    • Performance statistics
    • Checkpoint progression
    • Baseline comparisons
    • Qualitative analysis

🎯 Use Cases

βœ… Recommended Use

  • Developer assistance during code writing
  • Bug fix suggestions in code review
  • Educational tool for learning Java patterns
  • Research in automated program repair
  • Rapid prototyping of bug fixes

⚠️ Not Recommended

  • Fully autonomous code fixing without review
  • Security-critical bug fixes (always verify manually)
  • Production deployment without testing
  • Large-scale refactoring

βš™οΈ System Requirements

Minimum

  • GPU: 16GB VRAM (e.g., V100, RTX 4090)
  • RAM: 32GB
  • Storage: 20GB

Recommended

  • GPU: 24GB+ VRAM (e.g., A100, H100, GB10)
  • RAM: 64GB
  • Storage: 50GB

Performance

  • Inference Speed: ~0.5-2 seconds per fix (A100)
  • Batch Processing: ~3-5 seconds for 8 samples
  • Evaluation: ~12 seconds/sample (GB10, during benchmark)

πŸ”„ Differences from Base Model

This is a merged model - LoRA adapters have been fully integrated into the base weights.

Advantages

  • βœ… No adapter loading - Direct inference, simpler code
  • βœ… Faster inference - No adapter overhead during generation
  • βœ… Easier deployment - Single model, no dependency on PEFT
  • βœ… Better compatibility - Works with standard transformers

Trade-offs

  • ❌ Larger size - ~15 GB vs ~0.5 GB for adapters only
  • ❌ Less flexible - Cannot easily switch or update adapters

⚠️ Limitations

  1. Scope: Method-level fixes; may struggle with multi-file changes
  2. Exact Match: Low (1.97%) - multiple valid fixes exist for most bugs
  3. No Verification: Doesn't execute code; always test generated fixes
  4. Context Limit: 2,048 tokens during training (32K max in base model)
  5. Java-Specific: Optimized for Java; other languages not tested

Always review and test AI-generated code before deployment!


πŸ“Š Evaluation Details

Metrics Explained

  • BLEU (0.5876): Measures text similarity to human fixes

    • 0.5876 = 58.76% n-gram overlap with reference solutions
    • Good performance (0.5-0.7 range for code generation)
    • Above baseline (0.50), approaching strong baselines (0.65)
  • Exact Match (1.97%): Character-perfect matches

    • Low but normal for code generation
    • Many valid ways to fix the same bug
    • BLEU is more informative than EM
  • Consistency: Stable across all 3,400 evaluated samples

    • No performance degradation over time
    • Reliable quality

What We Measured

βœ… Text similarity (BLEU, Exact Match)
⚠️ Syntax validity (to be added)
⚠️ Functional correctness (requires Defects4J)
⚠️ Test pass rate (future work)

See PROPER_BUG_FIXING_EVALUATION.md for methodology discussion.


πŸ› οΈ Training Infrastructure

  • Hardware: NVIDIA GB10 (128GB VRAM)
  • Framework: PyTorch 2.10, Transformers, PEFT
  • Training Time: ~10-12 hours
  • Evaluation Time: ~11 hours (3,400 samples)
  • Docker: nvcr.io/nvidia/pytorch:25.11-py3

πŸ“– Citation

If you use this model in your research, please cite:

@software{haijava_surgeon_v2_2026,
  title={HaiJava-Surgeon-v2: A Fine-tuned Code LLM for Java Bug Fixing},
  author={HaiIntel Research},
  year={2026},
  version={2.0},
  url={https://github.com/yourusername/hai-java-surgeon}
}

πŸ“„ License

Apache 2.0 License (inherited from Qwen2.5-Coder)

  • βœ… Commercial use allowed
  • βœ… Modification allowed
  • βœ… Distribution allowed
  • ⚠️ No warranty provided

See LICENSE for full terms.


πŸ™ Acknowledgments

  • Qwen Team - Base model (Qwen2.5-Coder-7B-Instruct)
  • Microsoft Research - CodeXGLUE dataset
  • HuggingFace - Transformers and PEFT libraries
  • NVIDIA - GPU infrastructure

πŸ“ž Contact & Support


Model Version: 2.0
Last Updated: 2026-01-19
Status: βœ… Evaluated (3,400 samples)
BLEU: 0.5876 (Good)

Downloads last month
33
Safetensors
Model size
8B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Haiintel/haijava-surgeon-qwen2.5-coder-7b-sft-v2

Quantizations
2 models