YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

HaiJava-Surgeon-v2 🔧

A specialized code language model for Java bug fixing and code refinement.

📊 Quick Stats

Metric	Value
Parameters	7.6 billion
Base Model	Qwen2.5-Coder-7B-Instruct
Task	Java Bug Fixing
BLEU Score	0.5876 (Good)
Exact Match	1.97%
Model Size	~15 GB

🎯 Model Overview

HaiJava-Surgeon-v2 is a fine-tuned code language model specialized for automatically identifying and fixing bugs in Java code. Built on Qwen2.5-Coder-7B-Instruct, this model has been trained using supervised fine-tuning with LoRA adapters on the CodeXGLUE Code Refinement dataset.

Key Features

✅ Specialized for Java: Fine-tuned specifically for Java bug patterns
✅ Method-Level Fixes: Handles complete method-level bug fixes
✅ Multiple Bug Types: Null checks, bounds checking, exception handling, logic errors
✅ Consistent Performance: Stable BLEU ~0.59 across diverse samples
✅ Production Ready: Merged model, no adapter loading required
✅ Optimized: Supports BF16 for modern GPUs (GB10, H100, etc.)

📈 Evaluation Results

Evaluated on CodeXGLUE Code Refinement benchmark (3,400 samples):

Overall Performance

BLEU Score: 0.5876 (58.76% similarity to human fixes)
Exact Match: 1.97% (67/3,400)
Quality Grade: B+ (Good)
Position: Above baseline (0.50), approaching CodeBERT (0.65)

Comparison to Baselines

Model	BLEU	Status
PLBART (SOTA)	0.74	State-of-the-art
CodeBERT	0.65	Strong baseline
HaiJava-Surgeon-v2	0.5876	This model
Transformer Baseline	0.50	Basic seq2seq

📄 See full evaluation details: EVALUATION_RESULTS.json
📋 Complete model card: MODEL_CARD.md

🏗️ Model Details

Base Model: Qwen/Qwen2.5-Coder-7B-Instruct
Training Method: LoRA fine-tuning + merging
Merge Date: 2026-01-18
Precision: FP16 / BF16
Parameters: 7.6 billion
Architecture: Transformer decoder (32 layers, 4K hidden, GQA)
Context Length: 2,048 tokens (training) / 32,768 (max)

LoRA Configuration (Stage 2 Training)

Rank (r): 64
Alpha: 16
Dropout: 0.05
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training Dataset: CodeXGLUE Code Refinement (small subset, ~5,835 samples)
Starting Point: v1 model (merged with its LoRA adapters)
Training Steps: 4,000 steps
Purpose: Further refine bug-fixing on clearer, shorter examples

🚀 Quick Start

Installation

pip install transformers torch accelerate

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "haijava-surgeon-v2",
    torch_dtype=torch.bfloat16,  # or torch.float16
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("haijava-surgeon-v2")

# Prepare buggy code
buggy_code = """
public int divide(int x, int y) {
    return x / y;  // Bug: No division by zero check
}
"""

# Create prompt
messages = [
    {"role": "system", "content": "You are a Java code fixing assistant."},
    {"role": "user", "content": f"Fix the following buggy Java code:\\n\\nBuggy Code:\\n{buggy_code}\\n\\nFixed Code:\\n"}
]

# Generate fix
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.1,
    do_sample=False
)

fixed_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(fixed_code)

Expected Output

public int divide(int x, int y) {
    if (y == 0) {
        throw new ArithmeticException("Division by zero");
    }
    return x / y;
}

Recommended Generation Parameters

# For production (deterministic, reliable)
generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.1,
    "do_sample": False,
    "repetition_penalty": 1.05,
    "use_cache": True
}

# For exploration (diverse alternatives)
generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95,
    "do_sample": True,
    "num_return_sequences": 3
}

💡 Example Bug Fixes

The model can handle various bug types:

1. Null Pointer Prevention

// Input (buggy)
Calendar cal = Calendar.getInstance();
cal.setTime(date);  // Bug: date might be null

// Output (fixed)
Calendar cal = null;
if (date != null) {
    cal = Calendar.getInstance();
    cal.setTime(date);
}

2. Bounds Checking

// Input (buggy)
if (index == array.length - 1) {
    return false;
}

// Output (fixed)
if (index >= array.length - 1) {  // Fixed: >= instead of ==
    return false;
}

3. Exception Handling

// Input (buggy)
file.seek(position);
file.write(value);

// Output (fixed)
try {
    file.seek(position);
    file.write(value);
} catch (Exception e) {
    e.printStackTrace();
}

📁 Repository Structure

haijava-surgeon-v2/
├── model-*.safetensors      # Model weights (4 shards, ~15GB total)
├── config.json              # Model configuration
├── generation_config.json   # Default generation settings
├── tokenizer.json           # Tokenizer vocabulary
├── tokenizer_config.json    # Tokenizer configuration
├── README.md               # This file
├── MODEL_CARD.md           # Comprehensive model documentation
└── EVALUATION_RESULTS.json # Detailed evaluation metrics

📚 Documentation

MODEL_CARD.md - Complete model documentation
- Training details
- Evaluation methodology
- Limitations and ethical considerations
- Usage examples
- Citation information
EVALUATION_RESULTS.json - Detailed evaluation metrics
- Performance statistics
- Checkpoint progression
- Baseline comparisons
- Qualitative analysis

🎯 Use Cases

✅ Recommended Use

Developer assistance during code writing
Bug fix suggestions in code review
Educational tool for learning Java patterns
Research in automated program repair
Rapid prototyping of bug fixes

⚠️ Not Recommended

Fully autonomous code fixing without review
Security-critical bug fixes (always verify manually)
Production deployment without testing
Large-scale refactoring

⚙️ System Requirements

Minimum

GPU: 16GB VRAM (e.g., V100, RTX 4090)
RAM: 32GB
Storage: 20GB

Performance

Inference Speed: ~0.5-2 seconds per fix (A100)
Batch Processing: ~3-5 seconds for 8 samples
Evaluation: ~12 seconds/sample (GB10, during benchmark)

🔄 Differences from Base Model

This is a merged model - LoRA adapters have been fully integrated into the base weights.

Advantages

✅ No adapter loading - Direct inference, simpler code
✅ Faster inference - No adapter overhead during generation
✅ Easier deployment - Single model, no dependency on PEFT
✅ Better compatibility - Works with standard transformers

Trade-offs

❌ Larger size - ~15 GB vs ~0.5 GB for adapters only
❌ Less flexible - Cannot easily switch or update adapters

⚠️ Limitations

Scope: Method-level fixes; may struggle with multi-file changes
Exact Match: Low (1.97%) - multiple valid fixes exist for most bugs
No Verification: Doesn't execute code; always test generated fixes
Context Limit: 2,048 tokens during training (32K max in base model)
Java-Specific: Optimized for Java; other languages not tested

Always review and test AI-generated code before deployment!

📊 Evaluation Details

Metrics Explained

BLEU (0.5876): Measures text similarity to human fixes
- 0.5876 = 58.76% n-gram overlap with reference solutions
- Good performance (0.5-0.7 range for code generation)
- Above baseline (0.50), approaching strong baselines (0.65)
Exact Match (1.97%): Character-perfect matches
- Low but normal for code generation
- Many valid ways to fix the same bug
- BLEU is more informative than EM
Consistency: Stable across all 3,400 evaluated samples
- No performance degradation over time
- Reliable quality

What We Measured

✅ Text similarity (BLEU, Exact Match)
⚠️ Syntax validity (to be added)
⚠️ Functional correctness (requires Defects4J)
⚠️ Test pass rate (future work)

See PROPER_BUG_FIXING_EVALUATION.md for methodology discussion.

🛠️ Training Infrastructure

Hardware: NVIDIA GB10 (128GB VRAM)
Framework: PyTorch 2.10, Transformers, PEFT
Training Time: ~10-12 hours
Evaluation Time: ~11 hours (3,400 samples)
Docker: nvcr.io/nvidia/pytorch:25.11-py3

📖 Citation

If you use this model in your research, please cite:

@software{haijava_surgeon_v2_2026,
  title={HaiJava-Surgeon-v2: A Fine-tuned Code LLM for Java Bug Fixing},
  author={HaiIntel Research},
  year={2026},
  version={2.0},
  url={https://github.com/yourusername/hai-java-surgeon}
}

📄 License

Apache 2.0 License (inherited from Qwen2.5-Coder)

✅ Commercial use allowed
✅ Modification allowed
✅ Distribution allowed
⚠️ No warranty provided

See LICENSE for full terms.

🙏 Acknowledgments

Qwen Team - Base model (Qwen2.5-Coder-7B-Instruct)
Microsoft Research - CodeXGLUE dataset
HuggingFace - Transformers and PEFT libraries
NVIDIA - GPU infrastructure

📞 Contact & Support

Issues: GitHub Issues
Documentation: See MODEL_CARD.md
Evaluation Guides: See ../PROPER_BUG_FIXING_EVALUATION.md

Model Version: 2.0
Last Updated: 2026-01-19
Status: ✅ Evaluated (3,400 samples)
BLEU: 0.5876 (Good)

Downloads last month: 33

Safetensors

Model size

8B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Haiintel/haijava-surgeon-qwen2.5-coder-7b-sft-v2

Quantizations

2 models