HaiJava-Surgeon-v2 π§
A specialized code language model for Java bug fixing and code refinement.
π Quick Stats
| Metric | Value |
|---|---|
| Parameters | 7.6 billion |
| Base Model | Qwen2.5-Coder-7B-Instruct |
| Task | Java Bug Fixing |
| BLEU Score | 0.5876 (Good) |
| Exact Match | 1.97% |
| Model Size | ~15 GB |
π― Model Overview
HaiJava-Surgeon-v2 is a fine-tuned code language model specialized for automatically identifying and fixing bugs in Java code. Built on Qwen2.5-Coder-7B-Instruct, this model has been trained using supervised fine-tuning with LoRA adapters on the CodeXGLUE Code Refinement dataset.
Key Features
- β Specialized for Java: Fine-tuned specifically for Java bug patterns
- β Method-Level Fixes: Handles complete method-level bug fixes
- β Multiple Bug Types: Null checks, bounds checking, exception handling, logic errors
- β Consistent Performance: Stable BLEU ~0.59 across diverse samples
- β Production Ready: Merged model, no adapter loading required
- β Optimized: Supports BF16 for modern GPUs (GB10, H100, etc.)
π Evaluation Results
Evaluated on CodeXGLUE Code Refinement benchmark (3,400 samples):
Overall Performance
- BLEU Score: 0.5876 (58.76% similarity to human fixes)
- Exact Match: 1.97% (67/3,400)
- Quality Grade: B+ (Good)
- Position: Above baseline (0.50), approaching CodeBERT (0.65)
Comparison to Baselines
| Model | BLEU | Status |
|---|---|---|
| PLBART (SOTA) | 0.74 | State-of-the-art |
| CodeBERT | 0.65 | Strong baseline |
| HaiJava-Surgeon-v2 | 0.5876 | This model |
| Transformer Baseline | 0.50 | Basic seq2seq |
π See full evaluation details: EVALUATION_RESULTS.json
π Complete model card: MODEL_CARD.md
ποΈ Model Details
- Base Model: Qwen/Qwen2.5-Coder-7B-Instruct
- Training Method: LoRA fine-tuning + merging
- Merge Date: 2026-01-18
- Precision: FP16 / BF16
- Parameters: 7.6 billion
- Architecture: Transformer decoder (32 layers, 4K hidden, GQA)
- Context Length: 2,048 tokens (training) / 32,768 (max)
LoRA Configuration (Stage 2 Training)
- Rank (r): 64
- Alpha: 16
- Dropout: 0.05
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Training Dataset: CodeXGLUE Code Refinement (small subset, ~5,835 samples)
- Starting Point: v1 model (merged with its LoRA adapters)
- Training Steps: 4,000 steps
- Purpose: Further refine bug-fixing on clearer, shorter examples
π Quick Start
Installation
pip install transformers torch accelerate
Basic Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"haijava-surgeon-v2",
torch_dtype=torch.bfloat16, # or torch.float16
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("haijava-surgeon-v2")
# Prepare buggy code
buggy_code = """
public int divide(int x, int y) {
return x / y; // Bug: No division by zero check
}
"""
# Create prompt
messages = [
{"role": "system", "content": "You are a Java code fixing assistant."},
{"role": "user", "content": f"Fix the following buggy Java code:\\n\\nBuggy Code:\\n{buggy_code}\\n\\nFixed Code:\\n"}
]
# Generate fix
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.1,
do_sample=False
)
fixed_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(fixed_code)
Expected Output
public int divide(int x, int y) {
if (y == 0) {
throw new ArithmeticException("Division by zero");
}
return x / y;
}
Recommended Generation Parameters
# For production (deterministic, reliable)
generation_config = {
"max_new_tokens": 512,
"temperature": 0.1,
"do_sample": False,
"repetition_penalty": 1.05,
"use_cache": True
}
# For exploration (diverse alternatives)
generation_config = {
"max_new_tokens": 512,
"temperature": 0.7,
"top_p": 0.95,
"do_sample": True,
"num_return_sequences": 3
}
π‘ Example Bug Fixes
The model can handle various bug types:
1. Null Pointer Prevention
// Input (buggy)
Calendar cal = Calendar.getInstance();
cal.setTime(date); // Bug: date might be null
// Output (fixed)
Calendar cal = null;
if (date != null) {
cal = Calendar.getInstance();
cal.setTime(date);
}
2. Bounds Checking
// Input (buggy)
if (index == array.length - 1) {
return false;
}
// Output (fixed)
if (index >= array.length - 1) { // Fixed: >= instead of ==
return false;
}
3. Exception Handling
// Input (buggy)
file.seek(position);
file.write(value);
// Output (fixed)
try {
file.seek(position);
file.write(value);
} catch (Exception e) {
e.printStackTrace();
}
π Repository Structure
haijava-surgeon-v2/
βββ model-*.safetensors # Model weights (4 shards, ~15GB total)
βββ config.json # Model configuration
βββ generation_config.json # Default generation settings
βββ tokenizer.json # Tokenizer vocabulary
βββ tokenizer_config.json # Tokenizer configuration
βββ README.md # This file
βββ MODEL_CARD.md # Comprehensive model documentation
βββ EVALUATION_RESULTS.json # Detailed evaluation metrics
π Documentation
MODEL_CARD.md - Complete model documentation
- Training details
- Evaluation methodology
- Limitations and ethical considerations
- Usage examples
- Citation information
EVALUATION_RESULTS.json - Detailed evaluation metrics
- Performance statistics
- Checkpoint progression
- Baseline comparisons
- Qualitative analysis
π― Use Cases
β Recommended Use
- Developer assistance during code writing
- Bug fix suggestions in code review
- Educational tool for learning Java patterns
- Research in automated program repair
- Rapid prototyping of bug fixes
β οΈ Not Recommended
- Fully autonomous code fixing without review
- Security-critical bug fixes (always verify manually)
- Production deployment without testing
- Large-scale refactoring
βοΈ System Requirements
Minimum
- GPU: 16GB VRAM (e.g., V100, RTX 4090)
- RAM: 32GB
- Storage: 20GB
Recommended
- GPU: 24GB+ VRAM (e.g., A100, H100, GB10)
- RAM: 64GB
- Storage: 50GB
Performance
- Inference Speed: ~0.5-2 seconds per fix (A100)
- Batch Processing: ~3-5 seconds for 8 samples
- Evaluation: ~12 seconds/sample (GB10, during benchmark)
π Differences from Base Model
This is a merged model - LoRA adapters have been fully integrated into the base weights.
Advantages
- β No adapter loading - Direct inference, simpler code
- β Faster inference - No adapter overhead during generation
- β Easier deployment - Single model, no dependency on PEFT
- β Better compatibility - Works with standard transformers
Trade-offs
- β Larger size - ~15 GB vs ~0.5 GB for adapters only
- β Less flexible - Cannot easily switch or update adapters
β οΈ Limitations
- Scope: Method-level fixes; may struggle with multi-file changes
- Exact Match: Low (1.97%) - multiple valid fixes exist for most bugs
- No Verification: Doesn't execute code; always test generated fixes
- Context Limit: 2,048 tokens during training (32K max in base model)
- Java-Specific: Optimized for Java; other languages not tested
Always review and test AI-generated code before deployment!
π Evaluation Details
Metrics Explained
BLEU (0.5876): Measures text similarity to human fixes
- 0.5876 = 58.76% n-gram overlap with reference solutions
- Good performance (0.5-0.7 range for code generation)
- Above baseline (0.50), approaching strong baselines (0.65)
Exact Match (1.97%): Character-perfect matches
- Low but normal for code generation
- Many valid ways to fix the same bug
- BLEU is more informative than EM
Consistency: Stable across all 3,400 evaluated samples
- No performance degradation over time
- Reliable quality
What We Measured
β
Text similarity (BLEU, Exact Match)
β οΈ Syntax validity (to be added)
β οΈ Functional correctness (requires Defects4J)
β οΈ Test pass rate (future work)
See PROPER_BUG_FIXING_EVALUATION.md for methodology discussion.
π οΈ Training Infrastructure
- Hardware: NVIDIA GB10 (128GB VRAM)
- Framework: PyTorch 2.10, Transformers, PEFT
- Training Time: ~10-12 hours
- Evaluation Time: ~11 hours (3,400 samples)
- Docker:
nvcr.io/nvidia/pytorch:25.11-py3
π Citation
If you use this model in your research, please cite:
@software{haijava_surgeon_v2_2026,
title={HaiJava-Surgeon-v2: A Fine-tuned Code LLM for Java Bug Fixing},
author={HaiIntel Research},
year={2026},
version={2.0},
url={https://github.com/yourusername/hai-java-surgeon}
}
π License
Apache 2.0 License (inherited from Qwen2.5-Coder)
- β Commercial use allowed
- β Modification allowed
- β Distribution allowed
- β οΈ No warranty provided
See LICENSE for full terms.
π Acknowledgments
- Qwen Team - Base model (Qwen2.5-Coder-7B-Instruct)
- Microsoft Research - CodeXGLUE dataset
- HuggingFace - Transformers and PEFT libraries
- NVIDIA - GPU infrastructure
π Contact & Support
- Issues: GitHub Issues
- Documentation: See MODEL_CARD.md
- Evaluation Guides: See
../PROPER_BUG_FIXING_EVALUATION.md
Model Version: 2.0
Last Updated: 2026-01-19
Status: β
Evaluated (3,400 samples)
BLEU: 0.5876 (Good)
- Downloads last month
- 33