HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
Model Name: HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 Model Type: Supervised Fine-Tuned (SFT) - Merged LoRA + Base Model Base Model: Qwen/Qwen2.5-Coder-7B-Instruct Fine-tuning: checkpoint-1000 (1000 training steps on Java bug-fixing) Version: v1.0 Release Date: 2026-01-02 Status: β Ready for Production / Further Training
π Model Performance
This model is the result of merging checkpoint-1000 (LoRA adapter) into the base Qwen2.5-Coder-7B-Instruct model.
MultiPL-E Java Benchmark Results
| Model | Pass@1 | Passed | Total | Improvement |
|---|---|---|---|---|
| Base Model (Qwen2.5-Coder-7B-Instruct) | 67.72% | 107 | 158 | Baseline |
| This Model (Fine-Tuned) | 82.28% | 130 | 158 | +14.56% β |
Key Achievements:
- β +23 problems solved compared to base model
- β 27 problems where SFT passes but base fails
- β 103 problems where both models pass
Benchmark Details:
- Dataset: MultiPL-E Java (158 programming problems translated from HumanEval)
- Evaluation Date: 2026-01-08
- Temperature: 0.0 (deterministic)
- Max Tokens: 1024
Internal Evaluation Results (50-sample test set)
| Metric | Base Model | This Model (Merged) | Improvement |
|---|---|---|---|
| Overall Accuracy | 9/50 (18%) | 14/50 (28%) | +55.6% β |
| Syntax Errors | 6/10 (60%) | 9/10 (90%) | +50% β |
| Logic Bugs | 3/10 (30%) | 4/10 (40%) | +33% β |
| API Misuse | 0/10 (0%) | 0/10 (0%) | No change |
| Edge Cases | 0/10 (0%) | 0/10 (0%) | No change |
| OOD JavaScript | 0/2 (0%) | 1/2 (50%) | +50% β |
Statistical Significance: p-value = 0.0238* (significant at Ξ±=0.05)
π― Use Cases
1. Further Training
Use this merged model as the base for continued fine-tuning:
# LLaMA-Factory training config
model_name_or_path: ./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
finetuning_type: lora # Can apply new LoRA on top
lora_target: q_proj,v_proj
Benefits:
- Start from improved baseline (28% accuracy vs 18%)
- No adapter overhead during training
- Can apply new LoRA adapters for specialized tasks
2. Direct Inference
Use for production inference without adapter loading:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1")
# No adapter loading needed!
Benefits:
- Faster loading (no adapter merge at runtime)
- Simpler deployment (single model, no adapter files)
- Same performance as base + adapter
3. Production Deployment
Deploy directly to production environments:
# Copy to deployment server
scp -r HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 user@server:/models/
# Use in production
python inference_server.py --model /models/HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
π Model Files
| File | Size | Description |
|---|---|---|
model-00001-of-00004.safetensors |
~3.5GB | Model weights (shard 1) |
model-00002-of-00004.safetensors |
~3.5GB | Model weights (shard 2) |
model-00003-of-00004.safetensors |
~3.5GB | Model weights (shard 3) |
model-00004-of-00004.safetensors |
~3.5GB | Model weights (shard 4) |
config.json |
~1KB | Model configuration |
tokenizer.json |
~7MB | Tokenizer vocabulary |
generation_config.json |
~1KB | Generation parameters |
Total Size: ~14GB
π§ Training Details
Original LoRA Training (checkpoint-1000)
- Training Steps: 1000
- LoRA Rank (r): 16
- LoRA Alpha: 32
- Target Modules: q_proj, v_proj
- Dropout: 0.05
- Training Data: Java bug-fixing samples
Merge Process
- Method:
merge_and_unload()from PEFT library - Precision: float16
- Merge Date: 2026-01-02
- Verification: Passed (model loads successfully)
π Quick Start
Load for Inference
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained(
"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
trust_remote_code=True
)
# Generate
prompt = "Fix the bug in this Java code: int x = 10"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Load for Further Training
from transformers import AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model
# Load merged model as base
base_model = AutoModelForCausalLM.from_pretrained(
"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
torch_dtype=torch.float16,
device_map="auto"
)
# Apply new LoRA for specialized training
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Can expand targets
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)
# Continue training...
π Comparison with Alternatives
| Model | Exact Match | Pros | Cons |
|---|---|---|---|
| Base Model | 9/50 (18%) | General purpose | Lower accuracy on Java bugs |
| Base + LoRA Adapter | 14/50 (28%) | Modular, smaller files | Requires adapter loading |
| This Merged Model | 14/50 (28%) | β
Fast loading β Simple deployment β Ready for more training |
Larger file size (~14GB) |
β οΈ Known Limitations
Based on evaluation, this model still struggles with:
- API Misuse Detection (0% accuracy)
- Edge Case Handling (0% accuracy)
- Null Pointer Exception Fixes (0% accuracy)
- Python Bug Fixing (0% accuracy on OOD samples)
Recommendation: Continue training with more diverse samples focusing on these categories.
π Related Files
- Evaluation Report:
../local_inference/CHECKPOINT_COMPARISON_54_vs_1000.md - Original LoRA Checkpoint:
../checkpoint-1000/ - Merge Script:
../merge_lora_to_base.py - Evaluation Results:
../local_inference/evaluation_results_sequential_*.json
π Version History
| Version | Date | Description |
|---|---|---|
| v1.0 | 2026-01-02 | Initial merge of checkpoint-1000 into base model |
π License
Inherits license from base model: Qwen/Qwen2.5-Coder-7B-Instruct
π Acknowledgments
- Base Model: Qwen Team (Alibaba Cloud)
- Fine-tuning Framework: LLaMA-Factory
- Evaluation Framework: Custom 50-sample test suite
For questions or issues, refer to the evaluation documentation in local_inference/
- Downloads last month
- 107