# HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
**Model Name**: HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
**Model Type**: Supervised Fine-Tuned (SFT) - Merged LoRA + Base Model
**Base Model**: Qwen/Qwen2.5-Coder-7B-Instruct
**Fine-tuning**: checkpoint-1000 (1000 training steps on Java bug-fixing)
**Version**: v1.0
**Release Date**: 2026-01-02
**Status**: ✅ Ready for Production / Further Training
---
## 📊 Model Performance
This model is the result of merging checkpoint-1000 (LoRA adapter) into the base Qwen2.5-Coder-7B-Instruct model.
### MultiPL-E Java Benchmark Results
| Model | Pass@1 | Passed | Total | Improvement |
|-------|--------|--------|-------|-------------|
| **Base Model (Qwen2.5-Coder-7B-Instruct)** | 67.72% | 107 | 158 | Baseline |
| **This Model (Fine-Tuned)** | **82.28%** | **130** | **158** | **+14.56%** ✅ |
**Key Achievements**:
- ✅ **+23 problems solved** compared to base model
- ✅ **27 problems** where SFT passes but base fails
- ✅ **103 problems** where both models pass
**Benchmark Details**:
- **Dataset**: MultiPL-E Java (158 programming problems translated from HumanEval)
- **Evaluation Date**: 2026-01-08
- **Temperature**: 0.0 (deterministic)
- **Max Tokens**: 1024
### Internal Evaluation Results (50-sample test set)
| Metric | Base Model | This Model (Merged) | Improvement |
|--------|-----------|---------------------|-------------|
| **Overall Accuracy** | 9/50 (18%) | 14/50 (28%) | **+55.6%** ✅ |
| **Syntax Errors** | 6/10 (60%) | 9/10 (90%) | **+50%** ✅ |
| **Logic Bugs** | 3/10 (30%) | 4/10 (40%) | **+33%** ✅ |
| **API Misuse** | 0/10 (0%) | 0/10 (0%) | No change |
| **Edge Cases** | 0/10 (0%) | 0/10 (0%) | No change |
| **OOD JavaScript** | 0/2 (0%) | 1/2 (50%) | **+50%** ✅ |
**Statistical Significance**: p-value = 0.0238* (significant at α=0.05)
---
## 🎯 Use Cases
### 1. Further Training
Use this merged model as the base for continued fine-tuning:
```yaml
# LLaMA-Factory training config
model_name_or_path: ./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
finetuning_type: lora # Can apply new LoRA on top
lora_target: q_proj,v_proj
```
**Benefits**:
- Start from improved baseline (28% accuracy vs 18%)
- No adapter overhead during training
- Can apply new LoRA adapters for specialized tasks
### 2. Direct Inference
Use for production inference without adapter loading:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1")
# No adapter loading needed!
```
**Benefits**:
- Faster loading (no adapter merge at runtime)
- Simpler deployment (single model, no adapter files)
- Same performance as base + adapter
### 3. Production Deployment
Deploy directly to production environments:
```bash
# Copy to deployment server
scp -r HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 user@server:/models/
# Use in production
python inference_server.py --model /models/HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
```
---
## 📁 Model Files
| File | Size | Description |
|------|------|-------------|
| `model-00001-of-00004.safetensors` | ~3.5GB | Model weights (shard 1) |
| `model-00002-of-00004.safetensors` | ~3.5GB | Model weights (shard 2) |
| `model-00003-of-00004.safetensors` | ~3.5GB | Model weights (shard 3) |
| `model-00004-of-00004.safetensors` | ~3.5GB | Model weights (shard 4) |
| `config.json` | ~1KB | Model configuration |
| `tokenizer.json` | ~7MB | Tokenizer vocabulary |
| `generation_config.json` | ~1KB | Generation parameters |
**Total Size**: ~14GB
---
## 🔧 Training Details
### Original LoRA Training (checkpoint-1000)
- **Training Steps**: 1000
- **LoRA Rank (r)**: 16
- **LoRA Alpha**: 32
- **Target Modules**: q_proj, v_proj
- **Dropout**: 0.05
- **Training Data**: Java bug-fixing samples
### Merge Process
- **Method**: `merge_and_unload()` from PEFT library
- **Precision**: float16
- **Merge Date**: 2026-01-02
- **Verification**: Passed (model loads successfully)
---
## 🚀 Quick Start
### Load for Inference
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained(
"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
trust_remote_code=True
)
# Generate
prompt = "Fix the bug in this Java code: int x = 10"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```
### Load for Further Training
```python
from transformers import AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model
# Load merged model as base
base_model = AutoModelForCausalLM.from_pretrained(
"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
torch_dtype=torch.float16,
device_map="auto"
)
# Apply new LoRA for specialized training
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Can expand targets
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)
# Continue training...
```
---
## 📊 Comparison with Alternatives
| Model | Exact Match | Pros | Cons |
|-------|-------------|------|------|
| **Base Model** | 9/50 (18%) | General purpose | Lower accuracy on Java bugs |
| **Base + LoRA Adapter** | 14/50 (28%) | Modular, smaller files | Requires adapter loading |
| **This Merged Model** | 14/50 (28%) | ✅ Fast loading
✅ Simple deployment
✅ Ready for more training | Larger file size (~14GB) |
---
## ⚠️ Known Limitations
Based on evaluation, this model still struggles with:
- **API Misuse Detection** (0% accuracy)
- **Edge Case Handling** (0% accuracy)
- **Null Pointer Exception Fixes** (0% accuracy)
- **Python Bug Fixing** (0% accuracy on OOD samples)
**Recommendation**: Continue training with more diverse samples focusing on these categories.
---
## 📚 Related Files
- **Evaluation Report**: `../local_inference/CHECKPOINT_COMPARISON_54_vs_1000.md`
- **Original LoRA Checkpoint**: `../checkpoint-1000/`
- **Merge Script**: `../merge_lora_to_base.py`
- **Evaluation Results**: `../local_inference/evaluation_results_sequential_*.json`
---
## 🔄 Version History
| Version | Date | Description |
|---------|------|-------------|
| v1.0 | 2026-01-02 | Initial merge of checkpoint-1000 into base model |
---
## 📝 License
Inherits license from base model: Qwen/Qwen2.5-Coder-7B-Instruct
---
## 🙏 Acknowledgments
- **Base Model**: Qwen Team (Alibaba Cloud)
- **Fine-tuning Framework**: LLaMA-Factory
- **Evaluation Framework**: Custom 50-sample test suite
---
**For questions or issues, refer to the evaluation documentation in `local_inference/`**