# HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1

**Model Name**: HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
**Model Type**: Supervised Fine-Tuned (SFT) - Merged LoRA + Base Model
**Base Model**: Qwen/Qwen2.5-Coder-7B-Instruct
**Fine-tuning**: checkpoint-1000 (1000 training steps on Java bug-fixing)
**Version**: v1.0
**Release Date**: 2026-01-02
**Status**: ✅ Ready for Production / Further Training

---

## 📊 Model Performance

This model is the result of merging checkpoint-1000 (LoRA adapter) into the base Qwen2.5-Coder-7B-Instruct model.

### MultiPL-E Java Benchmark Results

| Model | Pass@1 | Passed | Total | Improvement |
|-------|--------|--------|-------|-------------|
| **Base Model (Qwen2.5-Coder-7B-Instruct)** | 67.72% | 107 | 158 | Baseline |
| **This Model (Fine-Tuned)** | **82.28%** | **130** | **158** | **+14.56%** ✅ |

**Key Achievements**:
- ✅ **+23 problems solved** compared to base model
- ✅ **27 problems** where SFT passes but base fails
- ✅ **103 problems** where both models pass

**Benchmark Details**:
- **Dataset**: MultiPL-E Java (158 programming problems translated from HumanEval)
- **Evaluation Date**: 2026-01-08
- **Temperature**: 0.0 (deterministic)
- **Max Tokens**: 1024

### Internal Evaluation Results (50-sample test set)

| Metric | Base Model | This Model (Merged) | Improvement |
|--------|-----------|---------------------|-------------|
| **Overall Accuracy** | 9/50 (18%) | 14/50 (28%) | **+55.6%** ✅ |
| **Syntax Errors** | 6/10 (60%) | 9/10 (90%) | **+50%** ✅ |
| **Logic Bugs** | 3/10 (30%) | 4/10 (40%) | **+33%** ✅ |
| **API Misuse** | 0/10 (0%) | 0/10 (0%) | No change |
| **Edge Cases** | 0/10 (0%) | 0/10 (0%) | No change |
| **OOD JavaScript** | 0/2 (0%) | 1/2 (50%) | **+50%** ✅ |

**Statistical Significance**: p-value = 0.0238* (significant at α=0.05)

---

## 🎯 Use Cases

### 1. Further Training
Use this merged model as the base for continued fine-tuning:

```yaml
# LLaMA-Factory training config
model_name_or_path: ./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
finetuning_type: lora  # Can apply new LoRA on top
lora_target: q_proj,v_proj
```

**Benefits**:
- Start from improved baseline (28% accuracy vs 18%)
- No adapter overhead during training
- Can apply new LoRA adapters for specialized tasks

### 2. Direct Inference
Use for production inference without adapter loading:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1")

# No adapter loading needed!
```

**Benefits**:
- Faster loading (no adapter merge at runtime)
- Simpler deployment (single model, no adapter files)
- Same performance as base + adapter

### 3. Production Deployment
Deploy directly to production environments:

```bash
# Copy to deployment server
scp -r HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 user@server:/models/

# Use in production
python inference_server.py --model /models/HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
```

---

## 📁 Model Files

| File | Size | Description |
|------|------|-------------|
| `model-00001-of-00004.safetensors` | ~3.5GB | Model weights (shard 1) |
| `model-00002-of-00004.safetensors` | ~3.5GB | Model weights (shard 2) |
| `model-00003-of-00004.safetensors` | ~3.5GB | Model weights (shard 3) |
| `model-00004-of-00004.safetensors` | ~3.5GB | Model weights (shard 4) |
| `config.json` | ~1KB | Model configuration |
| `tokenizer.json` | ~7MB | Tokenizer vocabulary |
| `generation_config.json` | ~1KB | Generation parameters |

**Total Size**: ~14GB

---

## 🔧 Training Details

### Original LoRA Training (checkpoint-1000)
- **Training Steps**: 1000
- **LoRA Rank (r)**: 16
- **LoRA Alpha**: 32
- **Target Modules**: q_proj, v_proj
- **Dropout**: 0.05
- **Training Data**: Java bug-fixing samples

### Merge Process
- **Method**: `merge_and_unload()` from PEFT library
- **Precision**: float16
- **Merge Date**: 2026-01-02
- **Verification**: Passed (model loads successfully)

---

## 🚀 Quick Start

### Load for Inference
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
    trust_remote_code=True
)

# Generate
prompt = "Fix the bug in this Java code: int x = 10"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```

### Load for Further Training
```python
from transformers import AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model

# Load merged model as base
base_model = AutoModelForCausalLM.from_pretrained(
    "./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Apply new LoRA for specialized training
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Can expand targets
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, lora_config)

# Continue training...
```

---

## 📊 Comparison with Alternatives

| Model | Exact Match | Pros | Cons |
|-------|-------------|------|------|
| **Base Model** | 9/50 (18%) | General purpose | Lower accuracy on Java bugs |
| **Base + LoRA Adapter** | 14/50 (28%) | Modular, smaller files | Requires adapter loading |
| **This Merged Model** | 14/50 (28%) | ✅ Fast loading<br/>✅ Simple deployment<br/>✅ Ready for more training | Larger file size (~14GB) |

---

## ⚠️ Known Limitations

Based on evaluation, this model still struggles with:
- **API Misuse Detection** (0% accuracy)
- **Edge Case Handling** (0% accuracy)
- **Null Pointer Exception Fixes** (0% accuracy)
- **Python Bug Fixing** (0% accuracy on OOD samples)

**Recommendation**: Continue training with more diverse samples focusing on these categories.

---

## 📚 Related Files

- **Evaluation Report**: `../local_inference/CHECKPOINT_COMPARISON_54_vs_1000.md`
- **Original LoRA Checkpoint**: `../checkpoint-1000/`
- **Merge Script**: `../merge_lora_to_base.py`
- **Evaluation Results**: `../local_inference/evaluation_results_sequential_*.json`

---

## 🔄 Version History

| Version | Date | Description |
|---------|------|-------------|
| v1.0 | 2026-01-02 | Initial merge of checkpoint-1000 into base model |

---

## 📝 License

Inherits license from base model: Qwen/Qwen2.5-Coder-7B-Instruct

---

## 🙏 Acknowledgments

- **Base Model**: Qwen Team (Alibaba Cloud)
- **Fine-tuning Framework**: LLaMA-Factory
- **Evaluation Framework**: Custom 50-sample test suite

---

**For questions or issues, refer to the evaluation documentation in `local_inference/`**