File size: 7,337 Bytes
482c9a9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 | # HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
**Model Name**: HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
**Model Type**: Supervised Fine-Tuned (SFT) - Merged LoRA + Base Model
**Base Model**: Qwen/Qwen2.5-Coder-7B-Instruct
**Fine-tuning**: checkpoint-1000 (1000 training steps on Java bug-fixing)
**Version**: v1.0
**Release Date**: 2026-01-02
**Status**: β
Ready for Production / Further Training
---
## π Model Performance
This model is the result of merging checkpoint-1000 (LoRA adapter) into the base Qwen2.5-Coder-7B-Instruct model.
### MultiPL-E Java Benchmark Results
| Model | Pass@1 | Passed | Total | Improvement |
|-------|--------|--------|-------|-------------|
| **Base Model (Qwen2.5-Coder-7B-Instruct)** | 67.72% | 107 | 158 | Baseline |
| **This Model (Fine-Tuned)** | **82.28%** | **130** | **158** | **+14.56%** β
|
**Key Achievements**:
- β
**+23 problems solved** compared to base model
- β
**27 problems** where SFT passes but base fails
- β
**103 problems** where both models pass
**Benchmark Details**:
- **Dataset**: MultiPL-E Java (158 programming problems translated from HumanEval)
- **Evaluation Date**: 2026-01-08
- **Temperature**: 0.0 (deterministic)
- **Max Tokens**: 1024
### Internal Evaluation Results (50-sample test set)
| Metric | Base Model | This Model (Merged) | Improvement |
|--------|-----------|---------------------|-------------|
| **Overall Accuracy** | 9/50 (18%) | 14/50 (28%) | **+55.6%** β
|
| **Syntax Errors** | 6/10 (60%) | 9/10 (90%) | **+50%** β
|
| **Logic Bugs** | 3/10 (30%) | 4/10 (40%) | **+33%** β
|
| **API Misuse** | 0/10 (0%) | 0/10 (0%) | No change |
| **Edge Cases** | 0/10 (0%) | 0/10 (0%) | No change |
| **OOD JavaScript** | 0/2 (0%) | 1/2 (50%) | **+50%** β
|
**Statistical Significance**: p-value = 0.0238* (significant at Ξ±=0.05)
---
## π― Use Cases
### 1. Further Training
Use this merged model as the base for continued fine-tuning:
```yaml
# LLaMA-Factory training config
model_name_or_path: ./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
finetuning_type: lora # Can apply new LoRA on top
lora_target: q_proj,v_proj
```
**Benefits**:
- Start from improved baseline (28% accuracy vs 18%)
- No adapter overhead during training
- Can apply new LoRA adapters for specialized tasks
### 2. Direct Inference
Use for production inference without adapter loading:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1")
# No adapter loading needed!
```
**Benefits**:
- Faster loading (no adapter merge at runtime)
- Simpler deployment (single model, no adapter files)
- Same performance as base + adapter
### 3. Production Deployment
Deploy directly to production environments:
```bash
# Copy to deployment server
scp -r HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 user@server:/models/
# Use in production
python inference_server.py --model /models/HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
```
---
## π Model Files
| File | Size | Description |
|------|------|-------------|
| `model-00001-of-00004.safetensors` | ~3.5GB | Model weights (shard 1) |
| `model-00002-of-00004.safetensors` | ~3.5GB | Model weights (shard 2) |
| `model-00003-of-00004.safetensors` | ~3.5GB | Model weights (shard 3) |
| `model-00004-of-00004.safetensors` | ~3.5GB | Model weights (shard 4) |
| `config.json` | ~1KB | Model configuration |
| `tokenizer.json` | ~7MB | Tokenizer vocabulary |
| `generation_config.json` | ~1KB | Generation parameters |
**Total Size**: ~14GB
---
## π§ Training Details
### Original LoRA Training (checkpoint-1000)
- **Training Steps**: 1000
- **LoRA Rank (r)**: 16
- **LoRA Alpha**: 32
- **Target Modules**: q_proj, v_proj
- **Dropout**: 0.05
- **Training Data**: Java bug-fixing samples
### Merge Process
- **Method**: `merge_and_unload()` from PEFT library
- **Precision**: float16
- **Merge Date**: 2026-01-02
- **Verification**: Passed (model loads successfully)
---
## π Quick Start
### Load for Inference
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained(
"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
trust_remote_code=True
)
# Generate
prompt = "Fix the bug in this Java code: int x = 10"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```
### Load for Further Training
```python
from transformers import AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model
# Load merged model as base
base_model = AutoModelForCausalLM.from_pretrained(
"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
torch_dtype=torch.float16,
device_map="auto"
)
# Apply new LoRA for specialized training
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Can expand targets
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)
# Continue training...
```
---
## π Comparison with Alternatives
| Model | Exact Match | Pros | Cons |
|-------|-------------|------|------|
| **Base Model** | 9/50 (18%) | General purpose | Lower accuracy on Java bugs |
| **Base + LoRA Adapter** | 14/50 (28%) | Modular, smaller files | Requires adapter loading |
| **This Merged Model** | 14/50 (28%) | β
Fast loading<br/>β
Simple deployment<br/>β
Ready for more training | Larger file size (~14GB) |
---
## β οΈ Known Limitations
Based on evaluation, this model still struggles with:
- **API Misuse Detection** (0% accuracy)
- **Edge Case Handling** (0% accuracy)
- **Null Pointer Exception Fixes** (0% accuracy)
- **Python Bug Fixing** (0% accuracy on OOD samples)
**Recommendation**: Continue training with more diverse samples focusing on these categories.
---
## π Related Files
- **Evaluation Report**: `../local_inference/CHECKPOINT_COMPARISON_54_vs_1000.md`
- **Original LoRA Checkpoint**: `../checkpoint-1000/`
- **Merge Script**: `../merge_lora_to_base.py`
- **Evaluation Results**: `../local_inference/evaluation_results_sequential_*.json`
---
## π Version History
| Version | Date | Description |
|---------|------|-------------|
| v1.0 | 2026-01-02 | Initial merge of checkpoint-1000 into base model |
---
## π License
Inherits license from base model: Qwen/Qwen2.5-Coder-7B-Instruct
---
## π Acknowledgments
- **Base Model**: Qwen Team (Alibaba Cloud)
- **Fine-tuning Framework**: LLaMA-Factory
- **Evaluation Framework**: Custom 50-sample test suite
---
**For questions or issues, refer to the evaluation documentation in `local_inference/`**
|