# HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 **Model Name**: HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 **Model Type**: Supervised Fine-Tuned (SFT) - Merged LoRA + Base Model **Base Model**: Qwen/Qwen2.5-Coder-7B-Instruct **Fine-tuning**: checkpoint-1000 (1000 training steps on Java bug-fixing) **Version**: v1.0 **Release Date**: 2026-01-02 **Status**: ✅ Ready for Production / Further Training --- ## 📊 Model Performance This model is the result of merging checkpoint-1000 (LoRA adapter) into the base Qwen2.5-Coder-7B-Instruct model. ### MultiPL-E Java Benchmark Results | Model | Pass@1 | Passed | Total | Improvement | |-------|--------|--------|-------|-------------| | **Base Model (Qwen2.5-Coder-7B-Instruct)** | 67.72% | 107 | 158 | Baseline | | **This Model (Fine-Tuned)** | **82.28%** | **130** | **158** | **+14.56%** ✅ | **Key Achievements**: - ✅ **+23 problems solved** compared to base model - ✅ **27 problems** where SFT passes but base fails - ✅ **103 problems** where both models pass **Benchmark Details**: - **Dataset**: MultiPL-E Java (158 programming problems translated from HumanEval) - **Evaluation Date**: 2026-01-08 - **Temperature**: 0.0 (deterministic) - **Max Tokens**: 1024 ### Internal Evaluation Results (50-sample test set) | Metric | Base Model | This Model (Merged) | Improvement | |--------|-----------|---------------------|-------------| | **Overall Accuracy** | 9/50 (18%) | 14/50 (28%) | **+55.6%** ✅ | | **Syntax Errors** | 6/10 (60%) | 9/10 (90%) | **+50%** ✅ | | **Logic Bugs** | 3/10 (30%) | 4/10 (40%) | **+33%** ✅ | | **API Misuse** | 0/10 (0%) | 0/10 (0%) | No change | | **Edge Cases** | 0/10 (0%) | 0/10 (0%) | No change | | **OOD JavaScript** | 0/2 (0%) | 1/2 (50%) | **+50%** ✅ | **Statistical Significance**: p-value = 0.0238* (significant at α=0.05) --- ## 🎯 Use Cases ### 1. Further Training Use this merged model as the base for continued fine-tuning: ```yaml # LLaMA-Factory training config model_name_or_path: ./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 finetuning_type: lora # Can apply new LoRA on top lora_target: q_proj,v_proj ``` **Benefits**: - Start from improved baseline (28% accuracy vs 18%) - No adapter overhead during training - Can apply new LoRA adapters for specialized tasks ### 2. Direct Inference Use for production inference without adapter loading: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1", torch_dtype=torch.float16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1") # No adapter loading needed! ``` **Benefits**: - Faster loading (no adapter merge at runtime) - Simpler deployment (single model, no adapter files) - Same performance as base + adapter ### 3. Production Deployment Deploy directly to production environments: ```bash # Copy to deployment server scp -r HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 user@server:/models/ # Use in production python inference_server.py --model /models/HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 ``` --- ## 📁 Model Files | File | Size | Description | |------|------|-------------| | `model-00001-of-00004.safetensors` | ~3.5GB | Model weights (shard 1) | | `model-00002-of-00004.safetensors` | ~3.5GB | Model weights (shard 2) | | `model-00003-of-00004.safetensors` | ~3.5GB | Model weights (shard 3) | | `model-00004-of-00004.safetensors` | ~3.5GB | Model weights (shard 4) | | `config.json` | ~1KB | Model configuration | | `tokenizer.json` | ~7MB | Tokenizer vocabulary | | `generation_config.json` | ~1KB | Generation parameters | **Total Size**: ~14GB --- ## 🔧 Training Details ### Original LoRA Training (checkpoint-1000) - **Training Steps**: 1000 - **LoRA Rank (r)**: 16 - **LoRA Alpha**: 32 - **Target Modules**: q_proj, v_proj - **Dropout**: 0.05 - **Training Data**: Java bug-fixing samples ### Merge Process - **Method**: `merge_and_unload()` from PEFT library - **Precision**: float16 - **Merge Date**: 2026-01-02 - **Verification**: Passed (model loads successfully) --- ## 🚀 Quick Start ### Load for Inference ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer # Load model model = AutoModelForCausalLM.from_pretrained( "./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1", torch_dtype=torch.float16, device_map="auto", trust_remote_code=True ) # Load tokenizer tokenizer = AutoTokenizer.from_pretrained( "./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1", trust_remote_code=True ) # Generate prompt = "Fix the bug in this Java code: int x = 10" messages = [{"role": "user", "content": prompt}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer([text], return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False) response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) print(response) ``` ### Load for Further Training ```python from transformers import AutoModelForCausalLM, TrainingArguments from peft import LoraConfig, get_peft_model # Load merged model as base base_model = AutoModelForCausalLM.from_pretrained( "./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1", torch_dtype=torch.float16, device_map="auto" ) # Apply new LoRA for specialized training lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Can expand targets lora_dropout=0.05, task_type="CAUSAL_LM" ) model = get_peft_model(base_model, lora_config) # Continue training... ``` --- ## 📊 Comparison with Alternatives | Model | Exact Match | Pros | Cons | |-------|-------------|------|------| | **Base Model** | 9/50 (18%) | General purpose | Lower accuracy on Java bugs | | **Base + LoRA Adapter** | 14/50 (28%) | Modular, smaller files | Requires adapter loading | | **This Merged Model** | 14/50 (28%) | ✅ Fast loading
✅ Simple deployment
✅ Ready for more training | Larger file size (~14GB) | --- ## ⚠️ Known Limitations Based on evaluation, this model still struggles with: - **API Misuse Detection** (0% accuracy) - **Edge Case Handling** (0% accuracy) - **Null Pointer Exception Fixes** (0% accuracy) - **Python Bug Fixing** (0% accuracy on OOD samples) **Recommendation**: Continue training with more diverse samples focusing on these categories. --- ## 📚 Related Files - **Evaluation Report**: `../local_inference/CHECKPOINT_COMPARISON_54_vs_1000.md` - **Original LoRA Checkpoint**: `../checkpoint-1000/` - **Merge Script**: `../merge_lora_to_base.py` - **Evaluation Results**: `../local_inference/evaluation_results_sequential_*.json` --- ## 🔄 Version History | Version | Date | Description | |---------|------|-------------| | v1.0 | 2026-01-02 | Initial merge of checkpoint-1000 into base model | --- ## 📝 License Inherits license from base model: Qwen/Qwen2.5-Coder-7B-Instruct --- ## 🙏 Acknowledgments - **Base Model**: Qwen Team (Alibaba Cloud) - **Fine-tuning Framework**: LLaMA-Factory - **Evaluation Framework**: Custom 50-sample test suite --- **For questions or issues, refer to the evaluation documentation in `local_inference/`**