| # HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 | |
| **Model Name**: HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 | |
| **Model Type**: Supervised Fine-Tuned (SFT) - Merged LoRA + Base Model | |
| **Base Model**: Qwen/Qwen2.5-Coder-7B-Instruct | |
| **Fine-tuning**: checkpoint-1000 (1000 training steps on Java bug-fixing) | |
| **Version**: v1.0 | |
| **Release Date**: 2026-01-02 | |
| **Status**: β Ready for Production / Further Training | |
| --- | |
| ## π Model Performance | |
| This model is the result of merging checkpoint-1000 (LoRA adapter) into the base Qwen2.5-Coder-7B-Instruct model. | |
| ### MultiPL-E Java Benchmark Results | |
| | Model | Pass@1 | Passed | Total | Improvement | | |
| |-------|--------|--------|-------|-------------| | |
| | **Base Model (Qwen2.5-Coder-7B-Instruct)** | 67.72% | 107 | 158 | Baseline | | |
| | **This Model (Fine-Tuned)** | **82.28%** | **130** | **158** | **+14.56%** β | | |
| **Key Achievements**: | |
| - β **+23 problems solved** compared to base model | |
| - β **27 problems** where SFT passes but base fails | |
| - β **103 problems** where both models pass | |
| **Benchmark Details**: | |
| - **Dataset**: MultiPL-E Java (158 programming problems translated from HumanEval) | |
| - **Evaluation Date**: 2026-01-08 | |
| - **Temperature**: 0.0 (deterministic) | |
| - **Max Tokens**: 1024 | |
| ### Internal Evaluation Results (50-sample test set) | |
| | Metric | Base Model | This Model (Merged) | Improvement | | |
| |--------|-----------|---------------------|-------------| | |
| | **Overall Accuracy** | 9/50 (18%) | 14/50 (28%) | **+55.6%** β | | |
| | **Syntax Errors** | 6/10 (60%) | 9/10 (90%) | **+50%** β | | |
| | **Logic Bugs** | 3/10 (30%) | 4/10 (40%) | **+33%** β | | |
| | **API Misuse** | 0/10 (0%) | 0/10 (0%) | No change | | |
| | **Edge Cases** | 0/10 (0%) | 0/10 (0%) | No change | | |
| | **OOD JavaScript** | 0/2 (0%) | 1/2 (50%) | **+50%** β | | |
| **Statistical Significance**: p-value = 0.0238* (significant at Ξ±=0.05) | |
| --- | |
| ## π― Use Cases | |
| ### 1. Further Training | |
| Use this merged model as the base for continued fine-tuning: | |
| ```yaml | |
| # LLaMA-Factory training config | |
| model_name_or_path: ./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 | |
| finetuning_type: lora # Can apply new LoRA on top | |
| lora_target: q_proj,v_proj | |
| ``` | |
| **Benefits**: | |
| - Start from improved baseline (28% accuracy vs 18%) | |
| - No adapter overhead during training | |
| - Can apply new LoRA adapters for specialized tasks | |
| ### 2. Direct Inference | |
| Use for production inference without adapter loading: | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1", | |
| torch_dtype=torch.float16, | |
| device_map="auto" | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1") | |
| # No adapter loading needed! | |
| ``` | |
| **Benefits**: | |
| - Faster loading (no adapter merge at runtime) | |
| - Simpler deployment (single model, no adapter files) | |
| - Same performance as base + adapter | |
| ### 3. Production Deployment | |
| Deploy directly to production environments: | |
| ```bash | |
| # Copy to deployment server | |
| scp -r HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 user@server:/models/ | |
| # Use in production | |
| python inference_server.py --model /models/HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 | |
| ``` | |
| --- | |
| ## π Model Files | |
| | File | Size | Description | | |
| |------|------|-------------| | |
| | `model-00001-of-00004.safetensors` | ~3.5GB | Model weights (shard 1) | | |
| | `model-00002-of-00004.safetensors` | ~3.5GB | Model weights (shard 2) | | |
| | `model-00003-of-00004.safetensors` | ~3.5GB | Model weights (shard 3) | | |
| | `model-00004-of-00004.safetensors` | ~3.5GB | Model weights (shard 4) | | |
| | `config.json` | ~1KB | Model configuration | | |
| | `tokenizer.json` | ~7MB | Tokenizer vocabulary | | |
| | `generation_config.json` | ~1KB | Generation parameters | | |
| **Total Size**: ~14GB | |
| --- | |
| ## π§ Training Details | |
| ### Original LoRA Training (checkpoint-1000) | |
| - **Training Steps**: 1000 | |
| - **LoRA Rank (r)**: 16 | |
| - **LoRA Alpha**: 32 | |
| - **Target Modules**: q_proj, v_proj | |
| - **Dropout**: 0.05 | |
| - **Training Data**: Java bug-fixing samples | |
| ### Merge Process | |
| - **Method**: `merge_and_unload()` from PEFT library | |
| - **Precision**: float16 | |
| - **Merge Date**: 2026-01-02 | |
| - **Verification**: Passed (model loads successfully) | |
| --- | |
| ## π Quick Start | |
| ### Load for Inference | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| # Load model | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1", | |
| torch_dtype=torch.float16, | |
| device_map="auto", | |
| trust_remote_code=True | |
| ) | |
| # Load tokenizer | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| "./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1", | |
| trust_remote_code=True | |
| ) | |
| # Generate | |
| prompt = "Fix the bug in this Java code: int x = 10" | |
| messages = [{"role": "user", "content": prompt}] | |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = tokenizer([text], return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False) | |
| response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) | |
| print(response) | |
| ``` | |
| ### Load for Further Training | |
| ```python | |
| from transformers import AutoModelForCausalLM, TrainingArguments | |
| from peft import LoraConfig, get_peft_model | |
| # Load merged model as base | |
| base_model = AutoModelForCausalLM.from_pretrained( | |
| "./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1", | |
| torch_dtype=torch.float16, | |
| device_map="auto" | |
| ) | |
| # Apply new LoRA for specialized training | |
| lora_config = LoraConfig( | |
| r=16, | |
| lora_alpha=32, | |
| target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Can expand targets | |
| lora_dropout=0.05, | |
| task_type="CAUSAL_LM" | |
| ) | |
| model = get_peft_model(base_model, lora_config) | |
| # Continue training... | |
| ``` | |
| --- | |
| ## π Comparison with Alternatives | |
| | Model | Exact Match | Pros | Cons | | |
| |-------|-------------|------|------| | |
| | **Base Model** | 9/50 (18%) | General purpose | Lower accuracy on Java bugs | | |
| | **Base + LoRA Adapter** | 14/50 (28%) | Modular, smaller files | Requires adapter loading | | |
| | **This Merged Model** | 14/50 (28%) | β Fast loading<br/>β Simple deployment<br/>β Ready for more training | Larger file size (~14GB) | | |
| --- | |
| ## β οΈ Known Limitations | |
| Based on evaluation, this model still struggles with: | |
| - **API Misuse Detection** (0% accuracy) | |
| - **Edge Case Handling** (0% accuracy) | |
| - **Null Pointer Exception Fixes** (0% accuracy) | |
| - **Python Bug Fixing** (0% accuracy on OOD samples) | |
| **Recommendation**: Continue training with more diverse samples focusing on these categories. | |
| --- | |
| ## π Related Files | |
| - **Evaluation Report**: `../local_inference/CHECKPOINT_COMPARISON_54_vs_1000.md` | |
| - **Original LoRA Checkpoint**: `../checkpoint-1000/` | |
| - **Merge Script**: `../merge_lora_to_base.py` | |
| - **Evaluation Results**: `../local_inference/evaluation_results_sequential_*.json` | |
| --- | |
| ## π Version History | |
| | Version | Date | Description | | |
| |---------|------|-------------| | |
| | v1.0 | 2026-01-02 | Initial merge of checkpoint-1000 into base model | | |
| --- | |
| ## π License | |
| Inherits license from base model: Qwen/Qwen2.5-Coder-7B-Instruct | |
| --- | |
| ## π Acknowledgments | |
| - **Base Model**: Qwen Team (Alibaba Cloud) | |
| - **Fine-tuning Framework**: LLaMA-Factory | |
| - **Evaluation Framework**: Custom 50-sample test suite | |
| --- | |
| **For questions or issues, refer to the evaluation documentation in `local_inference/`** | |