Upload folder using huggingface_hub

482c9a9 verified about 2 months ago

7.34 kB

	# HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1

	Model Name: HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
	Model Type: Supervised Fine-Tuned (SFT) - Merged LoRA + Base Model
	Base Model: Qwen/Qwen2.5-Coder-7B-Instruct
	Fine-tuning: checkpoint-1000 (1000 training steps on Java bug-fixing)
	Version: v1.0
	Release Date: 2026-01-02
	Status: ✅ Ready for Production / Further Training

	---

	## 📊 Model Performance

	This model is the result of merging checkpoint-1000 (LoRA adapter) into the base Qwen2.5-Coder-7B-Instruct model.

	### MultiPL-E Java Benchmark Results

	\| Model \| Pass@1 \| Passed \| Total \| Improvement \|
	\|-------\|--------\|--------\|-------\|-------------\|
	\| Base Model (Qwen2.5-Coder-7B-Instruct) \| 67.72% \| 107 \| 158 \| Baseline \|
	\| This Model (Fine-Tuned) \| 82.28% \| 130 \| 158 \| +14.56% ✅ \|

	Key Achievements:
	- ✅ +23 problems solved compared to base model
	- ✅ 27 problems where SFT passes but base fails
	- ✅ 103 problems where both models pass

	Benchmark Details:
	- Dataset: MultiPL-E Java (158 programming problems translated from HumanEval)
	- Evaluation Date: 2026-01-08
	- Temperature: 0.0 (deterministic)
	- Max Tokens: 1024

	### Internal Evaluation Results (50-sample test set)

	\| Metric \| Base Model \| This Model (Merged) \| Improvement \|
	\|--------\|-----------\|---------------------\|-------------\|
	\| Overall Accuracy \| 9/50 (18%) \| 14/50 (28%) \| +55.6% ✅ \|
	\| Syntax Errors \| 6/10 (60%) \| 9/10 (90%) \| +50% ✅ \|
	\| Logic Bugs \| 3/10 (30%) \| 4/10 (40%) \| +33% ✅ \|
	\| API Misuse \| 0/10 (0%) \| 0/10 (0%) \| No change \|
	\| Edge Cases \| 0/10 (0%) \| 0/10 (0%) \| No change \|
	\| OOD JavaScript \| 0/2 (0%) \| 1/2 (50%) \| +50% ✅ \|

	Statistical Significance: p-value = 0.0238* (significant at α=0.05)

	---

	## 🎯 Use Cases

	### 1. Further Training
	Use this merged model as the base for continued fine-tuning:

	```yaml
	# LLaMA-Factory training config
	model_name_or_path: ./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
	finetuning_type: lora # Can apply new LoRA on top
	lora_target: q_proj,v_proj
	```

	Benefits:
	- Start from improved baseline (28% accuracy vs 18%)
	- No adapter overhead during training
	- Can apply new LoRA adapters for specialized tasks

	### 2. Direct Inference
	Use for production inference without adapter loading:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
	torch_dtype=torch.float16,
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1")

	# No adapter loading needed!
	```

	Benefits:
	- Faster loading (no adapter merge at runtime)
	- Simpler deployment (single model, no adapter files)
	- Same performance as base + adapter

	### 3. Production Deployment
	Deploy directly to production environments:

	```bash
	# Copy to deployment server
	scp -r HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1 user@server:/models/

	# Use in production
	python inference_server.py --model /models/HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1
	```

	---

	## 📁 Model Files

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `model-00001-of-00004.safetensors` \| ~3.5GB \| Model weights (shard 1) \|
	\| `model-00002-of-00004.safetensors` \| ~3.5GB \| Model weights (shard 2) \|
	\| `model-00003-of-00004.safetensors` \| ~3.5GB \| Model weights (shard 3) \|
	\| `model-00004-of-00004.safetensors` \| ~3.5GB \| Model weights (shard 4) \|
	\| `config.json` \| ~1KB \| Model configuration \|
	\| `tokenizer.json` \| ~7MB \| Tokenizer vocabulary \|
	\| `generation_config.json` \| ~1KB \| Generation parameters \|

	Total Size: ~14GB

	---

	## 🔧 Training Details

	### Original LoRA Training (checkpoint-1000)
	- Training Steps: 1000
	- LoRA Rank (r): 16
	- LoRA Alpha: 32
	- Target Modules: q_proj, v_proj
	- Dropout: 0.05
	- Training Data: Java bug-fixing samples

	### Merge Process
	- Method: `merge_and_unload()` from PEFT library
	- Precision: float16
	- Merge Date: 2026-01-02
	- Verification: Passed (model loads successfully)

	---

	## 🚀 Quick Start

	### Load for Inference
	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load model
	model = AutoModelForCausalLM.from_pretrained(
	"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
	torch_dtype=torch.float16,
	device_map="auto",
	trust_remote_code=True
	)

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained(
	"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
	trust_remote_code=True
	)

	# Generate
	prompt = "Fix the bug in this Java code: int x = 10"
	messages = [{"role": "user", "content": prompt}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer([text], return_tensors="pt").to(model.device)

	outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
	response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
	print(response)
	```

	### Load for Further Training
	```python
	from transformers import AutoModelForCausalLM, TrainingArguments
	from peft import LoraConfig, get_peft_model

	# Load merged model as base
	base_model = AutoModelForCausalLM.from_pretrained(
	"./HaiJava-Surgeon-Qwen2.5-Coder-7B-SFT-v1",
	torch_dtype=torch.float16,
	device_map="auto"
	)

	# Apply new LoRA for specialized training
	lora_config = LoraConfig(
	r=16,
	lora_alpha=32,
	target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Can expand targets
	lora_dropout=0.05,
	task_type="CAUSAL_LM"
	)

	model = get_peft_model(base_model, lora_config)

	# Continue training...
	```

	---

	## 📊 Comparison with Alternatives

	\| Model \| Exact Match \| Pros \| Cons \|
	\|-------\|-------------\|------\|------\|
	\| Base Model \| 9/50 (18%) \| General purpose \| Lower accuracy on Java bugs \|
	\| Base + LoRA Adapter \| 14/50 (28%) \| Modular, smaller files \| Requires adapter loading \|
	\| This Merged Model \| 14/50 (28%) \| ✅ Fast loading<br/>✅ Simple deployment<br/>✅ Ready for more training \| Larger file size (~14GB) \|

	---

	## ⚠️ Known Limitations

	Based on evaluation, this model still struggles with:
	- API Misuse Detection (0% accuracy)
	- Edge Case Handling (0% accuracy)
	- Null Pointer Exception Fixes (0% accuracy)
	- Python Bug Fixing (0% accuracy on OOD samples)

	Recommendation: Continue training with more diverse samples focusing on these categories.

	---

	## 📚 Related Files

	- Evaluation Report: `../local_inference/CHECKPOINT_COMPARISON_54_vs_1000.md`
	- Original LoRA Checkpoint: `../checkpoint-1000/`
	- Merge Script: `../merge_lora_to_base.py`
	- Evaluation Results: `../local_inference/evaluation_results_sequential_*.json`

	---

	## 🔄 Version History

	\| Version \| Date \| Description \|
	\|---------\|------\|-------------\|
	\| v1.0 \| 2026-01-02 \| Initial merge of checkpoint-1000 into base model \|

	---

	## 📝 License

	Inherits license from base model: Qwen/Qwen2.5-Coder-7B-Instruct

	---

	## 🙏 Acknowledgments

	- Base Model: Qwen Team (Alibaba Cloud)
	- Fine-tuning Framework: LLaMA-Factory
	- Evaluation Framework: Custom 50-sample test suite

	---

	For questions or issues, refer to the evaluation documentation in `local_inference/`