Upload README.md with huggingface_hub

09ef04a verified 3 days ago

8.2 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3.5-9B
	tags:
	- qwen3.5
	- code
	- tool-calling
	- lora
	- sft
	- dpo
	- unsloth
	- reasoning
	- chain-of-thought
	datasets:
	- nohurry/Opus-4.6-Reasoning-3000x-filtered
	- Roman1111111/claude-opus-4.6-10000x
	- TeichAI/claude-4.5-opus-high-reasoning-250x
	- Jackrong/Qwen3.5-reasoning-700x
	- togethercomputer/CoderForge-Preview
	- TIGER-Lab/AceCode-V2-122K
	language:
	- en
	pipeline_tag: text-generation
	---

	# Qwen3.5-DeltaCoder-9B

	> Reliable tool-calling for agentic coding — LoRA fine-tune of Qwen3.5-9B
	> v1.1-DPO released — DPO alignment improves code correctness and self-verification.
	> If you downloaded before March 28, 2026, please re-pull to get v1.1-DPO.

	[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
	[![Base Model](https://img.shields.io/badge/Base-Qwen3.5--9B-purple)](https://huggingface.co/Qwen/Qwen3.5-9B)
	[![HuggingFace](https://img.shields.io/badge/HuggingFace-GGUF-yellow)](https://huggingface.co/danielcherubini/Qwen3.5-DeltaCoder-9B-GGUF)
	[![LoRA](https://img.shields.io/badge/HuggingFace-LoRA-orange)](https://huggingface.co/danielcherubini/Qwen3.5-DeltaCoder-9B)

	Small language models can reason about code, but they struggle to call tools reliably. DeltaCoder takes a strong reasoning base and teaches it to produce correctly-formatted JSON tool calls — the kind that coding agents like [OpenCode](https://github.com/opencode-ai/opencode), [Pi](https://github.com/badlogic/pi-mono), and [Cline](https://github.com/cline/cline) depend on.

	v1.1-DPO adds Direct Preference Optimization to further improve code correctness — the model now self-corrects its own bugs rather than submitting wrong answers.

	## Downloads

	\| Format \| Link \| Size \|
	\|--------\|------\|------\|
	\| GGUF Q4_K_M (recommended) \| [HuggingFace](https://huggingface.co/danielcherubini/Qwen3.5-DeltaCoder-9B-GGUF) \| ~5.5 GB \|
	\| GGUF Q5_K_M \| [HuggingFace](https://huggingface.co/danielcherubini/Qwen3.5-DeltaCoder-9B-GGUF) \| ~6.5 GB \|
	\| GGUF BF16 \| [HuggingFace](https://huggingface.co/danielcherubini/Qwen3.5-DeltaCoder-9B-GGUF) \| ~17.9 GB \|
	\| DPO LoRA adapter \| [HuggingFace](https://huggingface.co/danielcherubini/Qwen3.5-DeltaCoder-9B) \| ~700 MB \|

	## The Problem

	[Jackrong's Qwen3.5-9B reasoning distill](https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2) scores 53.7% on HumanEval — best-in-class at 9B. But when used as a coding agent, it frequently produces malformed JSON tool calls:

	```
	tool=edit, error=JSON Parse error: Property name must be a string literal
	tool=bash, error=JSON Parse error: Expected '}'
	```

	DeltaCoder fixes this, and v1.1-DPO further improves code correctness through preference learning.

	## What's New in v1.1-DPO

	- Self-correcting behavior — detects and fixes its own bugs during agentic tasks
	- Improved code correctness — trained on 4,519 preference pairs from AceCode-V2-122K
	- Two-stage merge — v1 SFT tool-calling improvements + DPO code quality improvements combined
	- 13 GGUF quants — from Q2_K to BF16, covering all VRAM configurations

	## Training Details

	### v1 — SFT (Tool-Call Reliability)

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base model \| Qwen3.5-9B (hybrid GDN architecture) \|
	\| Method \| LoRA (r=64, alpha=32) \|
	\| Dataset \| [CoderForge-Preview](https://huggingface.co/datasets/togethercomputer/CoderForge-Preview) `filtered_reward1` (50K subset) \|
	\| Sequence length \| 4096 \|
	\| Effective batch size \| 16 \|
	\| Learning rate \| 1e-4 (cosine) \|
	\| Epochs \| 1 \|
	\| Hardware \| NVIDIA H200 140GB (Vast.ai) \|
	\| Training time \| ~10 hours \|
	\| Final loss \| ~0.94 \|

	### v1.1 — DPO (Code Correctness)

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Method \| DPO (Direct Preference Optimization) \|
	\| Dataset \| [AceCode-V2-122K](https://huggingface.co/datasets/TIGER-Lab/AceCode-V2-122K) — 4,519 preference pairs \|
	\| Pair generation \| 10K problems × 8 samples, keep if ≥1 pass AND ≥1 fail (45% keep rate) \|
	\| Beta \| 0.1 \|
	\| Loss type \| sigmoid \|
	\| Learning rate \| 5e-6 (cosine) \|
	\| Effective batch size \| 16 \|
	\| Hardware \| NVIDIA H100 80GB (Vast.ai) \|
	\| Training time \| ~3.7 hours \|
	\| Final loss \| 0.538 \|
	\| Rewards/margins (final) \| ~1.0 \|
	\| Rewards/accuracies (final) \| ~80% \|

	### LoRA Target Modules

	All major weight matrices adapted across the hybrid architecture:

	- Full Attention (8/32 layers): `q_proj`, `k_proj`, `v_proj`, `o_proj`
	- Gated Delta Net (24/32 layers): `in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a`, `out_proj`
	- MLP (all 32 layers): `gate_proj`, `up_proj`, `down_proj`

	## Usage

	### Ollama

	```bash
	ollama create deltacoder -f Modelfile
	```

	### llama.cpp / ik_llama.cpp

	```bash
	./llama-server -m DeltaCoder-9B-v1.1-DPO-Q5_K_M.gguf -ngl 999 -c 131072 -ctk f16 -ctv q4_0 -fa 1 --jinja
	```

	### With PEFT (Python)

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel
	import torch

	base = AutoModelForCausalLM.from_pretrained(
	"Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2",
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	)
	model = PeftModel.from_pretrained(base, "danielcherubini/Qwen3.5-DeltaCoder-9B")
	tokenizer = AutoTokenizer.from_pretrained("danielcherubini/Qwen3.5-DeltaCoder-9B")
	```

	## Benchmarks

	\| Model \| HumanEval \| HumanEval+ \| Terminal-Bench Easy \|
	\|-------\|-----------\|------------\|-------------------\|
	\| Jackrong Qwen3.5-9B-v2 (base) \| 53.7% \| — \| — \|
	\| DeltaCoder-9B v1 (temp=0.6) \| 50.6% \| 49.4% \| 2/4 (50%) \|
	\| DeltaCoder-9B v1.1-DPO (temp=0.6) \| TBD \| TBD \| 2/4 (50%)* \|

	*v1.1-DPO timed out on 2 tasks that v1 answered incorrectly — behavioral improvement confirmed, re-evaluating with extended timeout.

	## Recommended Sampling Settings

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| temperature \| 0.6 \|
	\| top_k \| 20 \|
	\| top_p \| 0.95 \|
	\| min_p \| 0.0 \|
	\| presence_penalty \| 0.0 \|
	\| repeat_penalty \| 1.0 \|

	> [!WARNING]
	> Do not use temperature below 0.5 — low temperatures cause deterministic looping in multi-turn agentic use.

	### KV Cache Quantization

	\| Context Length \| KV Cache \| VRAM (Q4_K_M) \| Generation Speed \|
	\|---------------\|----------\|---------------\|-----------------\|
	\| 102,400 \| f16/q4_0 \| ~8.5 GB \| ~111 tok/s \|
	\| 131,072 \| f16/q4_0 \| ~9.1 GB \| ~110 tok/s \|

	## Key Findings

	> [!NOTE]
	> Qwen3.5 is a VLM — Unsloth treats it as a vision model. For text-only DPO training, use standard HuggingFace + PEFT + TRL directly (no Unsloth DPOTrainer).

	> [!WARNING]
	> Do not use `flash_attention_2` with sample packing on Qwen3.5 — training loss goes to 0. Use `attn_implementation="eager"` instead.

	- Qwen3.5 uses Gated Delta Networks — include `in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a`, `out_proj` in LoRA target modules or 75% of attention layers are untrained
	- DPO pairs generated on-policy using `Qwen/Qwen3.5-9B` base with vLLM async inference (32 concurrent requests)
	- Keep rate of 45.2% from 10K AceCode problems (4,519 pairs used for training)

	## Project Structure

	```
	scripts/
	train_unsloth.py # v1 SFT training
	train_dpo.py # v1.1 DPO training (HF + PEFT + TRL)
	generate_dpo_pairs.py # Async on-policy pair generation
	merge_and_export_dpo.py # Two-stage merge + GGUF export
	```

	## Status

	- [x] v1 SFT fine-tune (CoderForge, H200, ~10hrs)
	- [x] GGUF export (all quants Q2_K → BF16)
	- [x] HumanEval benchmarking (50.6% / 49.4%)
	- [x] Terminal-Bench evaluation (2/4 easy tasks)
	- [x] DPO pair generation (4,519 pairs from AceCode-V2-122K)
	- [x] v1.1-DPO training (H100, ~3.7hrs)
	- [x] v1.1-DPO GGUF export + HuggingFace release
	- [ ] v1.1-DPO HumanEval benchmarking
	- [ ] v1.1-DPO Terminal-Bench extended timeout evaluation

	## Acknowledgements

	- [Unsloth](https://unsloth.ai) for Qwen3.5 SFT training support
	- [Together AI](https://together.ai) for the CoderForge dataset
	- [TIGER Lab](https://huggingface.co/TIGER-Lab) for AceCode-V2-122K
	- [Jackrong](https://huggingface.co/Jackrong) for the reasoning distillation
	- [Qwen](https://huggingface.co/Qwen) for the base model