🪐 Qwopus3.5-4B-Coder

Coder SFT Release

Compact Agentic Coding Model Fine-Tuned for Debugging, Tool Use, and Structured Reasoning

🧬 Trace Inversion 🧠 4B Dense Model ⚡ MTP n=2 Tested 🛠️ Agentic Coding 🏆 benchlocal Evaluated

💡 What is Qwopus3.5-4B-Coder?

🪐 Qwopus3.5-4B-Coder is a compact coding and agent model built on the Qwen3.5 4B family. It is optimized for local execution, code debugging, structured tool-use behavior, and reasoning-heavy developer workflows. The training recipe follows the Qwopus Coder line: Trace Inversion for learnable reasoning traces, high-quality agent trajectories for tool behavior, and curriculum SFT to preserve formatting stability under longer contexts.

🧩 Structured Debugging Targets bug localization, minimal patch reasoning, and environment-verified code repair behavior.
🪶 Agent Trace Alignment Learns from tool-call trajectories that include real feedback loops, not only isolated prompt-response pairs.
🔁 MTP-Ready Evaluation Benchmarked with Multi-Token Prediction enabled at n=2 against the Qwen3.5-4B-MTP reference.
⚡ Local-First Design To empower resource-constrained users, the 4B size serves as the sweet spot for running agentic tasks on 16GB laptops, making it an excellent choice for handling simple repetitive tasks and service monitoring operations.

💡 1. Base Model, MTP Setup, and Training Stack

🧠 1.1 Base Model Specifications (Qwopus3.5-4B-Coder)

Qwopus3.5-4B-Coder inherits the compact Qwen3.5 4B dense architecture and adapts it toward agentic coding, debugging, tool routing, and long-form reasoning. The 4B size is intended to keep the model deployable on local machines while retaining enough reasoning capacity for developer workflows.

Attribute Specifications & Details
🧠 Architecture Dense Transformer / 4B-class Qwen3.5 family
🎯 Primary Focus Agentic coding, code debugging, tool-use stability, instruction following
🧬 Training Recipe Trace Inversion + high-quality agent trajectories + curriculum SFT
⚡ Tested MTP Variant Qwopus3.5-4B-Coder-MTP, configured with MTP n=2
💾 Reference Model Qwen3.5-4B-MTP, configured with MTP n=2
🔬 Additional 4B Comparison Similar Public 4B Claude-Distilled Variant
🧪 1.2 Hardware Cooperation & Joint Collaboration
This project continues the Qwopus collaboration path with engineer Kyle Hessling, whose hardware support and evaluation feedback helped make the local model testing workflow practical and reproducible.
👉 Follow hardware and model training updates on X / Twitter: @KyleHessling1
🦥 1.3 Fine-Tuning Framework (Unsloth)
Training and adaptation use the Qwopus SFT workflow with Unsloth acceleration where applicable, focusing on efficient supervised fine-tuning, stable LoRA-style adaptation, and clean model-card reproducibility.
👉 Unsloth documentation: unsloth.ai/docs

Community Release Notice: Qwopus3.5-4B-Coder is an experimental community release intended for research, local coding experiments, and agent workflow exploration. It has not undergone full safety evaluation or broad general-domain benchmarking.


📖 2. Background and Motivation

⚠️ 2.1 Why a 4B Coder Model?
A 4B-class model is small enough to run locally with practical latency, but large enough to benefit from structured reasoning and agent trace training. The goal of this release is not to maximize raw benchmark size, but to create a compact coding assistant that remains stable under debugging, instruction-following, and tool-use pressure.
🧬 2.2 Trace Inversion and Agent Behavior
Commercial and frontier models often expose only compressed reasoning summaries. Qwopus-style training uses Trace Inversion to reconstruct these compressed "reasoning bubbles" into fuller learnable reasoning traces. For coding, this is paired with agent trajectories that include tool definitions, tool calls, and real feedback, teaching the model to reason through interactive work rather than only produce static answers.

📊 3. benchlocal Evaluation and Baseline Comparison

📊 benchlocal Agent & Coding Benchmark

Local MTP comparison across official Qwen, Claude-Opus reasoning distill, and 9B reference rows for debugging, agent workflow, tool routing, and instruction following.

🏆 Suite Average 82.0% vs. 74.0% baseline (+8.0 pp)
🐛 BugFind-15 71 / 100 +19 Delta over baseline
🧠 HermesAgent-20 64 / 100 +3 Delta over baseline
🛠️ ToolCall-15 100 / 100 Perfect tool routing
Test configuration: Models were evaluated through benchlocal using LM Studio on the same local Apple Silicon / MLX / GGUF-style setup as the 9B Coder evaluation. Tested 4B models: Qwopus3.5-4B-Coder-MTP, Qwen3.5-4B-MTP, and Similar Public 4B Claude-Distilled Variant. MTP was set to n=2 for the MTP rows, sampling used temperature=1.0 and top_p=0.95. Each scenario allowed up to three answer attempts per model; a scenario was counted as correct if any attempt passed. Deep-blue rows mark Qwopus3.5-9B-Coder-GGUF reference scores. Other 9B comparison models use neutral rows, and the official Qwen/Qwen3.5-9B baseline is kept on a white background.
🐛 3.1 BugFind-15: Code Debugging and Bug Localization
Model Score Delta Dimension Scores Readout
Qwopus3.5-4B-Coder-MTP 71 +19 A: 53 / B: 67 / C: 73 / D: 77 / E: 83 Clear debugging lead.
Qwen3.5-4B-MTP 52 baseline A: 43 / B: 53 / C: 67 / D: 32 / E: 60 Lower consistency on bug-fix scenarios.
Similar Public 4B Claude-Distilled Variant 45 -26 A: 36 / B: 40 / C: 53 / D: 32 / E: 67 Lower debugging consistency in this run.
Qwopus3.5-9B-Coder-GGUF 79 9B reference A: 67 / B: 87 / C: 100 / D: 77 / E: 43 Leading 9B-class row in this pack.
Qwen3.5-9B-DeepSeek-V4-Flash 75 9B comparison A: 67 / B: 100 / C: 67 / D: 57 / E: 80 Public comparison row from 9B card.
Other Public 9B Agent Model 58 9B comparison A: 29 / B: 87 / C: 73 / D: 20 / E: 67 Public comparison row from 9B card.
🧠 3.2 HermesAgent-20: Memory, Workspace Orchestration, and Agent Workflow
Model Score Delta Visible Dimension Scores Readout
Qwopus3.5-4B-Coder-MTP 64 +3 memory_recall 71 / workspace_orchestration 70 / skills_procedural_memory 50 / scheduling_delivery 75 Better memory and workspace behavior.
Qwen3.5-4B-MTP 61 baseline memory_recall 41 / workspace_orchestration 45 / skills_procedural_memory 100 / scheduling_delivery 68 Stronger visible procedural-memory slice.
Similar Public 4B Claude-Distilled Variant 57 -7 memory_recall 69 / workspace_orchestration 41 / skills_procedural_memory 55 / scheduling_delivery 70 / delegation_recovery_boundaries 51 Competitive memory recall, lower workspace orchestration.
Qwopus3.5-9B-Coder-GGUF 85 9B reference 84 / 93 / 88 / 75 / 84 Leading 9B-class row in this pack.
Qwen/Qwen3.5-9B 71 official 9B baseline 75 / 58 / 100 / 53 / 69 Official Qwen reference row.
Other Public 9B Agent Model 68 9B comparison 71 / 83 / 43 / 61 / 80 Public comparison row from 9B card.
DJLougen/Harmonic-Hermes-9B 47 9B comparison 60 / 45 / 23 / 69 / 38 Public comparison row from 9B card.
🛠️ 3.3 ToolCall-15: Tool Routing Stability
Model Score Delta Dimension Scores Readout
Qwopus3.5-4B-Coder-MTP 100 +10 A: 100 / B: 100 / C: 100 / D: 100 / E: 100 Perfect tool-routing run.
Qwen3.5-4B-MTP 90 baseline A: 100 / B: 100 / C: 100 / D: 83 / E: 67 Minor failures in later categories.
Similar Public 4B Claude-Distilled Variant 77 -23 A: 100 / B: 33 / C: 67 / D: 83 / E: 100 Strong A/E categories, weaker B/C tool-routing slices.
Qwopus3.5-9B-Coder-GGUF 100 9B reference A: 100 / B: 100 / C: 100 / D: 100 / E: 100 Matches the leading tool-call score.
Qwen/Qwen3.5-9B 100 official 9B baseline A: 100 / B: 100 / C: 100 / D: 100 / E: 100 Official Qwen reference row.
Other Public 9B Agent Model 93 9B comparison A: 100 / B: 100 / C: 100 / D: 67 / E: 100 Public comparison row from 9B card.
📄 3.4 InstructFollow-15: Formatting and Constraint Following
Model Score Delta Dimension Scores Readout
Qwopus3.5-4B-Coder-MTP 93 0 A: 100 / B: 100 / C: 100 / D: 65 / E: 100 Tie
Qwen3.5-4B-MTP 93 0 A: 100 / B: 100 / C: 100 / D: 65 / E: 100 Tie
Similar Public 4B Claude-Distilled Variant 60 -33 A: 65 / B: 35 / C: 100 / D: 60 / E: 39 Lower constraint-following reliability in this run.
Qwopus3.5-9B-Coder-GGUF 93 9B reference A: 100 / B: 100 / C: 100 / D: 67 / E: 100 Reported 9B Coder reference score.

🍎 All screenshots of the test interfaces have been uploaded to the image folder in the repository. Click the link below to view and verify:
🔗 View Test Screenshots

❤️ Kyle Hessling for his generous hardware and equipment support. You can follow him for more updates on X / Twitter: @KyleHessling1.


🗺️ 4. Training & Data Pipeline Overview

The training process fuses Trace Inversion data augmentation with a Three-Stage Curriculum Learning pipeline. The core engineering focuses on expanding context length gradually while training on reconstructed reasoning traces and real agent trajectories to keep the output format stable.

       [ 🗺️ Trace Inversion: Reconstructing Distillation Workflow ]

  A. Surrogate Model Training (Trace Inverter)
     Open-source Model (GLM-5.1 / DS-V4) ──► Complete Reasoning Chain ──► [ Qwen3-235B Compression ] ──► Reasoning Bubbles
                                              │                                   │
                                              └──────────► [ Training ] ◄─────────┘
                                                   (Base: Qwen3-4B-Instruct)
                                                   (Result: Trace-Inverter-4B)

  B. Inversion Phase: Reconstructing Claude-4.7-Max
     _______________________________________________________
    |                                                       |
    |  Claude-4.7-Max API ──► Compressed Bubbles + Answer   |
    |_______________________________________________________|
                      │
                      ▼
    [ 🧠 Trace-Inverter-4B (Logic Reconstructor) ] ──► Synthetic Deep Reasoning Trace (Learnable CoT)
                      │
                      ▼
    [ 🧩 Data Splicing ] ◄────────── (Original Prompt + Response)
    (Embed reconstructed CoT in <think> tags, splicing with original prompt/response)
                      │
                      ▼
             (Result: claude-opus-4.6/4.7 inverted sets)

  C. Final Coder SFT Curriculum Pipeline
     ___________________________________________
    |                                           |
    |       Base Model (Qwen3.5-4B family)      |
    |___________________________________________|
                      │
                      ▼
    [ 📦 Phase 1: Format Inception ] ──► [ 🛠️ Phase 2: Agent/Coding Expansion ] ──► [ 🚀 Phase 3: Long-Context SFT ]
      ( < 4096 tokens )                     ( 4096 - 8192 tokens )                     ( 8192 - 32K tokens )
      (Stable <think> format)               (Tool traces + coding tasks)               (Long / multi-turn / replay)
                      │                                                                            │
                      └─────────────────────────────┬──────────────────────────────────────────────┘
                                                    ▼
                                   ________________________________________
                                  |                                        |
                                  |   🌟 Final Model: Qwopus3.5-4B-Coder   |
                                  |________________________________________|

🎯 5. Three-Stage Curriculum Learning

To steadily scale reasoning quality under local and long-context inference, Qwopus3.5-4B-Coder uses a curriculum-style data mixture. The model is first stabilized on short, clean reasoning samples, then exposed to complex coding and agent traces, and finally reinforced with longer contexts plus replay data. This section also describes the fine-tuning context-length distribution; runtime long-context extension guidance is covered in Section 6.

Curriculum Stage Focus & Sample Characteristics Strategy Details
📦 Stage 1: Format Inception • Limit context within 4,096 tokens
• Emphasize stable reasoning templates
Focuses on short-to-medium length, cleanly formatted reasoning samples. The primary goal is to establish reliable structured reasoning output, including stable <think> boundaries, before exposing the model to longer chains.
🛠️ Stage 2: Complexity Expansion • Extend length to 4,096 - 8,192 tokens
• Introduce higher-difficulty coding and agent samples
Gradually increases the ratio of complex reasoning chains, code debugging tasks, and multi-turn tool traces. The model learns to connect reasoning, action selection, and environment feedback.
🚀 Stage 3: Long-Context SFT • Progressively scale samples up to 32K tokens
• Use short-sample replay
Pushes the model toward long-context and multi-turn reasoning while replaying high-quality short samples to reduce instruction-following drift. The 32K figure describes the fine-tuning sequence/data mixture target, not a hard architectural limit.

📄 6. Context Length and Long-Context Usage

📄 6.1 Runtime Context Guidance

During fine-tuning, this model was trained with a maximum sequence length of 32K tokens. The training data mixture was also constructed around samples up to 32K tokens, so the context-length distribution in this model card reflects the fine-tuning data distribution rather than a hard architectural limit.

The model still inherits the native long-context capability of the Qwen3.5-family base model. Longer context windows such as 128K or 256K may be available in compatible inference runtimes, depending on backend support and configuration.

For practical long-context inference beyond 32K, especially when using llama.cpp / GGUF, it is recommended to enable RoPE/YaRN scaling instead of only increasing n_ctx or --ctx-size. Directly setting a larger context window without RoPE scaling may work in some setups, but it can be less stable and may not deliver the expected long-context behavior.

This follows Qwen community guidance for GGUF long-context usage. In a Qwen GGUF discussion, a Qwen maintainer noted that "128K context length needs YaRN" and later clarified that supported scaling should be explicitly enabled rather than assumed to be on by default. Reference: Qwen/Qwen2.5-72B-Instruct-GGUF discussion #2.

Community feedback also suggests that RoPE/YaRN scaling can improve long-context stability for this model family. One user reported that, on HermesAgent-20, Qwopus3.6-35B-A3B-v1 performed better when extending from 32K to 128K via RoPE scaling than when directly setting a 128K context window without scaling, with scores of 83 vs. 72 in their setup. This result may vary depending on backend, quantization type, KV cache settings, hardware, and benchmark configuration, but it is consistent with the recommendation to use RoPE/YaRN scaling for contexts beyond 32K.

Example llama.cpp configuration for extending from 32K to 128K:

./llama-server \
  -m model.gguf \
  --ctx-size 131072 \
  --rope-scaling yarn \
  --rope-scale 4 \
  --yarn-orig-ctx 32768

For 256K context, users may need to adjust the scaling factor and validate the result in their own workload:

./llama-server \
  -m model.gguf \
  --ctx-size 262144 \
  --rope-scaling yarn \
  --rope-scale 8 \
  --yarn-orig-ctx 32768

Please note that long-context behavior may vary depending on inference backend, quantization type, KV cache settings, available memory, and task type. For best results, benchmark the target workload when using contexts beyond 32K.


🎯 7. Recommended Use Cases and Limitations

Good Fits
Code debugging, small repository tasks, tool-call routing, local coding agents, structured instruction following, development workflow assistants, and reasoning traces where concise local latency matters.
Known Limits
As a compact model, it can still miss broad world knowledge, complex repository-wide dependencies, or highly specialized domain requirements. Tool-call behavior depends strongly on prompt format and tool schema consistency.

Deployment note: The model may emit reasoning inside <think> and </think> tags. Front-end applications and agent frameworks should parse or hide these sections where appropriate.


📚 8. Resources & Guides

👉 GitHub Repository: Jackrong-llm-finetuning-guide Access the repository to dive into the codebase and reproduce our results locally or on Google Colab.

👉 Qwen MTP GGUF Processing Workflow A custom splitting and merging methodology designed specifically for Qwen series Multi-Token Prediction (MTP) heads.

👉 benchlocal Evaluation Framework The evaluation framework used to run the local agentic and coding benchmarks.


🙏 9. Acknowledgements

Special thanks to:

  • The Qwen team for providing the powerful Qwen3.5 base model.
  • Unsloth for providing the highly efficient fine-tuning framework.
  • Open-source datasets and community contributors.
  • Kyle Hessling for the close collaboration on hardware and evaluation support.

📖 10. Citation

@misc{jackrong_qwopus35_4b_coder,
  title        = {Qwopus3.5-4B-Coder},
  author       = {Jackrong},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Jackrong/Qwopus3.5-4B-Coder}}
}
Downloads last month
-
Safetensors
Model size
5B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jackrong/Qwopus3.5-4B-Coder

Finetuned
Qwen/Qwen3.5-4B
Adapter
(226)
this model

Datasets used to train Jackrong/Qwopus3.5-4B-Coder

Collection including Jackrong/Qwopus3.5-4B-Coder