Instructions to use Jackrong/Qwopus3.5-4B-Coder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Jackrong/Qwopus3.5-4B-Coder with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Jackrong/Qwopus3.5-4B-Coder")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Jackrong/Qwopus3.5-4B-Coder")
model = AutoModelForMultimodalLM.from_pretrained("Jackrong/Qwopus3.5-4B-Coder")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Jackrong/Qwopus3.5-4B-Coder with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Jackrong/Qwopus3.5-4B-Coder"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jackrong/Qwopus3.5-4B-Coder",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Jackrong/Qwopus3.5-4B-Coder

SGLang

How to use Jackrong/Qwopus3.5-4B-Coder with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Jackrong/Qwopus3.5-4B-Coder" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jackrong/Qwopus3.5-4B-Coder",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Jackrong/Qwopus3.5-4B-Coder" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jackrong/Qwopus3.5-4B-Coder",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use Jackrong/Qwopus3.5-4B-Coder with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Jackrong/Qwopus3.5-4B-Coder to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Jackrong/Qwopus3.5-4B-Coder to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Jackrong/Qwopus3.5-4B-Coder to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="Jackrong/Qwopus3.5-4B-Coder",
    max_seq_length=2048,
)

Docker Model Runner
How to use Jackrong/Qwopus3.5-4B-Coder with Docker Model Runner:
```
docker model run hf.co/Jackrong/Qwopus3.5-4B-Coder
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

🪐 Qwopus3.5-4B-Coder

Coder SFT Release

Compact Agentic Coding Model Fine-Tuned for Debugging, Tool Use, and Structured Reasoning

🧬 Trace Inversion 🧠 4B Dense Model ⚡ MTP n=2 Tested 🛠️ Agentic Coding 🏆 benchlocal Evaluated

💡 What is Qwopus3.5-4B-Coder?

🪐 Qwopus3.5-4B-Coder is a compact coding and agent model built on the Qwen3.5 4B family. It is optimized for local execution, code debugging, structured tool-use behavior, and reasoning-heavy developer workflows. The training recipe follows the Qwopus Coder line: Trace Inversion for learnable reasoning traces, high-quality agent trajectories for tool behavior, and curriculum SFT to preserve formatting stability under longer contexts.

🧩 Structured Debugging Targets bug localization, minimal patch reasoning, and environment-verified code repair behavior.

🪶 Agent Trace Alignment Learns from tool-call trajectories that include real feedback loops, not only isolated prompt-response pairs.

🔁 MTP-Ready Evaluation Benchmarked with Multi-Token Prediction enabled at n=2 against the Qwen3.5-4B-MTP reference.

⚡ Local-First Design To empower resource-constrained users, the 4B size serves as the sweet spot for running agentic tasks on 16GB laptops, making it an excellent choice for handling simple repetitive tasks and service monitoring operations.

💡 1. Base Model, MTP Setup, and Training Stack

🧠 1.1 Base Model Specifications (Qwopus3.5-4B-Coder)

Qwopus3.5-4B-Coder inherits the compact Qwen3.5 4B dense architecture and adapts it toward agentic coding, debugging, tool routing, and long-form reasoning. The 4B size is intended to keep the model deployable on local machines while retaining enough reasoning capacity for developer workflows.

Attribute	Specifications & Details
🧠 Architecture	Dense Transformer / 4B-class Qwen3.5 family
🎯 Primary Focus	Agentic coding, code debugging, tool-use stability, instruction following
🧬 Training Recipe	Trace Inversion + high-quality agent trajectories + curriculum SFT
⚡ Tested MTP Variant	Qwopus3.5-4B-Coder-MTP, configured with MTP `n=2`
💾 Reference Model	Qwen3.5-4B-MTP, configured with MTP `n=2`
🔬 Additional 4B Comparison	Similar Public 4B Claude-Distilled Variant

🧪 1.2 Hardware Cooperation & Joint Collaboration

This project continues the Qwopus collaboration path with engineer Kyle Hessling, whose hardware support and evaluation feedback helped make the local model testing workflow practical and reproducible.

👉 Follow hardware and model training updates on X / Twitter: @KyleHessling1

🦥 1.3 Fine-Tuning Framework (Unsloth)

Training and adaptation use the Qwopus SFT workflow with Unsloth acceleration where applicable, focusing on efficient supervised fine-tuning, stable LoRA-style adaptation, and clean model-card reproducibility.

👉 Unsloth documentation: unsloth.ai/docs

Community Release Notice: Qwopus3.5-4B-Coder is an experimental community release intended for research, local coding experiments, and agent workflow exploration. It has not undergone full safety evaluation or broad general-domain benchmarking.

📖 2. Background and Motivation

⚠️ 2.1 Why a 4B Coder Model?

A 4B-class model is small enough to run locally with practical latency, but large enough to benefit from structured reasoning and agent trace training. The goal of this release is not to maximize raw benchmark size, but to create a compact coding assistant that remains stable under debugging, instruction-following, and tool-use pressure.

🧬 2.2 Trace Inversion and Agent Behavior

Commercial and frontier models often expose only compressed reasoning summaries. Qwopus-style training uses Trace Inversion to reconstruct these compressed "reasoning bubbles" into fuller learnable reasoning traces. For coding, this is paired with agent trajectories that include tool definitions, tool calls, and real feedback, teaching the model to reason through interactive work rather than only produce static answers.

📊 3. benchlocal Evaluation and Baseline Comparison

📊 benchlocal Agent & Coding Benchmark

Local MTP comparison across official Qwen, Claude-Opus reasoning distill, and 9B reference rows for debugging, agent workflow, tool routing, and instruction following.

🏆 Suite Average 82.0% vs. 74.0% baseline (+8.0 pp)

🐛 BugFind-15 71 / 100 +19 Delta over baseline

🧠 HermesAgent-20 64 / 100 +3 Delta over baseline

🛠️ ToolCall-15 100 / 100 Perfect tool routing

Test configuration: Models were evaluated through benchlocal using LM Studio on the same local Apple Silicon / MLX / GGUF-style setup as the 9B Coder evaluation. Tested 4B models: Qwopus3.5-4B-Coder-MTP, Qwen3.5-4B-MTP, and Similar Public 4B Claude-Distilled Variant. MTP was set to n=2 for the MTP rows, sampling used temperature=1.0 and top_p=0.95. Each scenario allowed up to three answer attempts per model; a scenario was counted as correct if any attempt passed. Deep-blue rows mark Qwopus3.5-9B-Coder-GGUF reference scores. Other 9B comparison models use neutral rows, and the official Qwen/Qwen3.5-9B baseline is kept on a white background.

🐛 3.1 BugFind-15: Code Debugging and Bug Localization

Model	Score	Delta	Dimension Scores	Readout
Qwopus3.5-4B-Coder-MTP	71	+19	A: 53 / B: 67 / C: 73 / D: 77 / E: 83	Clear debugging lead.
Qwen3.5-4B-MTP	52	baseline	A: 43 / B: 53 / C: 67 / D: 32 / E: 60	Lower consistency on bug-fix scenarios.
Similar Public 4B Claude-Distilled Variant	45	-26	A: 36 / B: 40 / C: 53 / D: 32 / E: 67	Lower debugging consistency in this run.
Qwopus3.5-9B-Coder-GGUF	79	9B reference	A: 67 / B: 87 / C: 100 / D: 77 / E: 43	Leading 9B-class row in this pack.
Qwen3.5-9B-DeepSeek-V4-Flash	75	9B comparison	A: 67 / B: 100 / C: 67 / D: 57 / E: 80	Public comparison row from 9B card.
Other Public 9B Agent Model	58	9B comparison	A: 29 / B: 87 / C: 73 / D: 20 / E: 67	Public comparison row from 9B card.

🧠 3.2 HermesAgent-20: Memory, Workspace Orchestration, and Agent Workflow

Model	Score	Delta	Visible Dimension Scores	Readout
Qwopus3.5-4B-Coder-MTP	64	+3	memory_recall 71 / workspace_orchestration 70 / skills_procedural_memory 50 / scheduling_delivery 75	Better memory and workspace behavior.
Qwen3.5-4B-MTP	61	baseline	memory_recall 41 / workspace_orchestration 45 / skills_procedural_memory 100 / scheduling_delivery 68	Stronger visible procedural-memory slice.
Similar Public 4B Claude-Distilled Variant	57	-7	memory_recall 69 / workspace_orchestration 41 / skills_procedural_memory 55 / scheduling_delivery 70 / delegation_recovery_boundaries 51	Competitive memory recall, lower workspace orchestration.
Qwopus3.5-9B-Coder-GGUF	85	9B reference	84 / 93 / 88 / 75 / 84	Leading 9B-class row in this pack.
Qwen/Qwen3.5-9B	71	official 9B baseline	75 / 58 / 100 / 53 / 69	Official Qwen reference row.
Other Public 9B Agent Model	68	9B comparison	71 / 83 / 43 / 61 / 80	Public comparison row from 9B card.
DJLougen/Harmonic-Hermes-9B	47	9B comparison	60 / 45 / 23 / 69 / 38	Public comparison row from 9B card.

🛠️ 3.3 ToolCall-15: Tool Routing Stability

Model	Score	Delta	Dimension Scores	Readout
Qwopus3.5-4B-Coder-MTP	100	+10	A: 100 / B: 100 / C: 100 / D: 100 / E: 100	Perfect tool-routing run.
Qwen3.5-4B-MTP	90	baseline	A: 100 / B: 100 / C: 100 / D: 83 / E: 67	Minor failures in later categories.
Similar Public 4B Claude-Distilled Variant	77	-23	A: 100 / B: 33 / C: 67 / D: 83 / E: 100	Strong A/E categories, weaker B/C tool-routing slices.
Qwopus3.5-9B-Coder-GGUF	100	9B reference	A: 100 / B: 100 / C: 100 / D: 100 / E: 100	Matches the leading tool-call score.
Qwen/Qwen3.5-9B	100	official 9B baseline	A: 100 / B: 100 / C: 100 / D: 100 / E: 100	Official Qwen reference row.
Other Public 9B Agent Model	93	9B comparison	A: 100 / B: 100 / C: 100 / D: 67 / E: 100	Public comparison row from 9B card.

📄 3.4 InstructFollow-15: Formatting and Constraint Following

Model	Score	Delta	Dimension Scores	Readout
Qwopus3.5-4B-Coder-MTP	93	0	A: 100 / B: 100 / C: 100 / D: 65 / E: 100	Tie
Qwen3.5-4B-MTP	93	0	A: 100 / B: 100 / C: 100 / D: 65 / E: 100	Tie
Similar Public 4B Claude-Distilled Variant	60	-33	A: 65 / B: 35 / C: 100 / D: 60 / E: 39	Lower constraint-following reliability in this run.
Qwopus3.5-9B-Coder-GGUF	93	9B reference	A: 100 / B: 100 / C: 100 / D: 67 / E: 100	Reported 9B Coder reference score.

🍎 All screenshots of the test interfaces have been uploaded to the image folder in the repository. Click the link below to view and verify:
🔗 View Test Screenshots

❤️ Kyle Hessling for his generous hardware and equipment support. You can follow him for more updates on X / Twitter: @KyleHessling1.

🗺️ 4. Training & Data Pipeline Overview

The training process fuses Trace Inversion data augmentation with a Three-Stage Curriculum Learning pipeline. The core engineering focuses on expanding context length gradually while training on reconstructed reasoning traces and real agent trajectories to keep the output format stable.

       [ 🗺️ Trace Inversion: Reconstructing Distillation Workflow ]

  A. Surrogate Model Training (Trace Inverter)
     Open-source Model (GLM-5.1 / DS-V4) ──► Complete Reasoning Chain ──► [ Qwen3-235B Compression ] ──► Reasoning Bubbles
                                              │                                   │
                                              └──────────► [ Training ] ◄─────────┘
                                                   (Base: Qwen3-4B-Instruct)
                                                   (Result: Trace-Inverter-4B)

  B. Inversion Phase: Reconstructing Claude-4.7-Max
     _______________________________________________________
    |                                                       |
    |  Claude-4.7-Max API ──► Compressed Bubbles + Answer   |
    |_______________________________________________________|
                      │
                      ▼
    [ 🧠 Trace-Inverter-4B (Logic Reconstructor) ] ──► Synthetic Deep Reasoning Trace (Learnable CoT)
                      │
                      ▼
    [ 🧩 Data Splicing ] ◄────────── (Original Prompt + Response)
    (Embed reconstructed CoT in <think> tags, splicing with original prompt/response)
                      │
                      ▼
             (Result: claude-opus-4.6/4.7 inverted sets)

  C. Final Coder SFT Curriculum Pipeline
     ___________________________________________
    |                                           |
    |       Base Model (Qwen3.5-4B family)      |
    |___________________________________________|
                      │
                      ▼
    [ 📦 Phase 1: Format Inception ] ──► [ 🛠️ Phase 2: Agent/Coding Expansion ] ──► [ 🚀 Phase 3: Long-Context SFT ]
      ( < 4096 tokens )                     ( 4096 - 8192 tokens )                     ( 8192 - 32K tokens )
      (Stable <think> format)               (Tool traces + coding tasks)               (Long / multi-turn / replay)
                      │                                                                            │
                      └─────────────────────────────┬──────────────────────────────────────────────┘
                                                    ▼
                                   ________________________________________
                                  |                                        |
                                  |   🌟 Final Model: Qwopus3.5-4B-Coder   |
                                  |________________________________________|

🎯 5. Three-Stage Curriculum Learning

To steadily scale reasoning quality under local and long-context inference, Qwopus3.5-4B-Coder uses a curriculum-style data mixture. The model is first stabilized on short, clean reasoning samples, then exposed to complex coding and agent traces, and finally reinforced with longer contexts plus replay data. This section also describes the fine-tuning context-length distribution; runtime long-context extension guidance is covered in Section 6.

Curriculum Stage	Focus & Sample Characteristics	Strategy Details
📦 Stage 1: Format Inception	• Limit context within 4,096 tokens • Emphasize stable reasoning templates	Focuses on short-to-medium length, cleanly formatted reasoning samples. The primary goal is to establish reliable structured reasoning output, including stable `<think>` boundaries, before exposing the model to longer chains.
🛠️ Stage 2: Complexity Expansion	• Extend length to 4,096 - 8,192 tokens • Introduce higher-difficulty coding and agent samples	Gradually increases the ratio of complex reasoning chains, code debugging tasks, and multi-turn tool traces. The model learns to connect reasoning, action selection, and environment feedback.
🚀 Stage 3: Long-Context SFT	• Progressively scale samples up to 32K tokens • Use short-sample replay	Pushes the model toward long-context and multi-turn reasoning while replaying high-quality short samples to reduce instruction-following drift. The 32K figure describes the fine-tuning sequence/data mixture target, not a hard architectural limit.

📄 6. Context Length and Long-Context Usage

📄 6.1 Runtime Context Guidance

During fine-tuning, this model was trained with a maximum sequence length of 32K tokens. The training data mixture was also constructed around samples up to 32K tokens, so the context-length distribution in this model card reflects the fine-tuning data distribution rather than a hard architectural limit.

The model still inherits the native long-context capability of the Qwen3.5-family base model. Longer context windows such as 128K or 256K may be available in compatible inference runtimes, depending on backend support and configuration.

For practical long-context inference beyond 32K, especially when using llama.cpp / GGUF, it is recommended to enable RoPE/YaRN scaling instead of only increasing n_ctx or --ctx-size. Directly setting a larger context window without RoPE scaling may work in some setups, but it can be less stable and may not deliver the expected long-context behavior.

This follows Qwen community guidance for GGUF long-context usage. In a Qwen GGUF discussion, a Qwen maintainer noted that "128K context length needs YaRN" and later clarified that supported scaling should be explicitly enabled rather than assumed to be on by default. Reference: Qwen/Qwen2.5-72B-Instruct-GGUF discussion #2.

Community feedback also suggests that RoPE/YaRN scaling can improve long-context stability for this model family. One user reported that, on HermesAgent-20, Qwopus3.6-35B-A3B-v1 performed better when extending from 32K to 128K via RoPE scaling than when directly setting a 128K context window without scaling, with scores of 83 vs. 72 in their setup. This result may vary depending on backend, quantization type, KV cache settings, hardware, and benchmark configuration, but it is consistent with the recommendation to use RoPE/YaRN scaling for contexts beyond 32K.

Example llama.cpp configuration for extending from 32K to 128K:

./llama-server \
  -m model.gguf \
  --ctx-size 131072 \
  --rope-scaling yarn \
  --rope-scale 4 \
  --yarn-orig-ctx 32768

For 256K context, users may need to adjust the scaling factor and validate the result in their own workload:

./llama-server \
  -m model.gguf \
  --ctx-size 262144 \
  --rope-scaling yarn \
  --rope-scale 8 \
  --yarn-orig-ctx 32768

Please note that long-context behavior may vary depending on inference backend, quantization type, KV cache settings, available memory, and task type. For best results, benchmark the target workload when using contexts beyond 32K.

🎯 7. Recommended Use Cases and Limitations

✅ Good Fits

Code debugging, small repository tasks, tool-call routing, local coding agents, structured instruction following, development workflow assistants, and reasoning traces where concise local latency matters.

❌ Known Limits

As a compact model, it can still miss broad world knowledge, complex repository-wide dependencies, or highly specialized domain requirements. Tool-call behavior depends strongly on prompt format and tool schema consistency.

Deployment note: The model may emit reasoning inside <think> and </think> tags. Front-end applications and agent frameworks should parse or hide these sections where appropriate.

📚 8. Resources & Guides

👉 GitHub Repository: Jackrong-llm-finetuning-guide Access the repository to dive into the codebase and reproduce our results locally or on Google Colab.

👉 Qwen MTP GGUF Processing Workflow A custom splitting and merging methodology designed specifically for Qwen series Multi-Token Prediction (MTP) heads.

👉 benchlocal Evaluation Framework The evaluation framework used to run the local agentic and coding benchmarks.

🙏 9. Acknowledgements

Special thanks to:

The Qwen team for providing the powerful Qwen3.5 base model.
Unsloth for providing the highly efficient fine-tuning framework.
Open-source datasets and community contributors.
Kyle Hessling for the close collaboration on hardware and evaluation support.

📖 10. Citation

@misc{jackrong_qwopus35_4b_coder,
  title        = {Qwopus3.5-4B-Coder},
  author       = {Jackrong},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Jackrong/Qwopus3.5-4B-Coder}}
}