Instructions to use Jackrong/Qwopus3.5-4B-Coder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Jackrong/Qwopus3.5-4B-Coder with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Jackrong/Qwopus3.5-4B-Coder") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Jackrong/Qwopus3.5-4B-Coder") model = AutoModelForImageTextToText.from_pretrained("Jackrong/Qwopus3.5-4B-Coder") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Jackrong/Qwopus3.5-4B-Coder with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Jackrong/Qwopus3.5-4B-Coder" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jackrong/Qwopus3.5-4B-Coder", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Jackrong/Qwopus3.5-4B-Coder
- SGLang
How to use Jackrong/Qwopus3.5-4B-Coder with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Jackrong/Qwopus3.5-4B-Coder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jackrong/Qwopus3.5-4B-Coder", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Jackrong/Qwopus3.5-4B-Coder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jackrong/Qwopus3.5-4B-Coder", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio new
How to use Jackrong/Qwopus3.5-4B-Coder with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Jackrong/Qwopus3.5-4B-Coder to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Jackrong/Qwopus3.5-4B-Coder to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Jackrong/Qwopus3.5-4B-Coder to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="Jackrong/Qwopus3.5-4B-Coder", max_seq_length=2048, ) - Docker Model Runner
How to use Jackrong/Qwopus3.5-4B-Coder with Docker Model Runner:
docker model run hf.co/Jackrong/Qwopus3.5-4B-Coder
base_model:
- Qwen/Qwen3.5-4B
tags:
- text-generation-inference
- transformers
- unsloth
- qwen3_5
- reasoning
- chain-of-thought
- mtp
- multi-token-prediction
- speculative-decoding
- lora
- sft
- agent
- tool-use
- function-calling
- coder
license: apache-2.0
language:
- en
- zh
- es
- ru
- ja
pipeline_tag: text-generation
datasets:
- Jackrong/Claude-opus-4.6-TraceInversion-9000x
- Jackrong/Claude-opus-4.7-TraceInversion-5000x
- lambda/hermes-agent-reasoning-traces
π‘ 1. Base Model, MTP Setup, and Training Stack
Community Release Notice: Qwopus3.5-4B-Coder is an experimental community release intended for research, local coding experiments, and agent workflow exploration. It has not undergone full safety evaluation or broad general-domain benchmarking.
π 2. Background and Motivation
π 3. benchlocal Evaluation and Baseline Comparison
πΊοΈ 4. Training & Data Pipeline Overview
The training process fuses Trace Inversion data augmentation with a Three-Stage Curriculum Learning pipeline. The core engineering focuses on expanding context length gradually while training on reconstructed reasoning traces and real agent trajectories to keep the output format stable.
[ πΊοΈ Trace Inversion: Reconstructing Distillation Workflow ]
A. Surrogate Model Training (Trace Inverter)
Open-source Model (GLM-5.1 / DS-V4) βββΊ Complete Reasoning Chain βββΊ [ Qwen3-235B Compression ] βββΊ Reasoning Bubbles
β β
ββββββββββββΊ [ Training ] βββββββββββ
(Base: Qwen3-4B-Instruct)
(Result: Trace-Inverter-4B)
B. Inversion Phase: Reconstructing Claude-4.7-Max
_______________________________________________________
| |
| Claude-4.7-Max API βββΊ Compressed Bubbles + Answer |
|_______________________________________________________|
β
βΌ
[ π§ Trace-Inverter-4B (Logic Reconstructor) ] βββΊ Synthetic Deep Reasoning Trace (Learnable CoT)
β
βΌ
[ π§© Data Splicing ] βββββββββββ (Original Prompt + Response)
(Embed reconstructed CoT in <think> tags, splicing with original prompt/response)
β
βΌ
(Result: claude-opus-4.6/4.7 inverted sets)
C. Final Coder SFT Curriculum Pipeline
___________________________________________
| |
| Base Model (Qwen3.5-4B family) |
|___________________________________________|
β
βΌ
[ π¦ Phase 1: Format Inception ] βββΊ [ π οΈ Phase 2: Agent/Coding Expansion ] βββΊ [ π Phase 3: Long-Context SFT ]
( < 4096 tokens ) ( 4096 - 8192 tokens ) ( 8192 - 32K tokens )
(Stable <think> format) (Tool traces + coding tasks) (Long / multi-turn / replay)
β β
βββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββ
βΌ
________________________________________
| |
| π Final Model: Qwopus3.5-4B-Coder |
|________________________________________|
π― 5. Three-Stage Curriculum Learning
To steadily scale reasoning quality under local and long-context inference, Qwopus3.5-4B-Coder uses a curriculum-style data mixture. The model is first stabilized on short, clean reasoning samples, then exposed to complex coding and agent traces, and finally reinforced with longer contexts plus replay data. This section also describes the fine-tuning context-length distribution; runtime long-context extension guidance is covered in Section 6.
| Curriculum Stage | Focus & Sample Characteristics | Strategy Details |
|---|---|---|
| π¦ Stage 1: Format Inception | β’ Limit context within 4,096 tokens β’ Emphasize stable reasoning templates |
Focuses on short-to-medium length, cleanly formatted reasoning samples. The primary goal is to establish reliable structured reasoning output, including stable <think> boundaries, before exposing the model to longer chains. |
| π οΈ Stage 2: Complexity Expansion | β’ Extend length to 4,096 - 8,192 tokens β’ Introduce higher-difficulty coding and agent samples |
Gradually increases the ratio of complex reasoning chains, code debugging tasks, and multi-turn tool traces. The model learns to connect reasoning, action selection, and environment feedback. |
| π Stage 3: Long-Context SFT | β’ Progressively scale samples up to 32K tokens β’ Use short-sample replay |
Pushes the model toward long-context and multi-turn reasoning while replaying high-quality short samples to reduce instruction-following drift. The 32K figure describes the fine-tuning sequence/data mixture target, not a hard architectural limit. |
π 6. Context Length and Long-Context Usage
Community feedback also suggests that RoPE/YaRN scaling can improve long-context stability for this model family. One user reported that, on HermesAgent-20, Qwopus3.6-35B-A3B-v1 performed better when extending from 32K to 128K via RoPE scaling than when directly setting a 128K context window without scaling, with scores of 83 vs. 72 in their setup. This result may vary depending on backend, quantization type, KV cache settings, hardware, and benchmark configuration, but it is consistent with the recommendation to use RoPE/YaRN scaling for contexts beyond 32K.
Example llama.cpp configuration for extending from 32K to 128K:
./llama-server \
-m model.gguf \
--ctx-size 131072 \
--rope-scaling yarn \
--rope-scale 4 \
--yarn-orig-ctx 32768
For 256K context, users may need to adjust the scaling factor and validate the result in their own workload:
./llama-server \
-m model.gguf \
--ctx-size 262144 \
--rope-scaling yarn \
--rope-scale 8 \
--yarn-orig-ctx 32768
Please note that long-context behavior may vary depending on inference backend, quantization type, KV cache settings, available memory, and task type. For best results, benchmark the target workload when using contexts beyond 32K.
π― 7. Recommended Use Cases and Limitations
Deployment note: The model may emit reasoning inside
<think>and</think>tags. Front-end applications and agent frameworks should parse or hide these sections where appropriate.
π 8. Resources & Guides
π GitHub Repository: Jackrong-llm-finetuning-guide Access the repository to dive into the codebase and reproduce our results locally or on Google Colab.
π Qwen MTP GGUF Processing Workflow A custom splitting and merging methodology designed specifically for Qwen series Multi-Token Prediction (MTP) heads.
π benchlocal Evaluation Framework The evaluation framework used to run the local agentic and coding benchmarks.
π 9. Acknowledgements
Special thanks to:
- The Qwen team for providing the powerful Qwen3.5 base model.
- Unsloth for providing the highly efficient fine-tuning framework.
- Open-source datasets and community contributors.
- Kyle Hessling for the close collaboration on hardware and evaluation support.
π 10. Citation
@misc{jackrong_qwopus35_4b_coder,
title = {Qwopus3.5-4B-Coder},
author = {Jackrong},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Jackrong/Qwopus3.5-4B-Coder}}
}