Qwen3-Coder-Next — Opus 4.6 Reasoning Distilled (Full Fine-Tune)

Full parameter fine-tune of Qwen/Qwen3-Coder-Next (~80B total / ~3B active, MoE) with Claude Opus 4.6 reasoning distillation. Trained on 8x NVIDIA H100 80GB SXM with DeepSpeed ZeRO-3.

Model Details

Property	Value
Base Model	Qwen/Qwen3-Coder-Next
Architecture	qwen3_next (Mixture of Experts)
Total Parameters	~80B
Active Parameters	~3B per token (10 of 512 experts)
Expert Config	512 experts, 10 active per token, intermediate=512
Shared Expert	Yes (intermediate=512)
Attention	Hybrid: linear attention + full attention every 4th layer
Hidden Size	2048
Layers	48
Attention Heads	16 (2 KV heads, GQA)
Intermediate Size	5120
Vocab Size	151,936
Max Context	262,144 tokens (256K)
RoPE	Partial rotary (0.25), theta=5M
Precision	BF16

Benchmark Results

Evaluated by Claude Opus 4.6 across 26 tests in 6 categories, scored 1-10 on correctness, completeness, clarity, and adherence to instructions.

Category	Base Model	Opus Distilled	Winner	Delta
Coding	8.2	8.5	Opus Distilled	+0.3
Bug Detection	8.4	9.0	Opus Distilled	+0.6
Probability	8.6	8.0	Base	-0.6
Tool Calling	3.4	7.2	Opus Distilled	+3.8
Logic	8.5	7.5	Base	-1.0
Instruction Following	7.7	7.0	Base	-0.7
Overall	7.35	7.73	Opus Distilled	+0.38

Detailed Test Results

Coding (6 tests)

Test	Base	Opus	Winner	Notes
Python: Trie Implementation	9	9	Tie	Both correct and complete
Python: Async Web Scraper	8	9	Opus	Opus uses cleaner class-based design with semaphore rate limiting
Rust: Custom Iterator	8	9	Opus	Opus adds idiomatic `impl Iterator<Item = u64>` convenience function
TypeScript: Event Emitter	8	8	Tie	Both implement type-safe emitters with different valid approaches
SQL: Complex Query	7	8	Opus	Base has a bug: references window function in WHERE of same SELECT
Python: Graph BFS/DFS	9	8	Base	Base covers more methods within token budget

Bug Detection (5 tests)

Test	Base	Opus	Winner	Notes
Off-by-one in binary search	9	9	Tie	Both find all 4 bugs
Race condition in Go	8	9	Opus	Opus adds deadlock diagram + 3 solutions vs 2
Memory leak in C++	9	9	Tie	Both identify Rule of Three/Five violations
Security bugs in JavaScript	8	9	Opus	Opus adds severity table, catches JWT forgery and missing rate limiting
Deadlock in Python threading	8	9	Opus	Opus provides step-by-step timeline diagram + Coffman conditions

Probability (5 tests)

Test	Base	Opus	Winner	Notes
Bayes' Theorem	9	9	Tie	Both correct (~1.94%)
Birthday Problem Variant	9	8	Base	Base more mathematically rigorous
Monty Hall Extended	9	9	Tie	Both correctly derive P(switch)=2/5
Markov Chain	8	6	Base	Opus made computation error, had to restart
Combinatorics: Card Hands	9	9	Tie	Both correct on completed sections

Tool Calling (5 tests) — Largest improvement

Test	Base	Opus	Winner	Notes
Weather API planning	2	7	Opus	Base outputs single call (37 tokens). Opus chains all 3 tools.
Database CRUD operations	3	8	Opus	Base: 1 tool call (119 tokens). Opus: complete 4-step workflow.
Multi-step file operations	4	3	Base	Both perform poorly on this test
API orchestration	2	7	Opus	Base outputs malformed tool call. Opus plans 3 clear steps.
Complex reasoning with tools	6	7	Opus	Base batches lookups correctly but stops. Opus completes reasoning.

Logic (2 tests)

Test	Base	Opus	Winner	Notes
Sudoku Solver Explanation	9	8	Base	Base explains constraint propagation more clearly within token budget
Einstein's Riddle	8	7	Base	Base makes more deduction progress within token budget

Instruction Following (3 tests)

Test	Base	Opus	Winner	Notes
Structured JSON output	10	6	Base	Base: clean JSON only (453 tokens). Opus: 2048 tokens, ignored "ONLY JSON" constraint
Code with exact constraints	5	6	Opus	Both struggle. Base self-corrects mid-response.
Multi-format output	8	8	Tie	Both truncated, similar quality

Key Findings

Tool Calling: Largest improvement (+3.8). Base model outputs only the first tool call and stops. Opus Distilled plans full multi-step tool chains with reasoning between steps.
Bug Detection: Opus Distilled provides more structured analysis with severity tables, timeline diagrams, and catches more edge cases (+0.6).
Coding: Opus Distilled favors class-based architectures with better design patterns. Caught a SQL bug (window function in WHERE clause) that Base missed.
Probability: Base is more concise and made fewer computation errors. Opus Distilled made an error on a Markov Chain steady-state calculation.
Logic: Base makes better progress within token budgets — Opus Distilled spends more tokens on preamble.
Instruction Following: Base adheres more strictly to output format constraints (e.g., "output ONLY valid JSON").

Verdict

Opus-Distilled wins overall driven by massively better tool calling and slightly better bug detection and coding. Base wins on math/probability (fewer errors), logic (better token efficiency), and instruction following (better constraint adherence). For coding assistant use cases where tool calling matters, Opus-Distilled is clearly superior.

Performance

Both models run at comparable speeds on RTX PRO 6000 Blackwell (96GB):

Metric	Base	Opus Distilled
Tokens/sec	100.5	102.5
Avg response length	1,085 tokens	1,464 tokens

Training Details

Hardware & Infrastructure

GPUs: 8x NVIDIA H100 80GB SXM with NVLink
System RAM: 2 TB DDR5
Distribution: DeepSpeed ZeRO-3 (parameters sharded across all 8 GPUs)
Optimizer Offload: AdamW optimizer states offloaded to CPU RAM (~700GB)
Platform: RunPod

Hyperparameters

Parameter	Value
Method	Full fine-tune (all parameters)
Framework	HuggingFace TRL 1.0.0 + DeepSpeed 0.18.9
Transformers	5.4.0
Optimizer	AdamW (CPU offloaded via DeepSpeed ZeRO-3)
Learning Rate	2e-5 (cosine schedule)
Warmup	5% of steps
Weight Decay	0.01
Gradient Clipping	1.0
Epochs	3
Effective Batch Size	32 (1 per GPU x 4 grad accum x 8 GPUs)
Max Sequence Length	8192 (training context window)
Gradient Checkpointing	Enabled (non-reentrant)
Precision	BF16
Total Steps	303
Seed	42

Training Progression

Metric	Step 1	Step 50	Step 114	Step 150	Step 214	Step 303 (Final)
Loss	0.870	0.498	0.244	0.210	0.115	0.062
Token Accuracy	78.0%	84.4%	91.5%	93.5%	96.5%	98.1%
Learning Rate	0	1.94e-5	1.51e-5	1.29e-5	4.66e-6	5.99e-10
Epoch	0.01	0.50	1.11	1.33	2.10	3.00

Datasets

3,204 examples after quality filtering (required <think> tags and >200 characters of assistant content):

Dataset	Examples	Description
nohurry/Opus-4.6-Reasoning-3000x-filtered	2,321	Claude Opus 4.6 reasoning traces (thinking + solution)
TeichAI/claude-4.5-opus-high-reasoning-250x	250	High-quality Claude reasoning conversations
Jackrong/Qwen3.5-reasoning-700x	633	Qwen reasoning conversations

Data Format

Each training example follows this structure:

<|im_start|>user
{problem}<|im_end|>
<|im_start|>assistant
<think>
{chain-of-thought reasoning}
</think>

{final answer}<|im_end|>

Quality Filter

Examples were filtered to require:

At least one assistant message containing <think> tags
Assistant content longer than 200 characters

This removed low-quality or non-reasoning examples from the combined dataset.

Reasoning Format

The model produces reasoning inside <think>...</think> tags:

<think>
Let me analyze this step by step...
1. First consideration
2. Second consideration
3. Conclusion
</think>

Here is the final answer based on my analysis.

When serving with llama.cpp, use --reasoning-format deepseek with a thinking-aware chat template to separate reasoning from visible output.

GGUF Quantizations

See samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-GGUF for quantized GGUF versions (Q4_K_M, Q6_K, Q8_0, BF16).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled",
    torch_dtype="bfloat16",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Implement a thread-safe LRU cache in Python"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.6)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 43

Safetensors

Model size

6.21M params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled

Base model

Qwen/Qwen3-Coder-Next

Finetuned

(35)

this model

Finetunes

3 models

Quantizations

1 model

samuelcardillo
/

Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled