Qwen3-Coder-Next — Opus 4.6 Reasoning Distilled (Full Fine-Tune)

Full parameter fine-tune of Qwen/Qwen3-Coder-Next (~80B total / ~3B active, MoE) with Claude Opus 4.6 reasoning distillation. Trained on 8x NVIDIA H100 80GB SXM with DeepSpeed ZeRO-3.

Model Details

Property Value
Base Model Qwen/Qwen3-Coder-Next
Architecture qwen3_next (Mixture of Experts)
Total Parameters ~80B
Active Parameters ~3B per token (10 of 512 experts)
Expert Config 512 experts, 10 active per token, intermediate=512
Shared Expert Yes (intermediate=512)
Attention Hybrid: linear attention + full attention every 4th layer
Hidden Size 2048
Layers 48
Attention Heads 16 (2 KV heads, GQA)
Intermediate Size 5120
Vocab Size 151,936
Max Context 262,144 tokens (256K)
RoPE Partial rotary (0.25), theta=5M
Precision BF16

Benchmark Results

Evaluated by Claude Opus 4.6 across 26 tests in 6 categories, scored 1-10 on correctness, completeness, clarity, and adherence to instructions.

Category Base Model Opus Distilled Winner Delta
Coding 8.2 8.5 Opus Distilled +0.3
Bug Detection 8.4 9.0 Opus Distilled +0.6
Probability 8.6 8.0 Base -0.6
Tool Calling 3.4 7.2 Opus Distilled +3.8
Logic 8.5 7.5 Base -1.0
Instruction Following 7.7 7.0 Base -0.7
Overall 7.35 7.73 Opus Distilled +0.38

Detailed Test Results

Coding (6 tests)

Test Base Opus Winner Notes
Python: Trie Implementation 9 9 Tie Both correct and complete
Python: Async Web Scraper 8 9 Opus Opus uses cleaner class-based design with semaphore rate limiting
Rust: Custom Iterator 8 9 Opus Opus adds idiomatic impl Iterator<Item = u64> convenience function
TypeScript: Event Emitter 8 8 Tie Both implement type-safe emitters with different valid approaches
SQL: Complex Query 7 8 Opus Base has a bug: references window function in WHERE of same SELECT
Python: Graph BFS/DFS 9 8 Base Base covers more methods within token budget

Bug Detection (5 tests)

Test Base Opus Winner Notes
Off-by-one in binary search 9 9 Tie Both find all 4 bugs
Race condition in Go 8 9 Opus Opus adds deadlock diagram + 3 solutions vs 2
Memory leak in C++ 9 9 Tie Both identify Rule of Three/Five violations
Security bugs in JavaScript 8 9 Opus Opus adds severity table, catches JWT forgery and missing rate limiting
Deadlock in Python threading 8 9 Opus Opus provides step-by-step timeline diagram + Coffman conditions

Probability (5 tests)

Test Base Opus Winner Notes
Bayes' Theorem 9 9 Tie Both correct (~1.94%)
Birthday Problem Variant 9 8 Base Base more mathematically rigorous
Monty Hall Extended 9 9 Tie Both correctly derive P(switch)=2/5
Markov Chain 8 6 Base Opus made computation error, had to restart
Combinatorics: Card Hands 9 9 Tie Both correct on completed sections

Tool Calling (5 tests) — Largest improvement

Test Base Opus Winner Notes
Weather API planning 2 7 Opus Base outputs single call (37 tokens). Opus chains all 3 tools.
Database CRUD operations 3 8 Opus Base: 1 tool call (119 tokens). Opus: complete 4-step workflow.
Multi-step file operations 4 3 Base Both perform poorly on this test
API orchestration 2 7 Opus Base outputs malformed tool call. Opus plans 3 clear steps.
Complex reasoning with tools 6 7 Opus Base batches lookups correctly but stops. Opus completes reasoning.

Logic (2 tests)

Test Base Opus Winner Notes
Sudoku Solver Explanation 9 8 Base Base explains constraint propagation more clearly within token budget
Einstein's Riddle 8 7 Base Base makes more deduction progress within token budget

Instruction Following (3 tests)

Test Base Opus Winner Notes
Structured JSON output 10 6 Base Base: clean JSON only (453 tokens). Opus: 2048 tokens, ignored "ONLY JSON" constraint
Code with exact constraints 5 6 Opus Both struggle. Base self-corrects mid-response.
Multi-format output 8 8 Tie Both truncated, similar quality

Key Findings

  • Tool Calling: Largest improvement (+3.8). Base model outputs only the first tool call and stops. Opus Distilled plans full multi-step tool chains with reasoning between steps.
  • Bug Detection: Opus Distilled provides more structured analysis with severity tables, timeline diagrams, and catches more edge cases (+0.6).
  • Coding: Opus Distilled favors class-based architectures with better design patterns. Caught a SQL bug (window function in WHERE clause) that Base missed.
  • Probability: Base is more concise and made fewer computation errors. Opus Distilled made an error on a Markov Chain steady-state calculation.
  • Logic: Base makes better progress within token budgets — Opus Distilled spends more tokens on preamble.
  • Instruction Following: Base adheres more strictly to output format constraints (e.g., "output ONLY valid JSON").

Verdict

Opus-Distilled wins overall driven by massively better tool calling and slightly better bug detection and coding. Base wins on math/probability (fewer errors), logic (better token efficiency), and instruction following (better constraint adherence). For coding assistant use cases where tool calling matters, Opus-Distilled is clearly superior.

Performance

Both models run at comparable speeds on RTX PRO 6000 Blackwell (96GB):

Metric Base Opus Distilled
Tokens/sec 100.5 102.5
Avg response length 1,085 tokens 1,464 tokens

Training Details

Hardware & Infrastructure

  • GPUs: 8x NVIDIA H100 80GB SXM with NVLink
  • System RAM: 2 TB DDR5
  • Distribution: DeepSpeed ZeRO-3 (parameters sharded across all 8 GPUs)
  • Optimizer Offload: AdamW optimizer states offloaded to CPU RAM (~700GB)
  • Platform: RunPod

Hyperparameters

Parameter Value
Method Full fine-tune (all parameters)
Framework HuggingFace TRL 1.0.0 + DeepSpeed 0.18.9
Transformers 5.4.0
Optimizer AdamW (CPU offloaded via DeepSpeed ZeRO-3)
Learning Rate 2e-5 (cosine schedule)
Warmup 5% of steps
Weight Decay 0.01
Gradient Clipping 1.0
Epochs 3
Effective Batch Size 32 (1 per GPU x 4 grad accum x 8 GPUs)
Max Sequence Length 8192 (training context window)
Gradient Checkpointing Enabled (non-reentrant)
Precision BF16
Total Steps 303
Seed 42

Training Progression

Metric Step 1 Step 50 Step 114 Step 150 Step 214 Step 303 (Final)
Loss 0.870 0.498 0.244 0.210 0.115 0.062
Token Accuracy 78.0% 84.4% 91.5% 93.5% 96.5% 98.1%
Learning Rate 0 1.94e-5 1.51e-5 1.29e-5 4.66e-6 5.99e-10
Epoch 0.01 0.50 1.11 1.33 2.10 3.00

Datasets

3,204 examples after quality filtering (required <think> tags and >200 characters of assistant content):

Dataset Examples Description
nohurry/Opus-4.6-Reasoning-3000x-filtered 2,321 Claude Opus 4.6 reasoning traces (thinking + solution)
TeichAI/claude-4.5-opus-high-reasoning-250x 250 High-quality Claude reasoning conversations
Jackrong/Qwen3.5-reasoning-700x 633 Qwen reasoning conversations

Data Format

Each training example follows this structure:

<|im_start|>user
{problem}<|im_end|>
<|im_start|>assistant
<think>
{chain-of-thought reasoning}
</think>

{final answer}<|im_end|>

Quality Filter

Examples were filtered to require:

  1. At least one assistant message containing <think> tags
  2. Assistant content longer than 200 characters

This removed low-quality or non-reasoning examples from the combined dataset.

Reasoning Format

The model produces reasoning inside <think>...</think> tags:

<think>
Let me analyze this step by step...
1. First consideration
2. Second consideration
3. Conclusion
</think>

Here is the final answer based on my analysis.

When serving with llama.cpp, use --reasoning-format deepseek with a thinking-aware chat template to separate reasoning from visible output.

GGUF Quantizations

See samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-GGUF for quantized GGUF versions (Q4_K_M, Q6_K, Q8_0, BF16).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled",
    torch_dtype="bfloat16",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Implement a thread-safe LRU cache in Python"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.6)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
180
Safetensors
Model size
6.21M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled

Finetuned
(28)
this model

Datasets used to train samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled