Dualmind-Qwen-1.7B-Thinking
Claude Opus 4.6 Reasoning Traces → 1.7B via DualMind SFT
Convergent Intelligence LLC: Research Division
What This Is
A 1.7B model trained on 2.5M+ tokens of Claude Opus 4.6 reasoning traces using the DualMind SFT methodology. The training data comes from Opus-4.6-Reasoning-3000x-filtered — a curated dataset of extended reasoning chains from Anthropic's most capable model, with refusals removed.
This is the Opus variant of the DualMind family. Where the base DualMind model was trained on LogicInference data, this model absorbs the reasoning patterns of Claude Opus 4.6 — longer chains, more nuanced self-correction, and richer deliberative structure. The Opus teacher produces qualitatively different reasoning than synthetic logic datasets: it backtracks, hedges, reconsiders, and synthesizes in ways that reflect genuine uncertainty navigation rather than pattern completion.
The base model is Disctil-Qwen3-1.7B — already DISC-refined and sitting in the middle of the DistilQwen distillation chain — giving it a strong structural foundation before the Opus reasoning signal is applied.
Architecture
| Parameter | Value |
|---|---|
| Architecture | Qwen3ForCausalLM |
| Parameters | ~2.03B (1.7B effective) |
| Hidden Size | 2048 |
| Layers | 28 |
| Attention Heads | 16 (Q) / 8 (KV) — GQA |
| Intermediate | 6144 |
| Head Dimension | 128 |
| Context Length | 40,960 tokens (max position) |
| Vocabulary | 151,936 |
| Precision | BF16 |
| Activation | SiLU |
Training
| Parameter | Value |
|---|---|
| Base Model | Disctil-Qwen3-1.7B |
| Dataset | Opus-4.6-Reasoning-3000x-filtered |
| Additional Tokens | ~2.5M |
| Max Sequence Length | 4,096 |
| Total Steps | 512 |
| Epochs | ~7.4 |
| Method | SFT (TRL SFTTrainer) |
| Precision | BF16 |
| Hardware | NVIDIA H100 |
Training Dynamics
| Metric | Start | End |
|---|---|---|
| Training Loss | 1.744 | 1.455 |
| Eval Loss | — | 1.406 |
| Token Accuracy | 61.0% | 67.8% |
The loss curve shows clean convergence across 7.4 epochs with no signs of overfitting — eval loss (1.406) remains below final training loss (1.455). The 6.8 percentage point gain in token accuracy reflects genuine absorption of the Opus reasoning structure, not memorization.
Why Opus Traces
The Opus-4.6-Reasoning dataset captures something that synthetic datasets don't: the way a frontier model navigates genuine uncertainty. Opus doesn't just solve problems — it reasons about its own confidence, backtracks when a line of thought weakens, and synthesizes across multiple attempted approaches. When you distill from these traces, the student doesn't just learn to produce correct answers. It learns the shape of deliberation.
This is the DualMind thesis in practice: the cognitive loop (explore → examine → respond) isn't an architectural trick. It's a training signal. When the teacher naturally exhibits multi-phase reasoning, the student absorbs that structure through standard SFT.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"reaperdoesntknow/Dualmind-Qwen-1.7B-Thinking",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
"reaperdoesntknow/Dualmind-Qwen-1.7B-Thinking"
)
messages = [
{"role": "user", "content": "What happens to information that falls into a black hole? Walk me through the paradox."}
]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(
**inputs,
max_new_tokens=2048,
do_sample=True,
top_p=0.9,
temperature=0.7,
repetition_penalty=1.15
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Generation Tips
- Temperature 0.6–0.8 — the Opus reasoning traces have natural variance in them. Don't flatten it with low temperature.
- Repetition penalty 1.1–1.2 — prevents looping during extended reasoning chains.
- Max tokens 1024–2048 — trained at 4096 max seq, so it can go long. The Opus signal rewards longer generation windows.
- The model may produce multi-phase reasoning naturally (exploring, then reconsidering, then concluding). This is the intended behavior — the DualMind cognitive loop emerging from the training signal.
Model Lineage
Qwen3-1.7B (base)
→ DiStil-Qwen3-1.7B-uncensored (uncensored SFT)
→ Disctil-Qwen3-1.7B (DISC refinement)
→ Dualmind-Qwen-1.7B-Thinking ← you are here
↑
Opus 4.6 reasoning traces (2.5M tokens, DualMind SFT)
DualMind Family Comparison
| Model | Training Signal | Character |
|---|---|---|
| DualMind | LogicInference | Structured logical deduction |
| Dualmind-Qwen-1.7B-Thinking | Opus 4.6 Reasoning | Extended deliberation, self-correction |
| TopologicalQwen | 30B-Thinking (TKD) | Topology-aware physics CoT |
Same methodology, different teachers, different capabilities. The LogicInference variant is more mechanical. The Opus variant is more deliberative. TopologicalQwen is the full TKD pipeline with BV decomposition. They're complementary — different facets of the same cognitive architecture.
DualMind Collection
| Model | Description |
|---|---|
| DualMind | LogicInference-trained. Explore→Examine→Response cognitive loop. |
| DualMind_Methodology | Paper: Three Teachers to Dual Cognition (DOI: 10.57967/hf/8184) |
| Dualmind-Qwen-1.7B-Thinking | ← this model. Opus 4.6 reasoning variant. |
| DualMind-GGUF | LogicInference variant quantized for edge deployment. |
Full collection: DualMind on HuggingFace
Papers
- Structure Over Scale: Proof-Weighted Knowledge Distillation — DOI: 10.57967/hf/8165. The DistilQwen methodology paper.
- Three Teachers to Dual Cognition — DOI: 10.57967/hf/8184. The DualMind extension: ghost imprinting and multi-teacher convergence.
License
Apache 2.0
Mathematical Foundations: Discrepancy Calculus (DISC)
This model's training pipeline is grounded in Discrepancy Calculus — a measure-theoretic framework that treats singularities as primary structure rather than pathology. Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division).
The Core Operator:
For smooth $f$: $Df(x) = |f'(x)|$. For rough $f$: $D$ localizes irregularity to null sets while preserving integral structure.
The Mesh Fundamental Identity — every BV function decomposes as:
Standard knowledge distillation captures only term 1. Topological Knowledge Distillation (TKD) preserves all three by treating the teacher's output distribution as a BV function and computing discrepancy energy, jump sets, and gap energy density before training begins.
Citation
@misc{colca2026dualmind,
title={Three Teachers to Dual Cognition: From Knowledge Distillation to Emergent Reasoning},
author={Colca, Roy},
year={2026},
doi={10.57967/hf/8184},
publisher={Convergent Intelligence LLC: Research Division}
}
Convergent Intelligence LLC: Research Division — 49 models, 22,598+ downloads across the portfolio. Full portfolio | DualMind Collection | DistilQwen Collection
- Downloads last month
- -