license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- qwen3
- sft
- trl
- topological-knowledge-distillation
- disc
- convergent-intelligence
base_model: Qwen/Qwen3-1.7B
TopologicalQwen
Topology-Aware Knowledge Distillation from Qwen3-30B-A3B β 1.7B
Convergent Intelligence LLC: Research Division
What This Is
TopologicalQwen is a 1.7B parameter model distilled from Qwen3-30B-A3B using Topological Knowledge Distillation (TKD) β a methodology that treats the teacher's output distribution over a concatenated token stream as a bounded variation (BV) function and decomposes knowledge transfer into three channels via the Mesh Fundamental Identity:
- Smooth distillation (AC component) β Standard KL divergence over regions where the teacher's distribution varies continuously. This is what every other KD method does and stops at.
- Jump corrections (D^j f) β Explicit correction terms at conceptual boundaries where the teacher's distribution exhibits discontinuities. These are the points where topic, register, or reasoning mode shifts β standard KD smears across them, losing structural information.
- Drift corrections (D^c f) β The Cantor/singular-continuous component capturing gradual distributional drift that neither the smooth nor jump terms account for. This is the residual structure that emerges in generation quality.
Standard knowledge distillation only handles term (1). TKD captures all three.
Architecture
| Parameter | Value |
|---|---|
| Architecture | Qwen3ForCausalLM |
| Parameters | ~2.03B (1.7B effective) |
| Hidden Size | 2048 |
| Layers | 28 |
| Attention Heads | 16 (Q) / 8 (KV) β GQA |
| Intermediate | 6144 |
| Context Length | 40,960 tokens |
| Vocabulary | 151,936 |
| Precision | FP32 training, BF16/FP16 inference |
Training
Student: Disctil-Qwen3-1.7B (DISC-refined uncensored Qwen3) Teacher: Qwen3-30B-A3B-Thinking-2507
Datasets (physics CoT, ~1,599 samples):
- CoT Differential Equations (636 examples)
- CoT Theoretical Mechanics (307 examples)
- CoT Electromagnetism (580 examples)
- CoT General Relativity (76 examples)
DualMind format β each training sample is restructured into <explore> (derivation), <examine> (verification/self-critique), and <response> (clean answer) blocks. The model learns a cognitive loop: generate reasoning, then critique it, then synthesize.
TKD Pipeline (4 phases)
Phase 1 β Teacher logit caching: Single forward pass through the 30B teacher with top-64 logit compression to disk. One pass, no repeated teacher inference.
Phase 2 β DISC topology pass: Vectorized discrepancy operator maps the knowledge manifold. Jump detection at 3Ο threshold with 1.25Γ amplification. Gap energy density computed over 64-token windows.
Phase 3 β Topology-guided adaptive windowing: 512-token windows cut at low-discrepancy positions (overlap 32β128) rather than fixed stride. The topology tells you where to cut without losing information across boundaries.
Phase 4 β Curriculum-ordered continuous KD: 4-phase curriculum (easiest 30% first). Proof-weighted loss: 2.25Γ β 1.1Γ decaying weights on reasoning tokens. KD alpha ramps from 0 β 0.45 (starting at 15% of training, reaching target at 45%). KL divergence at T=2.0. Effective batch size 32 (2 Γ 16 grad accumulation). Cosine LR: 5e-6 β 5e-7.
Hyperparameters
| Parameter | Value |
|---|---|
| Effective batch size | 32 (2 Γ 16 accum) |
| Learning rate | 5e-6 β 5e-7 (cosine) |
| Warmup steps | 30 |
| Weight decay | 1e-3 |
| Gradient clip | 1.0 |
| Temperature | 2.0 |
| KD target Ξ± | 0.45 |
| Proof weight | 2.25 β 1.1 |
| Jump threshold | 3Ο |
| Jump amplifier | 1.25Γ |
| Precision | BF16 (autocast) |
Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)
Usage
The model responds in DualMind format: <explore> β <examine> β <response>.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"reaperdoesntknow/TopologicalQwen",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/TopologicalQwen")
# Prompt with DualMind format β start the explore block
prompt = (
"##USER:\n"
"Prove that every convergent sequence is a Cauchy sequence.\n\n"
"<explore>\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
**inputs, max_new_tokens=2048, do_sample=True,
top_p=0.9, temperature=0.6, repetition_penalty=1.15
)
result = tokenizer.decode(output[0], skip_special_tokens=True)
print(result)
# Verify mode transitions
assert "<explore>" in result and "</explore>" in result # derivation
assert "<examine>" in result and "</examine>" in result # self-critique
assert "<response>" in result and "</response>" in result # clean answer
What the Output Looks Like
<explore>
[Unconstrained derivation β the model works through the proof freely]
</explore>
<examine>
[Adversarial self-response β the model critiques its own derivation]
</examine>
<response>
[Clean final answer synthesized from the internal dialogue]
</response>
This is the multi-model collision array collapsed into a single architecture. The dialectical structure that produces novel insights from architectural diversity is recreated through role-conditioned generation on shared weights.
Distillation Chain
Qwen3-1.7B (base)
β DiStil-Qwen3-1.7B-uncensored (uncensored SFT)
β Disctil-Qwen3-1.7B (DISC refinement)
β TopologicalQwen (TKD from 30B-Thinking teacher + DualMind format) β you are here
What Makes This Different
The broader Convergent Intelligence portfolio (43 models, 12,000+ downloads) was trained on CPU at FP32 for a total compute cost of $24. That proves the methodology β structure beats scale.
This model is the exception. TopologicalQwen was trained on Colab H100 at BF16 precision with a 30B-parameter teacher. Same TKD methodology, premium compute. This is the DistilQwen collection's answer to "what happens when you give this pipeline real hardware?"
The result: a 1.7B model that exhibits dual-mental-modality reasoning (explore β examine β respond) with structural quality that standard distillation at any precision doesn't produce. The methodology is the constant. The hardware is the variable. Both produce results that shouldn't exist at this parameter count.
Every knowledge distillation method in the literature treats the teacher's output as a smooth function and minimizes KL divergence globally. This works for the easy parts β regions where the teacher's distribution varies slowly. But language has structure: topic shifts, reasoning mode transitions, register changes. At these boundaries, the teacher's distribution jumps. Standard KD averages across these jumps, teaching the student a blurred version of the teacher's actual knowledge.
TKD uses the DISC (Discrepancy Calculus) framework to detect these structural features before training, then allocates capacity and loss weight accordingly. The result is a student that preserves the teacher's structural understanding, not just its surface statistics.
The empirical evidence: this model at 1.7B consistently produces responses with structural reasoning quality that standard distillation at the same parameter count does not achieve.
Related Models
| Model | Description | Downloads |
|---|---|---|
| Qwen3-1.7B-Thinking-Distil | TKD with Thinking teacher | 687 |
| LFM2.5-1.2B-Distilled-SFT | Cross-architecture TKD (LFM β Qwen) | 544 |
| Qwen3-1.7B-Coder-Distilled-SFT | TKD with Coder teacher | 508 |
DistilQwen Collection β Full proof-weighted distillation series (9 models)
Citation
@misc{colca2026topologicalqwen,
title={TopologicalQwen: Topology-Aware Knowledge Distillation via Bounded Variation Decomposition},
author={Colca, Roy S.},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/reaperdoesntknow/TopologicalQwen},
note={Convergent Intelligence LLC: Research Division}
}
From the Convergent Intelligence Portfolio
DistilQwen Collection β Our only BF16 series. Proof-weighted distillation from Qwen3-30B-A3B β 1.7B and 0.6B on H100. Three teacher variants (Instruct, Thinking, Coder), nine models, 2,788 combined downloads. The rest of the portfolio proves structure beats scale on CPU. This collection shows what happens when you give the methodology real hardware.
Top model: Qwen3-1.7B-Coder-Distilled-SFT β 508 downloads
Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)
Convergent Intelligence LLC: Research Division
Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."