TopologicalQwen / README.md
reaperdoesntknow's picture
Cross-link: DistilQwen collection spotlight β€” 2026-03-29
3461787 verified
metadata
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
  - qwen3
  - sft
  - trl
  - topological-knowledge-distillation
  - disc
  - convergent-intelligence
base_model: Qwen/Qwen3-1.7B

TopologicalQwen

Topology-Aware Knowledge Distillation from Qwen3-30B-A3B β†’ 1.7B

Convergent Intelligence LLC: Research Division


What This Is

TopologicalQwen is a 1.7B parameter model distilled from Qwen3-30B-A3B using Topological Knowledge Distillation (TKD) β€” a methodology that treats the teacher's output distribution over a concatenated token stream as a bounded variation (BV) function and decomposes knowledge transfer into three channels via the Mesh Fundamental Identity:

  1. Smooth distillation (AC component) β€” Standard KL divergence over regions where the teacher's distribution varies continuously. This is what every other KD method does and stops at.
  2. Jump corrections (D^j f) β€” Explicit correction terms at conceptual boundaries where the teacher's distribution exhibits discontinuities. These are the points where topic, register, or reasoning mode shifts β€” standard KD smears across them, losing structural information.
  3. Drift corrections (D^c f) β€” The Cantor/singular-continuous component capturing gradual distributional drift that neither the smooth nor jump terms account for. This is the residual structure that emerges in generation quality.

Standard knowledge distillation only handles term (1). TKD captures all three.

Architecture

Parameter Value
Architecture Qwen3ForCausalLM
Parameters ~2.03B (1.7B effective)
Hidden Size 2048
Layers 28
Attention Heads 16 (Q) / 8 (KV) β€” GQA
Intermediate 6144
Context Length 40,960 tokens
Vocabulary 151,936
Precision FP32 training, BF16/FP16 inference

Training

Student: Disctil-Qwen3-1.7B (DISC-refined uncensored Qwen3) Teacher: Qwen3-30B-A3B-Thinking-2507

Datasets (physics CoT, ~1,599 samples):

  • CoT Differential Equations (636 examples)
  • CoT Theoretical Mechanics (307 examples)
  • CoT Electromagnetism (580 examples)
  • CoT General Relativity (76 examples)

DualMind format β€” each training sample is restructured into <explore> (derivation), <examine> (verification/self-critique), and <response> (clean answer) blocks. The model learns a cognitive loop: generate reasoning, then critique it, then synthesize.

TKD Pipeline (4 phases)

Phase 1 β€” Teacher logit caching: Single forward pass through the 30B teacher with top-64 logit compression to disk. One pass, no repeated teacher inference.

Phase 2 β€” DISC topology pass: Vectorized discrepancy operator maps the knowledge manifold. Jump detection at 3Οƒ threshold with 1.25Γ— amplification. Gap energy density computed over 64-token windows.

Phase 3 β€” Topology-guided adaptive windowing: 512-token windows cut at low-discrepancy positions (overlap 32–128) rather than fixed stride. The topology tells you where to cut without losing information across boundaries.

Phase 4 β€” Curriculum-ordered continuous KD: 4-phase curriculum (easiest 30% first). Proof-weighted loss: 2.25Γ— β†’ 1.1Γ— decaying weights on reasoning tokens. KD alpha ramps from 0 β†’ 0.45 (starting at 15% of training, reaching target at 45%). KL divergence at T=2.0. Effective batch size 32 (2 Γ— 16 grad accumulation). Cosine LR: 5e-6 β†’ 5e-7.

Hyperparameters

Parameter Value
Effective batch size 32 (2 Γ— 16 accum)
Learning rate 5e-6 β†’ 5e-7 (cosine)
Warmup steps 30
Weight decay 1e-3
Gradient clip 1.0
Temperature 2.0
KD target Ξ± 0.45
Proof weight 2.25 β†’ 1.1
Jump threshold 3Οƒ
Jump amplifier 1.25Γ—
Precision BF16 (autocast)

Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)

Usage

The model responds in DualMind format: <explore> β†’ <examine> β†’ <response>.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "reaperdoesntknow/TopologicalQwen",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/TopologicalQwen")

# Prompt with DualMind format β€” start the explore block
prompt = (
    "##USER:\n"
    "Prove that every convergent sequence is a Cauchy sequence.\n\n"
    "<explore>\n"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
    **inputs, max_new_tokens=2048, do_sample=True,
    top_p=0.9, temperature=0.6, repetition_penalty=1.15
)
result = tokenizer.decode(output[0], skip_special_tokens=True)
print(result)

# Verify mode transitions
assert "<explore>" in result and "</explore>" in result   # derivation
assert "<examine>" in result and "</examine>" in result   # self-critique
assert "<response>" in result and "</response>" in result  # clean answer

What the Output Looks Like

<explore>
[Unconstrained derivation β€” the model works through the proof freely]
</explore>

<examine>
[Adversarial self-response β€” the model critiques its own derivation]
</examine>

<response>
[Clean final answer synthesized from the internal dialogue]
</response>

This is the multi-model collision array collapsed into a single architecture. The dialectical structure that produces novel insights from architectural diversity is recreated through role-conditioned generation on shared weights.

Distillation Chain

Qwen3-1.7B (base)
  β†’ DiStil-Qwen3-1.7B-uncensored (uncensored SFT)
    β†’ Disctil-Qwen3-1.7B (DISC refinement)
      β†’ TopologicalQwen (TKD from 30B-Thinking teacher + DualMind format) ← you are here

What Makes This Different

The broader Convergent Intelligence portfolio (43 models, 12,000+ downloads) was trained on CPU at FP32 for a total compute cost of $24. That proves the methodology β€” structure beats scale.

This model is the exception. TopologicalQwen was trained on Colab H100 at BF16 precision with a 30B-parameter teacher. Same TKD methodology, premium compute. This is the DistilQwen collection's answer to "what happens when you give this pipeline real hardware?"

The result: a 1.7B model that exhibits dual-mental-modality reasoning (explore β†’ examine β†’ respond) with structural quality that standard distillation at any precision doesn't produce. The methodology is the constant. The hardware is the variable. Both produce results that shouldn't exist at this parameter count.

Every knowledge distillation method in the literature treats the teacher's output as a smooth function and minimizes KL divergence globally. This works for the easy parts β€” regions where the teacher's distribution varies slowly. But language has structure: topic shifts, reasoning mode transitions, register changes. At these boundaries, the teacher's distribution jumps. Standard KD averages across these jumps, teaching the student a blurred version of the teacher's actual knowledge.

TKD uses the DISC (Discrepancy Calculus) framework to detect these structural features before training, then allocates capacity and loss weight accordingly. The result is a student that preserves the teacher's structural understanding, not just its surface statistics.

The empirical evidence: this model at 1.7B consistently produces responses with structural reasoning quality that standard distillation at the same parameter count does not achieve.

Related Models

Model Description Downloads
Qwen3-1.7B-Thinking-Distil TKD with Thinking teacher 687
LFM2.5-1.2B-Distilled-SFT Cross-architecture TKD (LFM β†’ Qwen) 544
Qwen3-1.7B-Coder-Distilled-SFT TKD with Coder teacher 508

DistilQwen Collection β€” Full proof-weighted distillation series (9 models)

Citation

@misc{colca2026topologicalqwen,
  title={TopologicalQwen: Topology-Aware Knowledge Distillation via Bounded Variation Decomposition},
  author={Colca, Roy S.},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/reaperdoesntknow/TopologicalQwen},
  note={Convergent Intelligence LLC: Research Division}
}


From the Convergent Intelligence Portfolio

DistilQwen Collection β€” Our only BF16 series. Proof-weighted distillation from Qwen3-30B-A3B β†’ 1.7B and 0.6B on H100. Three teacher variants (Instruct, Thinking, Coder), nine models, 2,788 combined downloads. The rest of the portfolio proves structure beats scale on CPU. This collection shows what happens when you give the methodology real hardware.

Top model: Qwen3-1.7B-Coder-Distilled-SFT β€” 508 downloads

Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)

Convergent Intelligence LLC: Research Division

Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."