Cross-link: DistilQwen collection spotlight — 2026-03-29

3461787 verified about 4 hours ago

9.65 kB

license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
  - qwen3
  - sft
  - trl
  - topological-knowledge-distillation
  - disc
  - convergent-intelligence
base_model: Qwen/Qwen3-1.7B

TopologicalQwen

Topology-Aware Knowledge Distillation from Qwen3-30B-A3B → 1.7B

Convergent Intelligence LLC: Research Division

What This Is

TopologicalQwen is a 1.7B parameter model distilled from Qwen3-30B-A3B using Topological Knowledge Distillation (TKD) — a methodology that treats the teacher's output distribution over a concatenated token stream as a bounded variation (BV) function and decomposes knowledge transfer into three channels via the Mesh Fundamental Identity:

Smooth distillation (AC component) — Standard KL divergence over regions where the teacher's distribution varies continuously. This is what every other KD method does and stops at.
Jump corrections (D^j f) — Explicit correction terms at conceptual boundaries where the teacher's distribution exhibits discontinuities. These are the points where topic, register, or reasoning mode shifts — standard KD smears across them, losing structural information.
Drift corrections (D^c f) — The Cantor/singular-continuous component capturing gradual distributional drift that neither the smooth nor jump terms account for. This is the residual structure that emerges in generation quality.

Standard knowledge distillation only handles term (1). TKD captures all three.

Architecture

Parameter	Value
Architecture	Qwen3ForCausalLM
Parameters	~2.03B (1.7B effective)
Hidden Size	2048
Layers	28
Attention Heads	16 (Q) / 8 (KV) — GQA
Intermediate	6144
Context Length	40,960 tokens
Vocabulary	151,936
Precision	FP32 training, BF16/FP16 inference

Training

Student: Disctil-Qwen3-1.7B (DISC-refined uncensored Qwen3) Teacher: Qwen3-30B-A3B-Thinking-2507

Datasets (physics CoT, ~1,599 samples):

CoT Differential Equations (636 examples)
CoT Theoretical Mechanics (307 examples)
CoT Electromagnetism (580 examples)
CoT General Relativity (76 examples)

DualMind format — each training sample is restructured into <explore> (derivation), <examine> (verification/self-critique), and <response> (clean answer) blocks. The model learns a cognitive loop: generate reasoning, then critique it, then synthesize.

TKD Pipeline (4 phases)

Phase 1 — Teacher logit caching: Single forward pass through the 30B teacher with top-64 logit compression to disk. One pass, no repeated teacher inference.

Phase 2 — DISC topology pass: Vectorized discrepancy operator maps the knowledge manifold. Jump detection at 3σ threshold with 1.25× amplification. Gap energy density computed over 64-token windows.

Phase 3 — Topology-guided adaptive windowing: 512-token windows cut at low-discrepancy positions (overlap 32–128) rather than fixed stride. The topology tells you where to cut without losing information across boundaries.

Phase 4 — Curriculum-ordered continuous KD: 4-phase curriculum (easiest 30% first). Proof-weighted loss: 2.25× → 1.1× decaying weights on reasoning tokens. KD alpha ramps from 0 → 0.45 (starting at 15% of training, reaching target at 45%). KL divergence at T=2.0. Effective batch size 32 (2 × 16 grad accumulation). Cosine LR: 5e-6 → 5e-7.

Hyperparameters

Parameter	Value
Effective batch size	32 (2 × 16 accum)
Learning rate	5e-6 → 5e-7 (cosine)
Warmup steps	30
Weight decay	1e-3
Gradient clip	1.0
Temperature	2.0
KD target α	0.45
Proof weight	2.25 → 1.1
Jump threshold	3σ
Jump amplifier	1.25×
Precision	BF16 (autocast)

Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)

Usage

The model responds in DualMind format: <explore> → <examine> → <response>.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "reaperdoesntknow/TopologicalQwen",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/TopologicalQwen")

# Prompt with DualMind format — start the explore block
prompt = (
    "##USER:\n"
    "Prove that every convergent sequence is a Cauchy sequence.\n\n"
    "<explore>\n"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
    **inputs, max_new_tokens=2048, do_sample=True,
    top_p=0.9, temperature=0.6, repetition_penalty=1.15
)
result = tokenizer.decode(output[0], skip_special_tokens=True)
print(result)

# Verify mode transitions
assert "<explore>" in result and "</explore>" in result   # derivation
assert "<examine>" in result and "</examine>" in result   # self-critique
assert "<response>" in result and "</response>" in result  # clean answer

What the Output Looks Like

<explore>
[Unconstrained derivation — the model works through the proof freely]
</explore>

<examine>
[Adversarial self-response — the model critiques its own derivation]
</examine>

<response>
[Clean final answer synthesized from the internal dialogue]
</response>

This is the multi-model collision array collapsed into a single architecture. The dialectical structure that produces novel insights from architectural diversity is recreated through role-conditioned generation on shared weights.

Distillation Chain

Qwen3-1.7B (base)
  → DiStil-Qwen3-1.7B-uncensored (uncensored SFT)
    → Disctil-Qwen3-1.7B (DISC refinement)
      → TopologicalQwen (TKD from 30B-Thinking teacher + DualMind format) ← you are here

What Makes This Different

The broader Convergent Intelligence portfolio (43 models, 12,000+ downloads) was trained on CPU at FP32 for a total compute cost of $24. That proves the methodology — structure beats scale.

This model is the exception. TopologicalQwen was trained on Colab H100 at BF16 precision with a 30B-parameter teacher. Same TKD methodology, premium compute. This is the DistilQwen collection's answer to "what happens when you give this pipeline real hardware?"

The result: a 1.7B model that exhibits dual-mental-modality reasoning (explore → examine → respond) with structural quality that standard distillation at any precision doesn't produce. The methodology is the constant. The hardware is the variable. Both produce results that shouldn't exist at this parameter count.

Every knowledge distillation method in the literature treats the teacher's output as a smooth function and minimizes KL divergence globally. This works for the easy parts — regions where the teacher's distribution varies slowly. But language has structure: topic shifts, reasoning mode transitions, register changes. At these boundaries, the teacher's distribution jumps. Standard KD averages across these jumps, teaching the student a blurred version of the teacher's actual knowledge.

TKD uses the DISC (Discrepancy Calculus) framework to detect these structural features before training, then allocates capacity and loss weight accordingly. The result is a student that preserves the teacher's structural understanding, not just its surface statistics.

The empirical evidence: this model at 1.7B consistently produces responses with structural reasoning quality that standard distillation at the same parameter count does not achieve.

Related Models

Model	Description	Downloads
Qwen3-1.7B-Thinking-Distil	TKD with Thinking teacher	687
LFM2.5-1.2B-Distilled-SFT	Cross-architecture TKD (LFM → Qwen)	544
Qwen3-1.7B-Coder-Distilled-SFT	TKD with Coder teacher	508

DistilQwen Collection — Full proof-weighted distillation series (9 models)

Citation

@misc{colca2026topologicalqwen,
  title={TopologicalQwen: Topology-Aware Knowledge Distillation via Bounded Variation Decomposition},
  author={Colca, Roy S.},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/reaperdoesntknow/TopologicalQwen},
  note={Convergent Intelligence LLC: Research Division}
}

From the Convergent Intelligence Portfolio

DistilQwen Collection — Our only BF16 series. Proof-weighted distillation from Qwen3-30B-A3B → 1.7B and 0.6B on H100. Three teacher variants (Instruct, Thinking, Coder), nine models, 2,788 combined downloads. The rest of the portfolio proves structure beats scale on CPU. This collection shows what happens when you give the methodology real hardware.

Top model: Qwen3-1.7B-Coder-Distilled-SFT — 508 downloads

Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)

Convergent Intelligence LLC: Research Division

Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."