reaperdoesntknow commited on
Commit
2ea53cb
Β·
verified Β·
1 Parent(s): df37c1d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -12
README.md CHANGED
@@ -44,22 +44,51 @@ Standard knowledge distillation only handles term (1). TKD captures all three.
44
  | Vocabulary | 151,936 |
45
  | Precision | FP32 training, BF16/FP16 inference |
46
 
47
- ## Training Methodology
48
 
49
- The TKD pipeline has four phases:
 
50
 
51
- **Phase 1 β€” Teacher logit caching:** Single forward pass through the teacher (Qwen3-30B-A3B) with top-k logit compression to disk. One pass, no repeated teacher inference.
 
 
 
 
52
 
53
- **Phase 2 β€” DISC topology pass:** Vectorized discrepancy operator maps the knowledge manifold, identifying where the teacher's distribution has structural features (jumps, drift, curvature).
54
 
55
- **Phase 3 β€” Topology-guided adaptive windowing:** Training windows cut at low-discrepancy positions rather than fixed stride. The topology tells you where to cut without losing information across boundaries.
56
 
57
- **Phase 4 β€” Curriculum-ordered continuous KD:** Belt-fed training with proof-weighted loss. 55% cross-entropy with decaying proof weights (2.5Γ— β†’ 1.5Γ—), 45% KL divergence at T=2.0. Proof weights amplify loss on reasoning-critical tokens.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
  Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165)
60
 
61
  ## Usage
62
 
 
 
63
  ```python
64
  from transformers import AutoModelForCausalLM, AutoTokenizer
65
 
@@ -70,14 +99,61 @@ model = AutoModelForCausalLM.from_pretrained(
70
  )
71
  tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/TopologicalQwen")
72
 
73
- messages = [{"role": "user", "content": "Derive the Euler-Lagrange equation from the principle of stationary action."}]
74
- text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
75
- inputs = tokenizer(text, return_tensors="pt").to(model.device)
76
- output = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, top_p=0.9)
77
- print(tokenizer.decode(output[0], skip_special_tokens=True))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
- ## Why Topology Matters
81
 
82
  Every knowledge distillation method in the literature treats the teacher's output as a smooth function and minimizes KL divergence globally. This works for the easy parts β€” regions where the teacher's distribution varies slowly. But language has structure: topic shifts, reasoning mode transitions, register changes. At these boundaries, the teacher's distribution jumps. Standard KD averages across these jumps, teaching the student a blurred version of the teacher's actual knowledge.
83
 
 
44
  | Vocabulary | 151,936 |
45
  | Precision | FP32 training, BF16/FP16 inference |
46
 
47
+ ## Training
48
 
49
+ **Student:** [Disctil-Qwen3-1.7B](https://huggingface.co/reaperdoesntknow/Disctil-Qwen3-1.7B) (DISC-refined uncensored Qwen3)
50
+ **Teacher:** Qwen3-30B-A3B-Thinking-2507
51
 
52
+ **Datasets (physics CoT, ~1,599 samples):**
53
+ - CoT Differential Equations (636 examples)
54
+ - CoT Theoretical Mechanics (307 examples)
55
+ - CoT Electromagnetism (580 examples)
56
+ - CoT General Relativity (76 examples)
57
 
58
+ **DualMind format** β€” each training sample is restructured into `<explore>` (derivation), `<examine>` (verification/self-critique), and `<response>` (clean answer) blocks. The model learns a cognitive loop: generate reasoning, then critique it, then synthesize.
59
 
60
+ ### TKD Pipeline (4 phases)
61
 
62
+ **Phase 1 β€” Teacher logit caching:** Single forward pass through the 30B teacher with top-64 logit compression to disk. One pass, no repeated teacher inference.
63
+
64
+ **Phase 2 β€” DISC topology pass:** Vectorized discrepancy operator maps the knowledge manifold. Jump detection at 3Οƒ threshold with 1.25Γ— amplification. Gap energy density computed over 64-token windows.
65
+
66
+ **Phase 3 β€” Topology-guided adaptive windowing:** 512-token windows cut at low-discrepancy positions (overlap 32–128) rather than fixed stride. The topology tells you where to cut without losing information across boundaries.
67
+
68
+ **Phase 4 β€” Curriculum-ordered continuous KD:** 4-phase curriculum (easiest 30% first). Proof-weighted loss: 2.25Γ— β†’ 1.1Γ— decaying weights on reasoning tokens. KD alpha ramps from 0 β†’ 0.45 (starting at 15% of training, reaching target at 45%). KL divergence at T=2.0. Effective batch size 32 (2 Γ— 16 grad accumulation). Cosine LR: 5e-6 β†’ 5e-7.
69
+
70
+ ### Hyperparameters
71
+
72
+ | Parameter | Value |
73
+ |-----------|-------|
74
+ | Effective batch size | 32 (2 Γ— 16 accum) |
75
+ | Learning rate | 5e-6 β†’ 5e-7 (cosine) |
76
+ | Warmup steps | 30 |
77
+ | Weight decay | 1e-3 |
78
+ | Gradient clip | 1.0 |
79
+ | Temperature | 2.0 |
80
+ | KD target Ξ± | 0.45 |
81
+ | Proof weight | 2.25 β†’ 1.1 |
82
+ | Jump threshold | 3Οƒ |
83
+ | Jump amplifier | 1.25Γ— |
84
+ | Precision | BF16 (autocast) |
85
 
86
  Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165)
87
 
88
  ## Usage
89
 
90
+ The model responds in DualMind format: `<explore>` β†’ `<examine>` β†’ `<response>`.
91
+
92
  ```python
93
  from transformers import AutoModelForCausalLM, AutoTokenizer
94
 
 
99
  )
100
  tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/TopologicalQwen")
101
 
102
+ # Prompt with DualMind format β€” start the explore block
103
+ prompt = (
104
+ "##USER:\n"
105
+ "Prove that every convergent sequence is a Cauchy sequence.\n\n"
106
+ "<explore>\n"
107
+ )
108
+
109
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
110
+ output = model.generate(
111
+ **inputs, max_new_tokens=2048, do_sample=True,
112
+ top_p=0.9, temperature=0.6, repetition_penalty=1.15
113
+ )
114
+ result = tokenizer.decode(output[0], skip_special_tokens=True)
115
+ print(result)
116
+
117
+ # Verify mode transitions
118
+ assert "<explore>" in result and "</explore>" in result # derivation
119
+ assert "<examine>" in result and "</examine>" in result # self-critique
120
+ assert "<response>" in result and "</response>" in result # clean answer
121
+ ```
122
+
123
+ ### What the Output Looks Like
124
+
125
  ```
126
+ <explore>
127
+ [Unconstrained derivation β€” the model works through the proof freely]
128
+ </explore>
129
+
130
+ <examine>
131
+ [Adversarial self-response β€” the model critiques its own derivation]
132
+ </examine>
133
+
134
+ <response>
135
+ [Clean final answer synthesized from the internal dialogue]
136
+ </response>
137
+ ```
138
+
139
+ This is the multi-model collision array collapsed into a single architecture. The dialectical structure that produces novel insights from architectural diversity is recreated through role-conditioned generation on shared weights.
140
+
141
+ ## Distillation Chain
142
+
143
+ ```
144
+ Qwen3-1.7B (base)
145
+ β†’ DiStil-Qwen3-1.7B-uncensored (uncensored SFT)
146
+ β†’ Disctil-Qwen3-1.7B (DISC refinement)
147
+ β†’ TopologicalQwen (TKD from 30B-Thinking teacher + DualMind format) ← you are here
148
+ ```
149
+
150
+ ## What Makes This Different
151
+
152
+ The broader Convergent Intelligence portfolio ([43 models, 12,000+ downloads](https://huggingface.co/reaperdoesntknow)) was trained on CPU at FP32 for a total compute cost of $24. That proves the methodology β€” structure beats scale.
153
+
154
+ **This model is the exception.** TopologicalQwen was trained on Colab H100 at BF16 precision with a 30B-parameter teacher. Same TKD methodology, premium compute. This is the DistilQwen collection's answer to "what happens when you give this pipeline real hardware?"
155
 
156
+ The result: a 1.7B model that exhibits dual-mental-modality reasoning (explore β†’ examine β†’ respond) with structural quality that standard distillation at any precision doesn't produce. The methodology is the constant. The hardware is the variable. Both produce results that shouldn't exist at this parameter count.
157
 
158
  Every knowledge distillation method in the literature treats the teacher's output as a smooth function and minimizes KL divergence globally. This works for the easy parts β€” regions where the teacher's distribution varies slowly. But language has structure: topic shifts, reasoning mode transitions, register changes. At these boundaries, the teacher's distribution jumps. Standard KD averages across these jumps, teaching the student a blurred version of the teacher's actual knowledge.
159