Update README.md
Browse files
README.md
CHANGED
|
@@ -44,22 +44,51 @@ Standard knowledge distillation only handles term (1). TKD captures all three.
|
|
| 44 |
| Vocabulary | 151,936 |
|
| 45 |
| Precision | FP32 training, BF16/FP16 inference |
|
| 46 |
|
| 47 |
-
## Training
|
| 48 |
|
| 49 |
-
|
|
|
|
| 50 |
|
| 51 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
**
|
| 54 |
|
| 55 |
-
|
| 56 |
|
| 57 |
-
**Phase
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165)
|
| 60 |
|
| 61 |
## Usage
|
| 62 |
|
|
|
|
|
|
|
| 63 |
```python
|
| 64 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 65 |
|
|
@@ -70,14 +99,61 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
| 70 |
)
|
| 71 |
tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/TopologicalQwen")
|
| 72 |
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
-
|
| 81 |
|
| 82 |
Every knowledge distillation method in the literature treats the teacher's output as a smooth function and minimizes KL divergence globally. This works for the easy parts β regions where the teacher's distribution varies slowly. But language has structure: topic shifts, reasoning mode transitions, register changes. At these boundaries, the teacher's distribution jumps. Standard KD averages across these jumps, teaching the student a blurred version of the teacher's actual knowledge.
|
| 83 |
|
|
|
|
| 44 |
| Vocabulary | 151,936 |
|
| 45 |
| Precision | FP32 training, BF16/FP16 inference |
|
| 46 |
|
| 47 |
+
## Training
|
| 48 |
|
| 49 |
+
**Student:** [Disctil-Qwen3-1.7B](https://huggingface.co/reaperdoesntknow/Disctil-Qwen3-1.7B) (DISC-refined uncensored Qwen3)
|
| 50 |
+
**Teacher:** Qwen3-30B-A3B-Thinking-2507
|
| 51 |
|
| 52 |
+
**Datasets (physics CoT, ~1,599 samples):**
|
| 53 |
+
- CoT Differential Equations (636 examples)
|
| 54 |
+
- CoT Theoretical Mechanics (307 examples)
|
| 55 |
+
- CoT Electromagnetism (580 examples)
|
| 56 |
+
- CoT General Relativity (76 examples)
|
| 57 |
|
| 58 |
+
**DualMind format** β each training sample is restructured into `<explore>` (derivation), `<examine>` (verification/self-critique), and `<response>` (clean answer) blocks. The model learns a cognitive loop: generate reasoning, then critique it, then synthesize.
|
| 59 |
|
| 60 |
+
### TKD Pipeline (4 phases)
|
| 61 |
|
| 62 |
+
**Phase 1 β Teacher logit caching:** Single forward pass through the 30B teacher with top-64 logit compression to disk. One pass, no repeated teacher inference.
|
| 63 |
+
|
| 64 |
+
**Phase 2 β DISC topology pass:** Vectorized discrepancy operator maps the knowledge manifold. Jump detection at 3Ο threshold with 1.25Γ amplification. Gap energy density computed over 64-token windows.
|
| 65 |
+
|
| 66 |
+
**Phase 3 β Topology-guided adaptive windowing:** 512-token windows cut at low-discrepancy positions (overlap 32β128) rather than fixed stride. The topology tells you where to cut without losing information across boundaries.
|
| 67 |
+
|
| 68 |
+
**Phase 4 β Curriculum-ordered continuous KD:** 4-phase curriculum (easiest 30% first). Proof-weighted loss: 2.25Γ β 1.1Γ decaying weights on reasoning tokens. KD alpha ramps from 0 β 0.45 (starting at 15% of training, reaching target at 45%). KL divergence at T=2.0. Effective batch size 32 (2 Γ 16 grad accumulation). Cosine LR: 5e-6 β 5e-7.
|
| 69 |
+
|
| 70 |
+
### Hyperparameters
|
| 71 |
+
|
| 72 |
+
| Parameter | Value |
|
| 73 |
+
|-----------|-------|
|
| 74 |
+
| Effective batch size | 32 (2 Γ 16 accum) |
|
| 75 |
+
| Learning rate | 5e-6 β 5e-7 (cosine) |
|
| 76 |
+
| Warmup steps | 30 |
|
| 77 |
+
| Weight decay | 1e-3 |
|
| 78 |
+
| Gradient clip | 1.0 |
|
| 79 |
+
| Temperature | 2.0 |
|
| 80 |
+
| KD target Ξ± | 0.45 |
|
| 81 |
+
| Proof weight | 2.25 β 1.1 |
|
| 82 |
+
| Jump threshold | 3Ο |
|
| 83 |
+
| Jump amplifier | 1.25Γ |
|
| 84 |
+
| Precision | BF16 (autocast) |
|
| 85 |
|
| 86 |
Full methodology: [Structure Over Scale (DOI: 10.57967/hf/8165)](https://doi.org/10.57967/hf/8165)
|
| 87 |
|
| 88 |
## Usage
|
| 89 |
|
| 90 |
+
The model responds in DualMind format: `<explore>` β `<examine>` β `<response>`.
|
| 91 |
+
|
| 92 |
```python
|
| 93 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 94 |
|
|
|
|
| 99 |
)
|
| 100 |
tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/TopologicalQwen")
|
| 101 |
|
| 102 |
+
# Prompt with DualMind format β start the explore block
|
| 103 |
+
prompt = (
|
| 104 |
+
"##USER:\n"
|
| 105 |
+
"Prove that every convergent sequence is a Cauchy sequence.\n\n"
|
| 106 |
+
"<explore>\n"
|
| 107 |
+
)
|
| 108 |
+
|
| 109 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 110 |
+
output = model.generate(
|
| 111 |
+
**inputs, max_new_tokens=2048, do_sample=True,
|
| 112 |
+
top_p=0.9, temperature=0.6, repetition_penalty=1.15
|
| 113 |
+
)
|
| 114 |
+
result = tokenizer.decode(output[0], skip_special_tokens=True)
|
| 115 |
+
print(result)
|
| 116 |
+
|
| 117 |
+
# Verify mode transitions
|
| 118 |
+
assert "<explore>" in result and "</explore>" in result # derivation
|
| 119 |
+
assert "<examine>" in result and "</examine>" in result # self-critique
|
| 120 |
+
assert "<response>" in result and "</response>" in result # clean answer
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
### What the Output Looks Like
|
| 124 |
+
|
| 125 |
```
|
| 126 |
+
<explore>
|
| 127 |
+
[Unconstrained derivation β the model works through the proof freely]
|
| 128 |
+
</explore>
|
| 129 |
+
|
| 130 |
+
<examine>
|
| 131 |
+
[Adversarial self-response β the model critiques its own derivation]
|
| 132 |
+
</examine>
|
| 133 |
+
|
| 134 |
+
<response>
|
| 135 |
+
[Clean final answer synthesized from the internal dialogue]
|
| 136 |
+
</response>
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
This is the multi-model collision array collapsed into a single architecture. The dialectical structure that produces novel insights from architectural diversity is recreated through role-conditioned generation on shared weights.
|
| 140 |
+
|
| 141 |
+
## Distillation Chain
|
| 142 |
+
|
| 143 |
+
```
|
| 144 |
+
Qwen3-1.7B (base)
|
| 145 |
+
β DiStil-Qwen3-1.7B-uncensored (uncensored SFT)
|
| 146 |
+
β Disctil-Qwen3-1.7B (DISC refinement)
|
| 147 |
+
β TopologicalQwen (TKD from 30B-Thinking teacher + DualMind format) β you are here
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
## What Makes This Different
|
| 151 |
+
|
| 152 |
+
The broader Convergent Intelligence portfolio ([43 models, 12,000+ downloads](https://huggingface.co/reaperdoesntknow)) was trained on CPU at FP32 for a total compute cost of $24. That proves the methodology β structure beats scale.
|
| 153 |
+
|
| 154 |
+
**This model is the exception.** TopologicalQwen was trained on Colab H100 at BF16 precision with a 30B-parameter teacher. Same TKD methodology, premium compute. This is the DistilQwen collection's answer to "what happens when you give this pipeline real hardware?"
|
| 155 |
|
| 156 |
+
The result: a 1.7B model that exhibits dual-mental-modality reasoning (explore β examine β respond) with structural quality that standard distillation at any precision doesn't produce. The methodology is the constant. The hardware is the variable. Both produce results that shouldn't exist at this parameter count.
|
| 157 |
|
| 158 |
Every knowledge distillation method in the literature treats the teacher's output as a smooth function and minimizes KL divergence globally. This works for the easy parts β regions where the teacher's distribution varies slowly. But language has structure: topic shifts, reasoning mode transitions, register changes. At these boundaries, the teacher's distribution jumps. Standard KD averages across these jumps, teaching the student a blurred version of the teacher's actual knowledge.
|
| 159 |
|