JuliaFluxGPT-distilled
Cross-species knowledge distillation: two JuliaFluxGPT siblings โ v1 (JuliaSLM fusion, val_loss=3.687) and v2 (Pythia-14m fusion, val_loss=3.856) โ serve as co-teachers for a student model. The student inherits v1's perplexity advantage and v2's linguistic quality (superior grammar, coherence, and syntactic complexity).
Why Distillation?
Weight-level fusion between v1 and v2 fails catastrophically โ even 50/50 alpha blend produces loss=10.3. Evolutionary per-layer search, SLERP, and Kuramoto sync all fail. The models occupy separate loss basins after being fused from different parent species (JuliaSLM vs Pythia-14m). Knowledge distillation bypasses this by operating on output distributions instead of weight matrices.
Architecture
|
|
| Parameters |
~23M |
| Embedding dim |
512 |
| Layers |
8 |
| Attention |
GQA (8 query, 2 KV heads) |
| Head dim |
64 |
| FFN |
SwiGLU (1344 inner) |
| Normalization |
RMSNorm (pre-norm) |
| Position encoding |
RoPE (base=10000) |
| Context length |
256 |
| Vocab |
2000 (BPE) |
| Weight tying |
Yes (embedding = output projection) |
Training
|
|
| Method |
Knowledge distillation (warm start) |
| Teacher 1 |
JuliaFluxGPT-fused v1 (JuliaSLM fusion, val_loss=3.687) |
| Teacher 2 |
JuliaFluxGPT-fused v2 (Pythia fusion, val_loss=3.856) |
| Student init |
v1 weights (warm start) |
| Loss |
0.35 KL(s||v1) + 0.35 KL(s||v2) + 0.30 CE(s, targets) |
| Temperature |
3.0 |
| Steps |
3000 (best at step 2600) |
| LR |
3e-4 (cosine decay) |
| Optimizer |
AdamW (weight_decay=0.01) |
| Val loss |
3.687 (beats both parents) |
Scaling Context
| Model |
Params |
d_model |
Val Loss |
Method |
| MicroJulia |
1M |
192 |
โ |
Baseline |
| JuliaSLM |
5M |
256 |
3.54 |
Baseline |
| SymbioSLM |
5M |
256 |
3.48 |
Multi-organelle |
| MonarchSLM |
5M |
256 |
3.51 |
Monarch matrices |
| JuliaFluxGPT-fused (v1) |
23M |
512 |
3.698 |
JuliaSLM fusion |
| JuliaFluxGPT-fused-v2 |
23M |
512 |
3.873 |
Pythia fusion |
| JuliaFluxGPT-distilled |
23M |
512 |
3.687 |
v1+v2 distillation |
Files
| File |
Description |
juliaflux_distilled_warm_best.pt |
Best checkpoint (step 2600, val_loss=3.687) |
juliaflux_model.py |
Model definition (JuliaFluxGPT class) |
vocab.json |
BPE vocabulary (2000 tokens) |
merges.txt |
BPE merge rules |
Links