JuliaFluxGPT-distilled

Cross-species knowledge distillation: two JuliaFluxGPT siblings โ€” v1 (JuliaSLM fusion, val_loss=3.687) and v2 (Pythia-14m fusion, val_loss=3.856) โ€” serve as co-teachers for a student model. The student inherits v1's perplexity advantage and v2's linguistic quality (superior grammar, coherence, and syntactic complexity).

Why Distillation?

Weight-level fusion between v1 and v2 fails catastrophically โ€” even 50/50 alpha blend produces loss=10.3. Evolutionary per-layer search, SLERP, and Kuramoto sync all fail. The models occupy separate loss basins after being fused from different parent species (JuliaSLM vs Pythia-14m). Knowledge distillation bypasses this by operating on output distributions instead of weight matrices.

Architecture

Parameters ~23M
Embedding dim 512
Layers 8
Attention GQA (8 query, 2 KV heads)
Head dim 64
FFN SwiGLU (1344 inner)
Normalization RMSNorm (pre-norm)
Position encoding RoPE (base=10000)
Context length 256
Vocab 2000 (BPE)
Weight tying Yes (embedding = output projection)

Training

Method Knowledge distillation (warm start)
Teacher 1 JuliaFluxGPT-fused v1 (JuliaSLM fusion, val_loss=3.687)
Teacher 2 JuliaFluxGPT-fused v2 (Pythia fusion, val_loss=3.856)
Student init v1 weights (warm start)
Loss 0.35 KL(s||v1) + 0.35 KL(s||v2) + 0.30 CE(s, targets)
Temperature 3.0
Steps 3000 (best at step 2600)
LR 3e-4 (cosine decay)
Optimizer AdamW (weight_decay=0.01)
Val loss 3.687 (beats both parents)

Scaling Context

Model Params d_model Val Loss Method
MicroJulia 1M 192 โ€” Baseline
JuliaSLM 5M 256 3.54 Baseline
SymbioSLM 5M 256 3.48 Multi-organelle
MonarchSLM 5M 256 3.51 Monarch matrices
JuliaFluxGPT-fused (v1) 23M 512 3.698 JuliaSLM fusion
JuliaFluxGPT-fused-v2 23M 512 3.873 Pythia fusion
JuliaFluxGPT-distilled 23M 512 3.687 v1+v2 distillation

Files

File Description
juliaflux_distilled_warm_best.pt Best checkpoint (step 2600, val_loss=3.687)
juliaflux_model.py Model definition (JuliaFluxGPT class)
vocab.json BPE vocabulary (2000 tokens)
merges.txt BPE merge rules

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using LisaMegaWatts/JuliaFluxGPT-distilled 1

Evaluation results

  • Val Loss on Classical Philosophy Corpus (266M tokens)
    self-reported
    3.687
  • Perplexity on Classical Philosophy Corpus (266M tokens)
    self-reported
    39.900