metadata
language: en
license: mit
pipeline_tag: text-generation
tags:
- pytorch
- llama-style
- rope
- swiglu
- gqa
- rmsnorm
- bpe
- philosophy
- openai-compatible
- symbiogenesis
- distillation
- cross-species
model-index:
- name: JuliaFluxGPT-distilled
results:
- task:
type: text-generation
name: Philosophy Text Generation
dataset:
type: custom
name: Classical Philosophy Corpus (266M tokens)
metrics:
- type: loss
name: Val Loss
value: 3.687
- type: perplexity
name: Perplexity
value: 39.9
JuliaFluxGPT-distilled
Cross-species knowledge distillation: two JuliaFluxGPT siblings — v1 (JuliaSLM fusion, val_loss=3.687) and v2 (Pythia-14m fusion, val_loss=3.856) — serve as co-teachers for a student model. The student inherits v1's perplexity advantage and v2's linguistic quality (superior grammar, coherence, and syntactic complexity).
Why Distillation?
Weight-level fusion between v1 and v2 fails catastrophically — even 50/50 alpha blend produces loss=10.3. Evolutionary per-layer search, SLERP, and Kuramoto sync all fail. The models occupy separate loss basins after being fused from different parent species (JuliaSLM vs Pythia-14m). Knowledge distillation bypasses this by operating on output distributions instead of weight matrices.
Architecture
| Parameters | ~23M |
| Embedding dim | 512 |
| Layers | 8 |
| Attention | GQA (8 query, 2 KV heads) |
| Head dim | 64 |
| FFN | SwiGLU (1344 inner) |
| Normalization | RMSNorm (pre-norm) |
| Position encoding | RoPE (base=10000) |
| Context length | 256 |
| Vocab | 2000 (BPE) |
| Weight tying | Yes (embedding = output projection) |
Training
| Method | Knowledge distillation (warm start) |
| Teacher 1 | JuliaFluxGPT-fused v1 (JuliaSLM fusion, val_loss=3.687) |
| Teacher 2 | JuliaFluxGPT-fused v2 (Pythia fusion, val_loss=3.856) |
| Student init | v1 weights (warm start) |
| Loss | 0.35 KL(s||v1) + 0.35 KL(s||v2) + 0.30 CE(s, targets) |
| Temperature | 3.0 |
| Steps | 3000 (best at step 2600) |
| LR | 3e-4 (cosine decay) |
| Optimizer | AdamW (weight_decay=0.01) |
| Val loss | 3.687 (beats both parents) |
Scaling Context
| Model | Params | d_model | Val Loss | Method |
|---|---|---|---|---|
| MicroJulia | 1M | 192 | — | Baseline |
| JuliaSLM | 5M | 256 | 3.54 | Baseline |
| SymbioSLM | 5M | 256 | 3.48 | Multi-organelle |
| MonarchSLM | 5M | 256 | 3.51 | Monarch matrices |
| JuliaFluxGPT-fused (v1) | 23M | 512 | 3.698 | JuliaSLM fusion |
| JuliaFluxGPT-fused-v2 | 23M | 512 | 3.873 | Pythia fusion |
| JuliaFluxGPT-distilled | 23M | 512 | 3.687 | v1+v2 distillation |
Files
| File | Description |
|---|---|
juliaflux_distilled_warm_best.pt |
Best checkpoint (step 2600, val_loss=3.687) |
juliaflux_model.py |
Model definition (JuliaFluxGPT class) |
vocab.json |
BPE vocabulary (2000 tokens) |
merges.txt |
BPE merge rules |
Links
- Inference Space: LisaMegaWatts/JuliaFluxGPT-distilled
- Parent v1: LisaMegaWatts/JuliaFluxGPT-fused
- Parent v2: LisaMegaWatts/JuliaFluxGPT-fused-v2
- Source code: DavinciDreams/SymbioGPT
- W&B project: symbiogenesis