Add model card

e67b473 verified about 2 months ago

3.76 kB

language: en
license: mit
pipeline_tag: text-generation
tags:
  - pytorch
  - llama-style
  - rope
  - swiglu
  - gqa
  - rmsnorm
  - bpe
  - philosophy
  - openai-compatible
  - symbiogenesis
  - distillation
  - cross-species
model-index:
  - name: JuliaFluxGPT-distilled
    results:
      - task:
          type: text-generation
          name: Philosophy Text Generation
        dataset:
          type: custom
          name: Classical Philosophy Corpus (266M tokens)
        metrics:
          - type: loss
            name: Val Loss
            value: 3.687
          - type: perplexity
            name: Perplexity
            value: 39.9

JuliaFluxGPT-distilled

Cross-species knowledge distillation: two JuliaFluxGPT siblings — v1 (JuliaSLM fusion, val_loss=3.687) and v2 (Pythia-14m fusion, val_loss=3.856) — serve as co-teachers for a student model. The student inherits v1's perplexity advantage and v2's linguistic quality (superior grammar, coherence, and syntactic complexity).

Why Distillation?

Weight-level fusion between v1 and v2 fails catastrophically — even 50/50 alpha blend produces loss=10.3. Evolutionary per-layer search, SLERP, and Kuramoto sync all fail. The models occupy separate loss basins after being fused from different parent species (JuliaSLM vs Pythia-14m). Knowledge distillation bypasses this by operating on output distributions instead of weight matrices.

Architecture


Parameters	~23M
Embedding dim	512
Layers	8
Attention	GQA (8 query, 2 KV heads)
Head dim	64
FFN	SwiGLU (1344 inner)
Normalization	RMSNorm (pre-norm)
Position encoding	RoPE (base=10000)
Context length	256
Vocab	2000 (BPE)
Weight tying	Yes (embedding = output projection)

Training


Method	Knowledge distillation (warm start)
Teacher 1	JuliaFluxGPT-fused v1 (JuliaSLM fusion, val_loss=3.687)
Teacher 2	JuliaFluxGPT-fused v2 (Pythia fusion, val_loss=3.856)
Student init	v1 weights (warm start)
Loss	0.35 KL(s\|\|v1) + 0.35 KL(s\|\|v2) + 0.30 CE(s, targets)
Temperature	3.0
Steps	3000 (best at step 2600)
LR	3e-4 (cosine decay)
Optimizer	AdamW (weight_decay=0.01)
Val loss	3.687 (beats both parents)

Scaling Context

Model	Params	d_model	Val Loss	Method
MicroJulia	1M	192	—	Baseline
JuliaSLM	5M	256	3.54	Baseline
SymbioSLM	5M	256	3.48	Multi-organelle
MonarchSLM	5M	256	3.51	Monarch matrices
JuliaFluxGPT-fused (v1)	23M	512	3.698	JuliaSLM fusion
JuliaFluxGPT-fused-v2	23M	512	3.873	Pythia fusion
JuliaFluxGPT-distilled	23M	512	3.687	v1+v2 distillation

Files

File	Description
`juliaflux_distilled_warm_best.pt`	Best checkpoint (step 2600, val_loss=3.687)
`juliaflux_model.py`	Model definition (JuliaFluxGPT class)
`vocab.json`	BPE vocabulary (2000 tokens)
`merges.txt`	BPE merge rules

LisaMegaWatts
/

JuliaFluxGPT-distilled