JuliaDistill v2

A ~5M parameter LLaMA-style student model, second iteration of knowledge distillation from JuliaFluxGPT. Uses a larger BPE vocabulary (4,000 tokens) compared to JuliaGPTDistill (2,000 tokens).

Distillation Lineage

Model Params Vocab Steps Val Loss
JuliaFluxGPT (teacher) ~10M 2,000 BPE 30,382 6.62
JuliaGPTDistill ~5M 2,000 BPE 4,089 7.44
juliadistill-v2 ~5M 4,000 BPE 104,961 7.66

Architecture

Parameter Value
Architecture LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA)
Embedding dim 256
Layers 4
Query heads 4
KV heads 2 (GQA ratio 2:1)
Head dim 64
Context length 256 tokens
Vocabulary 4,000 (ByteLevel BPE)
Dropout 0.1
Weight tying Yes
Framework Julia + Flux.jl

Distillation Settings

Parameter Value
Teacher model JuliaFluxGPT (512d/8L/8Q/2KV)
KD temperature 4.0
KD alpha 0.5
Loss 0.5 * CE + 0.5 * KL(teacher || student)

Training

Value
Dataset philosophy-corpus
Tokenizer BPE (4,000 vocab, ByteLevel)
Training steps 104,961
Best val loss 7.66
Optimizer AdamW (lr=6e-4, weight_decay=0.01)
LR schedule Cosine decay with 5% warmup
Gradient clipping ClipNorm(1.0)
Hardware NVIDIA RTX 3060 12GB

Inference Settings

Parameter Value
vocab_size 4,000
context_length 256
temperature 0.8
top_k 40

Note: This model uses a 4,000-token BPE tokenizer, different from the 2,000-token tokenizer used by JuliaFluxGPT and JuliaGPTDistill. No tokenizer file is included in this repo.

Checkpoint Format

JLD2 files containing:

  • model_state — Flux model weights
  • hyperparams — Dict("n_embd"=>256, "n_layer"=>4, "n_head"=>4, "n_kv_head"=>2, "vocab_size"=>4000, "block_size"=>256, "dropout"=>0.1, "kd_temperature"=>4.0, "kd_alpha"=>0.5)
  • step, best_val_loss, train_losses, val_losses

Files

File Description
best_model.jld2 Best validation loss checkpoint
final_model.jld2 Final training step checkpoint
checkpoint_latest.jld2 Latest periodic checkpoint
checkpoint_interrupted.jld2 Auto-saved on training interruption

Provenance

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train LisaMegaWatts/juliadistill-v2

Evaluation results