JuliaDistill v2

A ~5M parameter LLaMA-style student model, second iteration of knowledge distillation from JuliaFluxGPT. Uses a larger BPE vocabulary (4,000 tokens) compared to JuliaGPTDistill (2,000 tokens).

Distillation Lineage

Model	Params	Vocab	Steps	Val Loss
JuliaFluxGPT (teacher)	~10M	2,000 BPE	30,382	6.62
JuliaGPTDistill	~5M	2,000 BPE	4,089	7.44
juliadistill-v2	~5M	4,000 BPE	104,961	7.66

Architecture

Parameter	Value
Architecture	LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA)
Embedding dim	256
Layers	4
Query heads	4
KV heads	2 (GQA ratio 2:1)
Head dim	64
Context length	256 tokens
Vocabulary	4,000 (ByteLevel BPE)
Dropout	0.1
Weight tying	Yes
Framework	Julia + Flux.jl

Distillation Settings

Parameter	Value
Teacher model	JuliaFluxGPT (512d/8L/8Q/2KV)
KD temperature	4.0
KD alpha	0.5
Loss	0.5 * CE + 0.5 * KL(teacher \|\| student)

Training

	Value
Dataset	philosophy-corpus
Tokenizer	BPE (4,000 vocab, ByteLevel)
Training steps	104,961
Best val loss	7.66
Optimizer	AdamW (lr=6e-4, weight_decay=0.01)
LR schedule	Cosine decay with 5% warmup
Gradient clipping	ClipNorm(1.0)
Hardware	NVIDIA RTX 3060 12GB

Inference Settings

Parameter	Value
vocab_size	4,000
context_length	256
temperature	0.8
top_k	40

Note: This model uses a 4,000-token BPE tokenizer, different from the 2,000-token tokenizer used by JuliaFluxGPT and JuliaGPTDistill. No tokenizer file is included in this repo.

Checkpoint Format

JLD2 files containing:

model_state — Flux model weights
hyperparams — Dict("n_embd"=>256, "n_layer"=>4, "n_head"=>4, "n_kv_head"=>2, "vocab_size"=>4000, "block_size"=>256, "dropout"=>0.1, "kd_temperature"=>4.0, "kd_alpha"=>0.5)
step, best_val_loss, train_losses, val_losses

Files

File	Description
`best_model.jld2`	Best validation loss checkpoint
`final_model.jld2`	Final training step checkpoint
`checkpoint_latest.jld2`	Latest periodic checkpoint
`checkpoint_interrupted.jld2`	Auto-saved on training interruption

Provenance

Author: LisaMegaWatts
Source code: DavinciDreams/JuliaGPT
Teacher model: LisaMegaWatts/JuliaFluxGPT
Training data: LisaMegaWatts/philosophy-corpus

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train LisaMegaWatts/juliadistill-v2

Evaluation results

Val Loss on philosophy-corpus
self-reported

7.660