| | --- |
| | language: |
| | - en |
| | license: mit |
| | library_name: flux |
| | tags: |
| | - julia |
| | - flux-jl |
| | - distillation |
| | - knowledge-distillation |
| | - llama-style |
| | - gqa |
| | - rope |
| | - rmsnorm |
| | - swiglu |
| | - bpe |
| | - philosophy |
| | - text-generation |
| | pipeline_tag: text-generation |
| | datasets: |
| | - LisaMegaWatts/philosophy-corpus |
| | model-index: |
| | - name: JuliaGPTDistill |
| | results: |
| | - task: |
| | type: text-generation |
| | name: Text Generation |
| | dataset: |
| | type: LisaMegaWatts/philosophy-corpus |
| | name: philosophy-corpus |
| | metrics: |
| | - type: loss |
| | value: 7.44 |
| | name: Val Loss |
| | verified: false |
| | --- |
| | |
| | # JuliaGPTDistill |
| |
|
| | A **~5M parameter** LLaMA-style student model distilled from [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) (10M params). Uses knowledge distillation with temperature scaling to compress the teacher's knowledge into a smaller architecture. |
| |
|
| | ## Architecture |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Architecture | LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA) | |
| | | Embedding dim | 256 | |
| | | Layers | 4 | |
| | | Query heads | 4 | |
| | | KV heads | 2 (GQA ratio 2:1) | |
| | | Head dim | 64 | |
| | | Context length | 256 tokens | |
| | | Vocabulary | 2,000 (ByteLevel BPE) | |
| | | Dropout | 0.1 | |
| | | Weight tying | Yes | |
| | | Framework | Julia + Flux.jl | |
| |
|
| | ## Distillation Settings |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Teacher model | [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) (512d/8L/8Q/2KV) | |
| | | KD temperature | 4.0 | |
| | | KD alpha | 0.5 | |
| | | Loss | 0.5 * CE + 0.5 * KL(teacher \|\| student) | |
| |
|
| | ## Training |
| |
|
| | | | Value | |
| | |---|---| |
| | | Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) | |
| | | Tokenizer | BPE (2,000 vocab, ByteLevel) | |
| | | Training steps | 4,089 | |
| | | Best val loss | 7.44 | |
| | | Hardware | NVIDIA RTX 3060 12GB | |
| |
|
| | ## Inference Settings |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | vocab_size | 2,000 | |
| | | context_length | 256 | |
| | | temperature | 0.8 | |
| | | top_k | 40 | |
| | |
| | **Note:** This model requires the same BPE tokenizer used by JuliaFluxGPT. No tokenizer file is included in this repo — use the tokenizer from [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT/blob/main/tokenizer.json). |
| | |
| | ## Checkpoint Format |
| | |
| | JLD2 files containing: |
| | - `model_state` — Flux model weights |
| | - `hyperparams` — `Dict("n_embd"=>256, "n_layer"=>4, "n_head"=>4, "n_kv_head"=>2, "vocab_size"=>2000, "block_size"=>256, "dropout"=>0.1, "kd_temperature"=>4.0, "kd_alpha"=>0.5)` |
| | - `step`, `best_val_loss`, `train_losses`, `val_losses` |
| |
|
| | ## Files |
| |
|
| | | File | Description | |
| | |------|-------------| |
| | | `best_model.jld2` | Best validation loss checkpoint | |
| | | `final_model.jld2` | Final training step checkpoint | |
| | | `checkpoint_latest.jld2` | Latest periodic checkpoint | |
| |
|
| | ## Provenance |
| |
|
| | - **Author**: LisaMegaWatts |
| | - **Source code**: [DavinciDreams/JuliaGPT](https://github.com/DavinciDreams/JuliaGPT) |
| | - **Teacher model**: [LisaMegaWatts/JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) |
| |
|
| | ## License |
| |
|
| | MIT |
| |
|