Add proper model card: 256d/4L/4H/2KV, vocab=2000, distilled from JuliaFluxGPT

Browse files

Files changed (1) hide show

README.md +111 -3

README.md CHANGED Viewed

@@ -1,3 +1,111 @@
----
-license: mit
----

+---
+language:
+  - en
+license: mit
+library_name: flux
+tags:
+  - julia
+  - flux-jl
+  - distillation
+  - knowledge-distillation
+  - llama-style
+  - gqa
+  - rope
+  - rmsnorm
+  - swiglu
+  - bpe
+  - philosophy
+  - text-generation
+pipeline_tag: text-generation
+datasets:
+  - LisaMegaWatts/philosophy-corpus
+model-index:
+  - name: JuliaGPTDistill
+    results:
+      - task:
+          type: text-generation
+          name: Text Generation
+        dataset:
+          type: LisaMegaWatts/philosophy-corpus
+          name: philosophy-corpus
+        metrics:
+          - type: loss
+            value: 7.44
+            name: Val Loss
+            verified: false
+---
+# JuliaGPTDistill
+A **~5M parameter** LLaMA-style student model distilled from [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) (10M params). Uses knowledge distillation with temperature scaling to compress the teacher's knowledge into a smaller architecture.
+## Architecture
+| Parameter | Value |
+|-----------|-------|
+| Architecture | LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA) |
+| Embedding dim | 256 |
+| Layers | 4 |
+| Query heads | 4 |
+| KV heads | 2 (GQA ratio 2:1) |
+| Head dim | 64 |
+| Context length | 256 tokens |
+| Vocabulary | 2,000 (ByteLevel BPE) |
+| Dropout | 0.1 |
+| Weight tying | Yes |
+| Framework | Julia + Flux.jl |
+## Distillation Settings
+| Parameter | Value |
+|-----------|-------|
+| Teacher model | [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) (512d/8L/8Q/2KV) |
+| KD temperature | 4.0 |
+| KD alpha | 0.5 |
+| Loss | 0.5 * CE + 0.5 * KL(teacher \|\| student) |
+## Training
+| | Value |
+|---|---|
+| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
+| Tokenizer | BPE (2,000 vocab, ByteLevel) |
+| Training steps | 4,089 |
+| Best val loss | 7.44 |
+| Hardware | NVIDIA RTX 3060 12GB |
+## Inference Settings
+| Parameter | Value |
+|-----------|-------|
+| vocab_size | 2,000 |
+| context_length | 256 |
+| temperature | 0.8 |
+| top_k | 40 |
+**Note:** This model requires the same BPE tokenizer used by JuliaFluxGPT. No tokenizer file is included in this repo — use the tokenizer from [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT/blob/main/tokenizer.json).
+## Checkpoint Format
+JLD2 files containing:
+- `model_state` — Flux model weights
+- `hyperparams` — `Dict("n_embd"=>256, "n_layer"=>4, "n_head"=>4, "n_kv_head"=>2, "vocab_size"=>2000, "block_size"=>256, "dropout"=>0.1, "kd_temperature"=>4.0, "kd_alpha"=>0.5)`
+- `step`, `best_val_loss`, `train_losses`, `val_losses`
+## Files
+| File | Description |
+|------|-------------|
+| `best_model.jld2` | Best validation loss checkpoint |
+| `final_model.jld2` | Final training step checkpoint |
+| `checkpoint_latest.jld2` | Latest periodic checkpoint |
+## Provenance
+- **Author**: LisaMegaWatts
+- **Source code**: [DavinciDreams/JuliaGPT](https://github.com/DavinciDreams/JuliaGPT)
+- **Teacher model**: [LisaMegaWatts/JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT)
+## License
+MIT