LisaMegaWatts commited on
Commit
b2f06c9
·
verified ·
1 Parent(s): baed924

Add proper model card: 256d/4L/4H/2KV, vocab=2000, distilled from JuliaFluxGPT

Browse files
Files changed (1) hide show
  1. README.md +111 -3
README.md CHANGED
@@ -1,3 +1,111 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ library_name: flux
6
+ tags:
7
+ - julia
8
+ - flux-jl
9
+ - distillation
10
+ - knowledge-distillation
11
+ - llama-style
12
+ - gqa
13
+ - rope
14
+ - rmsnorm
15
+ - swiglu
16
+ - bpe
17
+ - philosophy
18
+ - text-generation
19
+ pipeline_tag: text-generation
20
+ datasets:
21
+ - LisaMegaWatts/philosophy-corpus
22
+ model-index:
23
+ - name: JuliaGPTDistill
24
+ results:
25
+ - task:
26
+ type: text-generation
27
+ name: Text Generation
28
+ dataset:
29
+ type: LisaMegaWatts/philosophy-corpus
30
+ name: philosophy-corpus
31
+ metrics:
32
+ - type: loss
33
+ value: 7.44
34
+ name: Val Loss
35
+ verified: false
36
+ ---
37
+
38
+ # JuliaGPTDistill
39
+
40
+ A **~5M parameter** LLaMA-style student model distilled from [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) (10M params). Uses knowledge distillation with temperature scaling to compress the teacher's knowledge into a smaller architecture.
41
+
42
+ ## Architecture
43
+
44
+ | Parameter | Value |
45
+ |-----------|-------|
46
+ | Architecture | LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA) |
47
+ | Embedding dim | 256 |
48
+ | Layers | 4 |
49
+ | Query heads | 4 |
50
+ | KV heads | 2 (GQA ratio 2:1) |
51
+ | Head dim | 64 |
52
+ | Context length | 256 tokens |
53
+ | Vocabulary | 2,000 (ByteLevel BPE) |
54
+ | Dropout | 0.1 |
55
+ | Weight tying | Yes |
56
+ | Framework | Julia + Flux.jl |
57
+
58
+ ## Distillation Settings
59
+
60
+ | Parameter | Value |
61
+ |-----------|-------|
62
+ | Teacher model | [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) (512d/8L/8Q/2KV) |
63
+ | KD temperature | 4.0 |
64
+ | KD alpha | 0.5 |
65
+ | Loss | 0.5 * CE + 0.5 * KL(teacher \|\| student) |
66
+
67
+ ## Training
68
+
69
+ | | Value |
70
+ |---|---|
71
+ | Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
72
+ | Tokenizer | BPE (2,000 vocab, ByteLevel) |
73
+ | Training steps | 4,089 |
74
+ | Best val loss | 7.44 |
75
+ | Hardware | NVIDIA RTX 3060 12GB |
76
+
77
+ ## Inference Settings
78
+
79
+ | Parameter | Value |
80
+ |-----------|-------|
81
+ | vocab_size | 2,000 |
82
+ | context_length | 256 |
83
+ | temperature | 0.8 |
84
+ | top_k | 40 |
85
+
86
+ **Note:** This model requires the same BPE tokenizer used by JuliaFluxGPT. No tokenizer file is included in this repo — use the tokenizer from [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT/blob/main/tokenizer.json).
87
+
88
+ ## Checkpoint Format
89
+
90
+ JLD2 files containing:
91
+ - `model_state` — Flux model weights
92
+ - `hyperparams` — `Dict("n_embd"=>256, "n_layer"=>4, "n_head"=>4, "n_kv_head"=>2, "vocab_size"=>2000, "block_size"=>256, "dropout"=>0.1, "kd_temperature"=>4.0, "kd_alpha"=>0.5)`
93
+ - `step`, `best_val_loss`, `train_losses`, `val_losses`
94
+
95
+ ## Files
96
+
97
+ | File | Description |
98
+ |------|-------------|
99
+ | `best_model.jld2` | Best validation loss checkpoint |
100
+ | `final_model.jld2` | Final training step checkpoint |
101
+ | `checkpoint_latest.jld2` | Latest periodic checkpoint |
102
+
103
+ ## Provenance
104
+
105
+ - **Author**: LisaMegaWatts
106
+ - **Source code**: [DavinciDreams/JuliaGPT](https://github.com/DavinciDreams/JuliaGPT)
107
+ - **Teacher model**: [LisaMegaWatts/JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT)
108
+
109
+ ## License
110
+
111
+ MIT