LisaMegaWatts commited on
Commit
aa3d3f0
Β·
verified Β·
1 Parent(s): 8e00ed3

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +155 -0
README.md ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - julia
7
+ - lux
8
+ - transformer
9
+ - language-model
10
+ - chinchilla
11
+ - bpe
12
+ datasets:
13
+ - LisaMegaWatts/philosophy-corpus
14
+ pipeline_tag: text-generation
15
+ ---
16
+
17
+ # Julia SLM β€” Small Language Models in Pure Julia
18
+
19
+ Transformer language models built entirely in Julia using [Lux.jl](https://github.com/LuxDL/Lux.jl), trained on the [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) dataset.
20
+
21
+ ## Models
22
+
23
+ ### 5M Chinchilla (`5m-chinchilla/`)
24
+
25
+ A 5.04M parameter transformer trained to Chinchilla-optimal (100M tokens at 20 tokens/param).
26
+
27
+ | Param | Value |
28
+ |-------|-------|
29
+ | Parameters | 5,037,312 |
30
+ | Architecture | Decoder-only Transformer |
31
+ | Embedding dim | 256 |
32
+ | Layers | 6 |
33
+ | Attention heads | 4 |
34
+ | Head dim | 64 |
35
+ | FFN multiplier | 4x (SwiGLU) |
36
+ | Context length | 256 |
37
+ | Vocab size | 2,000 (BPE) |
38
+ | Weight tying | Yes |
39
+ | Normalization | RMSNorm (pre-norm) |
40
+ | Positional encoding | RoPE |
41
+ | Bias | None |
42
+
43
+ **Training details:**
44
+
45
+ | Metric | Value |
46
+ |--------|-------|
47
+ | Optimizer | AdamW (lr=6e-4, min_lr=6e-5, wd=0.1) |
48
+ | Schedule | Cosine decay with 500-step warmup |
49
+ | Batch size | 32 |
50
+ | Training steps | 12,305 |
51
+ | Tokens processed | ~100M |
52
+ | Training time | 66 min on RTX 3060 12GB |
53
+ | Throughput | ~26K tok/s |
54
+ | Final val loss | 3.54 |
55
+ | Final val PPL | 34.5 |
56
+
57
+ **Loss curve:**
58
+
59
+ | Step | Train Loss | Val Loss | Val PPL |
60
+ |------|-----------|----------|---------|
61
+ | 500 | 6.69 | 5.01 | 149.6 |
62
+ | 2,000 | 4.09 | 4.02 | 56.0 |
63
+ | 6,000 | 3.72 | 3.70 | 40.4 |
64
+ | 10,000 | 3.58 | 3.57 | 35.4 |
65
+ | 12,305 | 3.55 | 3.54 | 34.5 |
66
+
67
+ ## Architecture
68
+
69
+ ```
70
+ JuliaGPTModel
71
+ β”œβ”€β”€ tok_emb: Embedding(2000 β†’ 256) # weight-tied with output head
72
+ β”œβ”€β”€ rope: RotaryPositionalEncoding(256)
73
+ β”œβ”€β”€ blocks Γ— 6:
74
+ β”‚ β”œβ”€β”€ ln1: RMSNorm(256)
75
+ β”‚ β”œβ”€β”€ attn: MultiHeadAttention(4 heads, 64 dim each)
76
+ β”‚ β”‚ β”œβ”€β”€ wq, wk, wv: Dense(256 β†’ 256)
77
+ β”‚ β”‚ └── wo: Dense(256 β†’ 256)
78
+ β”‚ β”œβ”€β”€ ln2: RMSNorm(256)
79
+ β”‚ └── ffn: SwiGLU(256 β†’ 1024 β†’ 256)
80
+ β”‚ β”œβ”€β”€ w1: Dense(256 β†’ 1024) # gate
81
+ β”‚ β”œβ”€β”€ v: Dense(256 β†’ 1024) # value
82
+ β”‚ └── w2: Dense(1024 β†’ 256) # down-project
83
+ β”œβ”€β”€ ln_f: RMSNorm(256)
84
+ └── head: TiedEmbeddingHead β†’ (2000,) # shares tok_emb weights
85
+ ```
86
+
87
+ ## Usage
88
+
89
+ ### Load and generate
90
+
91
+ ```julia
92
+ using Pkg; Pkg.activate("julia-slm")
93
+
94
+ include("src/JuliaGPT.jl")
95
+ using .JuliaGPT
96
+ using .JuliaGPT: Lux, CUDA, LuxCUDA
97
+
98
+ # Load tokenizer
99
+ tok = BPETokenizer("path/to/vocab.json", "path/to/merges.txt")
100
+
101
+ # Load checkpoint
102
+ device = Lux.gpu_device() # or Lux.cpu_device()
103
+ ps, st, _, step, val_loss = load_checkpoint("5m-chinchilla/final.jld2"; device)
104
+
105
+ # Create model (must match checkpoint architecture)
106
+ model = create_model(ModelConfig(;
107
+ vocab_size=vocab_size(tok), embed_dim=256, n_layers=6,
108
+ n_heads=4, head_dim=64, ffn_mult=4, context_length=256,
109
+ weight_tying=true,
110
+ ))
111
+
112
+ # Generate
113
+ text = generate(model, ps, st, tok, "the nature of ";
114
+ max_new_tokens=200, temperature=0.8, top_k=40)
115
+ println(text)
116
+ ```
117
+
118
+ ### Resume training
119
+
120
+ ```bash
121
+ julia --project scripts/train.jl --config config/5m.toml --resume 5m-chinchilla/final.jld2
122
+ ```
123
+
124
+ ## Dataset
125
+
126
+ Trained on [LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) β€” a curated collection of 981 source texts (BookCorpus, WikiText-103, PG-19, classical philosophy) processed through a custom text pipeline with deduplication and quality scoring.
127
+
128
+ - **Train tokens**: 794.9M (pre-encoded as `train.bin`)
129
+ - **Val tokens**: 88.2M (pre-encoded as `val.bin`)
130
+ - **Tokenizer**: ByteLevel BPE, 2,000 vocab (also available: 4K variant)
131
+
132
+ ## Framework
133
+
134
+ Built with:
135
+ - [Lux.jl](https://github.com/LuxDL/Lux.jl) β€” Explicit-parameter neural networks
136
+ - [Zygote.jl](https://github.com/FluxML/Zygote.jl) β€” Automatic differentiation
137
+ - [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) β€” GPU acceleration
138
+ - [Optimisers.jl](https://github.com/FluxML/Optimisers.jl) β€” AdamW with cosine LR
139
+ - [NNlib.jl](https://github.com/FluxML/NNlib.jl) β€” Softmax, activations
140
+ - [OneHotArrays.jl](https://github.com/FluxML/OneHotArrays.jl) β€” GPU-compatible cross-entropy
141
+
142
+ ## Files
143
+
144
+ ```
145
+ 5m-chinchilla/
146
+ β”œβ”€β”€ config.toml # Training config (TOML)
147
+ β”œβ”€β”€ final.jld2 # Final checkpoint (step 12305)
148
+ └── step_12000.jld2 # Intermediate checkpoint
149
+ ```
150
+
151
+ Checkpoints are saved in JLD2 format and contain: model parameters (`ps`), model state (`st`), optimizer state, step number, and best validation loss.
152
+
153
+ ## License
154
+
155
+ MIT