SGD Optimized

AXL-Micro-8M

SGD baseline. 12.8M params. PPL 3.13.

13M
Parameters
3.13
Perplexity
10 min
Training
15 MB
GGUF
PropertyValue
ArchitectureMulti-Scale Transformer
d_model?
Attention Heads?
Layers per Scale?
Context Window256 bytes
Downsample Factors[1, 2, 4]
Vocab Size258 (byte-level)
OptimizerAdamW
SGD 10 min on Shakespeare. 1723 steps. Multi-scale helps even with SGD.
MetricValue
Final Loss0.0210
Perplexity3.13
Training Steps1723
Training Time10 min

Usage

ollama create axl-micro-8m -f Modelfile
ollama run axl-micro-8m "def fibonacci():"
SGD baseline. Multi-scale architecture helps even without Lion optimizer.
FileSizeFormat
F16 GGUF15 MBFull precision
Q4_K_M GGUF15 MB4-bit quantized
GGUF files work with Ollama and llama.cpp. Q4_K_M is about 3x smaller than F16.
← All AXL Models