SGD Optimized

AXL-Code-1B

SGD baseline. 318M params. PPL 31.22. Context 256 bytes.

318M
Parameters
31.22
Perplexity
30 min
Training
636 MB
GGUF
PropertyValue
ArchitectureMulti-Scale Transformer
d_model?
Attention Heads?
Layers per Scale?
Context Window256 bytes
Downsample Factors[1, 2, 4]
Vocab Size258 (byte-level)
OptimizerSGD
Trained with vanilla SGD on 50MB Python code. 1012 steps, 30 min. Baseline for Lion comparison.
MetricValue
Final Loss2.9391
Perplexity31.22
Training Steps1012
Training Time30 min

Usage

ollama create axl-code-1b -f Modelfile
ollama run axl-code-1b "def fibonacci():"
SGD baseline. AXL-Code-1B-Lion achieves 16x better perplexity.
FileSizeFormat
F16 GGUF636 MBFull precision
Q4_K_M GGUF197 MB4-bit quantized
GGUF files work with Ollama and llama.cpp. Q4_K_M is about 3x smaller than F16.
← All AXL Models