--- license: cc0-1.0 language: - en --- # Atto: Extreme Intelligence Density Research Atto is an exploration into the fundamental limits of **Intelligence Density** — how much knowledge and generative capability can be packed into a neural network with a strictly limited parameter budget. This project focuses on the "sub-kiloparameter" and "low-kiloparameter" regime, training models to generate Shakespearean text with as few as 64 parameters. ## The Atto Series | Model | Parameters | Context | Weights Size (JSON) | Val Loss | | :--- | :--- | :--- | :--- | :--- | | **atto-64** | 64 | 3 | 1.8 KB | 2.59 | | **atto-128** | 128 | 7 | 3.5 KB | 2.83 | | **atto-256** | 256 | 8 | 6.0 KB | 2.33 | | **atto-512** | 512 | 16 | 11.8 KB | 2.44 | | **atto-1024** | 1,024 | 8 | 22.3 KB | 2.11 | | **atto-2048** | 2,048 | 24 | 44.3 KB | 2.15 | | **atto-4096** | 4,096 | 56 | 86.4 KB | 2.40 | | **atto-8192** | 8,192 | 28 | 172.7 KB | 1.91 | | **atto-16384** | 16,384 | 60 | ~640 KB | 2.11 | ## Research Findings: Intelligence Density 1. **Architecture Matters**: At the sub-1000 parameter scale, standard Transformers are highly inefficient due to the overhead of Attention and LayerNorm. Our custom **Neural N-Gram (AttoLM)** architecture ensures that every single parameter directly participates in character prediction. 2. **The Embedding Threshold**: We found that moving from 8-dimensional to 16-dimensional embeddings (at 8,192 parameters) creates a significant jump in coherence, allowing the model to represent complex character relationships. 3. **Context vs. Width**: In extremely small models, there is a sharp trade-off between the context window (memory) and embedding dimensionality (representation). Our 8,192 and 16,384 models prioritize a balance that favors realistic word formation. ## Next Steps This is just a **first step** in making intelligence very dense. By optimizing weight initialization, custom activation functions, and even more extreme parameter-tying, we believe it is possible to achieve "readable Shakespeare" with even fewer than 1,000 parameters. ## Usage ### Training To train the base series, run: ```bash python3 train_atto.py ``` ### Sampling To evaluate all trained models: ```bash python3 sample.py ``` The models are exported as dependency-free JSON files in the `models/` directory, ready for client-side inference in a web browser. ### Sample generations: ``` ============================================================ atto-8192 | 8192 params | embd=16 ctx=28 vocab=64 ============================================================ prompt="the": Math Laer axfourith tipht's gord me hour hace (remaat ond, I'll wore ser ar now pre's for word to styous the mall, stpoul folthis yow apt and be a prompt="to be": CPon. How gue. O- whut feathent. Thou the in ap bast. gos A thing of be rith nosset? [Tiths that hintend kyele in younk hore; Gat sgees wis prompt="Ham": . HaCleata, Wlotsef yow preerant fore thipe matte of iche in you? And spour, the tang offe herees welr then[foritr her veut arve id for houn w ```