| --- |
| license: cc0-1.0 |
| language: |
| - en |
| --- |
| # Atto: Extreme Intelligence Density Research |
|
|
| Atto is an exploration into the fundamental limits of **Intelligence Density** — how much knowledge and generative capability can be packed into a neural network with a strictly limited parameter budget. |
|
|
| This project focuses on the "sub-kiloparameter" and "low-kiloparameter" regime, training models to generate Shakespearean text with as few as 64 parameters. |
|
|
| ## The Atto Series |
|
|
| | Model | Parameters | Context | Weights Size (JSON) | Val Loss | |
| | :--- | :--- | :--- | :--- | :--- | |
| | **atto-64** | 64 | 3 | 1.8 KB | 2.59 | |
| | **atto-128** | 128 | 7 | 3.5 KB | 2.83 | |
| | **atto-256** | 256 | 8 | 6.0 KB | 2.33 | |
| | **atto-512** | 512 | 16 | 11.8 KB | 2.44 | |
| | **atto-1024** | 1,024 | 8 | 22.3 KB | 2.11 | |
| | **atto-2048** | 2,048 | 24 | 44.3 KB | 2.15 | |
| | **atto-4096** | 4,096 | 56 | 86.4 KB | 2.40 | |
| | **atto-8192** | 8,192 | 28 | 172.7 KB | 1.91 | |
| | **atto-16384** | 16,384 | 60 | ~640 KB | 2.11 | |
|
|
| ## Research Findings: Intelligence Density |
|
|
| 1. **Architecture Matters**: At the sub-1000 parameter scale, standard Transformers are highly inefficient due to the overhead of Attention and LayerNorm. Our custom **Neural N-Gram (AttoLM)** architecture ensures that every single parameter directly participates in character prediction. |
| 2. **The Embedding Threshold**: We found that moving from 8-dimensional to 16-dimensional embeddings (at 8,192 parameters) creates a significant jump in coherence, allowing the model to represent complex character relationships. |
| 3. **Context vs. Width**: In extremely small models, there is a sharp trade-off between the context window (memory) and embedding dimensionality (representation). Our 8,192 and 16,384 models prioritize a balance that favors realistic word formation. |
|
|
| ## Next Steps |
|
|
| This is just a **first step** in making intelligence very dense. By optimizing weight initialization, custom activation functions, and even more extreme parameter-tying, we believe it is possible to achieve "readable Shakespeare" with even fewer than 1,000 parameters. |
|
|
| ## Usage |
|
|
| ### Training |
| To train the base series, run: |
| ```bash |
| python3 train_atto.py |
| ``` |
|
|
| ### Sampling |
| To evaluate all trained models: |
| ```bash |
| python3 sample.py |
| ``` |
|
|
| The models are exported as dependency-free JSON files in the `models/` directory, ready for client-side inference in a web browser. |
|
|
| ### Sample generations: |
| ``` |
| |
| ============================================================ |
| atto-8192 | 8192 params | embd=16 ctx=28 vocab=64 |
| ============================================================ |
| prompt="the": |
| Math Laer axfourith tipht's gord me hour hace (remaat ond, |
| I'll wore ser ar now pre's for word to styous the mall, stpoul folthis yow apt and be a |
| |
| prompt="to be": |
| CPon. How gue. O- whut feathent. Thou the in ap bast. gos A thing of be rith nosset? |
| [Tiths that hintend kyele in younk hore; |
| Gat sgees wis |
| |
| prompt="Ham": |
| . HaCleata, |
| Wlotsef yow preerant fore thipe matte of iche in you? |
| And spour, the tang offe herees welr then[foritr her veut arve id for houn w |
| |
| |
| ``` |