---
license: cc0-1.0
language:
- en
---
# Atto: Extreme Intelligence Density Research

Atto is an exploration into the fundamental limits of **Intelligence Density** — how much knowledge and generative capability can be packed into a neural network with a strictly limited parameter budget.

This project focuses on the "sub-kiloparameter" and "low-kiloparameter" regime, training models to generate Shakespearean text with as few as 64 parameters.

## The Atto Series

| Model | Parameters | Context | Weights Size (JSON) | Val Loss |
| :--- | :--- | :--- | :--- | :--- |
| **atto-64** | 64 | 3 | 1.8 KB | 2.59 |
| **atto-128** | 128 | 7 | 3.5 KB | 2.83 |
| **atto-256** | 256 | 8 | 6.0 KB | 2.33 |
| **atto-512** | 512 | 16 | 11.8 KB | 2.44 |
| **atto-1024** | 1,024 | 8 | 22.3 KB | 2.11 |
| **atto-2048** | 2,048 | 24 | 44.3 KB | 2.15 |
| **atto-4096** | 4,096 | 56 | 86.4 KB | 2.40 |
| **atto-8192** | 8,192 | 28 | 172.7 KB | 1.91 |
| **atto-16384** | 16,384 | 60 | ~640 KB | 2.11 |

## Research Findings: Intelligence Density

1.  **Architecture Matters**: At the sub-1000 parameter scale, standard Transformers are highly inefficient due to the overhead of Attention and LayerNorm. Our custom **Neural N-Gram (AttoLM)** architecture ensures that every single parameter directly participates in character prediction.
2.  **The Embedding Threshold**: We found that moving from 8-dimensional to 16-dimensional embeddings (at 8,192 parameters) creates a significant jump in coherence, allowing the model to represent complex character relationships.
3.  **Context vs. Width**: In extremely small models, there is a sharp trade-off between the context window (memory) and embedding dimensionality (representation). Our 8,192 and 16,384 models prioritize a balance that favors realistic word formation.

## Next Steps

This is just a **first step** in making intelligence very dense. By optimizing weight initialization, custom activation functions, and even more extreme parameter-tying, we believe it is possible to achieve "readable Shakespeare" with even fewer than 1,000 parameters.

## Usage

### Training
To train the base series, run:
```bash
python3 train_atto.py
```

### Sampling
To evaluate all trained models:
```bash
python3 sample.py
```

The models are exported as dependency-free JSON files in the `models/` directory, ready for client-side inference in a web browser.

### Sample generations:
```

============================================================
  atto-8192  |  8192 params  |  embd=16  ctx=28  vocab=64
============================================================
  prompt="the":
    Math Laer axfourith tipht's gord me hour hace (remaat ond,
    I'll wore ser ar now pre's for word to styous the mall, stpoul folthis yow apt and be a

  prompt="to be":
     CPon. How gue. O- whut feathent. Thou the in ap bast.  gos A thing of be rith nosset?
    [Tiths that hintend kyele in younk hore;
    Gat sgees wis 

  prompt="Ham":
    . HaCleata,
    Wlotsef yow preerant fore thipe matte of iche in you?
    And spour, the tang offe herees welr then[foritr her veut arve id for houn w


```