AGofficial
/

atto

Model card Files Files and versions

atto / README.md

AGofficial's picture

Update README.md

45e7712 verified 17 days ago

|

history blame contribute delete

3.12 kB

	---
	license: cc0-1.0
	language:
	- en
	---
	# Atto: Extreme Intelligence Density Research

	Atto is an exploration into the fundamental limits of Intelligence Density — how much knowledge and generative capability can be packed into a neural network with a strictly limited parameter budget.

	This project focuses on the "sub-kiloparameter" and "low-kiloparameter" regime, training models to generate Shakespearean text with as few as 64 parameters.

	## The Atto Series

	\| Model \| Parameters \| Context \| Weights Size (JSON) \| Val Loss \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| atto-64 \| 64 \| 3 \| 1.8 KB \| 2.59 \|
	\| atto-128 \| 128 \| 7 \| 3.5 KB \| 2.83 \|
	\| atto-256 \| 256 \| 8 \| 6.0 KB \| 2.33 \|
	\| atto-512 \| 512 \| 16 \| 11.8 KB \| 2.44 \|
	\| atto-1024 \| 1,024 \| 8 \| 22.3 KB \| 2.11 \|
	\| atto-2048 \| 2,048 \| 24 \| 44.3 KB \| 2.15 \|
	\| atto-4096 \| 4,096 \| 56 \| 86.4 KB \| 2.40 \|
	\| atto-8192 \| 8,192 \| 28 \| 172.7 KB \| 1.91 \|
	\| atto-16384 \| 16,384 \| 60 \| ~640 KB \| 2.11 \|

	## Research Findings: Intelligence Density

	1. Architecture Matters: At the sub-1000 parameter scale, standard Transformers are highly inefficient due to the overhead of Attention and LayerNorm. Our custom Neural N-Gram (AttoLM) architecture ensures that every single parameter directly participates in character prediction.
	2. The Embedding Threshold: We found that moving from 8-dimensional to 16-dimensional embeddings (at 8,192 parameters) creates a significant jump in coherence, allowing the model to represent complex character relationships.
	3. Context vs. Width: In extremely small models, there is a sharp trade-off between the context window (memory) and embedding dimensionality (representation). Our 8,192 and 16,384 models prioritize a balance that favors realistic word formation.

	## Next Steps

	This is just a first step in making intelligence very dense. By optimizing weight initialization, custom activation functions, and even more extreme parameter-tying, we believe it is possible to achieve "readable Shakespeare" with even fewer than 1,000 parameters.

	## Usage

	### Training
	To train the base series, run:
	```bash
	python3 train_atto.py
	```

	### Sampling
	To evaluate all trained models:
	```bash
	python3 sample.py
	```

	The models are exported as dependency-free JSON files in the `models/` directory, ready for client-side inference in a web browser.

	### Sample generations:
	```

	============================================================
	atto-8192 \| 8192 params \| embd=16 ctx=28 vocab=64
	============================================================
	prompt="the":
	Math Laer axfourith tipht's gord me hour hace (remaat ond,
	I'll wore ser ar now pre's for word to styous the mall, stpoul folthis yow apt and be a

	prompt="to be":
	CPon. How gue. O- whut feathent. Thou the in ap bast. gos A thing of be rith nosset?
	[Tiths that hintend kyele in younk hore;
	Gat sgees wis

	prompt="Ham":
	. HaCleata,
	Wlotsef yow preerant fore thipe matte of iche in you?
	And spour, the tang offe herees welr then[foritr her veut arve id for houn w


	```