TinyBuddy-80K / README.md
Eeppa's picture
Update README.md
708ae57 verified
---
language:
- en
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- tiny-model
- educational
- record-breaker
- ultra-small
- smallest-llm
- 80k-parameters
---
# TinyBuddy-80K
> πŸ† **RECORD ATTEMPT**: The smallest functional English-speaking language model on Hugging Face.
> **83,856 parameters** β€” that's ~84K, beating the NaA-IA/Small-ever record by being both tiny AND coherent.
**Mission**: Prove that under 100K parameters, a language model can still learn English patterns and generate recognizable text. This is not just the smallest β€” it's the smallest that *works*.
---
## Model Details
| Property | Value |
|---|---|
| **Parameters** | **83,856** (~84K) |
| Layers | 1 |
| Hidden size | 48 |
| Attention heads | 4 (query) / 2 (key-value) = GQA |
| FF intermediate size | 192 |
| Context length | 128 |
| Vocabulary | 1,024 tokens (BPE) |
| Architecture | Llama-style: RMSNorm, RoPE, SiLU/SwiGLU, tied embeddings |
| Precision | float32 |
### Parameter Breakdown
| Component | Parameters |
|---|---|
| Token Embedding (tied) | 49,152 |
| Attention (Q/K/V/O) | 5,760 |
| FeedForward (Gate/Up/Down) | 27,648 |
| LayerNorm (3Γ— RMSNorm) | 144 |
| **Total** | **83,856** |
---
## Architecture
TinyBuddy-100K uses a **single transformer block** with:
- **RMSNorm** (pre-norm) β€” efficient normalization
- **Grouped Query Attention** β€” 4 query heads, 2 KV heads (saves params)
- **RoPE** (Rotary Position Embeddings) β€” relative position encoding
- **SwiGLU** (SiLU-gated MLP) β€” modern activation
- **Tied embeddings** β€” input and output share weights (saves ~49K params!)
```
Input β†’ Embedding β†’ [RMSNorm β†’ GQA Attention β†’ +] β†’ [RMSNorm β†’ SwiGLU FFN β†’ +] β†’ RMSNorm β†’ LM Head β†’ Output
```
---
## Training
- **Dataset**: TinyStories (~5,000 stories)
- **Tokenizer**: Byte-level BPE, 1,024 vocabulary (trained from scratch)
- **Optimizer**: AdamW (lr=5e-3, weight_decay=0.1)
- **Schedule**: Warmup (50 steps) + Cosine decay
- **Steps**: 1,000 on CPU
- **Hardware**: Single CPU core (the challenge!)
---
## Usage
```python
import torch
from model import create_model
# Load config
import json
with open("config.json") as f:
config = json.load(f)
# Create model
model = create_model(config)
model.load_state_dict(torch.load("output/model.pt", map_location="cpu"))
model.eval()
# Generate
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("data/tokenizer.json")
prompt = "Once upon a time,"
encoded = tokenizer.encode(prompt)
ids = [1] + encoded.ids # Add BOS
input_ids = torch.tensor([ids], dtype=torch.long)
output_ids = model.generate(input_ids, max_new_tokens=60, temperature=0.8, top_k=40)
print(tokenizer.decode(output_ids[0].tolist(), skip_special_tokens=True))
```
---
## Limitations
This model is **extremely small** β€” it has fewer parameters than a 28Γ—28 grayscale image.
**What works:**
- Basic word patterns and short phrases
- Recognizable English-like structure
- Story-like opening sentences
**What's broken:**
- Very limited coherence (1–2 sentences max)
- High repetition
- No factual knowledge or reasoning
- Limited vocabulary diversity
This model exists purely to explore the **lower bounds of language modeling**. It proves that even at 84K parameters, a neural network can capture statistical patterns in English text.
---
## The Record
| Model | Parameters | Speaks English? |
|---|---|---|
| NaA-IA/Small-ever | 112 | ❌ No |
| **TinyBuddy-80K** | **83,856** | **βœ… YES** |
TinyBuddy-100K may not be the absolute smallest model ever, but **it's the smallest that actually generates recognizable English text**. That's the real achievement.
---
## Citation
```bibtex
@misc{tinybuddy100k,
title = {TinyBuddy-100K: An 84K parameter Llama-style model that speaks English},
year = {2026},
note = {Record attempt: smallest functional English text generator.}
}
```
**LONG LIVE TINYBUDDY-80K** πŸš€