Update README.md

708ae57 verified 9 days ago

3.95 kB

	---
	language:
	- en
	license: mit
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- tiny-model
	- educational
	- record-breaker
	- ultra-small
	- smallest-llm
	- 80k-parameters
	---

	# TinyBuddy-80K

	> 🏆 RECORD ATTEMPT: The smallest functional English-speaking language model on Hugging Face.
	> 83,856 parameters — that's ~84K, beating the NaA-IA/Small-ever record by being both tiny AND coherent.

	Mission: Prove that under 100K parameters, a language model can still learn English patterns and generate recognizable text. This is not just the smallest — it's the smallest that works.

	---

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Parameters \| 83,856 (~84K) \|
	\| Layers \| 1 \|
	\| Hidden size \| 48 \|
	\| Attention heads \| 4 (query) / 2 (key-value) = GQA \|
	\| FF intermediate size \| 192 \|
	\| Context length \| 128 \|
	\| Vocabulary \| 1,024 tokens (BPE) \|
	\| Architecture \| Llama-style: RMSNorm, RoPE, SiLU/SwiGLU, tied embeddings \|
	\| Precision \| float32 \|

	### Parameter Breakdown

	\| Component \| Parameters \|
	\|---\|---\|
	\| Token Embedding (tied) \| 49,152 \|
	\| Attention (Q/K/V/O) \| 5,760 \|
	\| FeedForward (Gate/Up/Down) \| 27,648 \|
	\| LayerNorm (3× RMSNorm) \| 144 \|
	\| Total \| 83,856 \|

	---

	## Architecture

	TinyBuddy-100K uses a single transformer block with:

	- RMSNorm (pre-norm) — efficient normalization
	- Grouped Query Attention — 4 query heads, 2 KV heads (saves params)
	- RoPE (Rotary Position Embeddings) — relative position encoding
	- SwiGLU (SiLU-gated MLP) — modern activation
	- Tied embeddings — input and output share weights (saves ~49K params!)

	```
	Input → Embedding → [RMSNorm → GQA Attention → +] → [RMSNorm → SwiGLU FFN → +] → RMSNorm → LM Head → Output
	```

	---

	## Training

	- Dataset: TinyStories (~5,000 stories)
	- Tokenizer: Byte-level BPE, 1,024 vocabulary (trained from scratch)
	- Optimizer: AdamW (lr=5e-3, weight_decay=0.1)
	- Schedule: Warmup (50 steps) + Cosine decay
	- Steps: 1,000 on CPU
	- Hardware: Single CPU core (the challenge!)

	---

	## Usage

	```python
	import torch
	from model import create_model

	# Load config
	import json
	with open("config.json") as f:
	config = json.load(f)

	# Create model
	model = create_model(config)
	model.load_state_dict(torch.load("output/model.pt", map_location="cpu"))
	model.eval()

	# Generate
	from tokenizers import Tokenizer
	tokenizer = Tokenizer.from_file("data/tokenizer.json")

	prompt = "Once upon a time,"
	encoded = tokenizer.encode(prompt)
	ids = [1] + encoded.ids # Add BOS
	input_ids = torch.tensor([ids], dtype=torch.long)

	output_ids = model.generate(input_ids, max_new_tokens=60, temperature=0.8, top_k=40)
	print(tokenizer.decode(output_ids[0].tolist(), skip_special_tokens=True))
	```

	---

	## Limitations

	This model is extremely small — it has fewer parameters than a 28×28 grayscale image.

	What works:
	- Basic word patterns and short phrases
	- Recognizable English-like structure
	- Story-like opening sentences

	What's broken:
	- Very limited coherence (1–2 sentences max)
	- High repetition
	- No factual knowledge or reasoning
	- Limited vocabulary diversity

	This model exists purely to explore the lower bounds of language modeling. It proves that even at 84K parameters, a neural network can capture statistical patterns in English text.

	---

	## The Record

	\| Model \| Parameters \| Speaks English? \|
	\|---\|---\|---\|
	\| NaA-IA/Small-ever \| 112 \| ❌ No \|
	\| TinyBuddy-80K \| 83,856 \| ✅ YES \|

	TinyBuddy-100K may not be the absolute smallest model ever, but it's the smallest that actually generates recognizable English text. That's the real achievement.

	---

	## Citation

	```bibtex
	@misc{tinybuddy100k,
	title = {TinyBuddy-100K: An 84K parameter Llama-style model that speaks English},
	year = {2026},
	note = {Record attempt: smallest functional English text generator.}
	}
	```

	LONG LIVE TINYBUDDY-80K 🚀