ZeroNet-1 (28M)
A 28 million parameter decoder-only causal Transformer language model built entirely from scratch β no pretrained weights, no fine-tuned checkpoints, no API wrappers.
Every component was written from the ground up: tokenizer, attention mechanism, positional encoding, training loop, and generation pipeline.
Model Description
ZeroNet-1 is a small but architecturally modern language model designed to demonstrate genuine understanding of Transformer internals. It is trained on the TinyStories dataset and generates coherent short-form English text.
This is not a fine-tuned model. This is not an API wrapper. Every weight was initialized randomly and trained from scratch.
| Property | Value |
|---|---|
| Model Type | Decoder-only causal Transformer |
| Parameters | ~28M |
| Language | English |
| License | MIT |
| Trained From Scratch | Yes β zero pretrained weights |
Architecture
| Component | Specification |
|---|---|
| Type | Decoder-only (GPT-style) |
| Layers | 6 |
| Attention Heads | 8 |
| Hidden Dimension (d_model) | 512 |
| Head Dimension | 64 |
| Context Length | 256 tokens |
| MLP Ratio | 4x |
| Vocabulary Size | ~32,000 (BPE) |
| Positional Encoding | RoPE (Rotary Positional Embeddings) |
| Activation Function | SwiGLU |
| Normalization | Pre-norm LayerNorm |
| Weight Tying | Yes (embedding β output head) |
| Dropout | 0.1 |
Architecture Diagram
Input Token IDs β βΌ βββββββββββββββ β Token β β Embedding β (no positional embedding β RoPE handles position) ββββββββ¬βββββββ β βΌ ββββββββββββββββββββββββββββββββββββ β Transformer Block (Γ6) β β β β βββββββββββββββββββββββββββ β β β LayerNorm β β β β Causal Self-Attention β β β β ββ RoPE on Q, K β β β β ββ Causal Mask β β β β + Residual Connection β β β βββββββββββββββββββββββββββ€ β β β LayerNorm β β β β SwiGLU Feed-Forward β β β β + Residual Connection β β β βββββββββββββββββββββββββββ β ββββββββββββββββ¬ββββββββββββββββββββ β βΌ ββββββββββββββββββββ β Final LayerNorm β ββββββββββ¬ββββββββββ β βΌ ββββββββββββββββββββ β Linear Head β (weight-tied with embedding) β β Vocab Logits β ββββββββββββββββββββ
text
Training Details
Phase 1: Base Language Model
| Setting | Value |
|---|---|
| Dataset | TinyStories |
| Training Samples | Up to 500,000 |
| Validation Samples | 5,000 |
| Tokenizer | Custom BPE (trained from scratch) |
| Optimizer | AdamW (Ξ²1=0.9, Ξ²2=0.95) |
| Learning Rate | 3e-4 (peak) |
| LR Schedule | Cosine annealing with linear warmup |
| Warmup Steps | 500 |
| Weight Decay | 0.1 |
| Batch Size | 8 |
| Epochs | 5 |
| Precision | FP16 mixed precision |
| Gradient Clipping | Max norm 1.0 |
| Objective | Next-token prediction (cross-entropy) |
Phase 2: Chat Fine-Tuning
| Setting | Value |
|---|---|
| Format | <|user|> / <|assistant|> delimiters |
| Learning Rate | 1e-4 |
| Epochs | 3 |
| Loss Masking | Loss computed on assistant tokens only |
| Data | Curated QA pairs |
Phase 3: Quantization
| Setting | Value |
|---|---|
| Method | Post-training dynamic quantization |
| Target Layers | nn.Linear |
| Precision | INT8 |
| Compression | ~2-4x smaller |
Tokenizer
| Property | Value |
|---|---|
| Type | Byte-Pair Encoding (BPE) |
| Library | tokenizers (Hugging Face) |
| Vocabulary Size | ~32,000 |
| Trained On | TinyStories training split |
| Special Tokens | <pad>, <unk>, <bos>, <eos>, <|user|>, <|assistant|> |
The tokenizer was trained from scratch on the dataset. No pretrained tokenizer was used.
Usage
Loading the Model
import torch
from tokenizers import Tokenizer
# Load tokenizer
tokenizer = Tokenizer.from_file("zeronet-1-28M/tokenizer.json")
# Define model class (ZeroNet1 from training script), then:
model = ZeroNet1(config)
model.load_state_dict(torch.load("zeronet-1-28M/pytorch_model.bin", map_location="cpu"))
model.eval()
Text Generation
Python
prompt = "Once upon a time"
encoded = tokenizer.encode(prompt)
input_ids = torch.tensor([encoded.ids])
with torch.no_grad():
for _ in range(100):
logits, _ = model(input_ids)
next_token_logits = logits[0, -1, :] / 0.8 # temperature
probs = torch.softmax(next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=-1)
output = tokenizer.decode(input_ids[0].tolist())
print(output)
Chat Mode
Python
prompt = "<|user|>\nWhat is the sun?\n<|assistant|>\n"
# Feed to model and generate as above
What This Model Demonstrates
This project demonstrates practical understanding of:
β
Transformer architecture (not just API calls)
β
Custom tokenizer training (BPE from scratch)
β
Rotary Positional Embeddings (RoPE)
β
SwiGLU activation functions
β
Pre-norm architecture
β
Weight tying
β
Causal masking for autoregressive generation
β
Mixed precision training (FP16)
β
Cosine learning rate scheduling with warmup
β
Gradient clipping
β
Next-token prediction objective
β
Top-k and top-p (nucleus) sampling
β
Chat fine-tuning with masked loss
β
Post-training INT8 quantization
β
End-to-end ML pipeline
Limitations
Be honest about what this model is and is not.
What it IS:
An educational project demonstrating LLM internals
A proof of concept for from-scratch training
A portfolio piece showing engineering competence
Capable of generating coherent short stories (TinyStories domain)
What it is NOT:
β A production-ready language model
β Factually reliable
β Safe for deployment without content filters
β Comparable to GPT-4, LLaMA, or any large-scale model
β Trained with RLHF or safety alignment
β Suitable for real-world applications
Known Limitations:
Small context window (256 tokens)
Limited to simple English text (TinyStories domain)
May produce repetitive or nonsensical output
No safety filtering or content moderation
Chat capability is minimal (trained on ~20 QA pairs)
Technical Notes
Why RoPE instead of absolute positional embeddings?
Encodes relative position, not absolute
Better generalization to unseen sequence lengths
Used in LLaMA, Mistral, Qwen, GPT-NeoX
Applied to Q and K only (never V)
Why SwiGLU instead of GELU/ReLU?
Better gradient flow during training
Used in LLaMA, PaLM, Mistral
Slightly more parameters but measurably better performance
Why Pre-norm instead of Post-norm?
Easier to train (more stable gradients)
Standard in all modern LLMs (GPT-2+, LLaMA, etc.)
Original Vaswani (2017) used post-norm, but the field moved on
Why Weight Tying?
Reduces parameter count
Embedding and output head share the same weight matrix
Standard practice since GPT-2
Citation
If you use this model or code for educational purposes:
Hardware
Component Specification
Training Device NVIDIA GPU
Also runs on CPU (slower), Apple MPS
VRAM Required ~4 GB minimum
Training Time 1-3 hours (GPU), 12-24 hours (CPU)
Contact
For questions about the architecture or training process, open an issue in the repository.
Built from scratch. No shortcuts. No pretrained weights. No API wrappers.
text
---
## What Each Metadata Field Maps To
| HF Metadata Field | Value in YAML | Purpose |
|---|---|---|
| `license` | `mit` | Shows license badge |
| `datasets` | `roneneldan/TinyStories` | Links to dataset page |
| `language` | `en` | Shows language tag |
| `metrics` | `perplexity` | Shows evaluation metric |
| `pipeline_tag` | `text-generation` | Enables inference widget category |
| `library_name` | `pytorch` | Shows framework badge |
| `tags` | list of tags | Searchable tags on HF Hub |
| `model-index` | eval results block | Populates eval results table |
## Important
**Before committing, update these placeholder values:**
| Placeholder | Replace With |
|---|---|
| `Your Name` in citation | Your actual name |
| `value: 0.0` in metrics | Your actual perplexity score from training |
| Contact section | Your actual contact method |
- Downloads last month
- 12
Dataset used to train yashshinde0080/zeronet-1-28M
Evaluation results
- Perplexity on TinyStoriesself-reported0.000