ZeroNet-1 (28M)

A 28 million parameter decoder-only causal Transformer language model built entirely from scratch β€” no pretrained weights, no fine-tuned checkpoints, no API wrappers.

Every component was written from the ground up: tokenizer, attention mechanism, positional encoding, training loop, and generation pipeline.


Model Description

ZeroNet-1 is a small but architecturally modern language model designed to demonstrate genuine understanding of Transformer internals. It is trained on the TinyStories dataset and generates coherent short-form English text.

This is not a fine-tuned model. This is not an API wrapper. Every weight was initialized randomly and trained from scratch.

Property Value
Model Type Decoder-only causal Transformer
Parameters ~28M
Language English
License MIT
Trained From Scratch Yes β€” zero pretrained weights

Architecture

Component Specification
Type Decoder-only (GPT-style)
Layers 6
Attention Heads 8
Hidden Dimension (d_model) 512
Head Dimension 64
Context Length 256 tokens
MLP Ratio 4x
Vocabulary Size ~32,000 (BPE)
Positional Encoding RoPE (Rotary Positional Embeddings)
Activation Function SwiGLU
Normalization Pre-norm LayerNorm
Weight Tying Yes (embedding ↔ output head)
Dropout 0.1

Architecture Diagram

Input Token IDs β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Token β”‚ β”‚ Embedding β”‚ (no positional embedding β€” RoPE handles position) β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Transformer Block (Γ—6) β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ LayerNorm β”‚ β”‚ β”‚ β”‚ Causal Self-Attention β”‚ β”‚ β”‚ β”‚ └─ RoPE on Q, K β”‚ β”‚ β”‚ β”‚ └─ Causal Mask β”‚ β”‚ β”‚ β”‚ + Residual Connection β”‚ β”‚ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ LayerNorm β”‚ β”‚ β”‚ β”‚ SwiGLU Feed-Forward β”‚ β”‚ β”‚ β”‚ + Residual Connection β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Final LayerNorm β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Linear Head β”‚ (weight-tied with embedding) β”‚ β†’ Vocab Logits β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

text


Training Details

Phase 1: Base Language Model

Setting Value
Dataset TinyStories
Training Samples Up to 500,000
Validation Samples 5,000
Tokenizer Custom BPE (trained from scratch)
Optimizer AdamW (Ξ²1=0.9, Ξ²2=0.95)
Learning Rate 3e-4 (peak)
LR Schedule Cosine annealing with linear warmup
Warmup Steps 500
Weight Decay 0.1
Batch Size 8
Epochs 5
Precision FP16 mixed precision
Gradient Clipping Max norm 1.0
Objective Next-token prediction (cross-entropy)

Phase 2: Chat Fine-Tuning

Setting Value
Format <|user|> / <|assistant|> delimiters
Learning Rate 1e-4
Epochs 3
Loss Masking Loss computed on assistant tokens only
Data Curated QA pairs

Phase 3: Quantization

Setting Value
Method Post-training dynamic quantization
Target Layers nn.Linear
Precision INT8
Compression ~2-4x smaller

Tokenizer

Property Value
Type Byte-Pair Encoding (BPE)
Library tokenizers (Hugging Face)
Vocabulary Size ~32,000
Trained On TinyStories training split
Special Tokens <pad>, <unk>, <bos>, <eos>, <|user|>, <|assistant|>

The tokenizer was trained from scratch on the dataset. No pretrained tokenizer was used.


Usage

Loading the Model

import torch
from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_file("zeronet-1-28M/tokenizer.json")

# Define model class (ZeroNet1 from training script), then:
model = ZeroNet1(config)
model.load_state_dict(torch.load("zeronet-1-28M/pytorch_model.bin", map_location="cpu"))
model.eval()
Text Generation
Python

prompt = "Once upon a time"
encoded = tokenizer.encode(prompt)
input_ids = torch.tensor([encoded.ids])

with torch.no_grad():
    for _ in range(100):
        logits, _ = model(input_ids)
        next_token_logits = logits[0, -1, :] / 0.8  # temperature
        probs = torch.softmax(next_token_logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=-1)

output = tokenizer.decode(input_ids[0].tolist())
print(output)
Chat Mode
Python

prompt = "<|user|>\nWhat is the sun?\n<|assistant|>\n"
# Feed to model and generate as above
What This Model Demonstrates
This project demonstrates practical understanding of:

βœ… Transformer architecture (not just API calls)
βœ… Custom tokenizer training (BPE from scratch)
βœ… Rotary Positional Embeddings (RoPE)
βœ… SwiGLU activation functions
βœ… Pre-norm architecture
βœ… Weight tying
βœ… Causal masking for autoregressive generation
βœ… Mixed precision training (FP16)
βœ… Cosine learning rate scheduling with warmup
βœ… Gradient clipping
βœ… Next-token prediction objective
βœ… Top-k and top-p (nucleus) sampling
βœ… Chat fine-tuning with masked loss
βœ… Post-training INT8 quantization
βœ… End-to-end ML pipeline
Limitations
Be honest about what this model is and is not.

What it IS:
An educational project demonstrating LLM internals
A proof of concept for from-scratch training
A portfolio piece showing engineering competence
Capable of generating coherent short stories (TinyStories domain)
What it is NOT:
❌ A production-ready language model
❌ Factually reliable
❌ Safe for deployment without content filters
❌ Comparable to GPT-4, LLaMA, or any large-scale model
❌ Trained with RLHF or safety alignment
❌ Suitable for real-world applications
Known Limitations:
Small context window (256 tokens)
Limited to simple English text (TinyStories domain)
May produce repetitive or nonsensical output
No safety filtering or content moderation
Chat capability is minimal (trained on ~20 QA pairs)

Technical Notes
Why RoPE instead of absolute positional embeddings?
Encodes relative position, not absolute
Better generalization to unseen sequence lengths
Used in LLaMA, Mistral, Qwen, GPT-NeoX
Applied to Q and K only (never V)
Why SwiGLU instead of GELU/ReLU?
Better gradient flow during training
Used in LLaMA, PaLM, Mistral
Slightly more parameters but measurably better performance
Why Pre-norm instead of Post-norm?
Easier to train (more stable gradients)
Standard in all modern LLMs (GPT-2+, LLaMA, etc.)
Original Vaswani (2017) used post-norm, but the field moved on
Why Weight Tying?
Reduces parameter count
Embedding and output head share the same weight matrix
Standard practice since GPT-2
Citation
If you use this model or code for educational purposes:

Hardware
Component	Specification
Training Device	NVIDIA GPU 
Also runs on	CPU (slower), Apple MPS
VRAM Required	~4 GB minimum
Training Time	1-3 hours (GPU), 12-24 hours (CPU)
Contact
For questions about the architecture or training process, open an issue in the repository.

Built from scratch. No shortcuts. No pretrained weights. No API wrappers.

text


---

## What Each Metadata Field Maps To

| HF Metadata Field | Value in YAML | Purpose |
|---|---|---|
| `license` | `mit` | Shows license badge |
| `datasets` | `roneneldan/TinyStories` | Links to dataset page |
| `language` | `en` | Shows language tag |
| `metrics` | `perplexity` | Shows evaluation metric |
| `pipeline_tag` | `text-generation` | Enables inference widget category |
| `library_name` | `pytorch` | Shows framework badge |
| `tags` | list of tags | Searchable tags on HF Hub |
| `model-index` | eval results block | Populates eval results table |

## Important

**Before committing, update these placeholder values:**

| Placeholder | Replace With |
|---|---|
| `Your Name` in citation | Your actual name |
| `value: 0.0` in metrics | Your actual perplexity score from training |
| Contact section | Your actual contact method |
Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 2 Ask for provider support

Dataset used to train yashshinde0080/zeronet-1-28M

Evaluation results