ZeroNet-1 (28M)

A 28 million parameter decoder-only causal Transformer language model built entirely from scratch — no pretrained weights, no fine-tuned checkpoints, no API wrappers.

Every component was written from the ground up: tokenizer, attention mechanism, positional encoding, training loop, and generation pipeline.

Model Description

ZeroNet-1 is a small but architecturally modern language model designed to demonstrate genuine understanding of Transformer internals. It is trained on the TinyStories dataset and generates coherent short-form English text.

This is not a fine-tuned model. This is not an API wrapper. Every weight was initialized randomly and trained from scratch.

Property	Value
Model Type	Decoder-only causal Transformer
Parameters	~28M
Language	English
License	MIT
Trained From Scratch	Yes — zero pretrained weights

Architecture

Component	Specification
Type	Decoder-only (GPT-style)
Layers	6
Attention Heads	8
Hidden Dimension (d_model)	512
Head Dimension	64
Context Length	256 tokens
MLP Ratio	4x
Vocabulary Size	~32,000 (BPE)
Positional Encoding	RoPE (Rotary Positional Embeddings)
Activation Function	SwiGLU
Normalization	Pre-norm LayerNorm
Weight Tying	Yes (embedding ↔ output head)
Dropout	0.1

Architecture Diagram

Input Token IDs │ ▼ ┌─────────────┐ │ Token │ │ Embedding │ (no positional embedding — RoPE handles position) └──────┬──────┘ │ ▼ ┌──────────────────────────────────┐ │ Transformer Block (×6) │ │ │ │ ┌─────────────────────────┐ │ │ │ LayerNorm │ │ │ │ Causal Self-Attention │ │ │ │ └─ RoPE on Q, K │ │ │ │ └─ Causal Mask │ │ │ │ + Residual Connection │ │ │ ├─────────────────────────┤ │ │ │ LayerNorm │ │ │ │ SwiGLU Feed-Forward │ │ │ │ + Residual Connection │ │ │ └─────────────────────────┘ │ └──────────────┬───────────────────┘ │ ▼ ┌──────────────────┐ │ Final LayerNorm │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Linear Head │ (weight-tied with embedding) │ → Vocab Logits │ └──────────────────┘

text

Training Details

Phase 1: Base Language Model

Setting	Value
Dataset	TinyStories
Training Samples	Up to 500,000
Validation Samples	5,000
Tokenizer	Custom BPE (trained from scratch)
Optimizer	AdamW (β1=0.9, β2=0.95)
Learning Rate	3e-4 (peak)
LR Schedule	Cosine annealing with linear warmup
Warmup Steps	500
Weight Decay	0.1
Batch Size	8
Epochs	5
Precision	FP16 mixed precision
Gradient Clipping	Max norm 1.0
Objective	Next-token prediction (cross-entropy)

Phase 2: Chat Fine-Tuning

Setting	Value
Format	`<\|user\|>` / `<\|assistant\|>` delimiters
Learning Rate	1e-4
Epochs	3
Loss Masking	Loss computed on assistant tokens only
Data	Curated QA pairs

Phase 3: Quantization

Setting	Value
Method	Post-training dynamic quantization
Target Layers	nn.Linear
Precision	INT8
Compression	~2-4x smaller

Tokenizer

Property	Value
Type	Byte-Pair Encoding (BPE)
Library	`tokenizers` (Hugging Face)
Vocabulary Size	~32,000
Trained On	TinyStories training split
Special Tokens	`<pad>`, `<unk>`, `<bos>`, `<eos>`, `<\|user\|>`, `<\|assistant\|>`

The tokenizer was trained from scratch on the dataset. No pretrained tokenizer was used.

Usage

Loading the Model

import torch
from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_file("zeronet-1-28M/tokenizer.json")

# Define model class (ZeroNet1 from training script), then:
model = ZeroNet1(config)
model.load_state_dict(torch.load("zeronet-1-28M/pytorch_model.bin", map_location="cpu"))
model.eval()
Text Generation
Python

prompt = "Once upon a time"
encoded = tokenizer.encode(prompt)
input_ids = torch.tensor([encoded.ids])

with torch.no_grad():
    for _ in range(100):
        logits, _ = model(input_ids)
        next_token_logits = logits[0, -1, :] / 0.8  # temperature
        probs = torch.softmax(next_token_logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=-1)

output = tokenizer.decode(input_ids[0].tolist())
print(output)
Chat Mode
Python

prompt = "<|user|>\nWhat is the sun?\n<|assistant|>\n"
# Feed to model and generate as above
What This Model Demonstrates
This project demonstrates practical understanding of:

✅ Transformer architecture (not just API calls)
✅ Custom tokenizer training (BPE from scratch)
✅ Rotary Positional Embeddings (RoPE)
✅ SwiGLU activation functions
✅ Pre-norm architecture
✅ Weight tying
✅ Causal masking for autoregressive generation
✅ Mixed precision training (FP16)
✅ Cosine learning rate scheduling with warmup
✅ Gradient clipping
✅ Next-token prediction objective
✅ Top-k and top-p (nucleus) sampling
✅ Chat fine-tuning with masked loss
✅ Post-training INT8 quantization
✅ End-to-end ML pipeline
Limitations
Be honest about what this model is and is not.

What it IS:
An educational project demonstrating LLM internals
A proof of concept for from-scratch training
A portfolio piece showing engineering competence
Capable of generating coherent short stories (TinyStories domain)
What it is NOT:
❌ A production-ready language model
❌ Factually reliable
❌ Safe for deployment without content filters
❌ Comparable to GPT-4, LLaMA, or any large-scale model
❌ Trained with RLHF or safety alignment
❌ Suitable for real-world applications
Known Limitations:
Small context window (256 tokens)
Limited to simple English text (TinyStories domain)
May produce repetitive or nonsensical output
No safety filtering or content moderation
Chat capability is minimal (trained on ~20 QA pairs)

Technical Notes
Why RoPE instead of absolute positional embeddings?
Encodes relative position, not absolute
Better generalization to unseen sequence lengths
Used in LLaMA, Mistral, Qwen, GPT-NeoX
Applied to Q and K only (never V)
Why SwiGLU instead of GELU/ReLU?
Better gradient flow during training
Used in LLaMA, PaLM, Mistral
Slightly more parameters but measurably better performance
Why Pre-norm instead of Post-norm?
Easier to train (more stable gradients)
Standard in all modern LLMs (GPT-2+, LLaMA, etc.)
Original Vaswani (2017) used post-norm, but the field moved on
Why Weight Tying?
Reduces parameter count
Embedding and output head share the same weight matrix
Standard practice since GPT-2
Citation
If you use this model or code for educational purposes:

Hardware
Component	Specification
Training Device	NVIDIA GPU 
Also runs on	CPU (slower), Apple MPS
VRAM Required	~4 GB minimum
Training Time	1-3 hours (GPU), 12-24 hours (CPU)
Contact
For questions about the architecture or training process, open an issue in the repository.

Built from scratch. No shortcuts. No pretrained weights. No API wrappers.

text


---

## What Each Metadata Field Maps To

| HF Metadata Field | Value in YAML | Purpose |
|---|---|---|
| `license` | `mit` | Shows license badge |
| `datasets` | `roneneldan/TinyStories` | Links to dataset page |
| `language` | `en` | Shows language tag |
| `metrics` | `perplexity` | Shows evaluation metric |
| `pipeline_tag` | `text-generation` | Enables inference widget category |
| `library_name` | `pytorch` | Shows framework badge |
| `tags` | list of tags | Searchable tags on HF Hub |
| `model-index` | eval results block | Populates eval results table |

## Important

**Before committing, update these placeholder values:**

| Placeholder | Replace With |
|---|---|
| `Your Name` in citation | Your actual name |
| `value: 0.0` in metrics | Your actual perplexity score from training |
| Contact section | Your actual contact method |

Downloads last month: 12

Dataset used to train yashshinde0080/zeronet-1-28M

Evaluation results

Perplexity on TinyStories
self-reported

0.000