aarav-gpt-zg2-base

A 336M parameter decoder-only language model trained from scratch using a modern Llama-style architecture.

Model Details

Property	Value
Architecture	Llama-style (Pre-RMSNorm + RoPE + SwiGLU + GQA)
Parameters	336.1M
Layers	24
Hidden dim	1024
Attention heads	16 (query) / 4 (KV)
Context length	1024 tokens
Vocab size	32,000
Tokenizer	SentencePiece (32K vocab)
Training tokens	11.80B
Training steps	96,000
Validation loss	2.8572
Validation perplexity	17.4

Architecture

This model uses the modern 2023-2025 consensus architecture:

RMSNorm (pre-normalization) for training stability
Rotary Position Embeddings (RoPE) instead of learned position embeddings
SwiGLU activation in feed-forward layers (~8/3 expansion ratio)
Grouped Query Attention (GQA) with 4:1 query-to-KV head ratio
QK-normalization for attention stability
No bias terms throughout the model
Z-loss regularization during training

Training Data

Trained on a diverse mix of:

70% C4 (Common Crawl, cleaned)
30% Wikipedia (English, November 2023)

Usage

With PyTorch (custom code)

import torch
from modern_llm_model import ModernGPT, ModelConfig

# Load model
checkpoint = torch.load("pytorch_model.pt", map_location="cuda")
config = ModelConfig(**checkpoint["config"])
model = ModernGPT(config).cuda()
model.load_state_dict(checkpoint["model"])
model.eval()

# Generate
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")

tokens = sp.encode("The future of AI is")
x = torch.tensor([tokens], dtype=torch.long, device="cuda")
output = model.generate(x, max_new_tokens=100, temperature=0.7, top_k=40)
print(sp.decode(output[0].tolist()))

Training Configuration

{
  "vocab_size": 32000,
  "block_size": 1024,
  "n_layer": 24,
  "n_head": 16,
  "n_kv_head": 4,
  "n_embd": 1024,
  "intermediate_size": 0,
  "dropout": 0.0,
  "bias": false,
  "rope_theta": 10000.0,
  "qk_norm": true,
  "tie_word_embeddings": false,
  "max_steps": 100000,
  "batch_size": 8,
  "grad_accum_steps": 15,
  "lr": 0.0003,
  "min_lr": 3e-05,
  "warmup_steps": 2000,
  "weight_decay": 0.1,
  "beta1": 0.9,
  "beta2": 0.95,
  "grad_clip": 1.0,
  "z_loss_weight": 0.0001,
  "eval_interval": 1000,
  "eval_iters": 50,
  "checkpoint_interval": 5000,
  "checkpoint_dir": "modern_checkpoints",
  "patience": 20,
  "min_delta": 0.001,
  "use_amp": true,
  "use_gradient_checkpointing": false,
  "use_flash_attention": true,
  "compile_model": true,
  "data_dir": "diverse_data",
  "c4_weight": 0.7,
  "wiki_weight": 0.3,
  "num_data_workers": 2,
  "wandb_project": "llm-training",
  "wandb_run_name": "aarav-gpt-zg2-base",
  "wandb_enabled": true,
  "log_interval": 50,
  "csv_log_file": "training_metrics.csv"
}

Limitations

This is a base model (not instruction-tuned) — it does text completion, not conversation
Trained on English data only
336M parameters — smaller than production LLMs, intended for research and education
May produce factually incorrect, biased, or nonsensical text

License

Apache 2.0

Downloads last month: 12