aarav-gpt-zg2-base
A 336M parameter decoder-only language model trained from scratch using a modern Llama-style architecture.
Model Details
| Property | Value |
|---|---|
| Architecture | Llama-style (Pre-RMSNorm + RoPE + SwiGLU + GQA) |
| Parameters | 336.1M |
| Layers | 24 |
| Hidden dim | 1024 |
| Attention heads | 16 (query) / 4 (KV) |
| Context length | 1024 tokens |
| Vocab size | 32,000 |
| Tokenizer | SentencePiece (32K vocab) |
| Training tokens | 11.80B |
| Training steps | 96,000 |
| Validation loss | 2.8572 |
| Validation perplexity | 17.4 |
Architecture
This model uses the modern 2023-2025 consensus architecture:
- RMSNorm (pre-normalization) for training stability
- Rotary Position Embeddings (RoPE) instead of learned position embeddings
- SwiGLU activation in feed-forward layers (~8/3 expansion ratio)
- Grouped Query Attention (GQA) with 4:1 query-to-KV head ratio
- QK-normalization for attention stability
- No bias terms throughout the model
- Z-loss regularization during training
Training Data
Trained on a diverse mix of:
- 70% C4 (Common Crawl, cleaned)
- 30% Wikipedia (English, November 2023)
Usage
With PyTorch (custom code)
import torch
from modern_llm_model import ModernGPT, ModelConfig
# Load model
checkpoint = torch.load("pytorch_model.pt", map_location="cuda")
config = ModelConfig(**checkpoint["config"])
model = ModernGPT(config).cuda()
model.load_state_dict(checkpoint["model"])
model.eval()
# Generate
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")
tokens = sp.encode("The future of AI is")
x = torch.tensor([tokens], dtype=torch.long, device="cuda")
output = model.generate(x, max_new_tokens=100, temperature=0.7, top_k=40)
print(sp.decode(output[0].tolist()))
Training Configuration
{
"vocab_size": 32000,
"block_size": 1024,
"n_layer": 24,
"n_head": 16,
"n_kv_head": 4,
"n_embd": 1024,
"intermediate_size": 0,
"dropout": 0.0,
"bias": false,
"rope_theta": 10000.0,
"qk_norm": true,
"tie_word_embeddings": false,
"max_steps": 100000,
"batch_size": 8,
"grad_accum_steps": 15,
"lr": 0.0003,
"min_lr": 3e-05,
"warmup_steps": 2000,
"weight_decay": 0.1,
"beta1": 0.9,
"beta2": 0.95,
"grad_clip": 1.0,
"z_loss_weight": 0.0001,
"eval_interval": 1000,
"eval_iters": 50,
"checkpoint_interval": 5000,
"checkpoint_dir": "modern_checkpoints",
"patience": 20,
"min_delta": 0.001,
"use_amp": true,
"use_gradient_checkpointing": false,
"use_flash_attention": true,
"compile_model": true,
"data_dir": "diverse_data",
"c4_weight": 0.7,
"wiki_weight": 0.3,
"num_data_workers": 2,
"wandb_project": "llm-training",
"wandb_run_name": "aarav-gpt-zg2-base",
"wandb_enabled": true,
"log_interval": 50,
"csv_log_file": "training_metrics.csv"
}
Limitations
- This is a base model (not instruction-tuned) โ it does text completion, not conversation
- Trained on English data only
- 336M parameters โ smaller than production LLMs, intended for research and education
- May produce factually incorrect, biased, or nonsensical text
License
Apache 2.0
- Downloads last month
- 12