YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

🏌️ Parameter Golf Submission β€” tbukuai

Minimize Bits Per Byte (BPB) on the FineWeb validation split under a ~15–16 MB artifact size limit.

Metric Value
SOTA to beat 1.1056 BPB (cmpatino-8, 8Γ—H100)
Baseline 1.2244 BPB (9L/512d/SP1024)
Our config SP4096, 11L, 512d, Muon + EMA, depth recurrence, int6 quantization

Architecture

  • Tokenizer: SP4096 BPE (SentencePiece, vocab=4096)
  • Model: 11 layers, 512 dim, 8 heads, 4 KV heads (GQA), MLPΓ—4
  • Tied embeddings: shared input/output embedding (saves artifact space)
  • Parallel residuals: layers 7–10 (GPT-J/Falcon style)
  • Depth recurrence: layers 3–5 looped Γ—2 extra (activates at 35% of training)
  • U-Net skip connections: encoder/decoder halves with learned skip weights
  • Activation: LeakyReLUΒ² (negative_slope=0.5)
  • Logit softcapping: tanh(logits/30) Γ— 30 (Gemma-style)
  • Positional encoding: Partial RoPE (25% of head dim)

Optimizer

  • Matrix weights (Q/K/V/O/MLP): Muon optimizer (lr=0.022, momentum=0.95, wd=0.095, 5 Newton-Schulz steps)
  • Embeddings: Adam (lr=0.05 tied, fused)
  • Scalars/norms: Adam (lr=0.04, fused)
  • Schedule: 20-step warmup β†’ trapezoidal β†’ 5000-step warmdown
  • EMA: decay=0.9965, starts at step 100

Quantization & Compression

  • Weights: int6 GPTQ-style with per-row SDClip (k=12.85 Γ— std)
  • Embeddings: int8 (more sensitive, k=20.0)
  • Small tensors (<65K params): kept as float16
  • Control tensors (scales, gains, skip weights): kept as float32
  • Compression: Brotli quality=11 (fallback: zlib-9)

Data

  • Source: InAbsentia/parameter-golf-sp4096 on HF Hub
  • Train: 106 shards Γ— 190MB each (10B tokens)
  • Val: 1 shard (86.7MB)
  • Tokenizer: tokenizers/fineweb_4096_bpe.model (295.6KB)

Files

File Description
train_gpt.py Core training script β€” model, Muon optimizer, EMA, quantization, BPB eval, checkpoint save/resume
launch_training_competitive.py Competitive launcher β€” downloads data, sets optimized hyperparams, runs torchrun
launch_training.py Original launcher (requires private openai/parameter-golf repo)
parameter_golf_colab.ipynb Colab notebook β€” train on a free 1Γ—T4 GPU
parameter_golf_kaggle.ipynb Kaggle notebook β€” train on free 2Γ—T4 GPUs (recommended!)
sota_train_gpt.py LZMA-compressed submission artifact
claude.md Detailed technical notes
roadmap.md Experiment plan & hyperparameter tuning roadmap

How to Run

πŸ₯‡ Free: Kaggle (2Γ—T4, 30h/week) β€” Recommended

The best free option. 2 GPUs = ~2Γ— faster, 30h/week quota, stable sessions.

Quick start:

  1. Go to kaggle.com/code β†’ New Notebook
  2. Upload parameter_golf_kaggle.ipynb
  3. Settings (right panel):
    • Accelerator β†’ GPU T4 Γ—2
    • Internet β†’ ON
  4. Add HF token: Add-ons β†’ Secrets β†’ name: HF_TOKEN, value: your token from hf.co/settings/tokens
  5. Run all cells

What the notebook does:

  • Downloads data (selectable: 10-106 train shards)
  • Patches Flash Attention β†’ mem-efficient SDP (T4 compatible)
  • Runs torchrun --nproc_per_node=2 for 2-GPU distributed training
  • grad_accum_steps auto-adjusts: 8 β†’ 4 (effective batch stays the same)
  • Checkpoints to HF Hub every 1000 steps

Kaggle presets:

Preset Wallclock Iters Warmdown Shards Est. steps
Quick test 600s (10min) 2000 350 10 ~200
1-hour run 3600s 8000 1500 20 ~3000
Full 12h session 39600s (11h) 30000 5000 106 ~20000

Maximizing 30h/week:

  • Run 3 sessions Γ— 10h across the week
  • Set RESUME_FROM_HUB=1 in sessions 2 & 3 to continue from last checkpoint
  • Total: ~30K steps with full dataset β€” competitive BPB

πŸ†“ Free: Google Colab (1Γ—T4, 16GB)

  1. Open parameter_golf_colab.ipynb in Colab
  2. Runtime β†’ Change runtime type β†’ T4 GPU
  3. Run all cells

Colab presets:

Preset Wallclock Iters Warmdown Purpose
Quick test 600s (10min) 2000 350 Verify everything works
1-hour run 3600s 5000 1000 Get a baseline BPB
Max session 36000s (10h) 20000 3500 Best from free Colab

Colab vs Kaggle

Colab Free Kaggle
GPUs 1Γ— T4 2Γ— T4
VRAM 16GB 16GB Γ— 2
Quota Throttled, no guarantee 30h/week guaranteed
Session ~12h (can disconnect) 12h (stable)
Disk ~100GB shared 73GB
Speed 1Γ— ~2Γ—
Distributed ❌ single GPU βœ… torchrun DDP
Best for Quick tests Serious training

πŸ’» Multi-GPU (8Γ—H100, 6 hours)

python launch_training_competitive.py

⏩ Resume from checkpoint

RESUME_FROM_HUB=1 python launch_training_competitive.py

πŸ”§ Quick test (1Γ—GPU, 10 min)

MAX_WALLCLOCK_SECONDS=600 ITERATIONS=2000 WARMDOWN_ITERS=350 \
  python launch_training_competitive.py

All hyperparameters are overridable via environment variables (see Hyperparameters class in train_gpt.py).


Free GPU Resources

Platform GPU VRAM Free Quota Session Limit Best For
Kaggle 2Γ— T4 16GB each 30h/week 12h πŸ₯‡ Best free option
Google Colab T4 16GB Unlimited (throttled) ~12h Quick tests
Lightning.ai T4 / A10G 16-24GB 22 GPU-hrs/month Varies Persistent storage
Paperspace Gradient M4000 8GB Free tier 6h Light experiments
Oracle Cloud Free A10G 24GB 1 always-free VM Persistent Best if you get capacity
Google Cloud Trial T4/V100/A100 16-80GB $300 credit (90 days) You manage Competitive runs

BPB Metric

BPB = cross_entropy_nats / (ln(2) Γ— avg_bytes_per_token)

Where avg_bytes_per_token is the actual UTF-8 byte count of decoded tokens (accounting for SentencePiece leading-space tokens). Larger vocab β†’ more bytes/token β†’ better BPB ratio.


Strategy to Beat SOTA

High-confidence improvements

  1. Train longer: 6+ hours on 8Γ—H100
  2. More iterations: 30K steps with 5K warmdown
  3. Use all 106 train shards: More data = better generalization

Medium-risk experiments

  1. Increase recurrence: layers 3–5 Γ— 3 instead of Γ— 2
  2. Larger MLP: MLPΓ—6 with smaller model_dim
  3. Better warmdown schedule: Cosine instead of linear

High-risk / high-reward

  1. SP8192 tokenizer: Better bytes/token ratio, but larger embedding table
  2. Mixed quantization: int4 for MLP, int6 for attention, int8 for embeddings
  3. Distillation: Train larger model, distill into size-limited one
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support