Goedel-Baseline-1B

Standard transformer baseline for comparison with Goedel-mHC-1B.

This model exists as an ablation reference. It uses the same data, compute budget, and hyperparameters as Goedel-mHC-1B but replaces modified HyperConnections with standard pre-norm residual connections. See Goedel-mHC-1B for the primary model and full discussion.

Architecture

Component Details
Parameters 1,185M
Dimensions 2048
Layers 24
Vocab size 50304
Attention GQA — 16 heads, 4 KV heads, 128 head dim, QK-norm
FFN SwiGLU, 2.667x expansion
Residual Standard pre-norm (RMSNorm)
Optimizer AdamW (LR 3e-4), cosine schedule, 500-step warmup

Training

  • Data: 20B tokens of FineWeb-Edu, GPT-2 tokenizer
  • Hardware: 8x H200 SXM on Vast.ai
  • Sequence length: 4096
  • Effective batch: 384 sequences (8 GPUs x 8 micro-batch x 6 grad accum = 1.57M tokens/step)

Results

Benchmark Goedel-Baseline-1B (1,185M) Goedel-mHC-1B (1,009M)
BPB (wikitext-2) 1.130 1.087
val_loss (FineWeb-Edu) 2.686 2.645
HellaSwag 36.2% 39.7%
ARC-Easy 52.8% 57.8%
ARC-Challenge 23.9% 24.3%
WinoGrande 53.1% 54.9%

The mHC variant wins on all benchmarks with 15% fewer parameters.

Full Config

model:
  dim: 2048
  n_layers: 24
  vocab_size: 50304

attention:
  type: gqa
  num_heads: 16
  num_kv_heads: 4
  head_dim: 128
  qk_norm: true
  rope_theta: 10000

ffn:
  type: swiglu
  intermediate_mult: 2.667

residual:
  type: prenorm

optim:
  type: adamw
  lr: 3.0e-4
  scheduler: cosine
  warmup_steps: 500
  weight_decay: 0.1
  max_grad_norm: 1.0

training:
  tokens: 20_000_000_000
  batch_size: 8
  seq_len: 4096
  grad_accum_steps: 6
  liger: true
  compile: true
  compile_mode: max-autotune-no-cudagraphs

data:
  shard_dir: data/fineweb_edu

Limitations

  • Undertrained: 20B tokens is well below chinchilla-optimal for 1.2B parameters.
  • English only: Trained exclusively on English web text.
  • No instruction tuning: Base model only; not suitable for chat or instruction-following without fine-tuning.
  • Custom codebase required: Weights are saved as raw PyTorch state dicts and require this project's model code to load.

Purpose

This model exists solely as an ablation reference to isolate the impact of modified HyperConnections. For the primary model, see Goedel-mHC-1B.

License

Apache 2.0

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support