Goedel-Baseline-1B

Standard transformer baseline for comparison with Goedel-mHC-1B.

This model exists as an ablation reference. It uses the same data, compute budget, and hyperparameters as Goedel-mHC-1B but replaces modified HyperConnections with standard pre-norm residual connections. See Goedel-mHC-1B for the primary model and full discussion.

Architecture

Component	Details
Parameters	1,185M
Dimensions	2048
Layers	24
Vocab size	50304
Attention	GQA — 16 heads, 4 KV heads, 128 head dim, QK-norm
FFN	SwiGLU, 2.667x expansion
Residual	Standard pre-norm (RMSNorm)
Optimizer	AdamW (LR 3e-4), cosine schedule, 500-step warmup

Training

Data: 20B tokens of FineWeb-Edu, GPT-2 tokenizer
Hardware: 8x H200 SXM on Vast.ai
Sequence length: 4096
Effective batch: 384 sequences (8 GPUs x 8 micro-batch x 6 grad accum = 1.57M tokens/step)

Results

Benchmark	Goedel-Baseline-1B (1,185M)	Goedel-mHC-1B (1,009M)
BPB (wikitext-2)	1.130	1.087
val_loss (FineWeb-Edu)	2.686	2.645
HellaSwag	36.2%	39.7%
ARC-Easy	52.8%	57.8%
ARC-Challenge	23.9%	24.3%
WinoGrande	53.1%	54.9%

The mHC variant wins on all benchmarks with 15% fewer parameters.

Full Config

model:
  dim: 2048
  n_layers: 24
  vocab_size: 50304

attention:
  type: gqa
  num_heads: 16
  num_kv_heads: 4
  head_dim: 128
  qk_norm: true
  rope_theta: 10000

ffn:
  type: swiglu
  intermediate_mult: 2.667

residual:
  type: prenorm

optim:
  type: adamw
  lr: 3.0e-4
  scheduler: cosine
  warmup_steps: 500
  weight_decay: 0.1
  max_grad_norm: 1.0

training:
  tokens: 20_000_000_000
  batch_size: 8
  seq_len: 4096
  grad_accum_steps: 6
  liger: true
  compile: true
  compile_mode: max-autotune-no-cudagraphs

data:
  shard_dir: data/fineweb_edu

Limitations

Undertrained: 20B tokens is well below chinchilla-optimal for 1.2B parameters.
English only: Trained exclusively on English web text.
No instruction tuning: Base model only; not suitable for chat or instruction-following without fine-tuning.
Custom codebase required: Weights are saved as raw PyTorch state dicts and require this project's model code to load.

Purpose

This model exists solely as an ablation reference to isolate the impact of modified HyperConnections. For the primary model, see Goedel-mHC-1B.

License

Apache 2.0

Downloads last month: 3