Goedel-Baseline-1B
Standard transformer baseline for comparison with Goedel-mHC-1B.
This model exists as an ablation reference. It uses the same data, compute budget, and hyperparameters as Goedel-mHC-1B but replaces modified HyperConnections with standard pre-norm residual connections. See Goedel-mHC-1B for the primary model and full discussion.
Architecture
| Component | Details |
|---|---|
| Parameters | 1,185M |
| Dimensions | 2048 |
| Layers | 24 |
| Vocab size | 50304 |
| Attention | GQA — 16 heads, 4 KV heads, 128 head dim, QK-norm |
| FFN | SwiGLU, 2.667x expansion |
| Residual | Standard pre-norm (RMSNorm) |
| Optimizer | AdamW (LR 3e-4), cosine schedule, 500-step warmup |
Training
- Data: 20B tokens of FineWeb-Edu, GPT-2 tokenizer
- Hardware: 8x H200 SXM on Vast.ai
- Sequence length: 4096
- Effective batch: 384 sequences (8 GPUs x 8 micro-batch x 6 grad accum = 1.57M tokens/step)
Results
| Benchmark | Goedel-Baseline-1B (1,185M) | Goedel-mHC-1B (1,009M) |
|---|---|---|
| BPB (wikitext-2) | 1.130 | 1.087 |
| val_loss (FineWeb-Edu) | 2.686 | 2.645 |
| HellaSwag | 36.2% | 39.7% |
| ARC-Easy | 52.8% | 57.8% |
| ARC-Challenge | 23.9% | 24.3% |
| WinoGrande | 53.1% | 54.9% |
The mHC variant wins on all benchmarks with 15% fewer parameters.
Full Config
model:
dim: 2048
n_layers: 24
vocab_size: 50304
attention:
type: gqa
num_heads: 16
num_kv_heads: 4
head_dim: 128
qk_norm: true
rope_theta: 10000
ffn:
type: swiglu
intermediate_mult: 2.667
residual:
type: prenorm
optim:
type: adamw
lr: 3.0e-4
scheduler: cosine
warmup_steps: 500
weight_decay: 0.1
max_grad_norm: 1.0
training:
tokens: 20_000_000_000
batch_size: 8
seq_len: 4096
grad_accum_steps: 6
liger: true
compile: true
compile_mode: max-autotune-no-cudagraphs
data:
shard_dir: data/fineweb_edu
Limitations
- Undertrained: 20B tokens is well below chinchilla-optimal for 1.2B parameters.
- English only: Trained exclusively on English web text.
- No instruction tuning: Base model only; not suitable for chat or instruction-following without fine-tuning.
- Custom codebase required: Weights are saved as raw PyTorch state dicts and require this project's model code to load.
Purpose
This model exists solely as an ablation reference to isolate the impact of modified HyperConnections. For the primary model, see Goedel-mHC-1B.
License
Apache 2.0
- Downloads last month
- 3