---
license: apache-2.0
language:
- en
tags:
- baseline
- gqa
- swiglu
pipeline_tag: text-generation
---

# Goedel-Baseline-1B

Standard transformer baseline for comparison with [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B).

This model exists as an ablation reference. It uses the same data, compute budget, and hyperparameters as Goedel-mHC-1B but replaces modified HyperConnections with standard pre-norm residual connections. See [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B) for the primary model and full discussion.

## Architecture

| Component | Details |
|-----------|---------|
| Parameters | 1,185M |
| Dimensions | 2048 |
| Layers | 24 |
| Vocab size | 50304 |
| Attention | GQA — 16 heads, 4 KV heads, 128 head dim, QK-norm |
| FFN | SwiGLU, 2.667x expansion |
| Residual | Standard pre-norm (RMSNorm) |
| Optimizer | AdamW (LR 3e-4), cosine schedule, 500-step warmup |

## Training

- **Data:** 20B tokens of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), GPT-2 tokenizer
- **Hardware:** 8x H200 SXM on Vast.ai
- **Sequence length:** 4096
- **Effective batch:** 384 sequences (8 GPUs x 8 micro-batch x 6 grad accum = 1.57M tokens/step)

## Results

| Benchmark | Goedel-Baseline-1B (1,185M) | Goedel-mHC-1B (1,009M) |
|-----------|------------------------------|--------------------------|
| BPB (wikitext-2) | 1.130 | **1.087** |
| val_loss (FineWeb-Edu) | 2.686 | **2.645** |
| HellaSwag | 36.2% | **39.7%** |
| ARC-Easy | 52.8% | **57.8%** |
| ARC-Challenge | 23.9% | **24.3%** |
| WinoGrande | 53.1% | **54.9%** |

The mHC variant wins on all benchmarks with **15% fewer parameters**.

## Full Config

```yaml
model:
  dim: 2048
  n_layers: 24
  vocab_size: 50304

attention:
  type: gqa
  num_heads: 16
  num_kv_heads: 4
  head_dim: 128
  qk_norm: true
  rope_theta: 10000

ffn:
  type: swiglu
  intermediate_mult: 2.667

residual:
  type: prenorm

optim:
  type: adamw
  lr: 3.0e-4
  scheduler: cosine
  warmup_steps: 500
  weight_decay: 0.1
  max_grad_norm: 1.0

training:
  tokens: 20_000_000_000
  batch_size: 8
  seq_len: 4096
  grad_accum_steps: 6
  liger: true
  compile: true
  compile_mode: max-autotune-no-cudagraphs

data:
  shard_dir: data/fineweb_edu
```

## Limitations

- **Undertrained:** 20B tokens is well below chinchilla-optimal for 1.2B parameters.
- **English only:** Trained exclusively on English web text.
- **No instruction tuning:** Base model only; not suitable for chat or instruction-following without fine-tuning.
- **Custom codebase required:** Weights are saved as raw PyTorch state dicts and require this project's model code to load.

## Purpose

This model exists solely as an ablation reference to isolate the impact of modified HyperConnections. For the primary model, see [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B).

## License

Apache 2.0