| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - baseline |
| - gqa |
| - swiglu |
| pipeline_tag: text-generation |
| --- |
| |
| # Goedel-Baseline-1B |
|
|
| Standard transformer baseline for comparison with [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B). |
|
|
| This model exists as an ablation reference. It uses the same data, compute budget, and hyperparameters as Goedel-mHC-1B but replaces modified HyperConnections with standard pre-norm residual connections. See [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B) for the primary model and full discussion. |
|
|
| ## Architecture |
|
|
| | Component | Details | |
| |-----------|---------| |
| | Parameters | 1,185M | |
| | Dimensions | 2048 | |
| | Layers | 24 | |
| | Vocab size | 50304 | |
| | Attention | GQA — 16 heads, 4 KV heads, 128 head dim, QK-norm | |
| | FFN | SwiGLU, 2.667x expansion | |
| | Residual | Standard pre-norm (RMSNorm) | |
| | Optimizer | AdamW (LR 3e-4), cosine schedule, 500-step warmup | |
|
|
| ## Training |
|
|
| - **Data:** 20B tokens of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), GPT-2 tokenizer |
| - **Hardware:** 8x H200 SXM on Vast.ai |
| - **Sequence length:** 4096 |
| - **Effective batch:** 384 sequences (8 GPUs x 8 micro-batch x 6 grad accum = 1.57M tokens/step) |
|
|
| ## Results |
|
|
| | Benchmark | Goedel-Baseline-1B (1,185M) | Goedel-mHC-1B (1,009M) | |
| |-----------|------------------------------|--------------------------| |
| | BPB (wikitext-2) | 1.130 | **1.087** | |
| | val_loss (FineWeb-Edu) | 2.686 | **2.645** | |
| | HellaSwag | 36.2% | **39.7%** | |
| | ARC-Easy | 52.8% | **57.8%** | |
| | ARC-Challenge | 23.9% | **24.3%** | |
| | WinoGrande | 53.1% | **54.9%** | |
| |
| The mHC variant wins on all benchmarks with **15% fewer parameters**. |
| |
| ## Full Config |
| |
| ```yaml |
| model: |
| dim: 2048 |
| n_layers: 24 |
| vocab_size: 50304 |
| |
| attention: |
| type: gqa |
| num_heads: 16 |
| num_kv_heads: 4 |
| head_dim: 128 |
| qk_norm: true |
| rope_theta: 10000 |
| |
| ffn: |
| type: swiglu |
| intermediate_mult: 2.667 |
|
|
| residual: |
| type: prenorm |
|
|
| optim: |
| type: adamw |
| lr: 3.0e-4 |
| scheduler: cosine |
| warmup_steps: 500 |
| weight_decay: 0.1 |
| max_grad_norm: 1.0 |
|
|
| training: |
| tokens: 20_000_000_000 |
| batch_size: 8 |
| seq_len: 4096 |
| grad_accum_steps: 6 |
| liger: true |
| compile: true |
| compile_mode: max-autotune-no-cudagraphs |
|
|
| data: |
| shard_dir: data/fineweb_edu |
| ``` |
| |
| ## Limitations |
| |
| - **Undertrained:** 20B tokens is well below chinchilla-optimal for 1.2B parameters. |
| - **English only:** Trained exclusively on English web text. |
| - **No instruction tuning:** Base model only; not suitable for chat or instruction-following without fine-tuning. |
| - **Custom codebase required:** Weights are saved as raw PyTorch state dicts and require this project's model code to load. |
| |
| ## Purpose |
| |
| This model exists solely as an ablation reference to isolate the impact of modified HyperConnections. For the primary model, see [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B). |
| |
| ## License |
| |
| Apache 2.0 |
| |