--- license: apache-2.0 language: - en tags: - baseline - gqa - swiglu pipeline_tag: text-generation --- # Goedel-Baseline-1B Standard transformer baseline for comparison with [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B). This model exists as an ablation reference. It uses the same data, compute budget, and hyperparameters as Goedel-mHC-1B but replaces modified HyperConnections with standard pre-norm residual connections. See [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B) for the primary model and full discussion. ## Architecture | Component | Details | |-----------|---------| | Parameters | 1,185M | | Dimensions | 2048 | | Layers | 24 | | Vocab size | 50304 | | Attention | GQA — 16 heads, 4 KV heads, 128 head dim, QK-norm | | FFN | SwiGLU, 2.667x expansion | | Residual | Standard pre-norm (RMSNorm) | | Optimizer | AdamW (LR 3e-4), cosine schedule, 500-step warmup | ## Training - **Data:** 20B tokens of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), GPT-2 tokenizer - **Hardware:** 8x H200 SXM on Vast.ai - **Sequence length:** 4096 - **Effective batch:** 384 sequences (8 GPUs x 8 micro-batch x 6 grad accum = 1.57M tokens/step) ## Results | Benchmark | Goedel-Baseline-1B (1,185M) | Goedel-mHC-1B (1,009M) | |-----------|------------------------------|--------------------------| | BPB (wikitext-2) | 1.130 | **1.087** | | val_loss (FineWeb-Edu) | 2.686 | **2.645** | | HellaSwag | 36.2% | **39.7%** | | ARC-Easy | 52.8% | **57.8%** | | ARC-Challenge | 23.9% | **24.3%** | | WinoGrande | 53.1% | **54.9%** | The mHC variant wins on all benchmarks with **15% fewer parameters**. ## Full Config ```yaml model: dim: 2048 n_layers: 24 vocab_size: 50304 attention: type: gqa num_heads: 16 num_kv_heads: 4 head_dim: 128 qk_norm: true rope_theta: 10000 ffn: type: swiglu intermediate_mult: 2.667 residual: type: prenorm optim: type: adamw lr: 3.0e-4 scheduler: cosine warmup_steps: 500 weight_decay: 0.1 max_grad_norm: 1.0 training: tokens: 20_000_000_000 batch_size: 8 seq_len: 4096 grad_accum_steps: 6 liger: true compile: true compile_mode: max-autotune-no-cudagraphs data: shard_dir: data/fineweb_edu ``` ## Limitations - **Undertrained:** 20B tokens is well below chinchilla-optimal for 1.2B parameters. - **English only:** Trained exclusively on English web text. - **No instruction tuning:** Base model only; not suitable for chat or instruction-following without fine-tuning. - **Custom codebase required:** Weights are saved as raw PyTorch state dicts and require this project's model code to load. ## Purpose This model exists solely as an ablation reference to isolate the impact of modified HyperConnections. For the primary model, see [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B). ## License Apache 2.0