Goedel-Baseline-1B / README.md
Lazaurus's picture
Upload README.md with huggingface_hub
0887d57 verified
---
license: apache-2.0
language:
- en
tags:
- baseline
- gqa
- swiglu
pipeline_tag: text-generation
---
# Goedel-Baseline-1B
Standard transformer baseline for comparison with [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B).
This model exists as an ablation reference. It uses the same data, compute budget, and hyperparameters as Goedel-mHC-1B but replaces modified HyperConnections with standard pre-norm residual connections. See [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B) for the primary model and full discussion.
## Architecture
| Component | Details |
|-----------|---------|
| Parameters | 1,185M |
| Dimensions | 2048 |
| Layers | 24 |
| Vocab size | 50304 |
| Attention | GQA — 16 heads, 4 KV heads, 128 head dim, QK-norm |
| FFN | SwiGLU, 2.667x expansion |
| Residual | Standard pre-norm (RMSNorm) |
| Optimizer | AdamW (LR 3e-4), cosine schedule, 500-step warmup |
## Training
- **Data:** 20B tokens of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), GPT-2 tokenizer
- **Hardware:** 8x H200 SXM on Vast.ai
- **Sequence length:** 4096
- **Effective batch:** 384 sequences (8 GPUs x 8 micro-batch x 6 grad accum = 1.57M tokens/step)
## Results
| Benchmark | Goedel-Baseline-1B (1,185M) | Goedel-mHC-1B (1,009M) |
|-----------|------------------------------|--------------------------|
| BPB (wikitext-2) | 1.130 | **1.087** |
| val_loss (FineWeb-Edu) | 2.686 | **2.645** |
| HellaSwag | 36.2% | **39.7%** |
| ARC-Easy | 52.8% | **57.8%** |
| ARC-Challenge | 23.9% | **24.3%** |
| WinoGrande | 53.1% | **54.9%** |
The mHC variant wins on all benchmarks with **15% fewer parameters**.
## Full Config
```yaml
model:
dim: 2048
n_layers: 24
vocab_size: 50304
attention:
type: gqa
num_heads: 16
num_kv_heads: 4
head_dim: 128
qk_norm: true
rope_theta: 10000
ffn:
type: swiglu
intermediate_mult: 2.667
residual:
type: prenorm
optim:
type: adamw
lr: 3.0e-4
scheduler: cosine
warmup_steps: 500
weight_decay: 0.1
max_grad_norm: 1.0
training:
tokens: 20_000_000_000
batch_size: 8
seq_len: 4096
grad_accum_steps: 6
liger: true
compile: true
compile_mode: max-autotune-no-cudagraphs
data:
shard_dir: data/fineweb_edu
```
## Limitations
- **Undertrained:** 20B tokens is well below chinchilla-optimal for 1.2B parameters.
- **English only:** Trained exclusively on English web text.
- **No instruction tuning:** Base model only; not suitable for chat or instruction-following without fine-tuning.
- **Custom codebase required:** Weights are saved as raw PyTorch state dicts and require this project's model code to load.
## Purpose
This model exists solely as an ablation reference to isolate the impact of modified HyperConnections. For the primary model, see [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B).
## License
Apache 2.0