GoedelMachines
/

Goedel-Baseline-1B

Text Generation

Model card Files Files and versions

Goedel-Baseline-1B / README.md

Lazaurus's picture

Upload README.md with huggingface_hub

0887d57 verified 26 days ago

|

history blame contribute delete

2.9 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- baseline
	- gqa
	- swiglu
	pipeline_tag: text-generation
	---

	# Goedel-Baseline-1B

	Standard transformer baseline for comparison with [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B).

	This model exists as an ablation reference. It uses the same data, compute budget, and hyperparameters as Goedel-mHC-1B but replaces modified HyperConnections with standard pre-norm residual connections. See [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B) for the primary model and full discussion.

	## Architecture

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Parameters \| 1,185M \|
	\| Dimensions \| 2048 \|
	\| Layers \| 24 \|
	\| Vocab size \| 50304 \|
	\| Attention \| GQA — 16 heads, 4 KV heads, 128 head dim, QK-norm \|
	\| FFN \| SwiGLU, 2.667x expansion \|
	\| Residual \| Standard pre-norm (RMSNorm) \|
	\| Optimizer \| AdamW (LR 3e-4), cosine schedule, 500-step warmup \|

	## Training

	- Data: 20B tokens of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), GPT-2 tokenizer
	- Hardware: 8x H200 SXM on Vast.ai
	- Sequence length: 4096
	- Effective batch: 384 sequences (8 GPUs x 8 micro-batch x 6 grad accum = 1.57M tokens/step)

	## Results

	\| Benchmark \| Goedel-Baseline-1B (1,185M) \| Goedel-mHC-1B (1,009M) \|
	\|-----------\|------------------------------\|--------------------------\|
	\| BPB (wikitext-2) \| 1.130 \| 1.087 \|
	\| val_loss (FineWeb-Edu) \| 2.686 \| 2.645 \|
	\| HellaSwag \| 36.2% \| 39.7% \|
	\| ARC-Easy \| 52.8% \| 57.8% \|
	\| ARC-Challenge \| 23.9% \| 24.3% \|
	\| WinoGrande \| 53.1% \| 54.9% \|

	The mHC variant wins on all benchmarks with 15% fewer parameters.

	## Full Config

	```yaml
	model:
	dim: 2048
	n_layers: 24
	vocab_size: 50304

	attention:
	type: gqa
	num_heads: 16
	num_kv_heads: 4
	head_dim: 128
	qk_norm: true
	rope_theta: 10000

	ffn:
	type: swiglu
	intermediate_mult: 2.667

	residual:
	type: prenorm

	optim:
	type: adamw
	lr: 3.0e-4
	scheduler: cosine
	warmup_steps: 500
	weight_decay: 0.1
	max_grad_norm: 1.0

	training:
	tokens: 20_000_000_000
	batch_size: 8
	seq_len: 4096
	grad_accum_steps: 6
	liger: true
	compile: true
	compile_mode: max-autotune-no-cudagraphs

	data:
	shard_dir: data/fineweb_edu
	```

	## Limitations

	- Undertrained: 20B tokens is well below chinchilla-optimal for 1.2B parameters.
	- English only: Trained exclusively on English web text.
	- No instruction tuning: Base model only; not suitable for chat or instruction-following without fine-tuning.
	- Custom codebase required: Weights are saved as raw PyTorch state dicts and require this project's model code to load.

	## Purpose

	This model exists solely as an ablation reference to isolate the impact of modified HyperConnections. For the primary model, see [Goedel-mHC-1B](https://huggingface.co/GoedelMachines/Goedel-mHC-1B).

	## License

	Apache 2.0