File size: 7,804 Bytes

---
license: apache-2.0
language:
- en
tags:
- hyperconnections
- mhc
- gated-gqa
- relu2
- muon
pipeline_tag: text-generation
---

# Goedel-mHC-1B

A 1B-parameter language model built on **multi-stream Hyperconnections (mHC)**, Gated GQA, and ReLU² FFN. This is the first open 1B+ LLM using mHC as its residual connection mechanism.

This is an architecture research release. The model is trained on 20B tokens of FineWeb-Edu to validate that mHC, combined with modern attention and FFN innovations from the NanoGPT speedrun community, produces better scaling behavior than a standard transformer at equivalent compute. It does: **3.8% better bits-per-byte with 15% fewer parameters** compared to a standard GQA + SwiGLU + PreNorm + AdamW baseline trained identically.

## Architecture

**Parameters:** 1,009M

| Component | Design | Details |
|-----------|--------|---------|
| **Attention** | Gated GQA | 16 query heads, 4 KV heads, 128 head dim, QK-norm, sigmoid output gate |
| **FFN** | ReLU² | `relu(x)²` activation, 2.667x expansion (hidden dim rounded to nearest 256) |
| **Residual** | mHC | 4 parallel streams, Sinkhorn-constrained mixing matrices per layer |
| **Norm** | RMSNorm | Pre-norm within mHC streams |
| **Positional** | RoPE | \u03b8 = 10,000 |
| **Vocab** | 50,304 | GPT-2 tokenizer, padded to multiple of 64 |

### Gated GQA

Grouped Query Attention with a learned sigmoid output gate, following Qwen3 ([arXiv:2505.09388](https://arxiv.org/abs/2505.09388)). After computing standard GQA attention output, an additional linear projection produces a gate of the same shape, and the output is element-wise multiplied by `sigmoid(gate)`. This eliminates attention sink tokens and prevents bf16 loss spikes that occur with standard attention at scale.

### ReLU²

From the NanoGPT speedrun lineage. The FFN applies `relu(x * W_up)²` followed by `W_down`. Squared ReLU produces sparser activations than SwiGLU while being simpler and more fusible. The intermediate dimension is `dim * 2.667`, rounded up to the nearest multiple of 256 for hardware alignment.

### Multi-stream Hyperconnections (mHC)

Instead of a single residual stream `x + sublayer(norm(x))`, mHC maintains `n` parallel streams of the full hidden dimension. Between layers, streams are mixed via a learned doubly-stochastic matrix (enforced by Sinkhorn-Knopp iterations on a `4x4` logit matrix). A learned `h_pre` vector combines streams into a single input for each sublayer, and a learned `h_post` vector distributes the sublayer output back across streams.

At initialization, mHC exactly recovers standard pre-norm residual connections. During training, the model learns to route information through multiple parallel pathways, which empirically improves gradient flow and representation capacity.

The expanded hidden state between blocks has shape `(B, S, n*D)` where `n=4` streams and `D=2048`, so the inter-block representation is 8,192-dimensional. `expand()` replicates the embedding into streams after the embedding layer; `contract()` averages across streams before the final norm.

Reference: [Zhu et al., 2024](https://arxiv.org/abs/2409.19606), [Wenfeng et al., 2024](https://arxiv.org/abs/2512.24880).

### Optimizer: NorMuon

A split optimizer: **Muon** (LR 0.007) for all 2D weight matrices, **Adam** (LR 3e-4) for 1D parameters and embeddings. Muon applies Newton's method in the spectral domain via Nesterov momentum on the orthogonalized gradient, which significantly accelerates training of large weight matrices. The "NorMuon" variant adds a normalization step with beta2=0.95 for additional stability.

LR schedule: trapezoidal with 500-step linear warmup, constant phase, and 45% linear cooldown.

## Training

| Setting | Value |
|---------|-------|
| **Data** | 20B tokens of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) |
| **Tokenizer** | GPT-2 (50,257 tokens, vocab padded to 50,304) |
| **Hardware** | 8x NVIDIA H200 SXM (Vast.ai) |
| **Sequence length** | 4,096 |
| **Per-GPU batch** | 8 sequences |
| **Gradient accumulation** | 4 steps |
| **Effective batch** | 256 sequences (8 GPUs x 8 seq x 4 accum = 1.05M tokens/step) |
| **Precision** | bf16 |
| **Compilation** | `torch.compile` with `max-autotune-no-cudagraphs` |
| **Fused CE loss** | Liger kernel (never materializes full logit tensor) |
| **Weight tying** | Embedding and LM head share weights |
| **Wall-clock time** | ~21 hours |

## Results

| Benchmark | Goedel-mHC-1B (1,009M) | Baseline (1,185M) |
|-----------|------------------------|-------------------|
| **BPB** (wikitext-2) | **1.087** | 1.130 |
| **val_loss** (FineWeb-Edu) | **2.645** | 2.686 |
| **HellaSwag** | **39.7%** | 36.2% |
| **ARC-Easy** | **57.8%** | 52.8% |
| **ARC-Challenge** | **24.3%** | 23.9% |
| **WinoGrande** | **54.9%** | 53.1% |

Both models trained on 20B tokens of FineWeb-Edu, 8xH200. The baseline uses GQA + SwiGLU + PreNorm + AdamW and has 15% more parameters.

**Key result:** Goedel-mHC-1B achieves 3.8% better BPB with 15% fewer parameters than the baseline, demonstrating that the combination of mHC + Gated GQA + ReLU² + NorMuon meaningfully improves parameter efficiency.

## Full Config

The complete resolved configuration used for training:

```yaml
model:
  dim: 2048
  n_layers: 24
  vocab_size: 50304

attention:
  type: gated_gqa
  num_heads: 16
  num_kv_heads: 4
  head_dim: 128
  qk_norm: true
  rope_theta: 10000

ffn:
  type: relu2
  intermediate_mult: 2.667

residual:
  type: mhc
  n_streams: 4

optim:
  type: muon
  lr: 3.0e-4
  muon_lr: 0.007
  normuon: true
  normuon_beta2: 0.95
  scheduler: trapezoidal
  cooldown_fraction: 0.45
  warmup_steps: 500
  weight_decay: 0.1
  max_grad_norm: 1.0

training:
  tokens: 20_000_000_000
  batch_size: 8
  seq_len: 4096
  grad_accum_steps: 4
  liger: true
  compile: true
  compile_mode: max-autotune-no-cudagraphs

data:
  shard_dir: data/fineweb_edu
```

## Limitations

- **Undertrained.** 20B tokens is far below modern standards. Comparable 1B models typically train on 1--4T tokens. This model exists to validate architecture choices, not to compete on downstream benchmarks.
- **English only.** FineWeb-Edu is English-language web text filtered for educational content.
- **Base model only.** No instruction tuning, RLHF, or alignment. The model will not follow instructions reliably and may generate harmful or incorrect text.
- **Custom architecture required.** mHC, Gated GQA, and ReLU² are not standard HuggingFace `transformers` architectures. You cannot load this model with `AutoModelForCausalLM`. Loading requires our custom codebase. Code release is coming.
- **Not suitable for production use.** This is a research artifact for architecture exploration.

## What's Next

- **100B-token continued pretraining** on FineWeb-HQ is currently in progress, which will bring this architecture closer to a properly trained model.
- **Full code release** and technical writeup will accompany the 100B results.

## Citation

If you use this model in your work, please cite:

```bibtex
@misc{goedel2026mhc1b,
  title={Goedel-mHC-1B: First Open 1B+ Language Model with Multi-Stream Hyperconnections},
  author={Goedel Machines},
  year={2026},
  url={https://huggingface.co/GoedelMachines/Goedel-mHC-1B}
}
```

This work builds on the Hyper-Connections papers:

```bibtex
@article{zhu2024hyperconnections,
  title={Hyper-Connections},
  author={Defa Zhu and Hongzhi Huang and Zihao Huang and Yutao Zeng and Yunyao Mao and Banggu Wu and Qiyang Min and Xun Zhou},
  journal={arXiv preprint arXiv:2409.19606},
  year={2024}
}

@article{wenfeng2024mhc,
  title={Manifold-Constrained Hyper-Connections},
  author={Zhenda Xie and Yixuan Wei and Huanqi Cao and others},
  journal={arXiv preprint arXiv:2512.24880},
  year={2024}
}
```