---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- pytorch
- safetensors
- text-generation
- small-llm
- custom-architecture
- linear-attention
- gated-deltanet
- test-time-training
- hybrid-attention
- research
library_name: genesis-llm
datasets:
- HuggingFaceTB/smol-smoltalk
base_model: []
---
𧬠Genesis-152M-Instruct
A Research-Oriented Small Language Model with Hybrid Linear Attention
---
## Table of Contents
- [Overview](#overview)
- [Model Summary](#model-summary)
- [Architecture Deep Dive](#architecture-deep-dive)
- [Hybrid Attention Layout](#hybrid-attention-layout)
- [Gated DeltaNet (GLA)](#gated-deltanet-gla)
- [Forgetting Attention (FoX)](#forgetting-attention-fox)
- [Test-Time Training (TTT)](#test-time-training-ttt)
- [Selective Activation](#selective-activation)
- [Additional Components](#additional-components)
- [Comparison with Other Architectures](#comparison-with-other-architectures)
- [Training Details](#training-details)
- [Pre-training](#pre-training)
- [Supervised Fine-Tuning (SFT)](#supervised-fine-tuning-sft)
- [Usage](#usage)
- [Benchmarks](#benchmarks)
- [Limitations](#limitations)
- [Citation](#citation)
- [License](#license)
---
## Overview
**Genesis-152M-Instruct** is an experimental small language model that combines recent advances in efficient attention mechanisms into a single architecture. It serves as a research platform for exploring:
- **Hybrid attention**: Mixing O(n) linear attention with O(n²) softmax attention
- **Efficient inference**: Sub-quadratic complexity for most layers
- **Adaptive computation**: Test-time training for dynamic model adaptation
> ā ļø **Experimental Model**: This is a research artifact, not a production-ready model. It demonstrates architectural innovations but has limitations typical of small models.
---
## Model Summary
| Property | Value |
|----------|-------|
| **Parameters** | 151.8M total (~122.8M non-embedding) |
| **Architecture** | Hybrid GLA + FoX Attention |
| **Context Length** | 2,048 tokens |
| **Vocab Size** | 50,279 (GPT-NeoX + ChatML tokens) |
| **Pre-training Data** | 2B tokens |
| **SFT Dataset** | [smol-smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) |
| **License** | Apache 2.0 |
### Files in this Repository
```
āāā genesis_152m_instruct.safetensors # Model weights
āāā README.md # This model card
āāā LICENSE # Apache 2.0
```
---
## Architecture Deep Dive
Genesis follows a **"deep-and-thin"** design philosophy inspired by [SmolLM2](https://arxiv.org/abs/2502.02737) and [MobileLLM](https://arxiv.org/abs/2402.14905), which has proven effective for small language models.
### Core Configuration
| Component | Value | Rationale |
|-----------|-------|-----------|
| Layers | 30 | Deep architecture for better representation |
| Hidden Size | 576 | Optimal width for 150M scale |
| Attention Heads | 9 | Query heads |
| KV Heads | 3 | 3:1 GQA ratio for memory efficiency |
| Head Dimension | 64 | Standard for efficient attention |
| FFN Size | 1,440 | 2.5Ć expansion (SwiGLU-efficient) |
| Weight Tying | ā | Embeddings tied with LM head |
---
### Hybrid Attention Layout
Genesis employs a **hybrid attention layout** inspired by [Qwen3-Next](https://huggingface.co/docs/transformers/main/en/model_doc/qwen3_next), alternating between linear and full attention:
```
Layer Distribution (30 layers):
āāā 23 layers: GLA (Gated DeltaNet) - O(n) linear attention
āāā 7 layers: FoX (Forgetting Attention) - O(n²) softmax with forget gate
Ratio: 75% Linear / 25% Full Attention
```
**Why hybrid?** Pure linear attention struggles with precise retrieval tasks (e.g., copying, in-context learning). Interleaving full attention layers restores this capability while maintaining overall efficiency.
> š **Reference**: The hybrid approach is validated by Qwen3-Next (2025) and research showing that [3:1 to 6:1 linear-to-full ratios](https://arxiv.org/abs/2507.06457) optimize the efficiency-quality tradeoff.
---
### Gated DeltaNet (GLA)
The primary attention mechanism (75% of layers) is **Gated DeltaNet**, a state-of-the-art O(n) linear attention mechanism from NVIDIA.
#### Key Features
| Feature | Description | Paper Reference |
|---------|-------------|-----------------|
| **Delta Rule** | Online learning rule for recurrent state updates | [Schlag et al., 2021](https://arxiv.org/abs/2102.11174) |
| **Gated Forget** | Mamba-style data-dependent forgetting | [Gu & Dao, 2023](https://arxiv.org/abs/2312.00752) |
| **Short Convolution** | 1D conv on Q, K, V for local context | [Gu et al., 2022](https://arxiv.org/abs/2212.14052) |
| **L2 QK-Norm** | Stabilizes attention scores | Standard practice |
#### Mathematical Formulation
The delta rule update enables the model to selectively write to and erase from a recurrent state:
```
S_t = α_t * S_{t-1} + β_t * (v_t ā k_t - S_{t-1} @ k_t ā k_t)
o_t = S_t @ q_t
```
Where:
- `S_t`: Recurrent state matrix
- `α_t`: Forget gate (data-dependent)
- `β_t`: Learning rate gate (per-token)
> š **Paper**: [Gated Delta Networks: Improving Mamba2 with Delta Rule](https://arxiv.org/abs/2412.06464) (ICLR 2025)
>
> š¦ **Code**: [NVlabs/GatedDeltaNet](https://github.com/NVlabs/GatedDeltaNet)
#### Configuration in Genesis
```python
gla_expand_k: 0.75 # Key expansion ratio
gla_expand_v: 1.5 # Value expansion ratio (asymmetric)
gla_gate_fn: "swish" # Gating activation
gla_use_short_conv: True
gla_conv_size: 4
gla_chunk_size: 64 # For chunked parallel training
gla_use_delta_rule: True
gla_qk_norm: "l2"
gla_use_mamba_gate: True
```
---
### Forgetting Attention (FoX)
The full attention layers (25%) use **FoX (Forgetting Transformer)**, which augments standard softmax attention with a learnable forget gate.
#### Why FoX over Standard Attention?
| Aspect | Standard Attention | FoX |
|--------|-------------------|-----|
| Position Encoding | Requires RoPE/ALiBi | **NoPE** (implicit via forget gate) |
| Long-range Decay | Uniform attention | Data-dependent decay |
| Length Extrapolation | Poor | Better generalization |
#### Mechanism
FoX modifies attention scores with cumulative forget gates:
```
attn[i,j] = softmax(q_i @ k_j / ād + Ī£_{k=j}^{i} log(f_k))
```
Where `f_k = sigmoid(W_f @ x_k)` is a learned forget gate that naturally down-weights distant tokens.
> š **Paper**: [Forgetting Transformer: Softmax Attention with a Forget Gate](https://arxiv.org/abs/2503.02130) (ICLR 2025)
>
> š¦ **Code**: [zhixuan-lin/forgetting-transformer](https://github.com/zhixuan-lin/forgetting-transformer)
#### FoX "Pro" Design
Genesis uses the enhanced "Pro" block design:
| Component | Purpose |
|-----------|---------|
| Output Gate | Controls information flow (like GLA) |
| QK-Norm | Training stability |
| Short Convolution | Local context on K, V |
| FusedRMSNormSwishGate | Efficient fused operations |
---
### Test-Time Training (TTT)
Genesis includes an experimental **TTT metacognition layer** that adapts the model during inference.
#### Concept
Traditional models have **fixed weights** at inference. TTT layers have a small set of **fast weights** that update based on the input sequence, allowing the model to "learn" from context.
```
Standard: y = f(x; Īø_fixed)
TTT: y = f(x; Īø_fixed, Īø_fast(x))
```
#### Implementation Details
| Parameter | Value | Description |
|-----------|-------|-------------|
| `ttt_rank` | 4 | Low-rank adaptation dimension |
| `ttt_inner_lr` | 0.01 | Learning rate for fast weights |
| `ttt_mode` | "dual" | Parallel dual-form computation |
| `ttt_chunk_size` | 64 | Chunking for efficiency |
The "dual form" enables fully parallel gradient computation:
```python
# Instead of sequential updates:
# W_1 = W_0 - lr * grad_0
# W_2 = W_1 - lr * grad_1
# ...
# Dual form computes all at once:
# W_t = W_0 - lr * Ī£_{i š **Paper**: [Learning to (Learn at Test Time): RNNs with Expressive Hidden States](https://arxiv.org/abs/2407.04620) (ICML 2024)
>
> š¦ **Code**: [test-time-training/ttt-lm-pytorch](https://github.com/test-time-training/ttt-lm-pytorch)
#### When TTT Activates
TTT is designed for **inference-time adaptation** and runs only during `model.eval()`. During training, it's disabled to avoid overhead.
---
### Selective Activation
The FFN layers use **SwiGLU** with optional top-k sparsity masking.
#### SwiGLU FFN
```python
FFN(x) = (Swish(W_gate @ x) ā (W_up @ x)) @ W_down
```
> š **Paper**: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) (Shazeer, 2020)
#### Selective Activation (Experimental)
| Parameter | Value |
|-----------|-------|
| `selective_k_ratio` | 0.85 (keeps top 85%) |
| `selective_use_soft_mask` | True |
**Important**: This is a **regularization technique**, not a speedup mechanism. Real sparse acceleration requires specialized kernels (e.g., Triton sparse GEMM).
> š **Related**: [ReLU Strikes Back](https://arxiv.org/abs/2310.04564) (Apple, ICLR 2024) shows natural activation sparsity can be exploited for inference.
---
### Additional Components
#### Grouped Query Attention (GQA)
Genesis uses 3:1 GQA (9 query heads, 3 KV heads) for memory efficiency during inference.
> š **Paper**: [GQA: Training Generalized Multi-Query Transformer Models](https://arxiv.org/abs/2305.13245) (Google, 2023)
#### Rotary Position Embeddings (RoPE)
Partial RoPE (50% rotation) is applied in GLA layers for position awareness.
> š **Paper**: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) (Su et al., 2021)
#### µP (Maximal Update Parametrization)
Hyperparameters were tuned using µP for potential scaling transfer.
> š **Paper**: [Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer](https://arxiv.org/abs/2203.03466) (Yang et al., 2022)
>
> š **Guide**: [The Practitioner's Guide to µP](https://www.cerebras.ai/blog/the-practitioners-guide-to-the-maximal-update-parameterization) (Cerebras)
#### Zero-Centered RMSNorm
Used throughout for better weight decay compatibility with µP.
---
## Comparison with Other Architectures
### vs. SmolLM2-135M (HuggingFace)
| Aspect | Genesis-152M | SmolLM2-135M |
|--------|--------------|--------------|
| **Attention** | Hybrid GLA + FoX | Standard Multi-Head |
| **Complexity** | O(n) for 75% layers | O(n²) all layers |
| **Position Encoding** | RoPE (GLA) / NoPE (FoX) | RoPE |
| **TTT** | ā Experimental | ā |
| **Pre-training** | 2B tokens | 2T tokens |
| **Architecture** | 30L Ć 576 | 30L Ć 576 |
> SmolLM2 uses 1000Ć more training tokens, making direct benchmark comparison unfair. Genesis demonstrates architectural innovations, not data scaling.
### vs. Qwen3-Next
| Aspect | Genesis-152M | Qwen3-Next-80B-A3B |
|--------|--------------|---------------------|
| **Scale** | 152M | 80B (3B active) |
| **Linear Attention** | GLA (same) | GLA |
| **Full Attention** | FoX | Standard |
| **Hybrid Ratio** | 75/25 | Similar |
| **MoE** | ā | ā |
Genesis can be seen as a **miniature research version** of the hybrid attention approach that Qwen3-Next uses at scale.
### vs. Mamba / Mamba-2
| Aspect | Genesis-152M | Mamba-2 |
|--------|--------------|---------|
| **Architecture** | Hybrid (Linear + Softmax) | Pure SSM |
| **Retrieval** | Strong (FoX layers) | Limited |
| **Implementation** | PyTorch + Optional Triton | Requires CUDA |
| **Flexibility** | Modular | Monolithic |
---
## Training Details
### Pre-training
| Parameter | Value |
|-----------|-------|
| **Tokens** | 2 billion |
| **Dataset Mix** | FineWeb-Edu (51%), DCLM (22%), FineMath (12%), Stack-Edu (8%), Cosmopedia (5%), Synth (2%) |
| **Context Length** | 2,048 |
| **Batch Size** | 128 |
| **Learning Rate** | 1e-3 (WSD schedule) |
| **Optimizer** | AdamW (βā=0.9, βā=0.95) |
| **Weight Decay** | 0.1 |
| **Warmup** | 5% of steps |
| **Hardware** | Single A100 80GB |
#### Learning Rate Schedule
**WSD (Warmup-Stable-Decay)**:
- Warmup: 5% of training (linear ramp)
- Stable: 85% of training (constant LR)
- Decay: 10% of training (cosine to min_lr)
### Supervised Fine-Tuning (SFT)
| Parameter | Value |
|-----------|-------|
| **Dataset** | [smol-smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) |
| **Samples** | ~485K conversations |
| **Epochs** | 1 |
| **Learning Rate** | 1e-3 |
| **Batch Size** | 32 (effective: 128 with grad accum) |
#### smol-smoltalk Composition
The SFT dataset is the same used to train SmolLM2-135M-Instruct:
| Subset | Purpose |
|--------|---------|
| smol-magpie-ultra-short | Instruction following |
| everyday-conversations | Multi-turn dialogue |
| smol-rewrite | Text editing |
| smol-summarize | Summarization |
| openhermes-100k | Knowledge & reasoning |
| systemchats-30k | System prompt following |
> This dataset was specifically curated for small models (<1B params) and avoids issues like `` tags from reasoning models.
---
## Usage
### Installation
```bash
pip install genesis-llm
```
### Download Weights
```bash
pip install "huggingface-hub>=0.20"
huggingface-cli download guiferrarib/genesis-152m-instruct genesis_152m_instruct.safetensors --local-dir .
```
### Interactive Chat
```bash
genesis --model ./genesis_152m_instruct.safetensors
```
### Python API
```python
import json
import torch
from safetensors import safe_open
from safetensors.torch import load_file
from genesis import Genesis, GenesisConfig, get_tokenizer
# 1. Load config from checkpoint metadata
model_path = "./genesis_152m_instruct.safetensors"
with safe_open(model_path, framework="pt", device="cpu") as f:
metadata = f.metadata() or {}
config_dict = json.loads(metadata.get("genesis_config_json", "{}"))
config = GenesisConfig(**config_dict) if config_dict else GenesisConfig.genesis_147m()
# 2. Load model weights
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
state_dict = load_file(model_path, device=device)
model = Genesis(config).to(device)
model.load_state_dict(state_dict, strict=False)
model.eval()
# 3. Setup tokenizer (GPT-NeoX + ChatML tokens)
tokenizer = get_tokenizer("neox")
tokenizer.add_chat_tokens()
# 4. Build ChatML prompt
prompt = """<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
Explain what linear attention is in simple terms.
<|im_end|>
<|im_start|>assistant
"""
# 5. Generate
input_ids = torch.tensor([tokenizer.encode(prompt)], device=device)
with torch.no_grad():
output_ids = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
response = tokenizer.decode(output_ids[0][input_ids.shape[1]:].tolist())
print(response)
```
### Prompt Format
Genesis uses **ChatML** format:
```
<|im_start|>system
{system_message}
<|im_end|>
<|im_start|>user
{user_message}
<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
```
---
## Benchmarks
Evaluated using LightEval on MPS (Apple Silicon).
### Results
| Task | Metric | Score | Stderr |
|------|--------|-------|--------|
| **ARC-Easy** (25-shot) | acc_norm | 44.02% | ±1.02 |
| **ARC-Challenge** (25-shot) | acc_norm | 24.66% | ±1.26 |
| **BoolQ** (0-shot) | acc_norm | 56.30% | ±0.87 |
| **HellaSwag** (10-shot) | acc_norm | 30.19% | ±0.46 |
| **Winogrande** (5-shot) | acc | 49.09% | ±1.41 |
| **CommonsenseQA** (0-shot) | acc_norm | 29.16% | ±1.30 |
| **OpenBookQA** (0-shot) | acc_norm | 28.60% | ±2.02 |
| **SciQ** (0-shot) | acc_norm | 46.80% | ±1.58 |
### Interpretation
| Task | Random Baseline | Genesis | Signal |
|------|-----------------|---------|--------|
| ARC-Easy | 25% | 44% | ā
**Strong** |
| BoolQ | 50% | 56% | ā
Learning |
| HellaSwag | ~25% | 30% | ā
Learning |
| Winogrande | 50% | 49% | ā ļø At baseline |
| ARC-Challenge | ~25% | 25% | ā ļø Too hard for size |
> **Note**: With only 2B pre-training tokens (vs. 2T for SmolLM2), benchmarks primarily reflect architectural capacity rather than world knowledge.
---
## Limitations
### Known Issues
1. **Hallucinations**: Frequent factual errors due to limited pre-training data
2. **Math**: Unreliable arithmetic and multi-step reasoning
3. **Instruction Following**: Can be brittle with strict constraints
4. **TTT Overhead**: Metacognition layer adds latency (can be disabled)
### Not Suitable For
- Production deployments requiring reliability
- Tasks requiring factual accuracy
- Complex multi-step reasoning
- Safety-critical applications
### Best Use Cases
- Architecture research and ablation studies
- Efficient attention mechanism exploration
- Small model behavior analysis
- Educational purposes
---
## Citation
If you use Genesis in your research, please cite:
```bibtex
@misc{genesis2025,
title={Genesis: A Hybrid Linear Attention Architecture for Small Language Models},
author={Ferrari Brescia, Guilherme},
year={2025},
url={https://huggingface.co/guiferrarib/genesis-152m-instruct}
}
```
### Related Papers
```bibtex
@inproceedings{yang2024gated,
title={Gated Delta Networks: Improving Mamba2 with Delta Rule},
author={Yang, Songlin and Wang, Bailin and Zhang, Yu and Shen, Yikang and Keutzer, Kurt},
booktitle={ICLR},
year={2025}
}
@inproceedings{lin2025forgetting,
title={Forgetting Transformer: Softmax Attention with a Forget Gate},
author={Lin, Zhixuan and others},
booktitle={ICLR},
year={2025}
}
@inproceedings{sun2024learning,
title={Learning to (Learn at Test Time): RNNs with Expressive Hidden States},
author={Sun, Yu and others},
booktitle={ICML},
year={2024}
}
@article{allal2025smollm2,
title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
author={Allal, Loubna Ben and others},
journal={arXiv preprint arXiv:2502.02737},
year={2025}
}
```
---
## License
| Component | License |
|-----------|---------|
| **Model Weights** | Apache 2.0 |
| **Code** | Apache 2.0 |
| **Training Data** | Various (see dataset cards) |
---
Built with 𧬠by the Orch-Mind team
PyPI