🧬 Genesis-152M-Instruct

A Research-Oriented Small Language Model with Hybrid Linear Attention

Architecture Training License


Table of Contents


Overview

Genesis-152M-Instruct is an experimental small language model that combines recent advances in efficient attention mechanisms into a single architecture. It serves as a research platform for exploring:

  • Hybrid attention: Mixing O(n) linear attention with O(n²) softmax attention
  • Efficient inference: Sub-quadratic complexity for most layers
  • Adaptive computation: Test-time training for dynamic model adaptation

⚠️ Experimental Model: This is a research artifact, not a production-ready model. It demonstrates architectural innovations but has limitations typical of small models.


Model Summary

Property Value
Parameters 151.8M total (~122.8M non-embedding)
Architecture Hybrid GLA + FoX Attention
Context Length 2,048 tokens
Vocab Size 50,279 (GPT-NeoX + ChatML tokens)
Pre-training Data 2B tokens
SFT Dataset smol-smoltalk
License Apache 2.0

Files in this Repository

├── genesis_152m_instruct.safetensors  # Model weights
├── README.md                           # This model card
└── LICENSE                             # Apache 2.0

Architecture Deep Dive

Genesis follows a "deep-and-thin" design philosophy inspired by SmolLM2 and MobileLLM, which has proven effective for small language models.

Core Configuration

Component Value Rationale
Layers 30 Deep architecture for better representation
Hidden Size 576 Optimal width for 150M scale
Attention Heads 9 Query heads
KV Heads 3 3:1 GQA ratio for memory efficiency
Head Dimension 64 Standard for efficient attention
FFN Size 1,440 2.5× expansion (SwiGLU-efficient)
Weight Tying Embeddings tied with LM head

Hybrid Attention Layout

Genesis employs a hybrid attention layout inspired by Qwen3-Next, alternating between linear and full attention:

Layer Distribution (30 layers):
├── 23 layers: GLA (Gated DeltaNet) - O(n) linear attention
└──  7 layers: FoX (Forgetting Attention) - O(n²) softmax with forget gate

Ratio: 75% Linear / 25% Full Attention

Why hybrid? Pure linear attention struggles with precise retrieval tasks (e.g., copying, in-context learning). Interleaving full attention layers restores this capability while maintaining overall efficiency.

📖 Reference: The hybrid approach is validated by Qwen3-Next (2025) and research showing that 3:1 to 6:1 linear-to-full ratios optimize the efficiency-quality tradeoff.


Gated DeltaNet (GLA)

The primary attention mechanism (75% of layers) is Gated DeltaNet, a state-of-the-art O(n) linear attention mechanism from NVIDIA.

Key Features

Feature Description Paper Reference
Delta Rule Online learning rule for recurrent state updates Schlag et al., 2021
Gated Forget Mamba-style data-dependent forgetting Gu & Dao, 2023
Short Convolution 1D conv on Q, K, V for local context Gu et al., 2022
L2 QK-Norm Stabilizes attention scores Standard practice

Mathematical Formulation

The delta rule update enables the model to selectively write to and erase from a recurrent state:

S_t = α_t * S_{t-1} + β_t * (v_t ⊗ k_t - S_{t-1} @ k_t ⊗ k_t)
o_t = S_t @ q_t

Where:

  • S_t: Recurrent state matrix
  • α_t: Forget gate (data-dependent)
  • β_t: Learning rate gate (per-token)

📖 Paper: Gated Delta Networks: Improving Mamba2 with Delta Rule (ICLR 2025)

📦 Code: NVlabs/GatedDeltaNet

Configuration in Genesis

gla_expand_k: 0.75      # Key expansion ratio
gla_expand_v: 1.5       # Value expansion ratio (asymmetric)
gla_gate_fn: "swish"    # Gating activation
gla_use_short_conv: True
gla_conv_size: 4
gla_chunk_size: 64      # For chunked parallel training
gla_use_delta_rule: True
gla_qk_norm: "l2"
gla_use_mamba_gate: True

Forgetting Attention (FoX)

The full attention layers (25%) use FoX (Forgetting Transformer), which augments standard softmax attention with a learnable forget gate.

Why FoX over Standard Attention?

Aspect Standard Attention FoX
Position Encoding Requires RoPE/ALiBi NoPE (implicit via forget gate)
Long-range Decay Uniform attention Data-dependent decay
Length Extrapolation Poor Better generalization

Mechanism

FoX modifies attention scores with cumulative forget gates:

attn[i,j] = softmax(q_i @ k_j / √d + Σ_{k=j}^{i} log(f_k))

Where f_k = sigmoid(W_f @ x_k) is a learned forget gate that naturally down-weights distant tokens.

📖 Paper: Forgetting Transformer: Softmax Attention with a Forget Gate (ICLR 2025)

📦 Code: zhixuan-lin/forgetting-transformer

FoX "Pro" Design

Genesis uses the enhanced "Pro" block design:

Component Purpose
Output Gate Controls information flow (like GLA)
QK-Norm Training stability
Short Convolution Local context on K, V
FusedRMSNormSwishGate Efficient fused operations

Test-Time Training (TTT)

Genesis includes an experimental TTT metacognition layer that adapts the model during inference.

Concept

Traditional models have fixed weights at inference. TTT layers have a small set of fast weights that update based on the input sequence, allowing the model to "learn" from context.

Standard: y = f(x; θ_fixed)
TTT:      y = f(x; θ_fixed, θ_fast(x))

Implementation Details

Parameter Value Description
ttt_rank 4 Low-rank adaptation dimension
ttt_inner_lr 0.01 Learning rate for fast weights
ttt_mode "dual" Parallel dual-form computation
ttt_chunk_size 64 Chunking for efficiency

The "dual form" enables fully parallel gradient computation:

# Instead of sequential updates:
# W_1 = W_0 - lr * grad_0
# W_2 = W_1 - lr * grad_1
# ...

# Dual form computes all at once:
# W_t = W_0 - lr * Σ_{i<t} grad_i  (via cumsum)

📖 Paper: Learning to (Learn at Test Time): RNNs with Expressive Hidden States (ICML 2024)

📦 Code: test-time-training/ttt-lm-pytorch

When TTT Activates

TTT is designed for inference-time adaptation and runs only during model.eval(). During training, it's disabled to avoid overhead.


Selective Activation

The FFN layers use SwiGLU with optional top-k sparsity masking.

SwiGLU FFN

FFN(x) = (Swish(W_gate @ x) ⊙ (W_up @ x)) @ W_down

📖 Paper: GLU Variants Improve Transformer (Shazeer, 2020)

Selective Activation (Experimental)

Parameter Value
selective_k_ratio 0.85 (keeps top 85%)
selective_use_soft_mask True

Important: This is a regularization technique, not a speedup mechanism. Real sparse acceleration requires specialized kernels (e.g., Triton sparse GEMM).

📖 Related: ReLU Strikes Back (Apple, ICLR 2024) shows natural activation sparsity can be exploited for inference.


Additional Components

Grouped Query Attention (GQA)

Genesis uses 3:1 GQA (9 query heads, 3 KV heads) for memory efficiency during inference.

📖 Paper: GQA: Training Generalized Multi-Query Transformer Models (Google, 2023)

Rotary Position Embeddings (RoPE)

Partial RoPE (50% rotation) is applied in GLA layers for position awareness.

📖 Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)

µP (Maximal Update Parametrization)

Hyperparameters were tuned using µP for potential scaling transfer.

📖 Paper: Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (Yang et al., 2022)

📖 Guide: The Practitioner's Guide to µP (Cerebras)

Zero-Centered RMSNorm

Used throughout for better weight decay compatibility with µP.


Comparison with Other Architectures

vs. SmolLM2-135M (HuggingFace)

Aspect Genesis-152M SmolLM2-135M
Attention Hybrid GLA + FoX Standard Multi-Head
Complexity O(n) for 75% layers O(n²) all layers
Position Encoding RoPE (GLA) / NoPE (FoX) RoPE
TTT ✓ Experimental
Pre-training 2B tokens 2T tokens
Architecture 30L × 576 30L × 576

SmolLM2 uses 1000× more training tokens, making direct benchmark comparison unfair. Genesis demonstrates architectural innovations, not data scaling.

vs. Qwen3-Next

Aspect Genesis-152M Qwen3-Next-80B-A3B
Scale 152M 80B (3B active)
Linear Attention GLA (same) GLA
Full Attention FoX Standard
Hybrid Ratio 75/25 Similar
MoE

Genesis can be seen as a miniature research version of the hybrid attention approach that Qwen3-Next uses at scale.

vs. Mamba / Mamba-2

Aspect Genesis-152M Mamba-2
Architecture Hybrid (Linear + Softmax) Pure SSM
Retrieval Strong (FoX layers) Limited
Implementation PyTorch + Optional Triton Requires CUDA
Flexibility Modular Monolithic

Training Details

Pre-training

Parameter Value
Tokens 2 billion
Dataset Mix FineWeb-Edu (51%), DCLM (22%), FineMath (12%), Stack-Edu (8%), Cosmopedia (5%), Synth (2%)
Context Length 2,048
Batch Size 128
Learning Rate 1e-3 (WSD schedule)
Optimizer AdamW (β₁=0.9, β₂=0.95)
Weight Decay 0.1
Warmup 5% of steps
Hardware Single A100 80GB

Learning Rate Schedule

WSD (Warmup-Stable-Decay):

  • Warmup: 5% of training (linear ramp)
  • Stable: 85% of training (constant LR)
  • Decay: 10% of training (cosine to min_lr)

Supervised Fine-Tuning (SFT)

Parameter Value
Dataset smol-smoltalk
Samples ~485K conversations
Epochs 1
Learning Rate 1e-3
Batch Size 32 (effective: 128 with grad accum)

smol-smoltalk Composition

The SFT dataset is the same used to train SmolLM2-135M-Instruct:

Subset Purpose
smol-magpie-ultra-short Instruction following
everyday-conversations Multi-turn dialogue
smol-rewrite Text editing
smol-summarize Summarization
openhermes-100k Knowledge & reasoning
systemchats-30k System prompt following

This dataset was specifically curated for small models (<1B params) and avoids issues like <think> tags from reasoning models.


Usage

Installation

pip install genesis-llm

Download Weights

pip install "huggingface-hub>=0.20"
huggingface-cli download guiferrarib/genesis-152m-instruct genesis_152m_instruct.safetensors --local-dir .

Interactive Chat

genesis --model ./genesis_152m_instruct.safetensors

Python API

import json
import torch
from safetensors import safe_open
from safetensors.torch import load_file
from genesis import Genesis, GenesisConfig, get_tokenizer

# 1. Load config from checkpoint metadata
model_path = "./genesis_152m_instruct.safetensors"
with safe_open(model_path, framework="pt", device="cpu") as f:
    metadata = f.metadata() or {}
    config_dict = json.loads(metadata.get("genesis_config_json", "{}"))
    config = GenesisConfig(**config_dict) if config_dict else GenesisConfig.genesis_147m()

# 2. Load model weights
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
state_dict = load_file(model_path, device=device)
model = Genesis(config).to(device)
model.load_state_dict(state_dict, strict=False)
model.eval()

# 3. Setup tokenizer (GPT-NeoX + ChatML tokens)
tokenizer = get_tokenizer("neox")
tokenizer.add_chat_tokens()

# 4. Build ChatML prompt
prompt = """<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
Explain what linear attention is in simple terms.
<|im_end|>
<|im_start|>assistant
"""

# 5. Generate
input_ids = torch.tensor([tokenizer.encode(prompt)], device=device)
with torch.no_grad():
    output_ids = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
response = tokenizer.decode(output_ids[0][input_ids.shape[1]:].tolist())
print(response)

Prompt Format

Genesis uses ChatML format:

<|im_start|>system
{system_message}
<|im_end|>
<|im_start|>user
{user_message}
<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>

Benchmarks

Evaluated using LightEval on MPS (Apple Silicon).

Results

Task Metric Score Stderr
ARC-Easy (25-shot) acc_norm 44.02% ±1.02
ARC-Challenge (25-shot) acc_norm 24.66% ±1.26
BoolQ (0-shot) acc_norm 56.30% ±0.87
HellaSwag (10-shot) acc_norm 30.19% ±0.46
Winogrande (5-shot) acc 49.09% ±1.41
CommonsenseQA (0-shot) acc_norm 29.16% ±1.30
OpenBookQA (0-shot) acc_norm 28.60% ±2.02
SciQ (0-shot) acc_norm 46.80% ±1.58

Interpretation

Task Random Baseline Genesis Signal
ARC-Easy 25% 44% Strong
BoolQ 50% 56% ✅ Learning
HellaSwag ~25% 30% ✅ Learning
Winogrande 50% 49% ⚠️ At baseline
ARC-Challenge ~25% 25% ⚠️ Too hard for size

Note: With only 2B pre-training tokens (vs. 2T for SmolLM2), benchmarks primarily reflect architectural capacity rather than world knowledge.


Limitations

Known Issues

  1. Hallucinations: Frequent factual errors due to limited pre-training data
  2. Math: Unreliable arithmetic and multi-step reasoning
  3. Instruction Following: Can be brittle with strict constraints
  4. TTT Overhead: Metacognition layer adds latency (can be disabled)

Not Suitable For

  • Production deployments requiring reliability
  • Tasks requiring factual accuracy
  • Complex multi-step reasoning
  • Safety-critical applications

Best Use Cases

  • Architecture research and ablation studies
  • Efficient attention mechanism exploration
  • Small model behavior analysis
  • Educational purposes

Citation

If you use Genesis in your research, please cite:

@misc{genesis2025,
  title={Genesis: A Hybrid Linear Attention Architecture for Small Language Models},
  author={Ferrari Brescia, Guilherme},
  year={2025},
  url={https://huggingface.co/guiferrarib/genesis-152m-instruct}
}

Related Papers

@inproceedings{yang2024gated,
  title={Gated Delta Networks: Improving Mamba2 with Delta Rule},
  author={Yang, Songlin and Wang, Bailin and Zhang, Yu and Shen, Yikang and Keutzer, Kurt},
  booktitle={ICLR},
  year={2025}
}

@inproceedings{lin2025forgetting,
  title={Forgetting Transformer: Softmax Attention with a Forget Gate},
  author={Lin, Zhixuan and others},
  booktitle={ICLR},
  year={2025}
}

@inproceedings{sun2024learning,
  title={Learning to (Learn at Test Time): RNNs with Expressive Hidden States},
  author={Sun, Yu and others},
  booktitle={ICML},
  year={2024}
}

@article{allal2025smollm2,
  title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
  author={Allal, Loubna Ben and others},
  journal={arXiv preprint arXiv:2502.02737},
  year={2025}
}

License

Component License
Model Weights Apache 2.0
Code Apache 2.0
Training Data Various (see dataset cards)

Built with 🧬 by the Orch-Mind team

PyPI

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Dataset used to train guiferrarib/genesis-152m-instruct