🧬 Genesis-152M-Instruct

A Research-Oriented Small Language Model with Hybrid Linear Attention

Overview
Model Summary
Architecture Deep Dive
Comparison with Other Architectures
Training Details
- Pre-training
- Supervised Fine-Tuning (SFT)
Usage
Benchmarks
Limitations
Citation
License

Overview

Genesis-152M-Instruct is an experimental small language model that combines recent advances in efficient attention mechanisms into a single architecture. It serves as a research platform for exploring:

Hybrid attention: Mixing O(n) linear attention with O(n²) softmax attention
Efficient inference: Sub-quadratic complexity for most layers
Adaptive computation: Test-time training for dynamic model adaptation

⚠️ Experimental Model: This is a research artifact, not a production-ready model. It demonstrates architectural innovations but has limitations typical of small models.

Model Summary

Property	Value
Parameters	151.8M total (~122.8M non-embedding)
Architecture	Hybrid GLA + FoX Attention
Context Length	2,048 tokens
Vocab Size	50,279 (GPT-NeoX + ChatML tokens)
Pre-training Data	2B tokens
SFT Dataset	smol-smoltalk
License	Apache 2.0

Files in this Repository

├── genesis_152m_instruct.safetensors  # Model weights
├── README.md                           # This model card
└── LICENSE                             # Apache 2.0

Architecture Deep Dive

Genesis follows a "deep-and-thin" design philosophy inspired by SmolLM2 and MobileLLM, which has proven effective for small language models.

Core Configuration

Component	Value	Rationale
Layers	30	Deep architecture for better representation
Hidden Size	576	Optimal width for 150M scale
Attention Heads	9	Query heads
KV Heads	3	3:1 GQA ratio for memory efficiency
Head Dimension	64	Standard for efficient attention
FFN Size	1,440	2.5× expansion (SwiGLU-efficient)
Weight Tying	✓	Embeddings tied with LM head

Hybrid Attention Layout

Genesis employs a hybrid attention layout inspired by Qwen3-Next, alternating between linear and full attention:

Layer Distribution (30 layers):
├── 23 layers: GLA (Gated DeltaNet) - O(n) linear attention
└──  7 layers: FoX (Forgetting Attention) - O(n²) softmax with forget gate

Ratio: 75% Linear / 25% Full Attention

Why hybrid? Pure linear attention struggles with precise retrieval tasks (e.g., copying, in-context learning). Interleaving full attention layers restores this capability while maintaining overall efficiency.

📖 Reference: The hybrid approach is validated by Qwen3-Next (2025) and research showing that 3:1 to 6:1 linear-to-full ratios optimize the efficiency-quality tradeoff.

Gated DeltaNet (GLA)

The primary attention mechanism (75% of layers) is Gated DeltaNet, a state-of-the-art O(n) linear attention mechanism from NVIDIA.

Key Features

Feature	Description	Paper Reference
Delta Rule	Online learning rule for recurrent state updates	Schlag et al., 2021
Gated Forget	Mamba-style data-dependent forgetting	Gu & Dao, 2023
Short Convolution	1D conv on Q, K, V for local context	Gu et al., 2022
L2 QK-Norm	Stabilizes attention scores	Standard practice

Mathematical Formulation

The delta rule update enables the model to selectively write to and erase from a recurrent state:

S_t = α_t * S_{t-1} + β_t * (v_t ⊗ k_t - S_{t-1} @ k_t ⊗ k_t)
o_t = S_t @ q_t

Where:

S_t: Recurrent state matrix
α_t: Forget gate (data-dependent)
β_t: Learning rate gate (per-token)

📖 Paper: Gated Delta Networks: Improving Mamba2 with Delta Rule (ICLR 2025)

📦 Code: NVlabs/GatedDeltaNet

Configuration in Genesis

gla_expand_k: 0.75      # Key expansion ratio
gla_expand_v: 1.5       # Value expansion ratio (asymmetric)
gla_gate_fn: "swish"    # Gating activation
gla_use_short_conv: True
gla_conv_size: 4
gla_chunk_size: 64      # For chunked parallel training
gla_use_delta_rule: True
gla_qk_norm: "l2"
gla_use_mamba_gate: True

Forgetting Attention (FoX)

The full attention layers (25%) use FoX (Forgetting Transformer), which augments standard softmax attention with a learnable forget gate.

Why FoX over Standard Attention?

Aspect	Standard Attention	FoX
Position Encoding	Requires RoPE/ALiBi	NoPE (implicit via forget gate)
Long-range Decay	Uniform attention	Data-dependent decay
Length Extrapolation	Poor	Better generalization

Mechanism

FoX modifies attention scores with cumulative forget gates:

attn[i,j] = softmax(q_i @ k_j / √d + Σ_{k=j}^{i} log(f_k))

Where f_k = sigmoid(W_f @ x_k) is a learned forget gate that naturally down-weights distant tokens.

📖 Paper: Forgetting Transformer: Softmax Attention with a Forget Gate (ICLR 2025)

📦 Code: zhixuan-lin/forgetting-transformer

FoX "Pro" Design

Genesis uses the enhanced "Pro" block design:

Component	Purpose
Output Gate	Controls information flow (like GLA)
QK-Norm	Training stability
Short Convolution	Local context on K, V
FusedRMSNormSwishGate	Efficient fused operations

Test-Time Training (TTT)

Genesis includes an experimental TTT metacognition layer that adapts the model during inference.

Concept

Traditional models have fixed weights at inference. TTT layers have a small set of fast weights that update based on the input sequence, allowing the model to "learn" from context.

Standard: y = f(x; θ_fixed)
TTT:      y = f(x; θ_fixed, θ_fast(x))

Implementation Details

Parameter	Value	Description
`ttt_rank`	4	Low-rank adaptation dimension
`ttt_inner_lr`	0.01	Learning rate for fast weights
`ttt_mode`	"dual"	Parallel dual-form computation
`ttt_chunk_size`	64	Chunking for efficiency

The "dual form" enables fully parallel gradient computation:

# Instead of sequential updates:
# W_1 = W_0 - lr * grad_0
# W_2 = W_1 - lr * grad_1
# ...

# Dual form computes all at once:
# W_t = W_0 - lr * Σ_{i<t} grad_i  (via cumsum)

📖 Paper: Learning to (Learn at Test Time): RNNs with Expressive Hidden States (ICML 2024)

📦 Code: test-time-training/ttt-lm-pytorch

When TTT Activates

TTT is designed for inference-time adaptation and runs only during model.eval(). During training, it's disabled to avoid overhead.

Selective Activation

The FFN layers use SwiGLU with optional top-k sparsity masking.

SwiGLU FFN

FFN(x) = (Swish(W_gate @ x) ⊙ (W_up @ x)) @ W_down

📖 Paper: GLU Variants Improve Transformer (Shazeer, 2020)

Selective Activation (Experimental)

Parameter	Value
`selective_k_ratio`	0.85 (keeps top 85%)
`selective_use_soft_mask`	True

Important: This is a regularization technique, not a speedup mechanism. Real sparse acceleration requires specialized kernels (e.g., Triton sparse GEMM).

📖 Related: ReLU Strikes Back (Apple, ICLR 2024) shows natural activation sparsity can be exploited for inference.

Additional Components

Grouped Query Attention (GQA)

Genesis uses 3:1 GQA (9 query heads, 3 KV heads) for memory efficiency during inference.

📖 Paper: GQA: Training Generalized Multi-Query Transformer Models (Google, 2023)

Rotary Position Embeddings (RoPE)

Partial RoPE (50% rotation) is applied in GLA layers for position awareness.

📖 Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)

µP (Maximal Update Parametrization)

Hyperparameters were tuned using µP for potential scaling transfer.

📖 Paper: Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (Yang et al., 2022)

📖 Guide: The Practitioner's Guide to µP (Cerebras)

Zero-Centered RMSNorm

Used throughout for better weight decay compatibility with µP.

Comparison with Other Architectures

vs. SmolLM2-135M (HuggingFace)

Aspect	Genesis-152M	SmolLM2-135M
Attention	Hybrid GLA + FoX	Standard Multi-Head
Complexity	O(n) for 75% layers	O(n²) all layers
Position Encoding	RoPE (GLA) / NoPE (FoX)	RoPE
TTT	✓ Experimental	✗
Pre-training	2B tokens	2T tokens
Architecture	30L × 576	30L × 576

SmolLM2 uses 1000× more training tokens, making direct benchmark comparison unfair. Genesis demonstrates architectural innovations, not data scaling.

vs. Qwen3-Next

Aspect	Genesis-152M	Qwen3-Next-80B-A3B
Scale	152M	80B (3B active)
Linear Attention	GLA (same)	GLA
Full Attention	FoX	Standard
Hybrid Ratio	75/25	Similar
MoE	✗	✓

Genesis can be seen as a miniature research version of the hybrid attention approach that Qwen3-Next uses at scale.

vs. Mamba / Mamba-2

Aspect	Genesis-152M	Mamba-2
Architecture	Hybrid (Linear + Softmax)	Pure SSM
Retrieval	Strong (FoX layers)	Limited
Implementation	PyTorch + Optional Triton	Requires CUDA
Flexibility	Modular	Monolithic

Training Details

Pre-training

Parameter	Value
Tokens	2 billion
Dataset Mix	FineWeb-Edu (51%), DCLM (22%), FineMath (12%), Stack-Edu (8%), Cosmopedia (5%), Synth (2%)
Context Length	2,048
Batch Size	128
Learning Rate	1e-3 (WSD schedule)
Optimizer	AdamW (β₁=0.9, β₂=0.95)
Weight Decay	0.1
Warmup	5% of steps
Hardware	Single A100 80GB

Learning Rate Schedule

WSD (Warmup-Stable-Decay):

Warmup: 5% of training (linear ramp)
Stable: 85% of training (constant LR)
Decay: 10% of training (cosine to min_lr)

Supervised Fine-Tuning (SFT)

Parameter	Value
Dataset	smol-smoltalk
Samples	~485K conversations
Epochs	1
Learning Rate	1e-3
Batch Size	32 (effective: 128 with grad accum)

smol-smoltalk Composition

The SFT dataset is the same used to train SmolLM2-135M-Instruct:

Subset	Purpose
smol-magpie-ultra-short	Instruction following
everyday-conversations	Multi-turn dialogue
smol-rewrite	Text editing
smol-summarize	Summarization
openhermes-100k	Knowledge & reasoning
systemchats-30k	System prompt following

This dataset was specifically curated for small models (<1B params) and avoids issues like <think> tags from reasoning models.

Usage

Installation

pip install genesis-llm

Download Weights

pip install "huggingface-hub>=0.20"
huggingface-cli download guiferrarib/genesis-152m-instruct genesis_152m_instruct.safetensors --local-dir .

Interactive Chat

genesis --model ./genesis_152m_instruct.safetensors

Python API

import json
import torch
from safetensors import safe_open
from safetensors.torch import load_file
from genesis import Genesis, GenesisConfig, get_tokenizer

# 1. Load config from checkpoint metadata
model_path = "./genesis_152m_instruct.safetensors"
with safe_open(model_path, framework="pt", device="cpu") as f:
    metadata = f.metadata() or {}
    config_dict = json.loads(metadata.get("genesis_config_json", "{}"))
    config = GenesisConfig(**config_dict) if config_dict else GenesisConfig.genesis_147m()

# 2. Load model weights
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
state_dict = load_file(model_path, device=device)
model = Genesis(config).to(device)
model.load_state_dict(state_dict, strict=False)
model.eval()

# 3. Setup tokenizer (GPT-NeoX + ChatML tokens)
tokenizer = get_tokenizer("neox")
tokenizer.add_chat_tokens()

# 4. Build ChatML prompt
prompt = """<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
Explain what linear attention is in simple terms.
<|im_end|>
<|im_start|>assistant
"""

# 5. Generate
input_ids = torch.tensor([tokenizer.encode(prompt)], device=device)
with torch.no_grad():
    output_ids = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
response = tokenizer.decode(output_ids[0][input_ids.shape[1]:].tolist())
print(response)

Prompt Format

Genesis uses ChatML format:

<|im_start|>system
{system_message}
<|im_end|>
<|im_start|>user
{user_message}
<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>

Benchmarks

Evaluated using LightEval on MPS (Apple Silicon).

Results

Task	Metric	Score	Stderr
ARC-Easy (25-shot)	acc_norm	44.02%	±1.02
ARC-Challenge (25-shot)	acc_norm	24.66%	±1.26
BoolQ (0-shot)	acc_norm	56.30%	±0.87
HellaSwag (10-shot)	acc_norm	30.19%	±0.46
Winogrande (5-shot)	acc	49.09%	±1.41
CommonsenseQA (0-shot)	acc_norm	29.16%	±1.30
OpenBookQA (0-shot)	acc_norm	28.60%	±2.02
SciQ (0-shot)	acc_norm	46.80%	±1.58

Interpretation

Task	Random Baseline	Genesis	Signal
ARC-Easy	25%	44%	✅ Strong
BoolQ	50%	56%	✅ Learning
HellaSwag	~25%	30%	✅ Learning
Winogrande	50%	49%	⚠️ At baseline
ARC-Challenge	~25%	25%	⚠️ Too hard for size

Note: With only 2B pre-training tokens (vs. 2T for SmolLM2), benchmarks primarily reflect architectural capacity rather than world knowledge.

Limitations

Known Issues

Hallucinations: Frequent factual errors due to limited pre-training data
Math: Unreliable arithmetic and multi-step reasoning
Instruction Following: Can be brittle with strict constraints
TTT Overhead: Metacognition layer adds latency (can be disabled)

Not Suitable For

Production deployments requiring reliability
Tasks requiring factual accuracy
Complex multi-step reasoning
Safety-critical applications

Best Use Cases

Architecture research and ablation studies
Efficient attention mechanism exploration
Small model behavior analysis
Educational purposes

Citation

If you use Genesis in your research, please cite:

@misc{genesis2025,
  title={Genesis: A Hybrid Linear Attention Architecture for Small Language Models},
  author={Ferrari Brescia, Guilherme},
  year={2025},
  url={https://huggingface.co/guiferrarib/genesis-152m-instruct}
}

Related Papers

@inproceedings{yang2024gated,
  title={Gated Delta Networks: Improving Mamba2 with Delta Rule},
  author={Yang, Songlin and Wang, Bailin and Zhang, Yu and Shen, Yikang and Keutzer, Kurt},
  booktitle={ICLR},
  year={2025}
}

@inproceedings{lin2025forgetting,
  title={Forgetting Transformer: Softmax Attention with a Forget Gate},
  author={Lin, Zhixuan and others},
  booktitle={ICLR},
  year={2025}
}

@inproceedings{sun2024learning,
  title={Learning to (Learn at Test Time): RNNs with Expressive Hidden States},
  author={Sun, Yu and others},
  booktitle={ICML},
  year={2024}
}

@article{allal2025smollm2,
  title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
  author={Allal, Loubna Ben and others},
  journal={arXiv preprint arXiv:2502.02737},
  year={2025}
}

License

Component	License
Model Weights	Apache 2.0
Code	Apache 2.0
Training Data	Various (see dataset cards)

Built with 🧬 by the Orch-Mind team

PyPI

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train guiferrarib/genesis-152m-instruct

Papers for guiferrarib/genesis-152m-instruct