| --- |
| license: apache-2.0 |
| language: |
| - en |
| pipeline_tag: text-generation |
| tags: |
| - pytorch |
| - safetensors |
| - text-generation |
| - small-llm |
| - custom-architecture |
| - linear-attention |
| - gated-deltanet |
| - test-time-training |
| - hybrid-attention |
| - research |
| library_name: genesis-llm |
| datasets: |
| - HuggingFaceTB/smol-smoltalk |
| base_model: [] |
| --- |
| |
| <div align="center"> |
| <h1>🧬 Genesis-152M-Instruct</h1> |
| <p><em>A Research-Oriented Small Language Model with Hybrid Linear Attention</em></p> |
| |
| <p> |
| <a href="#architecture"><img alt="Architecture" src="https://img.shields.io/badge/Architecture-Hybrid_GLA%2BFoX-blue"></a> |
| <a href="#training"><img alt="Training" src="https://img.shields.io/badge/Pre--training-2B_tokens-green"></a> |
| <a href="#license"><img alt="License" src="https://img.shields.io/badge/License-Apache_2.0-orange"></a> |
| </p> |
| </div> |
| |
| --- |
|
|
| ## Table of Contents |
|
|
| - [Overview](#overview) |
| - [Model Summary](#model-summary) |
| - [Architecture Deep Dive](#architecture-deep-dive) |
| - [Hybrid Attention Layout](#hybrid-attention-layout) |
| - [Gated DeltaNet (GLA)](#gated-deltanet-gla) |
| - [Forgetting Attention (FoX)](#forgetting-attention-fox) |
| - [Test-Time Training (TTT)](#test-time-training-ttt) |
| - [Selective Activation](#selective-activation) |
| - [Additional Components](#additional-components) |
| - [Comparison with Other Architectures](#comparison-with-other-architectures) |
| - [Training Details](#training-details) |
| - [Pre-training](#pre-training) |
| - [Supervised Fine-Tuning (SFT)](#supervised-fine-tuning-sft) |
| - [Usage](#usage) |
| - [Benchmarks](#benchmarks) |
| - [Limitations](#limitations) |
| - [Citation](#citation) |
| - [License](#license) |
|
|
| --- |
|
|
| ## Overview |
|
|
| **Genesis-152M-Instruct** is an experimental small language model that combines recent advances in efficient attention mechanisms into a single architecture. It serves as a research platform for exploring: |
|
|
| - **Hybrid attention**: Mixing O(n) linear attention with O(n²) softmax attention |
| - **Efficient inference**: Sub-quadratic complexity for most layers |
| - **Adaptive computation**: Test-time training for dynamic model adaptation |
|
|
| > ⚠️ **Experimental Model**: This is a research artifact, not a production-ready model. It demonstrates architectural innovations but has limitations typical of small models. |
|
|
| --- |
|
|
| ## Model Summary |
|
|
| | Property | Value | |
| |----------|-------| |
| | **Parameters** | 151.8M total (~122.8M non-embedding) | |
| | **Architecture** | Hybrid GLA + FoX Attention | |
| | **Context Length** | 2,048 tokens | |
| | **Vocab Size** | 50,279 (GPT-NeoX + ChatML tokens) | |
| | **Pre-training Data** | 2B tokens | |
| | **SFT Dataset** | [smol-smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) | |
| | **License** | Apache 2.0 | |
|
|
| ### Files in this Repository |
|
|
| ``` |
| ├── genesis_152m_instruct.safetensors # Model weights |
| ├── README.md # This model card |
| └── LICENSE # Apache 2.0 |
| ``` |
|
|
| --- |
|
|
| ## Architecture Deep Dive |
|
|
| Genesis follows a **"deep-and-thin"** design philosophy inspired by [SmolLM2](https://arxiv.org/abs/2502.02737) and [MobileLLM](https://arxiv.org/abs/2402.14905), which has proven effective for small language models. |
|
|
| ### Core Configuration |
|
|
| | Component | Value | Rationale | |
| |-----------|-------|-----------| |
| | Layers | 30 | Deep architecture for better representation | |
| | Hidden Size | 576 | Optimal width for 150M scale | |
| | Attention Heads | 9 | Query heads | |
| | KV Heads | 3 | 3:1 GQA ratio for memory efficiency | |
| | Head Dimension | 64 | Standard for efficient attention | |
| | FFN Size | 1,440 | 2.5× expansion (SwiGLU-efficient) | |
| | Weight Tying | ✓ | Embeddings tied with LM head | |
|
|
| --- |
|
|
| ### Hybrid Attention Layout |
|
|
| Genesis employs a **hybrid attention layout** inspired by [Qwen3-Next](https://huggingface.co/docs/transformers/main/en/model_doc/qwen3_next), alternating between linear and full attention: |
|
|
| ``` |
| Layer Distribution (30 layers): |
| ├── 23 layers: GLA (Gated DeltaNet) - O(n) linear attention |
| └── 7 layers: FoX (Forgetting Attention) - O(n²) softmax with forget gate |
| |
| Ratio: 75% Linear / 25% Full Attention |
| ``` |
|
|
| **Why hybrid?** Pure linear attention struggles with precise retrieval tasks (e.g., copying, in-context learning). Interleaving full attention layers restores this capability while maintaining overall efficiency. |
|
|
| > 📖 **Reference**: The hybrid approach is validated by Qwen3-Next (2025) and research showing that [3:1 to 6:1 linear-to-full ratios](https://arxiv.org/abs/2507.06457) optimize the efficiency-quality tradeoff. |
|
|
| --- |
|
|
| ### Gated DeltaNet (GLA) |
|
|
| The primary attention mechanism (75% of layers) is **Gated DeltaNet**, a state-of-the-art O(n) linear attention mechanism from NVIDIA. |
|
|
| #### Key Features |
|
|
| | Feature | Description | Paper Reference | |
| |---------|-------------|-----------------| |
| | **Delta Rule** | Online learning rule for recurrent state updates | [Schlag et al., 2021](https://arxiv.org/abs/2102.11174) | |
| | **Gated Forget** | Mamba-style data-dependent forgetting | [Gu & Dao, 2023](https://arxiv.org/abs/2312.00752) | |
| | **Short Convolution** | 1D conv on Q, K, V for local context | [Gu et al., 2022](https://arxiv.org/abs/2212.14052) | |
| | **L2 QK-Norm** | Stabilizes attention scores | Standard practice | |
|
|
| #### Mathematical Formulation |
|
|
| The delta rule update enables the model to selectively write to and erase from a recurrent state: |
|
|
| ``` |
| S_t = α_t * S_{t-1} + β_t * (v_t ⊗ k_t - S_{t-1} @ k_t ⊗ k_t) |
| o_t = S_t @ q_t |
| ``` |
|
|
| Where: |
| - `S_t`: Recurrent state matrix |
| - `α_t`: Forget gate (data-dependent) |
| - `β_t`: Learning rate gate (per-token) |
|
|
| > 📖 **Paper**: [Gated Delta Networks: Improving Mamba2 with Delta Rule](https://arxiv.org/abs/2412.06464) (ICLR 2025) |
| > |
| > 📦 **Code**: [NVlabs/GatedDeltaNet](https://github.com/NVlabs/GatedDeltaNet) |
|
|
| #### Configuration in Genesis |
|
|
| ```python |
| gla_expand_k: 0.75 # Key expansion ratio |
| gla_expand_v: 1.5 # Value expansion ratio (asymmetric) |
| gla_gate_fn: "swish" # Gating activation |
| gla_use_short_conv: True |
| gla_conv_size: 4 |
| gla_chunk_size: 64 # For chunked parallel training |
| gla_use_delta_rule: True |
| gla_qk_norm: "l2" |
| gla_use_mamba_gate: True |
| ``` |
|
|
| --- |
|
|
| ### Forgetting Attention (FoX) |
|
|
| The full attention layers (25%) use **FoX (Forgetting Transformer)**, which augments standard softmax attention with a learnable forget gate. |
|
|
| #### Why FoX over Standard Attention? |
|
|
| | Aspect | Standard Attention | FoX | |
| |--------|-------------------|-----| |
| | Position Encoding | Requires RoPE/ALiBi | **NoPE** (implicit via forget gate) | |
| | Long-range Decay | Uniform attention | Data-dependent decay | |
| | Length Extrapolation | Poor | Better generalization | |
|
|
| #### Mechanism |
|
|
| FoX modifies attention scores with cumulative forget gates: |
|
|
| ``` |
| attn[i,j] = softmax(q_i @ k_j / √d + Σ_{k=j}^{i} log(f_k)) |
| ``` |
|
|
| Where `f_k = sigmoid(W_f @ x_k)` is a learned forget gate that naturally down-weights distant tokens. |
|
|
| > 📖 **Paper**: [Forgetting Transformer: Softmax Attention with a Forget Gate](https://arxiv.org/abs/2503.02130) (ICLR 2025) |
| > |
| > 📦 **Code**: [zhixuan-lin/forgetting-transformer](https://github.com/zhixuan-lin/forgetting-transformer) |
|
|
| #### FoX "Pro" Design |
|
|
| Genesis uses the enhanced "Pro" block design: |
|
|
| | Component | Purpose | |
| |-----------|---------| |
| | Output Gate | Controls information flow (like GLA) | |
| | QK-Norm | Training stability | |
| | Short Convolution | Local context on K, V | |
| | FusedRMSNormSwishGate | Efficient fused operations | |
|
|
| --- |
|
|
| ### Test-Time Training (TTT) |
|
|
| Genesis includes an experimental **TTT metacognition layer** that adapts the model during inference. |
|
|
| #### Concept |
|
|
| Traditional models have **fixed weights** at inference. TTT layers have a small set of **fast weights** that update based on the input sequence, allowing the model to "learn" from context. |
|
|
| ``` |
| Standard: y = f(x; θ_fixed) |
| TTT: y = f(x; θ_fixed, θ_fast(x)) |
| ``` |
|
|
| #### Implementation Details |
|
|
| | Parameter | Value | Description | |
| |-----------|-------|-------------| |
| | `ttt_rank` | 4 | Low-rank adaptation dimension | |
| | `ttt_inner_lr` | 0.01 | Learning rate for fast weights | |
| | `ttt_mode` | "dual" | Parallel dual-form computation | |
| | `ttt_chunk_size` | 64 | Chunking for efficiency | |
|
|
| The "dual form" enables fully parallel gradient computation: |
|
|
| ```python |
| # Instead of sequential updates: |
| # W_1 = W_0 - lr * grad_0 |
| # W_2 = W_1 - lr * grad_1 |
| # ... |
| |
| # Dual form computes all at once: |
| # W_t = W_0 - lr * Σ_{i<t} grad_i (via cumsum) |
| ``` |
|
|
| > 📖 **Paper**: [Learning to (Learn at Test Time): RNNs with Expressive Hidden States](https://arxiv.org/abs/2407.04620) (ICML 2024) |
| > |
| > 📦 **Code**: [test-time-training/ttt-lm-pytorch](https://github.com/test-time-training/ttt-lm-pytorch) |
|
|
| #### When TTT Activates |
|
|
| TTT is designed for **inference-time adaptation** and runs only during `model.eval()`. During training, it's disabled to avoid overhead. |
|
|
| --- |
|
|
| ### Selective Activation |
|
|
| The FFN layers use **SwiGLU** with optional top-k sparsity masking. |
|
|
| #### SwiGLU FFN |
|
|
| ```python |
| FFN(x) = (Swish(W_gate @ x) ⊙ (W_up @ x)) @ W_down |
| ``` |
|
|
| > 📖 **Paper**: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) (Shazeer, 2020) |
|
|
| #### Selective Activation (Experimental) |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | `selective_k_ratio` | 0.85 (keeps top 85%) | |
| | `selective_use_soft_mask` | True | |
|
|
| **Important**: This is a **regularization technique**, not a speedup mechanism. Real sparse acceleration requires specialized kernels (e.g., Triton sparse GEMM). |
|
|
| > 📖 **Related**: [ReLU Strikes Back](https://arxiv.org/abs/2310.04564) (Apple, ICLR 2024) shows natural activation sparsity can be exploited for inference. |
|
|
| --- |
|
|
| ### Additional Components |
|
|
| #### Grouped Query Attention (GQA) |
|
|
| Genesis uses 3:1 GQA (9 query heads, 3 KV heads) for memory efficiency during inference. |
|
|
| > 📖 **Paper**: [GQA: Training Generalized Multi-Query Transformer Models](https://arxiv.org/abs/2305.13245) (Google, 2023) |
|
|
| #### Rotary Position Embeddings (RoPE) |
|
|
| Partial RoPE (50% rotation) is applied in GLA layers for position awareness. |
|
|
| > 📖 **Paper**: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) (Su et al., 2021) |
|
|
| #### µP (Maximal Update Parametrization) |
|
|
| Hyperparameters were tuned using µP for potential scaling transfer. |
|
|
| > 📖 **Paper**: [Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer](https://arxiv.org/abs/2203.03466) (Yang et al., 2022) |
| > |
| > 📖 **Guide**: [The Practitioner's Guide to µP](https://www.cerebras.ai/blog/the-practitioners-guide-to-the-maximal-update-parameterization) (Cerebras) |
|
|
| #### Zero-Centered RMSNorm |
|
|
| Used throughout for better weight decay compatibility with µP. |
|
|
| --- |
|
|
| ## Comparison with Other Architectures |
|
|
| ### vs. SmolLM2-135M (HuggingFace) |
|
|
| | Aspect | Genesis-152M | SmolLM2-135M | |
| |--------|--------------|--------------| |
| | **Attention** | Hybrid GLA + FoX | Standard Multi-Head | |
| | **Complexity** | O(n) for 75% layers | O(n²) all layers | |
| | **Position Encoding** | RoPE (GLA) / NoPE (FoX) | RoPE | |
| | **TTT** | ✓ Experimental | ✗ | |
| | **Pre-training** | 2B tokens | 2T tokens | |
| | **Architecture** | 30L × 576 | 30L × 576 | |
|
|
| > SmolLM2 uses 1000× more training tokens, making direct benchmark comparison unfair. Genesis demonstrates architectural innovations, not data scaling. |
|
|
| ### vs. Qwen3-Next |
|
|
| | Aspect | Genesis-152M | Qwen3-Next-80B-A3B | |
| |--------|--------------|---------------------| |
| | **Scale** | 152M | 80B (3B active) | |
| | **Linear Attention** | GLA (same) | GLA | |
| | **Full Attention** | FoX | Standard | |
| | **Hybrid Ratio** | 75/25 | Similar | |
| | **MoE** | ✗ | ✓ | |
|
|
| Genesis can be seen as a **miniature research version** of the hybrid attention approach that Qwen3-Next uses at scale. |
|
|
| ### vs. Mamba / Mamba-2 |
|
|
| | Aspect | Genesis-152M | Mamba-2 | |
| |--------|--------------|---------| |
| | **Architecture** | Hybrid (Linear + Softmax) | Pure SSM | |
| | **Retrieval** | Strong (FoX layers) | Limited | |
| | **Implementation** | PyTorch + Optional Triton | Requires CUDA | |
| | **Flexibility** | Modular | Monolithic | |
|
|
| --- |
|
|
| ## Training Details |
|
|
| ### Pre-training |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | **Tokens** | 2 billion | |
| | **Dataset Mix** | FineWeb-Edu (51%), DCLM (22%), FineMath (12%), Stack-Edu (8%), Cosmopedia (5%), Synth (2%) | |
| | **Context Length** | 2,048 | |
| | **Batch Size** | 128 | |
| | **Learning Rate** | 1e-3 (WSD schedule) | |
| | **Optimizer** | AdamW (β₁=0.9, β₂=0.95) | |
| | **Weight Decay** | 0.1 | |
| | **Warmup** | 5% of steps | |
| | **Hardware** | Single A100 80GB | |
|
|
| #### Learning Rate Schedule |
|
|
| **WSD (Warmup-Stable-Decay)**: |
| - Warmup: 5% of training (linear ramp) |
| - Stable: 85% of training (constant LR) |
| - Decay: 10% of training (cosine to min_lr) |
| |
| ### Supervised Fine-Tuning (SFT) |
| |
| | Parameter | Value | |
| |-----------|-------| |
| | **Dataset** | [smol-smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) | |
| | **Samples** | ~485K conversations | |
| | **Epochs** | 1 | |
| | **Learning Rate** | 1e-3 | |
| | **Batch Size** | 32 (effective: 128 with grad accum) | |
| |
| #### smol-smoltalk Composition |
| |
| The SFT dataset is the same used to train SmolLM2-135M-Instruct: |
| |
| | Subset | Purpose | |
| |--------|---------| |
| | smol-magpie-ultra-short | Instruction following | |
| | everyday-conversations | Multi-turn dialogue | |
| | smol-rewrite | Text editing | |
| | smol-summarize | Summarization | |
| | openhermes-100k | Knowledge & reasoning | |
| | systemchats-30k | System prompt following | |
| |
| > This dataset was specifically curated for small models (<1B params) and avoids issues like `<think>` tags from reasoning models. |
| |
| --- |
| |
| ## Usage |
| |
| ### Installation |
| |
| ```bash |
| pip install genesis-llm |
| ``` |
| |
| ### Download Weights |
| |
| ```bash |
| pip install "huggingface-hub>=0.20" |
| huggingface-cli download guiferrarib/genesis-152m-instruct genesis_152m_instruct.safetensors --local-dir . |
| ``` |
| |
| ### Interactive Chat |
| |
| ```bash |
| genesis --model ./genesis_152m_instruct.safetensors |
| ``` |
| |
| ### Python API |
| |
| ```python |
| import json |
| import torch |
| from safetensors import safe_open |
| from safetensors.torch import load_file |
| from genesis import Genesis, GenesisConfig, get_tokenizer |
|
|
| # 1. Load config from checkpoint metadata |
| model_path = "./genesis_152m_instruct.safetensors" |
| with safe_open(model_path, framework="pt", device="cpu") as f: |
| metadata = f.metadata() or {} |
| config_dict = json.loads(metadata.get("genesis_config_json", "{}")) |
| config = GenesisConfig(**config_dict) if config_dict else GenesisConfig.genesis_147m() |
| |
| # 2. Load model weights |
| device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" |
| state_dict = load_file(model_path, device=device) |
| model = Genesis(config).to(device) |
| model.load_state_dict(state_dict, strict=False) |
| model.eval() |
|
|
| # 3. Setup tokenizer (GPT-NeoX + ChatML tokens) |
| tokenizer = get_tokenizer("neox") |
| tokenizer.add_chat_tokens() |
| |
| # 4. Build ChatML prompt |
| prompt = """<|im_start|>system |
| You are a helpful assistant. |
| <|im_end|> |
| <|im_start|>user |
| Explain what linear attention is in simple terms. |
| <|im_end|> |
| <|im_start|>assistant |
| """ |
|
|
| # 5. Generate |
| input_ids = torch.tensor([tokenizer.encode(prompt)], device=device) |
| with torch.no_grad(): |
| output_ids = model.generate(input_ids, max_new_tokens=256, temperature=0.7) |
| response = tokenizer.decode(output_ids[0][input_ids.shape[1]:].tolist()) |
| print(response) |
| ``` |
| |
| ### Prompt Format |
|
|
| Genesis uses **ChatML** format: |
|
|
| ``` |
| <|im_start|>system |
| {system_message} |
| <|im_end|> |
| <|im_start|>user |
| {user_message} |
| <|im_end|> |
| <|im_start|>assistant |
| {assistant_response}<|im_end|> |
| ``` |
|
|
| --- |
|
|
| ## Benchmarks |
|
|
| Evaluated using LightEval on MPS (Apple Silicon). |
|
|
| ### Results |
|
|
| | Task | Metric | Score | Stderr | |
| |------|--------|-------|--------| |
| | **ARC-Easy** (25-shot) | acc_norm | 44.02% | ±1.02 | |
| | **ARC-Challenge** (25-shot) | acc_norm | 24.66% | ±1.26 | |
| | **BoolQ** (0-shot) | acc_norm | 56.30% | ±0.87 | |
| | **HellaSwag** (10-shot) | acc_norm | 30.19% | ±0.46 | |
| | **Winogrande** (5-shot) | acc | 49.09% | ±1.41 | |
| | **CommonsenseQA** (0-shot) | acc_norm | 29.16% | ±1.30 | |
| | **OpenBookQA** (0-shot) | acc_norm | 28.60% | ±2.02 | |
| | **SciQ** (0-shot) | acc_norm | 46.80% | ±1.58 | |
| |
| ### Interpretation |
| |
| | Task | Random Baseline | Genesis | Signal | |
| |------|-----------------|---------|--------| |
| | ARC-Easy | 25% | 44% | ✅ **Strong** | |
| | BoolQ | 50% | 56% | ✅ Learning | |
| | HellaSwag | ~25% | 30% | ✅ Learning | |
| | Winogrande | 50% | 49% | ⚠️ At baseline | |
| | ARC-Challenge | ~25% | 25% | ⚠️ Too hard for size | |
| |
| > **Note**: With only 2B pre-training tokens (vs. 2T for SmolLM2), benchmarks primarily reflect architectural capacity rather than world knowledge. |
| |
| --- |
| |
| ## Limitations |
| |
| ### Known Issues |
| |
| 1. **Hallucinations**: Frequent factual errors due to limited pre-training data |
| 2. **Math**: Unreliable arithmetic and multi-step reasoning |
| 3. **Instruction Following**: Can be brittle with strict constraints |
| 4. **TTT Overhead**: Metacognition layer adds latency (can be disabled) |
| |
| ### Not Suitable For |
| |
| - Production deployments requiring reliability |
| - Tasks requiring factual accuracy |
| - Complex multi-step reasoning |
| - Safety-critical applications |
| |
| ### Best Use Cases |
| |
| - Architecture research and ablation studies |
| - Efficient attention mechanism exploration |
| - Small model behavior analysis |
| - Educational purposes |
| |
| --- |
| |
| ## Citation |
| |
| If you use Genesis in your research, please cite: |
| |
| ```bibtex |
| @misc{genesis2025, |
| title={Genesis: A Hybrid Linear Attention Architecture for Small Language Models}, |
| author={Ferrari Brescia, Guilherme}, |
| year={2025}, |
| url={https://huggingface.co/guiferrarib/genesis-152m-instruct} |
| } |
| ``` |
| |
| ### Related Papers |
| |
| ```bibtex |
| @inproceedings{yang2024gated, |
| title={Gated Delta Networks: Improving Mamba2 with Delta Rule}, |
| author={Yang, Songlin and Wang, Bailin and Zhang, Yu and Shen, Yikang and Keutzer, Kurt}, |
| booktitle={ICLR}, |
| year={2025} |
| } |
| |
| @inproceedings{lin2025forgetting, |
| title={Forgetting Transformer: Softmax Attention with a Forget Gate}, |
| author={Lin, Zhixuan and others}, |
| booktitle={ICLR}, |
| year={2025} |
| } |
| |
| @inproceedings{sun2024learning, |
| title={Learning to (Learn at Test Time): RNNs with Expressive Hidden States}, |
| author={Sun, Yu and others}, |
| booktitle={ICML}, |
| year={2024} |
| } |
| |
| @article{allal2025smollm2, |
| title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model}, |
| author={Allal, Loubna Ben and others}, |
| journal={arXiv preprint arXiv:2502.02737}, |
| year={2025} |
| } |
| ``` |
| |
| --- |
| |
| ## License |
| |
| | Component | License | |
| |-----------|---------| |
| | **Model Weights** | Apache 2.0 | |
| | **Code** | Apache 2.0 | |
| | **Training Data** | Various (see dataset cards) | |
| |
| --- |
| |
| <div align="center"> |
| <p><em>Built with 🧬 by the Orch-Mind team</em></p> |
| <p> |
| <a href="https://pypi.org/project/genesis-llm/">PyPI</a> |
| </p> |
| </div> |
| |