--- license: apache-2.0 language: - en pipeline_tag: text-generation tags: - pytorch - safetensors - text-generation - small-llm - custom-architecture - linear-attention - gated-deltanet - test-time-training - hybrid-attention - research library_name: genesis-llm datasets: - HuggingFaceTB/smol-smoltalk base_model: [] ---

🧬 Genesis-152M-Instruct

A Research-Oriented Small Language Model with Hybrid Linear Attention

Architecture Training License

--- ## Table of Contents - [Overview](#overview) - [Model Summary](#model-summary) - [Architecture Deep Dive](#architecture-deep-dive) - [Hybrid Attention Layout](#hybrid-attention-layout) - [Gated DeltaNet (GLA)](#gated-deltanet-gla) - [Forgetting Attention (FoX)](#forgetting-attention-fox) - [Test-Time Training (TTT)](#test-time-training-ttt) - [Selective Activation](#selective-activation) - [Additional Components](#additional-components) - [Comparison with Other Architectures](#comparison-with-other-architectures) - [Training Details](#training-details) - [Pre-training](#pre-training) - [Supervised Fine-Tuning (SFT)](#supervised-fine-tuning-sft) - [Usage](#usage) - [Benchmarks](#benchmarks) - [Limitations](#limitations) - [Citation](#citation) - [License](#license) --- ## Overview **Genesis-152M-Instruct** is an experimental small language model that combines recent advances in efficient attention mechanisms into a single architecture. It serves as a research platform for exploring: - **Hybrid attention**: Mixing O(n) linear attention with O(n²) softmax attention - **Efficient inference**: Sub-quadratic complexity for most layers - **Adaptive computation**: Test-time training for dynamic model adaptation > āš ļø **Experimental Model**: This is a research artifact, not a production-ready model. It demonstrates architectural innovations but has limitations typical of small models. --- ## Model Summary | Property | Value | |----------|-------| | **Parameters** | 151.8M total (~122.8M non-embedding) | | **Architecture** | Hybrid GLA + FoX Attention | | **Context Length** | 2,048 tokens | | **Vocab Size** | 50,279 (GPT-NeoX + ChatML tokens) | | **Pre-training Data** | 2B tokens | | **SFT Dataset** | [smol-smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) | | **License** | Apache 2.0 | ### Files in this Repository ``` ā”œā”€ā”€ genesis_152m_instruct.safetensors # Model weights ā”œā”€ā”€ README.md # This model card └── LICENSE # Apache 2.0 ``` --- ## Architecture Deep Dive Genesis follows a **"deep-and-thin"** design philosophy inspired by [SmolLM2](https://arxiv.org/abs/2502.02737) and [MobileLLM](https://arxiv.org/abs/2402.14905), which has proven effective for small language models. ### Core Configuration | Component | Value | Rationale | |-----------|-------|-----------| | Layers | 30 | Deep architecture for better representation | | Hidden Size | 576 | Optimal width for 150M scale | | Attention Heads | 9 | Query heads | | KV Heads | 3 | 3:1 GQA ratio for memory efficiency | | Head Dimension | 64 | Standard for efficient attention | | FFN Size | 1,440 | 2.5Ɨ expansion (SwiGLU-efficient) | | Weight Tying | āœ“ | Embeddings tied with LM head | --- ### Hybrid Attention Layout Genesis employs a **hybrid attention layout** inspired by [Qwen3-Next](https://huggingface.co/docs/transformers/main/en/model_doc/qwen3_next), alternating between linear and full attention: ``` Layer Distribution (30 layers): ā”œā”€ā”€ 23 layers: GLA (Gated DeltaNet) - O(n) linear attention └── 7 layers: FoX (Forgetting Attention) - O(n²) softmax with forget gate Ratio: 75% Linear / 25% Full Attention ``` **Why hybrid?** Pure linear attention struggles with precise retrieval tasks (e.g., copying, in-context learning). Interleaving full attention layers restores this capability while maintaining overall efficiency. > šŸ“– **Reference**: The hybrid approach is validated by Qwen3-Next (2025) and research showing that [3:1 to 6:1 linear-to-full ratios](https://arxiv.org/abs/2507.06457) optimize the efficiency-quality tradeoff. --- ### Gated DeltaNet (GLA) The primary attention mechanism (75% of layers) is **Gated DeltaNet**, a state-of-the-art O(n) linear attention mechanism from NVIDIA. #### Key Features | Feature | Description | Paper Reference | |---------|-------------|-----------------| | **Delta Rule** | Online learning rule for recurrent state updates | [Schlag et al., 2021](https://arxiv.org/abs/2102.11174) | | **Gated Forget** | Mamba-style data-dependent forgetting | [Gu & Dao, 2023](https://arxiv.org/abs/2312.00752) | | **Short Convolution** | 1D conv on Q, K, V for local context | [Gu et al., 2022](https://arxiv.org/abs/2212.14052) | | **L2 QK-Norm** | Stabilizes attention scores | Standard practice | #### Mathematical Formulation The delta rule update enables the model to selectively write to and erase from a recurrent state: ``` S_t = α_t * S_{t-1} + β_t * (v_t āŠ— k_t - S_{t-1} @ k_t āŠ— k_t) o_t = S_t @ q_t ``` Where: - `S_t`: Recurrent state matrix - `α_t`: Forget gate (data-dependent) - `β_t`: Learning rate gate (per-token) > šŸ“– **Paper**: [Gated Delta Networks: Improving Mamba2 with Delta Rule](https://arxiv.org/abs/2412.06464) (ICLR 2025) > > šŸ“¦ **Code**: [NVlabs/GatedDeltaNet](https://github.com/NVlabs/GatedDeltaNet) #### Configuration in Genesis ```python gla_expand_k: 0.75 # Key expansion ratio gla_expand_v: 1.5 # Value expansion ratio (asymmetric) gla_gate_fn: "swish" # Gating activation gla_use_short_conv: True gla_conv_size: 4 gla_chunk_size: 64 # For chunked parallel training gla_use_delta_rule: True gla_qk_norm: "l2" gla_use_mamba_gate: True ``` --- ### Forgetting Attention (FoX) The full attention layers (25%) use **FoX (Forgetting Transformer)**, which augments standard softmax attention with a learnable forget gate. #### Why FoX over Standard Attention? | Aspect | Standard Attention | FoX | |--------|-------------------|-----| | Position Encoding | Requires RoPE/ALiBi | **NoPE** (implicit via forget gate) | | Long-range Decay | Uniform attention | Data-dependent decay | | Length Extrapolation | Poor | Better generalization | #### Mechanism FoX modifies attention scores with cumulative forget gates: ``` attn[i,j] = softmax(q_i @ k_j / √d + Ī£_{k=j}^{i} log(f_k)) ``` Where `f_k = sigmoid(W_f @ x_k)` is a learned forget gate that naturally down-weights distant tokens. > šŸ“– **Paper**: [Forgetting Transformer: Softmax Attention with a Forget Gate](https://arxiv.org/abs/2503.02130) (ICLR 2025) > > šŸ“¦ **Code**: [zhixuan-lin/forgetting-transformer](https://github.com/zhixuan-lin/forgetting-transformer) #### FoX "Pro" Design Genesis uses the enhanced "Pro" block design: | Component | Purpose | |-----------|---------| | Output Gate | Controls information flow (like GLA) | | QK-Norm | Training stability | | Short Convolution | Local context on K, V | | FusedRMSNormSwishGate | Efficient fused operations | --- ### Test-Time Training (TTT) Genesis includes an experimental **TTT metacognition layer** that adapts the model during inference. #### Concept Traditional models have **fixed weights** at inference. TTT layers have a small set of **fast weights** that update based on the input sequence, allowing the model to "learn" from context. ``` Standard: y = f(x; Īø_fixed) TTT: y = f(x; Īø_fixed, Īø_fast(x)) ``` #### Implementation Details | Parameter | Value | Description | |-----------|-------|-------------| | `ttt_rank` | 4 | Low-rank adaptation dimension | | `ttt_inner_lr` | 0.01 | Learning rate for fast weights | | `ttt_mode` | "dual" | Parallel dual-form computation | | `ttt_chunk_size` | 64 | Chunking for efficiency | The "dual form" enables fully parallel gradient computation: ```python # Instead of sequential updates: # W_1 = W_0 - lr * grad_0 # W_2 = W_1 - lr * grad_1 # ... # Dual form computes all at once: # W_t = W_0 - lr * Ī£_{i šŸ“– **Paper**: [Learning to (Learn at Test Time): RNNs with Expressive Hidden States](https://arxiv.org/abs/2407.04620) (ICML 2024) > > šŸ“¦ **Code**: [test-time-training/ttt-lm-pytorch](https://github.com/test-time-training/ttt-lm-pytorch) #### When TTT Activates TTT is designed for **inference-time adaptation** and runs only during `model.eval()`. During training, it's disabled to avoid overhead. --- ### Selective Activation The FFN layers use **SwiGLU** with optional top-k sparsity masking. #### SwiGLU FFN ```python FFN(x) = (Swish(W_gate @ x) āŠ™ (W_up @ x)) @ W_down ``` > šŸ“– **Paper**: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) (Shazeer, 2020) #### Selective Activation (Experimental) | Parameter | Value | |-----------|-------| | `selective_k_ratio` | 0.85 (keeps top 85%) | | `selective_use_soft_mask` | True | **Important**: This is a **regularization technique**, not a speedup mechanism. Real sparse acceleration requires specialized kernels (e.g., Triton sparse GEMM). > šŸ“– **Related**: [ReLU Strikes Back](https://arxiv.org/abs/2310.04564) (Apple, ICLR 2024) shows natural activation sparsity can be exploited for inference. --- ### Additional Components #### Grouped Query Attention (GQA) Genesis uses 3:1 GQA (9 query heads, 3 KV heads) for memory efficiency during inference. > šŸ“– **Paper**: [GQA: Training Generalized Multi-Query Transformer Models](https://arxiv.org/abs/2305.13245) (Google, 2023) #### Rotary Position Embeddings (RoPE) Partial RoPE (50% rotation) is applied in GLA layers for position awareness. > šŸ“– **Paper**: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) (Su et al., 2021) #### µP (Maximal Update Parametrization) Hyperparameters were tuned using µP for potential scaling transfer. > šŸ“– **Paper**: [Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer](https://arxiv.org/abs/2203.03466) (Yang et al., 2022) > > šŸ“– **Guide**: [The Practitioner's Guide to µP](https://www.cerebras.ai/blog/the-practitioners-guide-to-the-maximal-update-parameterization) (Cerebras) #### Zero-Centered RMSNorm Used throughout for better weight decay compatibility with µP. --- ## Comparison with Other Architectures ### vs. SmolLM2-135M (HuggingFace) | Aspect | Genesis-152M | SmolLM2-135M | |--------|--------------|--------------| | **Attention** | Hybrid GLA + FoX | Standard Multi-Head | | **Complexity** | O(n) for 75% layers | O(n²) all layers | | **Position Encoding** | RoPE (GLA) / NoPE (FoX) | RoPE | | **TTT** | āœ“ Experimental | āœ— | | **Pre-training** | 2B tokens | 2T tokens | | **Architecture** | 30L Ɨ 576 | 30L Ɨ 576 | > SmolLM2 uses 1000Ɨ more training tokens, making direct benchmark comparison unfair. Genesis demonstrates architectural innovations, not data scaling. ### vs. Qwen3-Next | Aspect | Genesis-152M | Qwen3-Next-80B-A3B | |--------|--------------|---------------------| | **Scale** | 152M | 80B (3B active) | | **Linear Attention** | GLA (same) | GLA | | **Full Attention** | FoX | Standard | | **Hybrid Ratio** | 75/25 | Similar | | **MoE** | āœ— | āœ“ | Genesis can be seen as a **miniature research version** of the hybrid attention approach that Qwen3-Next uses at scale. ### vs. Mamba / Mamba-2 | Aspect | Genesis-152M | Mamba-2 | |--------|--------------|---------| | **Architecture** | Hybrid (Linear + Softmax) | Pure SSM | | **Retrieval** | Strong (FoX layers) | Limited | | **Implementation** | PyTorch + Optional Triton | Requires CUDA | | **Flexibility** | Modular | Monolithic | --- ## Training Details ### Pre-training | Parameter | Value | |-----------|-------| | **Tokens** | 2 billion | | **Dataset Mix** | FineWeb-Edu (51%), DCLM (22%), FineMath (12%), Stack-Edu (8%), Cosmopedia (5%), Synth (2%) | | **Context Length** | 2,048 | | **Batch Size** | 128 | | **Learning Rate** | 1e-3 (WSD schedule) | | **Optimizer** | AdamW (β₁=0.9, β₂=0.95) | | **Weight Decay** | 0.1 | | **Warmup** | 5% of steps | | **Hardware** | Single A100 80GB | #### Learning Rate Schedule **WSD (Warmup-Stable-Decay)**: - Warmup: 5% of training (linear ramp) - Stable: 85% of training (constant LR) - Decay: 10% of training (cosine to min_lr) ### Supervised Fine-Tuning (SFT) | Parameter | Value | |-----------|-------| | **Dataset** | [smol-smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) | | **Samples** | ~485K conversations | | **Epochs** | 1 | | **Learning Rate** | 1e-3 | | **Batch Size** | 32 (effective: 128 with grad accum) | #### smol-smoltalk Composition The SFT dataset is the same used to train SmolLM2-135M-Instruct: | Subset | Purpose | |--------|---------| | smol-magpie-ultra-short | Instruction following | | everyday-conversations | Multi-turn dialogue | | smol-rewrite | Text editing | | smol-summarize | Summarization | | openhermes-100k | Knowledge & reasoning | | systemchats-30k | System prompt following | > This dataset was specifically curated for small models (<1B params) and avoids issues like `` tags from reasoning models. --- ## Usage ### Installation ```bash pip install genesis-llm ``` ### Download Weights ```bash pip install "huggingface-hub>=0.20" huggingface-cli download guiferrarib/genesis-152m-instruct genesis_152m_instruct.safetensors --local-dir . ``` ### Interactive Chat ```bash genesis --model ./genesis_152m_instruct.safetensors ``` ### Python API ```python import json import torch from safetensors import safe_open from safetensors.torch import load_file from genesis import Genesis, GenesisConfig, get_tokenizer # 1. Load config from checkpoint metadata model_path = "./genesis_152m_instruct.safetensors" with safe_open(model_path, framework="pt", device="cpu") as f: metadata = f.metadata() or {} config_dict = json.loads(metadata.get("genesis_config_json", "{}")) config = GenesisConfig(**config_dict) if config_dict else GenesisConfig.genesis_147m() # 2. Load model weights device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" state_dict = load_file(model_path, device=device) model = Genesis(config).to(device) model.load_state_dict(state_dict, strict=False) model.eval() # 3. Setup tokenizer (GPT-NeoX + ChatML tokens) tokenizer = get_tokenizer("neox") tokenizer.add_chat_tokens() # 4. Build ChatML prompt prompt = """<|im_start|>system You are a helpful assistant. <|im_end|> <|im_start|>user Explain what linear attention is in simple terms. <|im_end|> <|im_start|>assistant """ # 5. Generate input_ids = torch.tensor([tokenizer.encode(prompt)], device=device) with torch.no_grad(): output_ids = model.generate(input_ids, max_new_tokens=256, temperature=0.7) response = tokenizer.decode(output_ids[0][input_ids.shape[1]:].tolist()) print(response) ``` ### Prompt Format Genesis uses **ChatML** format: ``` <|im_start|>system {system_message} <|im_end|> <|im_start|>user {user_message} <|im_end|> <|im_start|>assistant {assistant_response}<|im_end|> ``` --- ## Benchmarks Evaluated using LightEval on MPS (Apple Silicon). ### Results | Task | Metric | Score | Stderr | |------|--------|-------|--------| | **ARC-Easy** (25-shot) | acc_norm | 44.02% | ±1.02 | | **ARC-Challenge** (25-shot) | acc_norm | 24.66% | ±1.26 | | **BoolQ** (0-shot) | acc_norm | 56.30% | ±0.87 | | **HellaSwag** (10-shot) | acc_norm | 30.19% | ±0.46 | | **Winogrande** (5-shot) | acc | 49.09% | ±1.41 | | **CommonsenseQA** (0-shot) | acc_norm | 29.16% | ±1.30 | | **OpenBookQA** (0-shot) | acc_norm | 28.60% | ±2.02 | | **SciQ** (0-shot) | acc_norm | 46.80% | ±1.58 | ### Interpretation | Task | Random Baseline | Genesis | Signal | |------|-----------------|---------|--------| | ARC-Easy | 25% | 44% | āœ… **Strong** | | BoolQ | 50% | 56% | āœ… Learning | | HellaSwag | ~25% | 30% | āœ… Learning | | Winogrande | 50% | 49% | āš ļø At baseline | | ARC-Challenge | ~25% | 25% | āš ļø Too hard for size | > **Note**: With only 2B pre-training tokens (vs. 2T for SmolLM2), benchmarks primarily reflect architectural capacity rather than world knowledge. --- ## Limitations ### Known Issues 1. **Hallucinations**: Frequent factual errors due to limited pre-training data 2. **Math**: Unreliable arithmetic and multi-step reasoning 3. **Instruction Following**: Can be brittle with strict constraints 4. **TTT Overhead**: Metacognition layer adds latency (can be disabled) ### Not Suitable For - Production deployments requiring reliability - Tasks requiring factual accuracy - Complex multi-step reasoning - Safety-critical applications ### Best Use Cases - Architecture research and ablation studies - Efficient attention mechanism exploration - Small model behavior analysis - Educational purposes --- ## Citation If you use Genesis in your research, please cite: ```bibtex @misc{genesis2025, title={Genesis: A Hybrid Linear Attention Architecture for Small Language Models}, author={Ferrari Brescia, Guilherme}, year={2025}, url={https://huggingface.co/guiferrarib/genesis-152m-instruct} } ``` ### Related Papers ```bibtex @inproceedings{yang2024gated, title={Gated Delta Networks: Improving Mamba2 with Delta Rule}, author={Yang, Songlin and Wang, Bailin and Zhang, Yu and Shen, Yikang and Keutzer, Kurt}, booktitle={ICLR}, year={2025} } @inproceedings{lin2025forgetting, title={Forgetting Transformer: Softmax Attention with a Forget Gate}, author={Lin, Zhixuan and others}, booktitle={ICLR}, year={2025} } @inproceedings{sun2024learning, title={Learning to (Learn at Test Time): RNNs with Expressive Hidden States}, author={Sun, Yu and others}, booktitle={ICML}, year={2024} } @article{allal2025smollm2, title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model}, author={Allal, Loubna Ben and others}, journal={arXiv preprint arXiv:2502.02737}, year={2025} } ``` --- ## License | Component | License | |-----------|---------| | **Model Weights** | Apache 2.0 | | **Code** | Apache 2.0 | | **Training Data** | Various (see dataset cards) | ---

Built with 🧬 by the Orch-Mind team

PyPI