fix: correct Python API example in docs

6ff11d8 verified 4 months ago

18.8 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- pytorch
	- safetensors
	- text-generation
	- small-llm
	- custom-architecture
	- linear-attention
	- gated-deltanet
	- test-time-training
	- hybrid-attention
	- research
	library_name: genesis-llm
	datasets:
	- HuggingFaceTB/smol-smoltalk
	base_model: []
	---

	<div align="center">
	<h1>🧬 Genesis-152M-Instruct</h1>
	<p><em>A Research-Oriented Small Language Model with Hybrid Linear Attention</em></p>

	<p>
	<a href="#architecture"><img alt="Architecture" src="https://img.shields.io/badge/Architecture-Hybrid_GLA%2BFoX-blue"></a>
	<a href="#training"><img alt="Training" src="https://img.shields.io/badge/Pre--training-2B_tokens-green"></a>
	<a href="#license"><img alt="License" src="https://img.shields.io/badge/License-Apache_2.0-orange"></a>
	</p>
	</div>

	---

	## Table of Contents

	- [Overview](#overview)
	- [Model Summary](#model-summary)
	- [Architecture Deep Dive](#architecture-deep-dive)
	- [Hybrid Attention Layout](#hybrid-attention-layout)
	- [Gated DeltaNet (GLA)](#gated-deltanet-gla)
	- [Forgetting Attention (FoX)](#forgetting-attention-fox)
	- [Test-Time Training (TTT)](#test-time-training-ttt)
	- [Selective Activation](#selective-activation)
	- [Additional Components](#additional-components)
	- [Comparison with Other Architectures](#comparison-with-other-architectures)
	- [Training Details](#training-details)
	- [Pre-training](#pre-training)
	- [Supervised Fine-Tuning (SFT)](#supervised-fine-tuning-sft)
	- [Usage](#usage)
	- [Benchmarks](#benchmarks)
	- [Limitations](#limitations)
	- [Citation](#citation)
	- [License](#license)

	---

	## Overview

	Genesis-152M-Instruct is an experimental small language model that combines recent advances in efficient attention mechanisms into a single architecture. It serves as a research platform for exploring:

	- Hybrid attention: Mixing O(n) linear attention with O(n²) softmax attention
	- Efficient inference: Sub-quadratic complexity for most layers
	- Adaptive computation: Test-time training for dynamic model adaptation

	> ⚠️ Experimental Model: This is a research artifact, not a production-ready model. It demonstrates architectural innovations but has limitations typical of small models.

	---

	## Model Summary

	\| Property \| Value \|
	\|----------\|-------\|
	\| Parameters \| 151.8M total (~122.8M non-embedding) \|
	\| Architecture \| Hybrid GLA + FoX Attention \|
	\| Context Length \| 2,048 tokens \|
	\| Vocab Size \| 50,279 (GPT-NeoX + ChatML tokens) \|
	\| Pre-training Data \| 2B tokens \|
	\| SFT Dataset \| [smol-smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) \|
	\| License \| Apache 2.0 \|

	### Files in this Repository

	```
	├── genesis_152m_instruct.safetensors # Model weights
	├── README.md # This model card
	└── LICENSE # Apache 2.0
	```

	---

	## Architecture Deep Dive

	Genesis follows a "deep-and-thin" design philosophy inspired by [SmolLM2](https://arxiv.org/abs/2502.02737) and [MobileLLM](https://arxiv.org/abs/2402.14905), which has proven effective for small language models.

	### Core Configuration

	\| Component \| Value \| Rationale \|
	\|-----------\|-------\|-----------\|
	\| Layers \| 30 \| Deep architecture for better representation \|
	\| Hidden Size \| 576 \| Optimal width for 150M scale \|
	\| Attention Heads \| 9 \| Query heads \|
	\| KV Heads \| 3 \| 3:1 GQA ratio for memory efficiency \|
	\| Head Dimension \| 64 \| Standard for efficient attention \|
	\| FFN Size \| 1,440 \| 2.5× expansion (SwiGLU-efficient) \|
	\| Weight Tying \| ✓ \| Embeddings tied with LM head \|

	---

	### Hybrid Attention Layout

	Genesis employs a hybrid attention layout inspired by [Qwen3-Next](https://huggingface.co/docs/transformers/main/en/model_doc/qwen3_next), alternating between linear and full attention:

	```
	Layer Distribution (30 layers):
	├── 23 layers: GLA (Gated DeltaNet) - O(n) linear attention
	└── 7 layers: FoX (Forgetting Attention) - O(n²) softmax with forget gate

	Ratio: 75% Linear / 25% Full Attention
	```

	Why hybrid? Pure linear attention struggles with precise retrieval tasks (e.g., copying, in-context learning). Interleaving full attention layers restores this capability while maintaining overall efficiency.

	> 📖 Reference: The hybrid approach is validated by Qwen3-Next (2025) and research showing that [3:1 to 6:1 linear-to-full ratios](https://arxiv.org/abs/2507.06457) optimize the efficiency-quality tradeoff.

	---

	### Gated DeltaNet (GLA)

	The primary attention mechanism (75% of layers) is Gated DeltaNet, a state-of-the-art O(n) linear attention mechanism from NVIDIA.

	#### Key Features

	\| Feature \| Description \| Paper Reference \|
	\|---------\|-------------\|-----------------\|
	\| Delta Rule \| Online learning rule for recurrent state updates \| [Schlag et al., 2021](https://arxiv.org/abs/2102.11174) \|
	\| Gated Forget \| Mamba-style data-dependent forgetting \| [Gu & Dao, 2023](https://arxiv.org/abs/2312.00752) \|
	\| Short Convolution \| 1D conv on Q, K, V for local context \| [Gu et al., 2022](https://arxiv.org/abs/2212.14052) \|
	\| L2 QK-Norm \| Stabilizes attention scores \| Standard practice \|

	#### Mathematical Formulation

	The delta rule update enables the model to selectively write to and erase from a recurrent state:

	```
	S_t = α_t * S_{t-1} + β_t * (v_t ⊗ k_t - S_{t-1} @ k_t ⊗ k_t)
	o_t = S_t @ q_t
	```

	Where:
	- `S_t`: Recurrent state matrix
	- `α_t`: Forget gate (data-dependent)
	- `β_t`: Learning rate gate (per-token)

	> 📖 Paper: [Gated Delta Networks: Improving Mamba2 with Delta Rule](https://arxiv.org/abs/2412.06464) (ICLR 2025)
	>
	> 📦 Code: [NVlabs/GatedDeltaNet](https://github.com/NVlabs/GatedDeltaNet)

	#### Configuration in Genesis

	```python
	gla_expand_k: 0.75 # Key expansion ratio
	gla_expand_v: 1.5 # Value expansion ratio (asymmetric)
	gla_gate_fn: "swish" # Gating activation
	gla_use_short_conv: True
	gla_conv_size: 4
	gla_chunk_size: 64 # For chunked parallel training
	gla_use_delta_rule: True
	gla_qk_norm: "l2"
	gla_use_mamba_gate: True
	```

	---

	### Forgetting Attention (FoX)

	The full attention layers (25%) use FoX (Forgetting Transformer), which augments standard softmax attention with a learnable forget gate.

	#### Why FoX over Standard Attention?

	\| Aspect \| Standard Attention \| FoX \|
	\|--------\|-------------------\|-----\|
	\| Position Encoding \| Requires RoPE/ALiBi \| NoPE (implicit via forget gate) \|
	\| Long-range Decay \| Uniform attention \| Data-dependent decay \|
	\| Length Extrapolation \| Poor \| Better generalization \|

	#### Mechanism

	FoX modifies attention scores with cumulative forget gates:

	```
	attn[i,j] = softmax(q_i @ k_j / √d + Σ_{k=j}^{i} log(f_k))
	```

	Where `f_k = sigmoid(W_f @ x_k)` is a learned forget gate that naturally down-weights distant tokens.

	> 📖 Paper: [Forgetting Transformer: Softmax Attention with a Forget Gate](https://arxiv.org/abs/2503.02130) (ICLR 2025)
	>
	> 📦 Code: [zhixuan-lin/forgetting-transformer](https://github.com/zhixuan-lin/forgetting-transformer)

	#### FoX "Pro" Design

	Genesis uses the enhanced "Pro" block design:

	\| Component \| Purpose \|
	\|-----------\|---------\|
	\| Output Gate \| Controls information flow (like GLA) \|
	\| QK-Norm \| Training stability \|
	\| Short Convolution \| Local context on K, V \|
	\| FusedRMSNormSwishGate \| Efficient fused operations \|

	---

	### Test-Time Training (TTT)

	Genesis includes an experimental TTT metacognition layer that adapts the model during inference.

	#### Concept

	Traditional models have fixed weights at inference. TTT layers have a small set of fast weights that update based on the input sequence, allowing the model to "learn" from context.

	```
	Standard: y = f(x; θ_fixed)
	TTT: y = f(x; θ_fixed, θ_fast(x))
	```

	#### Implementation Details

	\| Parameter \| Value \| Description \|
	\|-----------\|-------\|-------------\|
	\| `ttt_rank` \| 4 \| Low-rank adaptation dimension \|
	\| `ttt_inner_lr` \| 0.01 \| Learning rate for fast weights \|
	\| `ttt_mode` \| "dual" \| Parallel dual-form computation \|
	\| `ttt_chunk_size` \| 64 \| Chunking for efficiency \|

	The "dual form" enables fully parallel gradient computation:

	```python
	# Instead of sequential updates:
	# W_1 = W_0 - lr * grad_0
	# W_2 = W_1 - lr * grad_1
	# ...

	# Dual form computes all at once:
	# W_t = W_0 - lr * Σ_{i<t} grad_i (via cumsum)
	```

	> 📖 Paper: [Learning to (Learn at Test Time): RNNs with Expressive Hidden States](https://arxiv.org/abs/2407.04620) (ICML 2024)
	>
	> 📦 Code: [test-time-training/ttt-lm-pytorch](https://github.com/test-time-training/ttt-lm-pytorch)

	#### When TTT Activates

	TTT is designed for inference-time adaptation and runs only during `model.eval()`. During training, it's disabled to avoid overhead.

	---

	### Selective Activation

	The FFN layers use SwiGLU with optional top-k sparsity masking.

	#### SwiGLU FFN

	```python
	FFN(x) = (Swish(W_gate @ x) ⊙ (W_up @ x)) @ W_down
	```

	> 📖 Paper: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) (Shazeer, 2020)

	#### Selective Activation (Experimental)

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| `selective_k_ratio` \| 0.85 (keeps top 85%) \|
	\| `selective_use_soft_mask` \| True \|

	Important: This is a regularization technique, not a speedup mechanism. Real sparse acceleration requires specialized kernels (e.g., Triton sparse GEMM).

	> 📖 Related: [ReLU Strikes Back](https://arxiv.org/abs/2310.04564) (Apple, ICLR 2024) shows natural activation sparsity can be exploited for inference.

	---

	### Additional Components

	#### Grouped Query Attention (GQA)

	Genesis uses 3:1 GQA (9 query heads, 3 KV heads) for memory efficiency during inference.

	> 📖 Paper: [GQA: Training Generalized Multi-Query Transformer Models](https://arxiv.org/abs/2305.13245) (Google, 2023)

	#### Rotary Position Embeddings (RoPE)

	Partial RoPE (50% rotation) is applied in GLA layers for position awareness.

	> 📖 Paper: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) (Su et al., 2021)

	#### µP (Maximal Update Parametrization)

	Hyperparameters were tuned using µP for potential scaling transfer.

	> 📖 Paper: [Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer](https://arxiv.org/abs/2203.03466) (Yang et al., 2022)
	>
	> 📖 Guide: [The Practitioner's Guide to µP](https://www.cerebras.ai/blog/the-practitioners-guide-to-the-maximal-update-parameterization) (Cerebras)

	#### Zero-Centered RMSNorm

	Used throughout for better weight decay compatibility with µP.

	---

	## Comparison with Other Architectures

	### vs. SmolLM2-135M (HuggingFace)

	\| Aspect \| Genesis-152M \| SmolLM2-135M \|
	\|--------\|--------------\|--------------\|
	\| Attention \| Hybrid GLA + FoX \| Standard Multi-Head \|
	\| Complexity \| O(n) for 75% layers \| O(n²) all layers \|
	\| Position Encoding \| RoPE (GLA) / NoPE (FoX) \| RoPE \|
	\| TTT \| ✓ Experimental \| ✗ \|
	\| Pre-training \| 2B tokens \| 2T tokens \|
	\| Architecture \| 30L × 576 \| 30L × 576 \|

	> SmolLM2 uses 1000× more training tokens, making direct benchmark comparison unfair. Genesis demonstrates architectural innovations, not data scaling.

	### vs. Qwen3-Next

	\| Aspect \| Genesis-152M \| Qwen3-Next-80B-A3B \|
	\|--------\|--------------\|---------------------\|
	\| Scale \| 152M \| 80B (3B active) \|
	\| Linear Attention \| GLA (same) \| GLA \|
	\| Full Attention \| FoX \| Standard \|
	\| Hybrid Ratio \| 75/25 \| Similar \|
	\| MoE \| ✗ \| ✓ \|

	Genesis can be seen as a miniature research version of the hybrid attention approach that Qwen3-Next uses at scale.

	### vs. Mamba / Mamba-2

	\| Aspect \| Genesis-152M \| Mamba-2 \|
	\|--------\|--------------\|---------\|
	\| Architecture \| Hybrid (Linear + Softmax) \| Pure SSM \|
	\| Retrieval \| Strong (FoX layers) \| Limited \|
	\| Implementation \| PyTorch + Optional Triton \| Requires CUDA \|
	\| Flexibility \| Modular \| Monolithic \|

	---

	## Training Details

	### Pre-training

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Tokens \| 2 billion \|
	\| Dataset Mix \| FineWeb-Edu (51%), DCLM (22%), FineMath (12%), Stack-Edu (8%), Cosmopedia (5%), Synth (2%) \|
	\| Context Length \| 2,048 \|
	\| Batch Size \| 128 \|
	\| Learning Rate \| 1e-3 (WSD schedule) \|
	\| Optimizer \| AdamW (β₁=0.9, β₂=0.95) \|
	\| Weight Decay \| 0.1 \|
	\| Warmup \| 5% of steps \|
	\| Hardware \| Single A100 80GB \|

	#### Learning Rate Schedule

	WSD (Warmup-Stable-Decay):
	- Warmup: 5% of training (linear ramp)
	- Stable: 85% of training (constant LR)
	- Decay: 10% of training (cosine to min_lr)

	### Supervised Fine-Tuning (SFT)

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Dataset \| [smol-smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) \|
	\| Samples \| ~485K conversations \|
	\| Epochs \| 1 \|
	\| Learning Rate \| 1e-3 \|
	\| Batch Size \| 32 (effective: 128 with grad accum) \|

	#### smol-smoltalk Composition

	The SFT dataset is the same used to train SmolLM2-135M-Instruct:

	\| Subset \| Purpose \|
	\|--------\|---------\|
	\| smol-magpie-ultra-short \| Instruction following \|
	\| everyday-conversations \| Multi-turn dialogue \|
	\| smol-rewrite \| Text editing \|
	\| smol-summarize \| Summarization \|
	\| openhermes-100k \| Knowledge & reasoning \|
	\| systemchats-30k \| System prompt following \|

	> This dataset was specifically curated for small models (<1B params) and avoids issues like `<think>` tags from reasoning models.

	---

	## Usage

	### Installation

	```bash
	pip install genesis-llm
	```

	### Download Weights

	```bash
	pip install "huggingface-hub>=0.20"
	huggingface-cli download guiferrarib/genesis-152m-instruct genesis_152m_instruct.safetensors --local-dir .
	```

	### Interactive Chat

	```bash
	genesis --model ./genesis_152m_instruct.safetensors
	```

	### Python API

	```python
	import json
	import torch
	from safetensors import safe_open
	from safetensors.torch import load_file
	from genesis import Genesis, GenesisConfig, get_tokenizer

	# 1. Load config from checkpoint metadata
	model_path = "./genesis_152m_instruct.safetensors"
	with safe_open(model_path, framework="pt", device="cpu") as f:
	metadata = f.metadata() or {}
	config_dict = json.loads(metadata.get("genesis_config_json", "{}"))
	config = GenesisConfig(**config_dict) if config_dict else GenesisConfig.genesis_147m()

	# 2. Load model weights
	device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
	state_dict = load_file(model_path, device=device)
	model = Genesis(config).to(device)
	model.load_state_dict(state_dict, strict=False)
	model.eval()

	# 3. Setup tokenizer (GPT-NeoX + ChatML tokens)
	tokenizer = get_tokenizer("neox")
	tokenizer.add_chat_tokens()

	# 4. Build ChatML prompt
	prompt = """<\|im_start\|>system
	You are a helpful assistant.
	<\|im_end\|>
	<\|im_start\|>user
	Explain what linear attention is in simple terms.
	<\|im_end\|>
	<\|im_start\|>assistant
	"""

	# 5. Generate
	input_ids = torch.tensor([tokenizer.encode(prompt)], device=device)
	with torch.no_grad():
	output_ids = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
	response = tokenizer.decode(output_ids[0][input_ids.shape[1]:].tolist())
	print(response)
	```

	### Prompt Format

	Genesis uses ChatML format:

	```
	<\|im_start\|>system
	{system_message}
	<\|im_end\|>
	<\|im_start\|>user
	{user_message}
	<\|im_end\|>
	<\|im_start\|>assistant
	{assistant_response}<\|im_end\|>
	```

	---

	## Benchmarks

	Evaluated using LightEval on MPS (Apple Silicon).

	### Results

	\| Task \| Metric \| Score \| Stderr \|
	\|------\|--------\|-------\|--------\|
	\| ARC-Easy (25-shot) \| acc_norm \| 44.02% \| ±1.02 \|
	\| ARC-Challenge (25-shot) \| acc_norm \| 24.66% \| ±1.26 \|
	\| BoolQ (0-shot) \| acc_norm \| 56.30% \| ±0.87 \|
	\| HellaSwag (10-shot) \| acc_norm \| 30.19% \| ±0.46 \|
	\| Winogrande (5-shot) \| acc \| 49.09% \| ±1.41 \|
	\| CommonsenseQA (0-shot) \| acc_norm \| 29.16% \| ±1.30 \|
	\| OpenBookQA (0-shot) \| acc_norm \| 28.60% \| ±2.02 \|
	\| SciQ (0-shot) \| acc_norm \| 46.80% \| ±1.58 \|

	### Interpretation

	\| Task \| Random Baseline \| Genesis \| Signal \|
	\|------\|-----------------\|---------\|--------\|
	\| ARC-Easy \| 25% \| 44% \| ✅ Strong \|
	\| BoolQ \| 50% \| 56% \| ✅ Learning \|
	\| HellaSwag \| ~25% \| 30% \| ✅ Learning \|
	\| Winogrande \| 50% \| 49% \| ⚠️ At baseline \|
	\| ARC-Challenge \| ~25% \| 25% \| ⚠️ Too hard for size \|

	> Note: With only 2B pre-training tokens (vs. 2T for SmolLM2), benchmarks primarily reflect architectural capacity rather than world knowledge.

	---

	## Limitations

	### Known Issues

	1. Hallucinations: Frequent factual errors due to limited pre-training data
	2. Math: Unreliable arithmetic and multi-step reasoning
	3. Instruction Following: Can be brittle with strict constraints
	4. TTT Overhead: Metacognition layer adds latency (can be disabled)

	### Not Suitable For

	- Production deployments requiring reliability
	- Tasks requiring factual accuracy
	- Complex multi-step reasoning
	- Safety-critical applications

	### Best Use Cases

	- Architecture research and ablation studies
	- Efficient attention mechanism exploration
	- Small model behavior analysis
	- Educational purposes

	---

	## Citation

	If you use Genesis in your research, please cite:

	```bibtex
	@misc{genesis2025,
	title={Genesis: A Hybrid Linear Attention Architecture for Small Language Models},
	author={Ferrari Brescia, Guilherme},
	year={2025},
	url={https://huggingface.co/guiferrarib/genesis-152m-instruct}
	}
	```

	### Related Papers

	```bibtex
	@inproceedings{yang2024gated,
	title={Gated Delta Networks: Improving Mamba2 with Delta Rule},
	author={Yang, Songlin and Wang, Bailin and Zhang, Yu and Shen, Yikang and Keutzer, Kurt},
	booktitle={ICLR},
	year={2025}
	}

	@inproceedings{lin2025forgetting,
	title={Forgetting Transformer: Softmax Attention with a Forget Gate},
	author={Lin, Zhixuan and others},
	booktitle={ICLR},
	year={2025}
	}

	@inproceedings{sun2024learning,
	title={Learning to (Learn at Test Time): RNNs with Expressive Hidden States},
	author={Sun, Yu and others},
	booktitle={ICML},
	year={2024}
	}

	@article{allal2025smollm2,
	title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
	author={Allal, Loubna Ben and others},
	journal={arXiv preprint arXiv:2502.02737},
	year={2025}
	}
	```

	---

	## License

	\| Component \| License \|
	\|-----------\|---------\|
	\| Model Weights \| Apache 2.0 \|
	\| Code \| Apache 2.0 \|
	\| Training Data \| Various (see dataset cards) \|

	---

	<div align="center">
	<p><em>Built with 🧬 by the Orch-Mind team</em></p>
	<p>
	<a href="https://pypi.org/project/genesis-llm/">PyPI</a>
	</p>
	</div>