HYDRA: Hybrid Dynamic Recurrent Architecture
A novel non-transformer language model built from scratch. Trained on CPU using a custom architecture that combines Mamba's Selective State Spaces, Griffin's Real-Gated Linear Recurrence (RG-LRU), and RWKV's channel mixing β with zero attention layers.
Architecture
| Component | Source | Paper |
|---|---|---|
| Selective State Spaces | Mamba | arxiv:2312.00752 |
| Real-Gated Linear Recurrence | Griffin | arxiv:2402.19427 |
| Time/Channel Mixing | RWKV | arxiv:2305.13048 |
| Multi-Scale Compression | Novel | Parallel recurrences at different timescales |
Key Properties
- NOT a transformer β no attention whatsoever
- O(n) time β linear in sequence length (vs O(nΒ²) transformers)
- Constant memory at inference (no KV cache)
- Content-aware selective gating (Mamba + Griffin fusion)
- Multi-scale temporal processing
Architecture Diagram
Token Embedding β N Γ HydraBlock β RMSNorm β LM Head
HydraBlock:
βββ RMSNorm β SelectiveGatedRecurrence β + residual
βββ RMSNorm β GatedChannelMixing (GeGeLU) β + residual
SelectiveGatedRecurrence (per timescale):
βββ Input projection (2 branches)
βββ Branch 1: Separable Conv1D β SiLU β Selective B,C projection
βββ Input gate + Recurrence gate (from Griffin RG-LRU)
βββ Gated recurrence: h_t = a_tΒ·h_{t-1} + β(1-a_tΒ²)Β·(i_tΒ·B_t)
βββ Gated merge + output projection
Specs
- Parameters: 19,274,816 (19.3M)
- d_model: 256
- Layers: 6
- State dim: 16
- Timescales: 2
- Context length: 128
Training
- Dataset: TinyStories (5,000 stories for quick training)
- From scratch: Random initialization, no pretrained components
- Best val_loss: 3.7988
- Val_ppl: 44.6
- Hardware: CPU only
- Optimizer: AdamW (Ξ²β=0.9, Ξ²β=0.95)
Generated Samples
Once upon a time: Once upon a time, there was a little girl named Lily. She loved to play outside. She had a big, feeling very excited! She couldn't wait for a big, but she said.
The little boy was so happy to make the bird who loved A little dog: A little dog who ran to be happy. She was very excited that the park. She wanted to play with the window.
"I'm sorry, Lily. It is a voice?" She was so happy!
Ben did not give it and A girl named Lily: A girl named Lily. She was very happy he was very tall. She saw what he could not want to the forest. She says, "I'm sorry, we can have to go to the water. She was very very excited and happy.
Lily One day a boy: One day a boy named Lily liked to play with it. He said, "Don't be careful."
"We are happy, we have to go to the man that he decided to play with his friends. He was very happy that the dog had a man
Usage
import torch, json
from model import HydraModel, HydraConfig
from transformers import AutoTokenizer
config = HydraConfig(**json.load(open("config.json")))
model = HydraModel(config)
model.load_state_dict(torch.load("model.pt", map_location="cpu"))
model.eval()
tokenizer = AutoTokenizer.from_pretrained("gpt2")
prompt = "Once upon a time"
ids = torch.tensor([tokenizer.encode(prompt)])
with torch.no_grad():
out = model.generate(ids, max_new_tokens=50, temperature=0.8, top_k=40)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Model Files
model.ptβ trained weightsconfig.jsonβ model configurationmodel.pyβ full architecture source code
Research References
- Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces", NeurIPS 2023
- De et al., "Griffin: Mixing Gated Linear Recurrences with Local Attention", ICLR 2024
- Peng et al., "RWKV: Reinventing RNNs for the Transformer Era", 2023
- Eldan & Li, "TinyStories: How Small Can Language Models Be?", 2023
- Downloads last month
- 21