HYDRA: Hybrid Dynamic Recurrent Architecture

A novel non-transformer language model built from scratch. Trained on CPU using a custom architecture that combines Mamba's Selective State Spaces, Griffin's Real-Gated Linear Recurrence (RG-LRU), and RWKV's channel mixing β€” with zero attention layers.

Architecture

Component Source Paper
Selective State Spaces Mamba arxiv:2312.00752
Real-Gated Linear Recurrence Griffin arxiv:2402.19427
Time/Channel Mixing RWKV arxiv:2305.13048
Multi-Scale Compression Novel Parallel recurrences at different timescales

Key Properties

  • NOT a transformer β€” no attention whatsoever
  • O(n) time β€” linear in sequence length (vs O(nΒ²) transformers)
  • Constant memory at inference (no KV cache)
  • Content-aware selective gating (Mamba + Griffin fusion)
  • Multi-scale temporal processing

Architecture Diagram

Token Embedding β†’ N Γ— HydraBlock β†’ RMSNorm β†’ LM Head

HydraBlock:
  β”œβ”€β”€ RMSNorm β†’ SelectiveGatedRecurrence β†’ + residual
  └── RMSNorm β†’ GatedChannelMixing (GeGeLU) β†’ + residual

SelectiveGatedRecurrence (per timescale):
  β”œβ”€β”€ Input projection (2 branches)
  β”œβ”€β”€ Branch 1: Separable Conv1D β†’ SiLU β†’ Selective B,C projection
  β”œβ”€β”€ Input gate + Recurrence gate (from Griffin RG-LRU)
  β”œβ”€β”€ Gated recurrence: h_t = a_tΒ·h_{t-1} + √(1-a_tΒ²)Β·(i_tΒ·B_t)
  └── Gated merge + output projection

Specs

  • Parameters: 19,274,816 (19.3M)
  • d_model: 256
  • Layers: 6
  • State dim: 16
  • Timescales: 2
  • Context length: 128

Training

  • Dataset: TinyStories (5,000 stories for quick training)
  • From scratch: Random initialization, no pretrained components
  • Best val_loss: 3.7988
  • Val_ppl: 44.6
  • Hardware: CPU only
  • Optimizer: AdamW (β₁=0.9, Ξ²β‚‚=0.95)

Generated Samples

Once upon a time: Once upon a time, there was a little girl named Lily. She loved to play outside. She had a big, feeling very excited! She couldn't wait for a big, but she said.

The little boy was so happy to make the bird who loved A little dog: A little dog who ran to be happy. She was very excited that the park. She wanted to play with the window.

"I'm sorry, Lily. It is a voice?" She was so happy!

Ben did not give it and A girl named Lily: A girl named Lily. She was very happy he was very tall. She saw what he could not want to the forest. She says, "I'm sorry, we can have to go to the water. She was very very excited and happy.

Lily One day a boy: One day a boy named Lily liked to play with it. He said, "Don't be careful."

"We are happy, we have to go to the man that he decided to play with his friends. He was very happy that the dog had a man

Usage

import torch, json
from model import HydraModel, HydraConfig
from transformers import AutoTokenizer

config = HydraConfig(**json.load(open("config.json")))
model = HydraModel(config)
model.load_state_dict(torch.load("model.pt", map_location="cpu"))
model.eval()

tokenizer = AutoTokenizer.from_pretrained("gpt2")
prompt = "Once upon a time"
ids = torch.tensor([tokenizer.encode(prompt)])
with torch.no_grad():
    out = model.generate(ids, max_new_tokens=50, temperature=0.8, top_k=40)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Model Files

  • model.pt β€” trained weights
  • config.json β€” model configuration
  • model.py β€” full architecture source code

Research References

  1. Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces", NeurIPS 2023
  2. De et al., "Griffin: Mixing Gated Linear Recurrences with Local Attention", ICLR 2024
  3. Peng et al., "RWKV: Reinventing RNNs for the Transformer Era", 2023
  4. Eldan & Li, "TinyStories: How Small Can Language Models Be?", 2023
Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train sukritvemula/hydra-small-tinystories

Papers for sukritvemula/hydra-small-tinystories