new architecture
updated
Blending Is All You Need: Cheaper, Better Alternative to
Trillion-Parameters LLM
Paper
• 2401.02994
• Published • 52
MambaByte: Token-free Selective State Space Model
Paper
• 2401.13660
• Published • 59
Repeat After Me: Transformers are Better than State Space Models at
Copying
Paper
• 2402.01032
• Published • 24
BlackMamba: Mixture of Experts for State-Space Models
Paper
• 2402.01771
• Published • 25
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning
Tasks
Paper
• 2402.04248
• Published • 32
KAN: Kolmogorov-Arnold Networks
Paper
• 2404.19756
• Published • 116
Zamba: A Compact 7B SSM Hybrid Model
Paper
• 2405.16712
• Published • 25
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Paper
• 2405.21060
• Published • 68
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Paper
• 2406.02657
• Published • 41
Breaking the Attention Bottleneck
Paper
• 2406.10906
• Published • 4
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Paper
• 2407.04620
• Published • 34
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
Paper
• 2408.12570
• Published • 32
A Comprehensive Survey of Mamba Architectures for Medical Image
Analysis: Classification, Segmentation, Restoration and Beyond
Paper
• 2410.02362
• Published • 18
Paper
• 2410.05258
• Published • 182
GPT or BERT: why not both?
Paper
• 2410.24159
• Published • 14
Relaxed Recursive Transformers: Effective Parameter Sharing with
Layer-wise LoRA
Paper
• 2410.20672
• Published • 7
SambaMixer: State of Health Prediction of Li-ion Batteries using Mamba
State Space Models
Paper
• 2411.00233
• Published • 7
Hymba: A Hybrid-head Architecture for Small Language Models
Paper
• 2411.13676
• Published • 48
Gated Delta Networks: Improving Mamba2 with Delta Rule
Paper
• 2412.06464
• Published • 16
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
• 2412.09871
• Published • 108
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper
• 2503.14456
• Published • 154
Deep Residual Echo State Networks: exploring residual orthogonal
connections in untrained Recurrent Neural Networks
Paper
• 2508.21172
• Published • 2
Gated Associative Memory: A Parallel O(N) Architecture for Efficient
Sequence Modeling
Paper
• 2509.00605
• Published • 43
Less is More: Recursive Reasoning with Tiny Networks
Paper
• 2510.04871
• Published • 513
Paper
• 2601.00417
• Published • 34
Nested Learning: The Illusion of Deep Learning Architectures
Paper
• 2512.24695
• Published • 45
Recursive Language Models
Paper
• 2512.24601
• Published • 94
Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs
Paper
• 2603.07475
• Published • 3
Effective Distillation to Hybrid xLSTM Architectures
Paper
• 2603.15590
• Published • 32
Residual Stream Duality in Modern Transformer Architectures
Paper
• 2603.16039
• Published • 4