kaizuberbuehler 's Collections LM Architectures
updated
Megalodon: Efficient LLM Pretraining and Inference with Unlimited
Context Length
Paper
• 2404.08801
• Published • 66
RecurrentGemma: Moving Past Transformers for Efficient Open Language
Models
Paper
• 2404.07839
• Published • 48
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Paper
• 2404.05892
• Published • 40
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper
• 2312.00752
• Published • 150
Multi-Head Mixture-of-Experts
Paper
• 2404.15045
• Published • 60
Jamba: A Hybrid Transformer-Mamba Language Model
Paper
• 2403.19887
• Published • 112
KAN: Kolmogorov-Arnold Networks
Paper
• 2404.19756
• Published • 116
Better & Faster Large Language Models via Multi-token Prediction
Paper
• 2404.19737
• Published • 80
Contextual Position Encoding: Learning to Count What's Important
Paper
• 2405.18719
• Published • 5
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Paper
• 2405.21060
• Published • 68
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
• 2406.09415
• Published • 51
Alleviating Distortion in Image Generation via Multi-Resolution
Diffusion Models
Paper
• 2406.09416
• Published • 29
Transformers meet Neural Algorithmic Reasoners
Paper
• 2406.09308
• Published • 44
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context
Language Modeling
Paper
• 2406.07522
• Published • 40
Explore the Limits of Omni-modal Pretraining at Scale
Paper
• 2406.09412
• Published • 11
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo
Tree Self-refine with LLaMa-3 8B
Paper
• 2406.07394
• Published • 29
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
• 2406.11816
• Published • 26
Mixture of A Million Experts
Paper
• 2407.04153
• Published • 5
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore
Paper
• 2407.12854
• Published • 31
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware
Experts
Paper
• 2407.21770
• Published • 22
Transformer Explainer: Interactive Learning of Text-Generative Models
Paper
• 2408.04619
• Published • 175
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
Paper
• 2408.12570
• Published • 32
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
• 2408.12528
• Published • 51
LLMs + Persona-Plug = Personalized LLMs
Paper
• 2409.11901
• Published • 35
MonoFormer: One Transformer for Both Diffusion and Autoregression
Paper
• 2409.16280
• Published • 18
Paper
• 2410.05258
• Published • 182
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
• 2412.09871
• Published • 108
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Paper
• 2412.01169
• Published • 13
Monet: Mixture of Monosemantic Experts for Transformers
Paper
• 2412.04139
• Published • 13
MH-MoE:Multi-Head Mixture-of-Experts
Paper
• 2411.16205
• Published • 26
Hymba: A Hybrid-head Architecture for Small Language Models
Paper
• 2411.13676
• Published • 47
SageAttention2 Technical Report: Accurate 4 Bit Attention for
Plug-and-play Inference Acceleration
Paper
• 2411.10958
• Published • 57
BitNet a4.8: 4-bit Activations for 1-bit LLMs
Paper
• 2411.04965
• Published • 69
Mixture-of-Transformers: A Sparse and Scalable Architecture for
Multi-Modal Foundation Models
Paper
• 2411.04996
• Published • 50
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep
Thinking
Paper
• 2501.04519
• Published • 290
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
• 2501.08313
• Published • 302
Tensor Product Attention Is All You Need
Paper
• 2501.06425
• Published • 90
Transformer^2: Self-adaptive LLMs
Paper
• 2501.06252
• Published • 55
Learnings from Scaling Visual Tokenizers for Reconstruction and
Generation
Paper
• 2501.09755
• Published • 35
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper
• 2501.09747
• Published • 29
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative
Textual Feedback
Paper
• 2501.12895
• Published • 61
Autonomy-of-Experts Models
Paper
• 2501.13074
• Published • 44
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for
Mixture-of-Experts Language Models
Paper
• 2501.12370
• Published • 11
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Paper
• 2501.16975
• Published • 32
PixelWorld: Towards Perceiving Everything as Pixels
Paper
• 2501.19339
• Published • 17
Scaling Embedding Layers in Language Models
Paper
• 2502.01637
• Published • 24
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration
Paper
• 2502.01068
• Published • 18
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth
Approach
Paper
• 2502.05171
• Published • 154
CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference
Paper
• 2502.04416
• Published • 12
The Curse of Depth in Large Language Models
Paper
• 2502.05795
• Published • 40
Paper
• 2502.06049
• Published • 31
TransMLA: Multi-head Latent Attention Is All You Need
Paper
• 2502.07864
• Published • 61
LLM Pretraining with Continuous Concepts
Paper
• 2502.08524
• Published • 30
Large Language Diffusion Models
Paper
• 2502.09992
• Published • 127
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
Attention
Paper
• 2502.11089
• Published • 169
Continuous Diffusion Model for Language Modeling
Paper
• 2502.11564
• Published • 53
Multimodal Mamba: Decoder-only Multimodal State Space Model via
Quadratic to Linear Distillation
Paper
• 2502.13145
• Published • 38
MoBA: Mixture of Block Attention for Long-Context LLMs
Paper
• 2502.13189
• Published • 17
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper
• 2503.04130
• Published • 96
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language
Models via Mixture-of-LoRAs
Paper
• 2503.01743
• Published • 89
Transformers without Normalization
Paper
• 2503.10622
• Published • 172
Block Diffusion: Interpolating Between Autoregressive and Diffusion
Language Models
Paper
• 2503.09573
• Published • 77
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper
• 2503.02130
• Published • 32
OmniMamba: Efficient and Unified Multimodal Understanding and Generation
via State Space Models
Paper
• 2503.08686
• Published • 19
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper
• 2503.14456
• Published • 154
Technologies on Effectiveness and Efficiency: A Survey of State Spaces
Models
Paper
• 2503.11224
• Published • 28
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language
Models
Paper
• 2503.16257
• Published • 27
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Paper
• 2503.11579
• Published • 21
I Have Covered All the Bases Here: Interpreting Reasoning Features in
Large Language Models via Sparse Autoencoders
Paper
• 2503.18878
• Published • 120
Dita: Scaling Diffusion Transformer for Generalist
Vision-Language-Action Policy
Paper
• 2503.19757
• Published • 51
FFN Fusion: Rethinking Sequential Computation in Large Language Models
Paper
• 2503.18908
• Published • 19
Paper
• 2504.00927
• Published • 56
ShortV: Efficient Multimodal Large Language Models by Freezing Visual
Tokens in Ineffective Layers
Paper
• 2504.00502
• Published • 26
C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization
for Test-Time Expert Re-Mixing
Paper
• 2504.07964
• Published • 62
Scaling Laws for Native Multimodal Models Scaling Laws for Native
Multimodal Models
Paper
• 2504.07951
• Published • 30
TransMamba: Flexibly Switching between Transformer and Mamba
Paper
• 2503.24067
• Published • 21
BitNet b1.58 2B4T Technical Report
Paper
• 2504.12285
• Published • 83
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
Paper
• 2504.10449
• Published • 15
EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language
Models
Paper
• 2504.15133
• Published • 26