EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling Paper • 2502.09509 • Published Feb 13, 2025 • 9
YOLOv12: Attention-Centric Real-Time Object Detectors Paper • 2502.12524 • Published Feb 18, 2025 • 12
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published Feb 20, 2025 • 165
ObjectMover: Generative Object Movement with Video Prior Paper • 2503.08037 • Published Mar 11, 2025 • 5
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models Paper • 2503.09573 • Published Mar 12, 2025 • 77
RWKV-7 "Goose" with Expressive Dynamic State Evolution Paper • 2503.14456 • Published Mar 18, 2025 • 153
TransMamba: Flexibly Switching between Transformer and Mamba Paper • 2503.24067 • Published Mar 31, 2025 • 21
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax Paper • 2504.20966 • Published Apr 29, 2025 • 31
Radial Attention: O(nlog n) Sparse Attention with Energy Decay for Long Video Generation Paper • 2506.19852 • Published Jun 24, 2025 • 43
Representing Speech Through Autoregressive Prediction of Cochlear Tokens Paper • 2508.11598 • Published Aug 15, 2025 • 18
2D Gaussian Splatting with Semantic Alignment for Image Inpainting Paper • 2509.01964 • Published Sep 2, 2025 • 7
Latent Diffusion Model without Variational Autoencoder Paper • 2510.15301 • Published Oct 17, 2025 • 50
Bolmo: Byteifying the Next Generation of Language Models Paper • 2512.15586 • Published Dec 17, 2025 • 18
ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation Paper • 2601.03955 • Published Jan 7 • 3
ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers Paper • 2601.05741 • Published Jan 9 • 2
Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings Paper • 2512.12167 • Published Dec 13, 2025 • 5
Implicit Neural Representation Facilitates Unified Universal Vision Encoding Paper • 2601.14256 • Published Jan 20 • 7
Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding Paper • 2506.16035 • Published Jun 19, 2025 • 89
Scaling Embeddings Outperforms Scaling Experts in Language Models Paper • 2601.21204 • Published Jan 29 • 104
Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection Paper • 2602.03216 • Published Feb 3 • 13
Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model Paper • 2603.21986 • Published Mar 23 • 125
LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding Paper • 2605.27365 • Published 1 day ago • 88