LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding Paper • 2605.27365 • Published May 26 • 144
Running on Zero Agents Featured 969 MMAudio — generating synchronized audio from video/text 🔊 969 Generate synchronized audio for videos from text prompts
Video Analysis and Generation via a Semantic Progress Function Paper • 2604.22554 • Published Apr 24 • 64
WildDet3D: Scaling Promptable 3D Detection in the Wild Paper • 2604.08626 • Published Apr 9 • 248
ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks Paper • 2603.27862 • Published Mar 29 • 33
UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation Paper • 2603.23500 • Published Mar 24 • 37
Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model Paper • 2603.21986 • Published Mar 23 • 125
VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Paper • 2602.13294 • Published Feb 9 • 13
VideoMaMa: Mask-Guided Video Matting via Generative Prior Paper • 2601.14255 • Published Jan 20 • 15