ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models Paper • 2412.07012 • Published Dec 9, 2024 • 1
m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks Paper • 2403.11085 • Published Mar 17, 2024
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding Paper • 2601.10611 • Published Jan 15 • 34
SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding Paper • 2510.09110 • Published Oct 10, 2025
Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos Paper • 2602.23543 • Published Feb 26 • 9
WildDet3D: Scaling Promptable 3D Detection in the Wild Paper • 2604.08626 • Published 29 days ago • 245
MolmoAct2: Action Reasoning Models for Real-world Deployment Paper • 2605.02881 • Published 4 days ago • 206
MolmoAct2: Action Reasoning Models for Real-world Deployment Paper • 2605.02881 • Published 4 days ago • 206
WildDet3D Collection This is the collection of WildDet3D artifacts, including demos, model checkpoints and data. https://github.com/allenai/WildDet3D • 8 items • Updated 24 days ago • 17
WildDet3D: Scaling Promptable 3D Detection in the Wild Paper • 2604.08626 • Published 29 days ago • 245
WildDet3D Collection This is the collection of WildDet3D artifacts, including demos, model checkpoints and data. https://github.com/allenai/WildDet3D • 8 items • Updated 24 days ago • 17
WildDet3D: Scaling Promptable 3D Detection in the Wild Paper • 2604.08626 • Published 29 days ago • 245