MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models Paper • 2511.18373 • Published Nov 23, 2025 • 7
Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO Paper • 2511.13288 • Published Nov 17, 2025 • 19
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens Paper • 2511.19418 • Published Nov 24, 2025 • 29
Temporal Prompting Matters: Rethinking Referring Video Object Segmentation Paper • 2510.07319 • Published Oct 8, 2025 • 3
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe Paper • 2511.16334 • Published Nov 20, 2025 • 94
O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents Paper • 2511.13593 • Published Nov 17, 2025 • 28
RynnVLA-002: A Unified Vision-Language-Action and World Model Paper • 2511.17502 • Published Nov 21, 2025 • 28
VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models Paper • 2511.11007 • Published Nov 14, 2025 • 15
Depth Anything 3: Recovering the Visual Space from Any Views Paper • 2511.10647 • Published Nov 13, 2025 • 101
LightRAG: Simple and Fast Retrieval-Augmented Generation Paper • 2410.05779 • Published Oct 8, 2024 • 31
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model Paper • 2510.14528 • Published Oct 16, 2025 • 121
TradingAgents: Multi-Agents LLM Financial Trading Framework Paper • 2412.20138 • Published Dec 28, 2024 • 24
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation Paper • 2410.17799 • Published Oct 23, 2024 • 12
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image Paper • 2511.13648 • Published Nov 17, 2025 • 53
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing Paper • 2509.22186 • Published Sep 26, 2025 • 152
Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery Paper • 2510.15869 • Published Oct 17, 2025 • 50
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization Paper • 2511.15705 • Published Nov 19, 2025 • 98
FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning Paper • 2510.22543 • Published Oct 26, 2025 • 14
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning Paper • 2511.19900 • Published Nov 25, 2025 • 49
From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models Paper • 2512.10867 • Published Dec 11, 2025 • 16
Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities Paper • 2503.04721 • Published Mar 6, 2025 • 4
MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild Paper • 2603.17187 • Published 2 days ago • 90
MosaicMem: Hybrid Spatial Memory for Controllable Video World Models Paper • 2603.17117 • Published 2 days ago • 69
AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents Paper • 2603.16496 • Published 2 days ago • 8
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs Paper • 2603.18004 • Published 1 day ago • 5
Qianfan-OCR: A Unified End-to-End Model for Document Intelligence Paper • 2603.13398 • Published 8 days ago • 135
Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation Paper • 2603.16669 • Published 2 days ago • 65
SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation Paper • 2603.16864 • Published 2 days ago • 14
Omnilingual MT: Machine Translation for 1,600 Languages Paper • 2603.16309 • Published 3 days ago • 12
ViT-AdaLA: Adapting Vision Transformers with Linear Attention Paper • 2603.16063 • Published 3 days ago • 2
OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data Paper • 2603.15594 • Published 3 days ago • 136
NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval Paper • 2603.12824 • Published 7 days ago • 5