SketchVLM: Vision language models can annotate images to explain thoughts and guide users Paper • 2604.22875 • Published 21 days ago • 35
FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol Paper • 2603.24943 • Published Mar 26 • 12
RealMaster: Lifting Rendered Scenes into Photorealistic Video Paper • 2603.23462 • Published Mar 24 • 33
RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models Paper • 2603.25502 • Published Mar 26 • 57
SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM Paper • 2603.23386 • Published Mar 24 • 40
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning Paper • 2603.23483 • Published Mar 24 • 62
Qianfan-OCR: A Unified End-to-End Model for Document Intelligence Paper • 2603.13398 • Published Mar 11 • 154
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance Paper • 2305.05176 • Published May 9, 2023 • 7
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published Apr 22, 2024 • 260
view article Article CodeGemma - an official Google release for code LLMs +4 pcuenq, osanseviero, reach-vb, philschmid, mishig, loubnabnl • Apr 9, 2024 • 107
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction Paper • 2404.02905 • Published Apr 3, 2024 • 74
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward Paper • 2404.01258 • Published Apr 1, 2024 • 12
Improving Text-to-Image Consistency via Automatic Prompt Optimization Paper • 2403.17804 • Published Mar 26, 2024 • 19
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation Paper • 2403.16990 • Published Mar 25, 2024 • 25
WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs Paper • 2403.07944 • Published Mar 10, 2024 • 1