Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2601.22060

Multimodal Agent

about 1 month ago

Gemini Robotics: Bringing AI into the Physical World

Paper • 2503.20020 • Published Mar 25, 2025 • 31
Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18, 2025 • 58
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 49

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Paper • 2602.12036 • Published 26 days ago • 91
Reinforcement Learning for Self-Improving Agent with Skill Library

Paper • 2512.17102 • Published Dec 18, 2025 • 36
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Paper • 2512.23705 • Published Dec 29, 2025 • 45
Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

Paper • 2512.19995 • Published Dec 23, 2025 • 16

Vision-DeepResearch

Osilly/Vision-DeepResearch-Toy-SFT-Data

Viewer • Updated Feb 1 • 1k • 140
Osilly/Vision-DeepResearch-Toy-RL-Data

Viewer • Updated Feb 1 • 1k • 47
Osilly/VDR-Bench

Viewer • Updated Feb 1 • 2k • 42
Osilly/VDR-Bench-testmini

Viewer • Updated Feb 1 • 500 • 96

Applications and Uses

ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

Paper • 2506.09790 • Published Jun 11, 2025 • 53
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

Paper • 2506.06444 • Published Jun 6, 2025 • 73
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Paper • 2506.11763 • Published Jun 13, 2025 • 74
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research

Paper • 2502.04644 • Published Feb 7, 2025 • 4

facebook/w2v-bert-2.0

Feature Extraction • 0.6B • Updated Jan 25, 2024 • 3.08M • 206
facebook/metaclip-h14-fullcc2.5b

Zero-Shot Image Classification • 1.0B • Updated Jan 11, 2024 • 7.58k • 49
openai/clip-vit-large-patch14

Zero-Shot Image Classification • 0.4B • Updated Sep 15, 2023 • 6.76M • 1.97k
Salesforce/blip-image-captioning-large

Image-to-Text • 0.5B • Updated Feb 3, 2025 • 1.04M • 1.45k

Qwen3-TTS Technical Report

Paper • 2601.15621 • Published Jan 22 • 70
PaperBanana: Automating Academic Illustration for AI Scientists

Paper • 2601.23265 • Published Jan 30 • 215
Moonshine: Speech Recognition for Live Transcription and Voice Commands

Paper • 2410.15608 • Published Oct 21, 2024 • 10
PersonaLive! Expressive Portrait Image Animation for Live Streaming

Paper • 2512.11253 • Published Dec 12, 2025 • 39

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Paper • 2601.22060 • Published Jan 29 • 154

GUI-G^2: Gaussian Reward Modeling for GUI Grounding

Paper • 2507.15846 • Published Jul 21, 2025 • 133
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Paper • 2508.05748 • Published Aug 7, 2025 • 141
Mobile-Agent-v3: Foundamental Agents for GUI Automation

Paper • 2508.15144 • Published Aug 21, 2025 • 65
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs

Paper • 2508.16153 • Published Aug 22, 2025 • 160

Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26, 2025 • 170
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

Paper • 2505.22453 • Published May 28, 2025 • 46
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

Paper • 2505.23380 • Published May 29, 2025 • 22
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Paper • 2505.21523 • Published May 23, 2025 • 13

Multimodal Agent

about 1 month ago

Gemini Robotics: Bringing AI into the Physical World

Paper • 2503.20020 • Published Mar 25, 2025 • 31
Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18, 2025 • 58
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 49

Qwen3-TTS Technical Report

Paper • 2601.15621 • Published Jan 22 • 70
PaperBanana: Automating Academic Illustration for AI Scientists

Paper • 2601.23265 • Published Jan 30 • 215
Moonshine: Speech Recognition for Live Transcription and Voice Commands

Paper • 2410.15608 • Published Oct 21, 2024 • 10
PersonaLive! Expressive Portrait Image Animation for Live Streaming

Paper • 2512.11253 • Published Dec 12, 2025 • 39

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Paper • 2602.12036 • Published 26 days ago • 91
Reinforcement Learning for Self-Improving Agent with Skill Library

Paper • 2512.17102 • Published Dec 18, 2025 • 36
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Paper • 2512.23705 • Published Dec 29, 2025 • 45
Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

Paper • 2512.19995 • Published Dec 23, 2025 • 16

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Paper • 2601.22060 • Published Jan 29 • 154

Vision-DeepResearch

Osilly/Vision-DeepResearch-Toy-SFT-Data

Viewer • Updated Feb 1 • 1k • 140
Osilly/Vision-DeepResearch-Toy-RL-Data

Viewer • Updated Feb 1 • 1k • 47
Osilly/VDR-Bench

Viewer • Updated Feb 1 • 2k • 42
Osilly/VDR-Bench-testmini

Viewer • Updated Feb 1 • 500 • 96

GUI-G^2: Gaussian Reward Modeling for GUI Grounding

Paper • 2507.15846 • Published Jul 21, 2025 • 133
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Paper • 2508.05748 • Published Aug 7, 2025 • 141
Mobile-Agent-v3: Foundamental Agents for GUI Automation

Paper • 2508.15144 • Published Aug 21, 2025 • 65
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs

Paper • 2508.16153 • Published Aug 22, 2025 • 160

Applications and Uses

ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

Paper • 2506.09790 • Published Jun 11, 2025 • 53
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

Paper • 2506.06444 • Published Jun 6, 2025 • 73
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Paper • 2506.11763 • Published Jun 13, 2025 • 74
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research

Paper • 2502.04644 • Published Feb 7, 2025 • 4

Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26, 2025 • 170
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

Paper • 2505.22453 • Published May 28, 2025 • 46
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

Paper • 2505.23380 • Published May 29, 2025 • 22
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Paper • 2505.21523 • Published May 23, 2025 • 13

facebook/w2v-bert-2.0

Feature Extraction • 0.6B • Updated Jan 25, 2024 • 3.08M • 206
facebook/metaclip-h14-fullcc2.5b

Zero-Shot Image Classification • 1.0B • Updated Jan 11, 2024 • 7.58k • 49
openai/clip-vit-large-patch14

Zero-Shot Image Classification • 0.4B • Updated Sep 15, 2023 • 6.76M • 1.97k
Salesforce/blip-image-captioning-large

Image-to-Text • 0.5B • Updated Feb 3, 2025 • 1.04M • 1.45k

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs