EvolveBench

Activity Feed Request to join this org

AI & ML interests

None defined yet.

authored a paper 2 months ago

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

Paper • 2605.30621 • Published May 28 • 24

authored a paper 2 months ago

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

Paper • 2605.30621 • Published May 28 • 24

submitted a paper to Daily Papers 2 months ago

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

Paper • 2605.30621 • Published May 28 • 24

submitted a paper to Daily Papers 2 months ago

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

Paper • 2605.20177 • Published May 19 • 10

authored a paper 3 months ago

ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

Paper • 2605.20176 • Published May 19 • 12

submitted a paper to Daily Papers 3 months ago

ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

Paper • 2605.20176 • Published May 19 • 12

authored 3 papers 3 months ago

m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models

Paper • 2504.00869 • Published Apr 1, 2025 • 10

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

Paper • 2605.20177 • Published May 19 • 10

ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

Paper • 2605.20176 • Published May 19 • 12

authored 4 papers 3 months ago

Mimicking the Physicist's Eye:A VLM-centric Approach for Physics Formula Discovery

Paper • 2508.17380 • Published Aug 24, 2025 • 7

Target-Oriented Pretraining Data Selection via Neuron-Activated Graph

Paper • 2604.15706 • Published Apr 17 • 10

Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows

Paper • 2604.20200 • Published Apr 22 • 5

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Paper • 2604.21375 • Published Apr 23 • 19

submitted a paper to Daily Papers 4 months ago

Target-Oriented Pretraining Data Selection via Neuron-Activated Graph

Paper • 2604.15706 • Published Apr 17 • 10

authored 3 papers 4 months ago

STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Paper • 2504.01903 • Published Apr 2, 2025 • 1

AHELM: A Holistic Evaluation of Audio-Language Models

Paper • 2508.21376 • Published Aug 29, 2025 • 9

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

Paper • 2511.02779 • Published Nov 4, 2025 • 60

authored a paper 4 months ago

MemMA: Coordinating the Memory Cycle through Multi-Agent Reasoning and In-Situ Self-Evolution

Paper • 2603.18718 • Published Mar 19 • 10

authored 2 papers 4 months ago

Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales

Paper • 2510.10880 • Published Oct 13, 2025

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

Paper • 2311.16101 • Published Nov 27, 2023 • 1