arxiv:2605.27492

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

Published on May 26

· Submitted by

Ouyang Yipeng on Jun 4

Sun Yat-Sen University

Upvote

Authors:

Abstract

Production-grounded evaluation framework RAMP assesses long-horizon software engineering agents through realistic compiler construction workloads and runtime analysis.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

Fernandez-Owen

Paper submitter about 2 hours ago

🚀 Benchmarks are Not Enough: RAMP for Evaluating Agentic Models in Real-World Production Systems with RAMP

Hi Hugging Face Community! 👋 We are excited to share our latest work that challenges the current paradigm of LLM agent evaluation benchmarks.

While Large Language Models (LLMs) are rapidly evolving into autonomous software engineering systems, existing evaluation methodologies are largely centered on static, isolated, and short-horizon benchmarks. We found that high scores on traditional benchmarks poorly reflect practical capabilities under realistic runtime environments that involve long execution chains, tool interactions, and iterative feedback loops.

To bridge this gap, we introduce RAMP (Runtime Assessment of Models in Production), an infrastructure designed to evaluate agents in continuous, stateful, and resource-constrained engineering workflows.

How Far Can Today’s Agentic Models Go on the Ramp to Real System Development?

🛠️ The Workload: Real-World Compiler Construction

Instead of isolated coding puzzles, RAMP evaluates agents on a 6-stage compiler-construction pipeline (based on YatCC). The tasks range from environment setup (T0) and lexer generation (T1) all the way to LLVM IR optimization (T4) and RV64 assembly generation (T5). Each task consumes the output artifact of its predecessor, creating a strict serial dependency chain.

✨ Core Innovation: The "Resurrection Protocol"

In long-horizon tasks, a failure at an early stage usually invalidates all downstream steps, obscuring the model's true capabilities. To solve this, RAMP introduces a Resurrection Protocol.

When an agent fails an intermediate task, the orchestrator automatically transparently injects a "golden artifact" (a perfect intermediate state) and lets the agent continue. This allows us to separate "cannot reach" from "cannot solve," providing unprecedented diagnostic granularity.

Figure 3: Long-horizon assessment workloads in the integrated pipeline of RAMP

Figure 3: The RAMP Pipeline demonstrating Serial Evolution and the Resurrection Protocol.

📊 Shocking Findings from 15 SOTA Models

We evaluated 15 models (including Opus-4.7, GPT-5.5, DeepSeek-v4-Pro, and Qwen-3.6-Max). The results were eye-opening:

A Clear Capability Ceiling: None of the 15 evaluated models successfully completed the entire pipeline. Even the top-performing model, Opus-4.7, stalled at the IR Generation stage.
The 2525x Efficiency Gap: Process efficiency varied wildly. Total inference costs ranged from $0.05 (Qwen3-Coder) to $126.24 (Opus-4.7) — an extreme 2525x difference.
The "Context" Killer: We mapped a detailed failure taxonomy and discovered that Context Failure is the most prevalent hard-stop reason (accounting for 60.0% of failures), predominantly occurring in the middle stages (T2-T3) as history and code artifacts accumulate.

Figure 5: Trade-off of Cost and Performance: Elapsed time and API cost versus mean reward.

⚖️ Beyond Accuracy: The Agent Efficiency Index (AEI)

In production, a model that brute-forces a solution using massive context and time isn't always the best choice. We propose the Agent Efficiency Index (AEI), a composite metric jointly measuring task effectiveness, time, cost, and token utilization.

Under AEI, the rankings flip: GPT-5.5 achieved the highest composite efficiency (AEI 81.57), whereas Opus-4.7, despite having the highest raw task reward, dropped to an AEI of 40.00 due to massive resource overhead.

Read the full paper to explore the deep diagnostics of model behavior and why we need to move past static benchmarks!

🔗 Code: https://github.com/Nexa-Language/RAMP
💻 Code & Leaderboard: http://ramp.yatcc-ai.com/

HUANG-XIN

15 minutes ago

The workload is based on YatCC compiler-construction pipeline, which is a widely-used engineering practice project. Beyond that, the YatCC Platform provides more intelligent services.
More Infomation: https://yatcc-ai.com

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.27492

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.27492 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.27492 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.27492 in a Space README.md to link it from this page.