new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jun 9

AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning

We present AXIOM, a trust-first neuro-symbolic execution architecture for natural-language mathematical reasoning. In AXIOM, the language model functions strictly as a canonicalizer: it rewrites informal problem text into a narrow schema consumed by a deterministic Computer-Algebra-System (CAS) pipeline, which derives and verifies the answer or abstains as a first-class output. Routing follows a 1:1:1 alignment between problem-shape regex, schema-specific prompt, and closed-form CAS handler, with 3,100+ such routes shipped and zero LOST_CORRECT regressions across 250+ consecutive ship commits. We report empirical results on 4 MATH categories with a cumulative correctness of 94.36% (2,592/2,747) at 100.00% trust on parseable (zero confident-wrong answers across the full 2,747-record benchmark), all four domains above the per-domain 70/90/70 floor with per-domain trust at 100.0%, and median latency of 1 ms on rule-only handlers (88% of records on the lm-eval arithmetic 20,000-record benchmark). The architecture has served ~30,000 production queries through a public deployment. The contribution we emphasize is not a final accuracy figure but the forward dynamic the architecture establishes: every logged abstain in production is a candidate correct after one ship cycle, since new tasks compose without regressing the registry. The operational discipline behind this property -- math-template bucketing, LOST_CORRECT scan as regression oracle, parseable-first onboarding, and abstain as first-class output -- constitutes a transferable framework for trustworthy neuro-symbolic systems beyond mathematics.

  • 1 authors
·
May 29

MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits

To reduce development overhead and enable seamless integration between potential components comprising any given generative AI application, the Model Context Protocol (MCP) (Anthropic, 2024) has recently been released and subsequently widely adopted. The MCP is an open protocol that standardizes API calls to large language models (LLMs), data sources, and agentic tools. By connecting multiple MCP servers, each defined with a set of tools, resources, and prompts, users are able to define automated workflows fully driven by LLMs. However, we show that the current MCP design carries a wide range of security risks for end users. In particular, we demonstrate that industry-leading LLMs may be coerced into using MCP tools to compromise an AI developer's system through various attacks, such as malicious code execution, remote access control, and credential theft. To proactively mitigate these and related attacks, we introduce a safety auditing tool, MCPSafetyScanner, the first agentic tool to assess the security of an arbitrary MCP server. MCPScanner uses several agents to (a) automatically determine adversarial samples given an MCP server's tools and resources; (b) search for related vulnerabilities and remediations based on those samples; and (c) generate a security report detailing all findings. Our work highlights serious security issues with general-purpose agentic workflows while also providing a proactive tool to audit MCP server safety and address detected vulnerabilities before deployment. The described MCP server auditing tool, MCPSafetyScanner, is freely available at: https://github.com/johnhalloran321/mcpSafetyScanner

  • 2 authors
·
Apr 2, 2025 3

CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects

Automated code review (CR) is a key application for Large Language Models (LLMs), but progress is hampered by a "reality gap": existing benchmarks evaluate models on isolated sub-tasks using simplified, context-poor data. This fails to reflect the holistic context-rich nature of real-world CR. To bridge this gap, we introduce CodeFuse-CR-Bench, the first comprehensiveness-aware benchmark for repository-level CR evaluation. CodeFuse-CR-Bench comprises 601 high-quality instances from 70 Python projects covering nine Pull-Request (PR) problem domains, where each instance provides rich, multi-faceted context including the associated issue, PR details, and repository state, enabling end-to-end evaluation. Beyond superficial metrics, we also propose a novel evaluation framework that combines rule-based checks for location and syntax with model-based judgments of review quality. We present the first large-scale assessment of state-of-the-art LLMs on this comprehensive CR task. Our results establish crucial baselines and reveal that (1) no single LLM dominates all aspects of CR; (2) Gemini 2.5 Pro achieves the highest comprehensive performance; and (3) different LLMs exhibit varying robustness to redundant context. These findings highlight the necessity of holistic, multi-dimensional evaluation and provide actionable insights for advancing truly intelligent yet practical CR assistants.

  • 7 authors
·
Sep 18, 2025 2

ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents

Recent advancements in integrating large language models (LLMs) with application programming interfaces (APIs) have gained significant interest in both academia and industry. These API-based agents, leveraging the strong autonomy and planning capabilities of LLMs, can efficiently solve problems requiring multi-step actions. However, their ability to handle multi-dimensional difficulty levels, diverse task types, and real-world demands through APIs remains unknown. In this paper, we introduce ShortcutsBench, a large-scale benchmark for the comprehensive evaluation of API-based agents in solving tasks with varying levels of difficulty, diverse task types, and real-world demands. ShortcutsBench includes a wealth of real APIs from Apple Inc.'s operating systems, refined user queries from shortcuts, human-annotated high-quality action sequences from shortcut developers, and accurate parameter filling values about primitive parameter types, enum parameter types, outputs from previous actions, and parameters that need to request necessary information from the system or user. Our extensive evaluation of agents built with 5 leading open-source (size >= 57B) and 4 closed-source LLMs (e.g. Gemini-1.5-Pro and GPT-3.5) reveals significant limitations in handling complex queries related to API selection, parameter filling, and requesting necessary information from systems and users. These findings highlight the challenges that API-based agents face in effectively fulfilling real and complex user queries. All datasets, code, and experimental results will be available at https://github.com/eachsheep/shortcutsbench.

  • 8 authors
·
Jun 28, 2024

Securing the Model Context Protocol (MCP): Risks, Controls, and Governance

The Model Context Protocol (MCP) replaces static, developer-controlled API integrations with more dynamic, user-driven agent systems, which also introduces new security risks. As MCP adoption grows across community servers and major platforms, organizations encounter threats that existing AI governance frameworks (such as NIST AI RMF and ISO/IEC 42001) do not yet cover in detail. We focus on three types of adversaries that take advantage of MCP s flexibility: content-injection attackers that embed malicious instructions into otherwise legitimate data; supply-chain attackers who distribute compromised servers; and agents who become unintentional adversaries by over-stepping their role. Based on early incidents and proof-of-concept attacks, we describe how MCP can increase the attack surface through data-driven exfiltration, tool poisoning, and cross-system privilege escalation. In response, we propose a set of practical controls, including per-user authentication with scoped authorization, provenance tracking across agent workflows, containerized sandboxing with input/output checks, inline policy enforcement with DLP and anomaly detection, and centralized governance using private registries or gateway layers. The aim is to help organizations ensure that unvetted code does not run outside a sandbox, tools are not used beyond their intended scope, data exfiltration attempts are detectable, and actions can be audited end-to-end. We close by outlining open research questions around verifiable registries, formal methods for these dynamic systems, and privacy-preserving agent operations.

  • 3 authors
·
Nov 24, 2025

ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files

As Large Language Model (LLM) agents increasingly execute complex, autonomous software engineering tasks, developers rely on natural language instruction files such as AGENTS.md to express project-specific coding conventions, tooling restrictions, and architectural boundaries. However, because these instructions remain passive text, agents frequently violate documented constraints due to context window saturation or conflicting local context. In autonomous settings without real-time human supervision, such violations rapidly compound into technical debt. To ground autonomous agents in repository constraints, we introduce ContextCov, a framework that transforms passive natural language instructions into executable guardrails. Unlike prompt-only or reflection-only compliance approaches, ContextCov compiles documented constraints into three complementary checks: static AST queries for code patterns, runtime shell shims that intercept prohibited commands, and architectural validators that enforce structural rules. Acting as an automated, continuous reviewer, ContextCov intercepts agent actions and returns immediate, reproducible violation traces, enabling self-correction before non-compliant changes are finalized. We evaluate ContextCov on SWE-bench Lite (12 repositories, 300 tasks). Compared to prompt-only and LLM reflection baselines, ContextCov achieves 88.3% constraint compliance (vs. 67.0% and 50.3%) with 3.4x lower feedback cost, while maintaining functional correctness. The source code and evaluation results are available at https://github.com/reSHARMA/ContextCov.

  • 1 authors
·
May 3

A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification

In this paper we present a novel solution that combines the capabilities of Large Language Models (LLMs) with Formal Verification strategies to verify and automatically repair software vulnerabilities. Initially, we employ Bounded Model Checking (BMC) to locate the software vulnerability and derive a counterexample. The counterexample provides evidence that the system behaves incorrectly or contains a vulnerability. The counterexample that has been detected, along with the source code, are provided to the LLM engine. Our approach involves establishing a specialized prompt language for conducting code debugging and generation to understand the vulnerability's root cause and repair the code. Finally, we use BMC to verify the corrected version of the code generated by the LLM. As a proof of concept, we create ESBMC-AI based on the Efficient SMT-based Context-Bounded Model Checker (ESBMC) and a pre-trained Transformer model, specifically gpt-3.5-turbo, to detect and fix errors in C programs. Our experimentation involved generating a dataset comprising 1000 C code samples, each consisting of 20 to 50 lines of code. Notably, our proposed method achieved an impressive success rate of up to 80% in repairing vulnerable code encompassing buffer overflow and pointer dereference failures. We assert that this automated approach can effectively incorporate into the software development lifecycle's continuous integration and deployment (CI/CD) process.

  • 6 authors
·
May 24, 2023

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at 0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of 0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.

  • 9 authors
·
Jun 12, 2025 2

The Mirror Design Pattern: Strict Data Geometry over Model Scale for Prompt Injection Detection

Prompt injection defenses are often framed as semantic understanding problems and delegated to increasingly large neural detectors. For the first screening layer, however, the requirements are different: the detector runs on every request and therefore must be fast, deterministic, non-promptable, and auditable. We introduce Mirror, a data-curation design pattern that organizes prompt injection corpora into matched positive and negative cells so that a classifier learns control-plane attack mechanics rather than incidental corpus shortcuts. Using 5,000 strictly curated open-source samples -- the largest corpus supportable under our public-data validity contract -- we define a 32-cell mirror topology, fill 31 of those cells with public data, train a sparse character n-gram linear SVM, compile its weights into a static Rust artifact, and obtain 95.97\% recall and 92.07\% F1 on a 524-case holdout at sub-millisecond latency with no external model runtime dependencies. On the same holdout, our next line of defense, a 22-million-parameter Prompt Guard~2 model reaches 44.35\% recall and 59.14\% F1 at 49\,ms median and 324\,ms p95 latency. Linear models still leave residual semantic ambiguities such as use-versus-mention for later pipeline layers, but within that scope our results show that for L1 prompt injection screening, strict data geometry can matter more than model scale.

  • 1 authors
·
Mar 12

IryoNLP at MEDIQA-CORR 2024: Tackling the Medical Error Detection & Correction Task On the Shoulders of Medical Agents

In natural language processing applied to the clinical domain, utilizing large language models has emerged as a promising avenue for error detection and correction on clinical notes, a knowledge-intensive task for which annotated data is scarce. This paper presents MedReAct'N'MedReFlex, which leverages a suite of four LLM-based medical agents. The MedReAct agent initiates the process by observing, analyzing, and taking action, generating trajectories to guide the search to target a potential error in the clinical notes. Subsequently, the MedEval agent employs five evaluators to assess the targeted error and the proposed correction. In cases where MedReAct's actions prove insufficient, the MedReFlex agent intervenes, engaging in reflective analysis and proposing alternative strategies. Finally, the MedFinalParser agent formats the final output, preserving the original style while ensuring the integrity of the error correction process. One core component of our method is our RAG pipeline based on our ClinicalCorp corpora. Among other well-known sources containing clinical guidelines and information, we preprocess and release the open-source MedWiki dataset for clinical RAG application. Our results demonstrate the central role of our RAG approach with ClinicalCorp leveraged through the MedReAct'N'MedReFlex framework. It achieved the ninth rank on the MEDIQA-CORR 2024 final leaderboard.

  • 1 authors
·
Apr 23, 2024

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Large language models (LLMs) typically deploy safety mechanisms to prevent harmful content generation. Most current approaches focus narrowly on risks posed by malicious actors, often framing risks as adversarial events and relying on defensive refusals. However, in real-world settings, risks also come from non-malicious users seeking help while under psychological distress (e.g., self-harm intentions). In such cases, the model's response can strongly influence the user's next actions. Simple refusals may lead them to repeat, escalate, or move to unsafe platforms, creating worse outcomes. We introduce Constructive Safety Alignment (CSA), a human-centric paradigm that protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, turning safety into a trust-building process. Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities. On our Constructive Benchmark, it shows strong constructive engagement, close to GPT-5, and unmatched robustness on the Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from refusal-first to guidance-first safety, CSA redefines the model-user relationship, aiming for systems that are not just safe, but meaningfully helpful. We release Oy1, code, and the benchmark to support responsible, user-centered AI.

  • 27 authors
·
Sep 1, 2025

LLM-Driven Multi-step Translation from C to Rust using Static Analysis

Translating software written in legacy languages to modern languages, such as C to Rust, has significant benefits in improving memory safety while maintaining high performance. However, manual translation is cumbersome, error-prone, and produces unidiomatic code. Large language models (LLMs) have demonstrated promise in producing idiomatic translations, but offer no correctness guarantees as they lack the ability to capture all the semantics differences between the source and target languages. To resolve this issue, we propose SACTOR, an LLM-driven C-to-Rust zero-shot translation tool using a two-step translation methodology: an "unidiomatic" step to translate C into Rust while preserving semantics, and an "idiomatic" step to refine the code to follow Rust's semantic standards. SACTOR utilizes information provided by static analysis of the source C program to address challenges such as pointer semantics and dependency resolution. To validate the correctness of the translated result from each step, we use end-to-end testing via the foreign function interface to embed our translated code segment into the original code. We evaluate the translation of 200 programs from two datasets and two case studies, comparing the performance of GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, Llama 3.3 70B and DeepSeek-R1 in SACTOR. Our results demonstrate that SACTOR achieves high correctness and improved idiomaticity, with the best-performing model (DeepSeek-R1) reaching 93% and (GPT-4o, Claude 3.5, DeepSeek-R1) reaching 84% correctness (on each dataset, respectively), while producing more natural and Rust-compliant translations compared to existing methods.

  • 6 authors
·
Mar 16, 2025

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym

On the Tool Manipulation Capability of Open-source Large Language Models

Recent studies on software tool manipulation with large language models (LLMs) mostly rely on closed model APIs. The industrial adoption of these models is substantially constrained due to the security and robustness risks in exposing information to closed LLM API services. In this paper, we ask can we enhance open-source LLMs to be competitive to leading closed LLM APIs in tool manipulation, with practical amount of human supervision. By analyzing common tool manipulation failures, we first demonstrate that open-source LLMs may require training with usage examples, in-context demonstration and generation style regulation to resolve failures. These insights motivate us to revisit classical methods in LLM literature, and demonstrate that we can adapt them as model alignment with programmatic data generation, system prompts and in-context demonstration retrievers to enhance open-source LLMs for tool manipulation. To evaluate these techniques, we create the ToolBench, a tool manipulation benchmark consisting of diverse software tools for real-world tasks. We demonstrate that our techniques can boost leading open-source LLMs by up to 90% success rate, showing capabilities competitive to OpenAI GPT-4 in 4 out of 8 ToolBench tasks. We show that such enhancement typically requires about one developer day to curate data for each tool, rendering a recipe with practical amount of human supervision.

sambanovasystems SambaNova
·
May 25, 2023

MultiMend: Multilingual Program Repair with Context Augmentation and Multi-Hunk Patch Generation

Context: Bugs in code are inevitable and can lead to severe consequences, ranging from security vulnerabilities to operational failures. Debugging software remains challenging despite advances in testing and verification, often requiring extensive manual effort. Learning-based automated program repair (APR) has shown promise in reducing the time, effort, and cost of manually fixing bugs. However, existing techniques face several challenges, including language-dependent strategies, limited bug context utilization, and difficulties in handling bugs that span multiple locations in the code. Objective: This paper introduces MultiMend, a learning-based APR approach designed to improve repair performance on multiple programming languages with language-independent context augmentation and multi-hunk patch generation. Method: MultiMend fine-tunes a pre-trained encoder-decoder transformer model (CodeT5) to generate bug-fixing patches. It embeds source code lines and applies retrieval-augmented generation to augment the buggy context with relevant lines during patch generation. The approach systematically constructs patches for multi-hunk bugs to reduce the needed patch validations. We evaluate MultiMend on four benchmarks with four programming languages and compare it with state-of-the-art methods. Results: Experimental results show that MultiMend achieves competitive effectiveness and efficiency against compared tools. Across all benchmarks, MultiMend fixes 2,077 bugs, of which 1,455 are identical to the developer's patch, and 106 are for multi-hunk bugs. Both context augmentation and multi-hunk patch generation positively contribute to the results. Conclusion: MultiMend shows promising performance across benchmarks. The findings highlight its applicability to real-world software maintenance and its potential to reduce manual debugging efforts.

  • 3 authors
·
Jan 27, 2025

AIP: Agent Identity Protocol for Verifiable Delegation Across MCP and A2A

AI agents increasingly call tools via the Model Context Protocol (MCP) and delegate to other agents via Agent-to-Agent (A2A), yet neither protocol verifies agent identity. A scan of approximately 2,000 MCP servers found all lacked authentication. In our survey, we did not identify a prior implemented protocol that jointly combines public-key verifiable delegation, holder-side attenuation, expressive chained policy, transport bindings across MCP/A2A/HTTP, and provenance-oriented completion records. We introduce Invocation-Bound Capability Tokens (IBCTs), a primitive that fuses identity, attenuated authorization, and provenance binding into a single append-only token chain. IBCTs operate in two wire formats: compact mode (a signed JWT for single-hop cases) and chained mode (a Biscuit token with Datalog policies for multi-hop delegation). We provide reference implementations in Python and Rust with full cross-language interoperability. Compact mode verification takes 0.049ms (Rust) and 0.189ms (Python), with 0.22ms overhead over no-auth in real MCP-over-HTTP deployment. In a real multi-agent deployment with Gemini 2.5 Flash, AIP adds 2.35ms of overhead (0.086% of total end-to-end latency). Adversarial evaluation across 600 attack attempts shows 100% rejection rate, with two attack categories (delegation depth violation and audit evasion through empty context) uniquely caught by AIP's chained delegation model that neither unsigned nor plain JWT deployments detect.

  • 1 authors
·
Mar 24

ClawHub Security Signals: When VirusTotal, Static Analysis, and SkillSpector Disagree

Agent skills extend AI agents with reusable instructions, tools, scripts, references, and workflows, establishing a security boundary distinct from both model safety and traditional package-malware detection. ClawHub Security Signals is a sanitized dataset of 67,453 latest public OpenClaw skill versions. Each row pairs redacted SKILL.md content and sanitized bundled files where present with a final ClawScan registry verdict and evidence from three scanner families: VirusTotal, static heuristic analysis, and NVIDIA SkillSpector. Rather than estimating malicious-skill prevalence, we study scanner disagreement. The three scanners rarely flag the same skills: any pair overlaps on at most 10.4% of their combined positives, only 0.69% of skills are flagged by all three, and 81.9% of flagged skills are identified by a single scanner. The disagreement is structured by attack surface. SkillSpector, which raises semantic agentic-risk advisories rather than malware-reputation signals, is positive for 19,209 of 25,504 suspicious rows (75.3%) but only 14 of 206 malicious rows (6.8%). The malicious-verdict region shows the inverse profile: 150 of 206 malicious rows (72.8%) are VirusTotal-positive, consistent with bundled-code malware evidence. These results show that agent-skill security requires layered governance, not single-scanner allow/block decisions. The corpus is released as a sanitized silver-standard dataset: labels are the registry's automated verdicts, not human-annotated ground truth, and the release represents an early, versioned snapshot intended to support the community while a human-annotated subset is developed. Further research is encouraged, including models tailored for skill-security triage.

OpenClaw OpenClaw
·
May 31 1

Bridging Protocol and Production: Design Patterns for Deploying AI Agents with Model Context Protocol

The Model Context Protocol (MCP) standardizes how AI agents discover and invoke external tools, with over 10,000 active servers and 97 million monthly SDK downloads as of early 2026. Yet MCP does not yet standardize how agents safely operate those tools at production scale. Three protocol-level primitives remain missing: identity propagation, adaptive tool budgeting, and structured error semantics. This paper identifies these gaps through field lessons from an enterprise deployment of an AI agent platform integrated with a major cloud provider's MCP servers (client name redacted). We propose three mechanisms to fill them: (1) the Context-Aware Broker Protocol (CABP), which extends JSON-RPC with identity-scoped request routing via a six-stage broker pipeline; (2) Adaptive Timeout Budget Allocation (ATBA), which frames sequential tool invocation as a budget allocation problem over heterogeneous latency distributions; and (3) the Structured Error Recovery Framework (SERF), which provides machine-readable failure semantics that enable deterministic agent self-correction. We organize production failure modes into five design dimensions (server contracts, user context, timeouts, errors, and observability), document concrete failure vignettes, and present a production readiness checklist. All three algorithms are formalized as testable hypotheses with reproducible experimental methodology. Field observations demonstrate that while MCP provides a solid protocol foundation, reliable agent tool integration requires infrastructure-level mechanisms that the specification does not yet address.

  • 1 authors
·
Mar 11

Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions

The Model Context Protocol (MCP) introduces a standard specification that defines how Foundation Model (FM)-based agents should interact with external systems by invoking tools. However, to understand a tool's purpose and features, FMs rely on natural-language tool descriptions, making these descriptions a critical component in guiding FMs to select the optimal tool for a given (sub)task and to pass the right arguments to the tool. While defects or smells in these descriptions can misguide FM-based agents, their prevalence and consequences in the MCP ecosystem remain unclear. Hence, we examine 856 tools spread across 103 MCP servers empirically, assess their description quality, and their impact on agent performance. We identify six components of tool descriptions from the literature, develop a scoring rubric utilizing these components, and then formalize tool description smells based on this rubric. By operationalizing this rubric through an FM-based scanner, we find that 97.1% of the analyzed tool descriptions contain at least one smell, with 56% failing to state their purpose clearly. While augmenting these descriptions for all components improves task success rates by a median of 5.85 percentage points and improves partial goal completion by 15.12%, it also increases the number of execution steps by 67.46% and regresses performance in 16.67% of cases. These results indicate that achieving performance gains is not straightforward; while execution cost can act as a trade-off, execution context can also impact. Furthermore, component ablations show that compact variants of different component combinations often preserve behavioral reliability while reducing unnecessary token overhead, enabling more efficient use of the FM context window and lower execution costs.

  • 5 authors
·
Feb 16 2

Compiling C to Safe Rust, Formalized

The popularity of the Rust language continues to explode; yet, many critical codebases remain authored in C, and cannot be realistically rewritten by hand. Automatically translating C to Rust is thus an appealing course of action. Several works have gone down this path, handling an ever-increasing subset of C through a variety of Rust features, such as unsafe. While the prospect of automation is appealing, producing code that relies on unsafe negates the memory safety guarantees offered by Rust, and therefore the main advantages of porting existing codebases to memory-safe languages. We instead explore a different path, and explore what it would take to translate C to safe Rust; that is, to produce code that is trivially memory safe, because it abides by Rust's type system without caveats. Our work sports several original contributions: a type-directed translation from (a subset of) C to safe Rust; a novel static analysis based on "split trees" that allows expressing C's pointer arithmetic using Rust's slices and splitting operations; an analysis that infers exactly which borrows need to be mutable; and a compilation strategy for C's struct types that is compatible with Rust's distinction between non-owned and owned allocations. We apply our methodology to existing formally verified C codebases: the HACL* cryptographic library, and binary parsers and serializers from EverParse, and show that the subset of C we support is sufficient to translate both applications to safe Rust. Our evaluation shows that for the few places that do violate Rust's aliasing discipline, automated, surgical rewrites suffice; and that the few strategic copies we insert have a negligible performance impact. Of particular note, the application of our approach to HACL* results in a 80,000 line verified cryptographic library, written in pure Rust, that implements all modern algorithms - the first of its kind.

  • 2 authors
·
Dec 19, 2024

GCVE: A Decentralized Model for Vulnerability Identification, Publication, and Operational Enrichment

The Global CVE initiative (GCVE) proposes a decentralized, open, and extensible model for vulnerability identification, publication, and enrichment. It addresses a gap in today's vulnerability ecosystem: centralized systems provide rigorous control and widely recognized identifiers, while many producers publish advisories independently without a shared fabric for discovery, correlation, enrichment, and reuse. This paper presents GCVE as a socio-technical standardization effort combining autonomous GCVE Numbering Authorities, lightweight allocation rules, distributed publication, open Best Current Practices, and practical reference implementations. The model preserves global uniqueness while allowing participants to publish according to their operational needs. It also broadens the concept of a vulnerability record to cover assignments, disclosures, sightings, rejected identifiers, observations, exploited vulnerability information, and enrichment records. The paper describes how the GCVE BCP process supports technical interoperability and amendable operational practice, including practical guidance for vulnerability handling and disclosure. It also examines the extension mechanism, including AI-oriented extensions, as a way to evolve the standard without centralizing control. A particular focus is placed on vulnerability-lookup as the reference implementation. It aggregates multiple sources, supports GCVE publication and consumption, implements distributed Known Exploited Vulnerability data, and enables automatically enriched vulnerability data streams. Building on lessons from the MISP ecosystem, GCVE frames vulnerability coordination not only as identifier allocation, but as open infrastructure for collective security knowledge production.

  • 1 authors
·
May 29

"Your AI, My Shell": Demystifying Prompt Injection Attacks on Agentic AI Coding Editors

Agentic AI coding editors driven by large language models have recently become more popular due to their ability to improve developer productivity during software development. Modern editors such as Cursor are designed not just for code completion, but also with more system privileges for complex coding tasks (e.g., run commands in the terminal, access development environments, and interact with external systems). While this brings us closer to the "fully automated programming" dream, it also raises new security concerns. In this study, we present the first empirical analysis of prompt injection attacks targeting these high-privilege agentic AI coding editors. We show how attackers can remotely exploit these systems by poisoning external development resources with malicious instructions, effectively hijacking AI agents to run malicious commands, turning "your AI" into "attacker's shell". To perform this analysis, we implement AIShellJack, an automated testing framework for assessing prompt injection vulnerabilities in agentic AI coding editors. AIShellJack contains 314 unique attack payloads that cover 70 techniques from the MITRE ATT&CK framework. Using AIShellJack, we conduct a large-scale evaluation on GitHub Copilot and Cursor, and our evaluation results show that attack success rates can reach as high as 84% for executing malicious commands. Moreover, these attacks are proven effective across a wide range of objectives, ranging from initial access and system discovery to credential theft and data exfiltration.

  • 6 authors
·
Sep 26, 2025

GAMMA: Revisiting Template-based Automated Program Repair via Mask Prediction

Automated program repair (APR) aims to fix software bugs without human intervention and template-based APR has been widely investigated with promising results. However, it is challenging for template-based APR to select the appropriate donor code, which is an important repair ingredient for generating candidate patches. Inappropriate donor code may cause plausible but incorrect patch generation even with correct fix patterns, limiting the repair performance. In this paper, we aim to revisit template-based APR, and propose GAMMA, to directly leverage large pre-trained language models for donor code generation. Our main insight is that instead of retrieving donor code in the local buggy file, we can directly predict the correct code tokens based on the context code snippets and repair patterns by a cloze task. Specifically, (1) GAMMA revises a variety of fix templates from state-of-the-art template-based APR techniques (i.e., TBar) and transforms them into mask patterns. (2) GAMMA adopts a pre-trained language model to predict the correct code for masked code as a fill-in-the-blank task. The experimental results demonstrate that GAMMA correctly repairs 82 bugs on Defects4J-v1.2, which achieves 20.59\% (14 bugs) and 26.15\% (17 bugs) improvement over the previous state-of-the-art template-based approach TBar and learning-based one Recoder. Furthermore, GAMMA repairs 45 bugs and 22 bugs from the additional Defects4J-v2.0 and QuixBugs, indicating the generalizability of GAMMA in addressing the dataset overfitting issue. We also prove that adopting other pre-trained language models can provide substantial advancement, e.g., CodeBERT-based and ChatGPT-based GAMMA is able to fix 80 and 67 bugs on Defects4J-v1.2, indicating the scalability of GAMMA. Overall, our study highlights the promising future of adopting pre-trained models to generate correct patches on top of fix patterns.

  • 6 authors
·
Sep 17, 2023

Impact of Code Language Models on Automated Program Repair

Automated program repair (APR) aims to help developers improve software reliability by generating patches for buggy programs. Although many code language models (CLM) are developed and effective in many software tasks such as code completion, there has been little comprehensive, in-depth work to evaluate CLMs' fixing capabilities and to fine-tune CLMs for the APR task. Firstly, this work is the first to evaluate ten CLMs on four APR benchmarks, which shows that surprisingly, the best CLM, as is, fixes 72% more bugs than the state-of-the-art deep-learning (DL)-based APR techniques. Secondly, one of the four APR benchmarks was created by us in this paper to avoid data leaking for a fair evaluation. Thirdly, it is the first work to fine-tune CLMs with APR training data, which shows that fine-tuning brings 31%-1,267% improvement to CLMs and enables them to fix 46%-164% more bugs than existing DL-based APR techniques. Fourthly, this work studies the impact of buggy lines, showing that CLMs, as is, cannot make good use of the buggy lines to fix bugs, yet fine-tuned CLMs could potentially over-rely on buggy lines. Lastly, this work analyzes the size, time, and memory efficiency of different CLMs. This work shows promising directions for the APR domain, such as fine-tuning CLMs with APR-specific designs, and also raises awareness of fair and comprehensive evaluations of CLMs and calls for more transparent reporting of open-source repositories used in the pre-training data to address the data leaking problem.

  • 4 authors
·
Feb 9, 2023

CAISE: Conversational Agent for Image Search and Editing

Demand for image editing has been increasing as users' desire for expression is also increasing. However, for most users, image editing tools are not easy to use since the tools require certain expertise in photo effects and have complex interfaces. Hence, users might need someone to help edit their images, but having a personal dedicated human assistant for every user is impossible to scale. For that reason, an automated assistant system for image editing is desirable. Additionally, users want more image sources for diverse image editing works, and integrating an image search functionality into the editing tool is a potential remedy for this demand. Thus, we propose a dataset of an automated Conversational Agent for Image Search and Editing (CAISE). To our knowledge, this is the first dataset that provides conversational image search and editing annotations, where the agent holds a grounded conversation with users and helps them to search and edit images according to their requests. To build such a system, we first collect image search and editing conversations between pairs of annotators. The assistant-annotators are equipped with a customized image search and editing tool to address the requests from the user-annotators. The functions that the assistant-annotators conduct with the tool are recorded as executable commands, allowing the trained system to be useful for real-world application execution. We also introduce a generator-extractor baseline model for this task, which can adaptively select the source of the next token (i.e., from the vocabulary or from textual/visual contexts) for the executable command. This serves as a strong starting point while still leaving a large human-machine performance gap for useful future work. Our code and dataset are publicly available at: https://github.com/hyounghk/CAISE

  • 6 authors
·
Feb 23, 2022

CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models

Code generation models based on the pre-training and fine-tuning paradigm have been increasingly attempted by both academia and industry, resulting in well-known industrial models such as Codex, CodeGen, and PanGu-Coder. To evaluate the effectiveness of these models, multiple existing benchmarks are proposed, including only cases of generating a standalone function, i.e., a function that may invoke or access only built-in functions and standard libraries. However, non-standalone functions, which typically are not included in the existing benchmarks, constitute more than 70% of the functions in popular open-source projects, and evaluating models' effectiveness on standalone functions cannot reflect these models' effectiveness on pragmatic code generation scenarios. To help bridge the preceding gap, in this paper, we propose a benchmark named CoderEval, consisting of 230 Python and 230 Java code generation tasks carefully curated from popular real-world open-source projects and a self-contained execution platform to automatically assess the functional correctness of generated code. CoderEval supports code generation tasks from six levels of context dependency, where context refers to code elements such as types, APIs, variables, and consts defined outside the function under generation but within the dependent third-party libraries, current class, file, or project. CoderEval can be used to evaluate the effectiveness of models in generating code beyond only standalone functions. By evaluating three code generation models on CoderEval, we find that the effectiveness of these models in generating standalone functions is substantially higher than that in generating non-standalone functions. Our analysis highlights the current progress and pinpoints future directions to further improve a model's effectiveness by leveraging contextual information for pragmatic code generation.

  • 10 authors
·
Feb 1, 2023

SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility.

  • 3 authors
·
Jun 18, 2025

XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants

AI coding assistants are widely used for tasks like code generation. These tools now require large and complex contexts, automatically sourced from various originsx2014across files, projects, and contributorsx2014forming part of the prompt fed to underlying LLMs. This automatic context-gathering introduces new vulnerabilities, allowing attackers to subtly poison input to compromise the assistant's outputs, potentially generating vulnerable code or introducing critical errors. We propose a novel attack, Cross-Origin Context Poisoning (XOXO), that is challenging to detect as it relies on adversarial code modifications that are semantically equivalent. Traditional program analysis techniques struggle to identify these perturbations since the semantics of the code remains correct, making it appear legitimate. This allows attackers to manipulate coding assistants into producing incorrect outputs, while shifting the blame to the victim developer. We introduce a novel, task-agnostic, black-box attack algorithm GCGS that systematically searches the transformation space using a Cayley Graph, achieving a 75.72% attack success rate on average across five tasks and eleven models, including GPT 4.1 and Claude 3.5 Sonnet v2 used by popular AI coding assistants. Furthermore, defenses like adversarial fine-tuning are ineffective against our attack, underscoring the need for new security measures in LLM-powered coding tools.

  • 7 authors
·
Mar 18, 2025

PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM judging is costly. We introduce PrefixGuard, a trace-to-monitor framework with an offline StepView induction step followed by supervised monitor training. StepView induces deterministic typed-step adapters from raw trace samples, and the monitor learns an event abstraction and prefix-risk scorer from terminal outcomes. Across WebArena, τ^2-Bench, SkillsBench, and TerminalBench, the strongest PrefixGuard monitors reach 0.900/0.710/0.533/0.557 AUPRC. Using the strongest backend within each representation, they improve over raw-text controls by an average of +0.137 AUPRC. LLM judges remain substantially weaker under the same prefix-warning protocol. We also derive an observability ceiling on score-based area under the precision-recall curve (AUPRC) that separates monitor error from failures lacking evidence in the observed prefix. For finite-state audit, post-hoc deterministic finite automaton (DFA) extraction remains compact on WebArena and τ^2-Bench (29 and 20 states) but expands to 151 and 187 states on SkillsBench and TerminalBench. Finally, first-alert diagnostics show that strong ranking does not imply deployment utility: WebArena ranks well yet fails to support low-false-alarm alerts, whereas τ^2-Bench and TerminalBench retain more actionable early alerts. Together, these results position PrefixGuard as a practical monitor-synthesis recipe with explicit diagnostics for when prefix warnings translate into actionable interventions.

S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models

Large Language Models have gained considerable attention for their revolutionary capabilities. However, there is also growing concern on their safety implications, making a comprehensive safety evaluation for LLMs urgently needed before model deployment. In this work, we propose S-Eval, a new comprehensive, multi-dimensional and open-ended safety evaluation benchmark. At the core of S-Eval is a novel LLM-based automatic test prompt generation and selection framework, which trains an expert testing LLM Mt combined with a range of test selection strategies to automatically construct a high-quality test suite for the safety evaluation. The key to the automation of this process is a novel expert safety-critique LLM Mc able to quantify the riskiness score of a LLM's response, and additionally produce risk tags and explanations. Besides, the generation process is also guided by a carefully designed risk taxonomy with four different levels, covering comprehensive and multi-dimensional safety risks of concern. Based on these, we systematically construct a new and large-scale safety evaluation benchmark for LLMs consisting of 220,000 evaluation prompts, including 20,000 base risk prompts (10,000 in Chinese and 10,000 in English) and 200, 000 corresponding attack prompts derived from 10 popular adversarial instruction attacks against LLMs. Moreover, considering the rapid evolution of LLMs and accompanied safety threats, S-Eval can be flexibly configured and adapted to include new risks, attacks and models. S-Eval is extensively evaluated on 20 popular and representative LLMs. The results confirm that S-Eval can better reflect and inform the safety risks of LLMs compared to existing benchmarks. We also explore the impacts of parameter scales, language environments, and decoding parameters on the evaluation, providing a systematic methodology for evaluating the safety of LLMs.

  • 10 authors
·
May 23, 2024

MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents

The Model Context Protocol (MCP) standardizes how large language model (LLM) agents discover, describe, and call external tools. While MCP unlocks broad interoperability, it also enlarges the attack surface by making tools first-class, composable objects with natural-language metadata, and standardized I/O. We present MSB (MCP Security Benchmark), the first end-to-end evaluation suite that systematically measures how well LLM agents resist MCP-specific attacks throughout the full tool-use pipeline: task planning, tool invocation, and response handling. MSB contributes: (1) a taxonomy of 12 attacks including name-collision, preference manipulation, prompt injections embedded in tool descriptions, out-of-scope parameter requests, user-impersonating responses, false-error escalation, tool-transfer, retrieval injection, and mixed attacks; (2) an evaluation harness that executes attacks by running real tools (both benign and malicious) via MCP rather than simulation; and (3) a robustness metric that quantifies the trade-off between security and performance: Net Resilient Performance (NRP). We evaluate nine popular LLM agents across 10 domains and 405 tools, producing 2,000 attack instances. Results reveal the effectiveness of attacks against each stage of MCP. Models with stronger performance are more vulnerable to attacks due to their outstanding tool calling and instruction following capabilities. MSB provides a practical baseline for researchers and practitioners to study, compare, and harden MCP agents. Code: https://github.com/dongsenzhang/MSB

  • 6 authors
·
Oct 14, 2025

Autoformalizer with Tool Feedback

Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements. Efforts in recent work shift from directly prompting large language models to training an end-to-end formalizer model from scratch, achieving remarkable advancements. However, existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency. To address this issue, we propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process. By integrating Lean 4 compilers for syntax corrections and employing a multi-LLMs-as-judge approach for consistency validation, the model is able to adaptively refine generated statements according to the tool feedback, enhancing both syntactic validity and semantic consistency. The training of ATF involves a cold-start phase on synthetic tool-calling data, an expert iteration phase to improve formalization capabilities, and Direct Preference Optimization to alleviate ineffective revisions. Experimental results show that ATF markedly outperforms a range of baseline formalizer models, with its superior performance further validated by human evaluations. Subsequent analysis reveals that ATF demonstrates excellent inference scaling properties. Moreover, we open-source Numina-ATF, a dataset containing 750K synthetic formal statements to facilitate advancements in autoformalization and ATP research.

  • 11 authors
·
Oct 8, 2025

LACUNA: Safe Agents as Recursive Program Holes

LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any of them. Letting model-written code shape the runtime itself would make agents more expressive, but it would also sharpen safety problems. A model can be diverted by a prompt injection, call the wrong tool, or fail partway and leave an inconsistent state, and each such failure reaches further when the code shapes the runtime than when it expresses a single action. We present LACUNA, a programming model for agents that closes this split while preserving safety. Each agent action is a typed call agent[T](task) that the LLM fills with code when execution reaches it, and the code is type-checked against the surrounding program before it runs. Because each action is accepted or rejected as a whole, a rejected one leaves the environment untouched, and its compiler diagnostics drive a retry. The same check also bounds which tools and data an action may use and how they flow. Our primitive expresses ReAct loops, sub-agents, skills, parallel decomposition, and multi-model planning as ordinary control flow. We evaluate LACUNA on a collection of test cases, BrowseComp-Plus, and τ^2-bench. On BrowseComp-Plus, 8.6% of generations are rejected before execution, with 0.7 retries per query on average, and the agent reaches 27.1% accuracy. On τ^2-bench, LACUNA solves 76.0% of 392 tasks across four domains with a capable model, on par with the baseline agent.