asplos-chatbot / data /papers.json
tanvisharma's picture
Upload data/papers.json with huggingface_hub
9e824f0 verified
[
{
"title": "Scaling Automated Database System Testing",
"authors": [
"Suyang Zhong",
"Manuel Rigger"
],
"abstract": "",
"doi": "10.1145/3779212.3790215",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790215",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790215",
"has_full_text": true
},
{
"title": "Unicorns, Centaurs, and Cyborgs: Co-design Powering the Intelligence Era",
"authors": [
"Parthasarathy Ranganathan"
],
"abstract": "We are in the middle of a new ''intelligence revolution''. From consumer products to enterprise cloud offerings, and even foundational scientific research, Artificial Intelligence (AI) is fundamentally transforming all aspects of our life. These amazing advances have been made possible by equally impressive innovations in the underlying computing systems powering AI, but the continued pace of growth and change will require even more. In this talk, we will discuss vertically-integrated optimizations in AI infrastructure that have already birthed a new cottage industry of ''unicorns'', ranging from physical data centers & hardware to software and cloud solutions, and identify the exciting opportunities ahead. We will then look forward to future ''centaurs and cyborgs,'' symbiotic partnerships between human ingenuity and artificial intelligence, highlighting the opportunities and challenges when we use AI to optimize AI. Co-design and collaboration, across the hardware-software stack, across disciplines, and across communities, will be key to this exciting new future.",
"doi": "10.1145/3779212.3795611",
"url": "https://dl.acm.org/doi/10.1145/3779212.3795611",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3795611",
"has_full_text": true
},
{
"title": "Title: An AI Stack: From Scaling AI Workloads to Evaluating LLMs",
"authors": [
"Ion Stoica"
],
"abstract": "Large language models (LLMs) have taken the world by storm, enabling new applications, intensifying GPU shortages, and raising concerns about the accuracy of their outputs. In this talk, I will present several projects I have worked on to address these challenges. Specifically, I will focus on Ray, a distributed framework for scaling AI workloads, vLLM and SGLang, two high-throughput inference engines for LLMs, and LMArena, a platform for accurate LLM benchmarking. I will conclude with key lessons learned and outline directions for future research",
"doi": "10.1145/3779212.3795612",
"url": "https://dl.acm.org/doi/10.1145/3779212.3795612",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3795612",
"has_full_text": true
},
{
"title": "Mission-Critical Enterprise Systems... What's a Mission? And What's Critical?: Processor and technology requirements for enterprise computing",
"authors": [
"Hillery Hunter"
],
"abstract": "Behind the scenes, mainframes (IBM Z) and IBM Power Systems fuel banking, manufacturing, healthcare, insurance, transportation, and many other critical industries. While we each touch multiple silicon-based systems daily, enterprise systems are the ones you never see, but always rely on. In this talk, we'll look inside these systems, discuss the silicon technology, microarchitecture, and system elements which come together to deliver eight nines of system-level resiliency (99.999999% uptime). We'll explore the changes in workloads on these systems, and discuss how new drivers like AI for fraud detection, AI for insurance claims processing, and AI-driven system operations are changing enterprise requirements. Bio: Hillery is the General Manager for IBM Power and serves as IBM Infrastructure CTO. In this role, she is focused on setting the global strategy for both Power and Infrastructure Platform in addition to driving cross-functional innovation and growth across IBM's Infrastructure portfolio which includes IBM zSystems, Power, Storage, TLS and IBM Cloud. In her previous roles as the GM, Industry Clouds and CTO of IBM Cloud, she led the business and technical strategy for IBM's industry-contextualized cloud offerings and was responsible for technical strategy for IBM's public cloud. Prior to this role, she served as Director of Accelerated Cognitive Infrastructure in IBM Research, leading a team doing cross-stack (hardware through software) optimization of AI workloads. Her technical interests have always been interdisciplinary, spanning from silicon technology through system software, and she has served in technical and leadership roles in memory technology, Systems for AI and other areas. Hillery was appointed as an IBM Fellow in 2017, and she is a BS, MS and PhD graduate of the University of Illinois at Urbana-Champaign, where she also received the 2024 Alumni Award for Distinguished Service.",
"doi": "10.1145/3779212.3795613",
"url": "https://dl.acm.org/doi/10.1145/3779212.3795613",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3795613",
"has_full_text": true
},
{
"title": "Optimizer-Friendly Instrumentation for Event Quantification with PRUE Algorithm",
"authors": [
"Hao Ling",
"Yiyuan Guo",
"Charles Zhang"
],
"abstract": "Event quantification provides frequency information for runtime events and is widely used to build fast, secure, and reliable software and systems. It is usually achieved through instrumentation that introduces new instructions into programs but has significant runtime overhead. A key challenge in developing efficient instrumentation is that the instrumentation can barely benefit from modern compiler optimizations because the additional instructions introduce side effects that complicate the optimization process. One intuitive idea to mitigate the side effect issue is to use local counters with transparent lifetimes and scopes instead of relying on global counters. Optimizations are more likely to succeed on instructions that manipulate local variables rather than global ones. However, this idea is not fully adequate, as local counters are unobservable to external users of the quantification data. Therefore, global counters are still necessary, and it is essential to determine when to update them based on the local counters. The challenge is that these updates also introduce side effects that hinder optimizations. We address this update decision problem with the Partially Redundant Update Elimination (PRUE). We present Zircon, the first optimizer-friendly instrumentation to quantify runtime events. With the help of local counters, Zircon separates the event occurrence and the quantification by moving the instructions with side effects outside the scope where the optimizers apply. However, this separation introduces redundant instructions; the additional instructions would still be executed even without event occurrence. In other words, the lifetime control of the local counters challenges the efficiency of instrumentation. PRUE transforms the program and prunes away paths with redundant updates based on static analysis. The evaluation on the SPEC CPU 2017 shows that Zircon is up to 149% faster than state-of-the-art work, including SanCov and Nisse. Unlike existing efforts, Zircon does not compromise accuracy for efficiency; instead, it fully harnesses the power of compilation optimizers and delivers automatic performance gains.",
"doi": "10.1145/3779212.3790196",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790196",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790196",
"has_full_text": true
},
{
"title": "Evaluating Compiler Optimization Impacts on zkVM Performance",
"authors": [
"Thomas Gassmann",
"Stefanos Chaliasos",
"Thodoris Sotiropoulos",
"Zhendong Su"
],
"abstract": "Zero-knowledge proofs (ZKPs) are the cornerstone of programmable cryptography. They enable (1) privacy-preserving and verifiable computation across blockchains, and (2) an expanding range of off-chain applications such as credential schemes. Zero-knowledge virtual machines (zkVMs) lower the barrier by turning ZKPs into a drop-in backend for standard compilation pipelines. This lets developers write proof-generating programs in conventional languages (e.g., Rust or C++) instead of hand-crafting arithmetic circuits. However, these VMs inherit compiler infrastructures tuned for traditional architectures rather than for proof systems. In particular, standard compiler optimizations assume features that are absent in zkVMs, including cache locality, branch prediction, or instruction-level parallelism. Therefore, their impact on proof generation is questionable. We present the first systematic study of the impact of compiler optimizations on zkVMs. We evaluate 64 LLVM passes, six standard optimization levels, and an unoptimized baseline across 58 benchmarks on two RISC-V\u2013based zkVMs (RISC Zero and SP1). While standard LLVM optimization levels do improve zkVM performance (over 40%), their impact is far smaller than on traditional CPUs, since their decisions rely on hardware features rather than proof constraints. Guided by a performance impact analysis, we slightly refine a small set of LLVM passes to be zkVM-aware, improving zkVM execution time by up to 45% (average +4.6% on RISC Zero, +1% on SP1) and achieving consistent proving-time gains. Our work highlights the potential of compiler-level optimizations for zkVM performance and opens new directions for zkVM-specific passes, backends, and superoptimizers.",
"doi": "10.1145/3779212.3790159",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790159",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790159",
"has_full_text": true
},
{
"title": "Mugi: Value Level Parallelism For Efficient LLMs",
"authors": [
"Daniel Price",
"Prabhu Vellaisamy",
"John Paul Shen",
"Di Wu"
],
"abstract": "Value level parallelism (VLP) has been proposed to improve the efficiency of large-batch, low-precision general matrix multiply (GEMM) between symmetric activations and weights. In transformer based large language models (LLMs), there exist more sophisticated operations beyond activation-weight GEMM. In this paper, we explore how VLP benefits LLMs. First, we generalize VLP for nonlinear approximations, outperforming existing nonlinear approximations in end-to-end LLM accuracy, performance, and efficiency. Our VLP approximation follows a value-centric approach, where important values are assigned with greater accuracy. Second, we optimize VLP for small-batch GEMMs with asymmetric inputs efficiently, which leverages timely LLM optimizations, including weight-only quantization, key-value (KV) cache quantization, and group query attention. Finally, we design a new VLP architecture, Mugi, to encapsulate the innovations above and support full LLM workloads, while providing better performance, efficiency and sustainability. Our experimental results show that Mugi can offer significant improvements on throughput and energy efficiency, up to 45\u00d7 and 668\u00d7 for nonlinear softmax operations, and 2.07\u00d7 and 3.11\u00d7 for LLMs, and also decrease operational carbon for LLM operation by 1.45\u00d7 and embodied carbon by 1.48\u00d7.",
"doi": "10.1145/3779212.3790189",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790189",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790189",
"has_full_text": true
},
{
"title": "Understanding and Optimizing Database Pushdown on Disaggregated Storage",
"authors": [
"Hua Zhang",
"Xiao Li",
"Yuebin Bai",
"Ming Liu"
],
"abstract": "Database pushdown is a widely adopted technique under compute-storage disaggregation. The rising network and I/O speeds, coupled with stagnated compute and memory subsystems of a disaggregated storage architecture in the past decade, render state-of-the-art policy-driven pushdown designs ineffective. This is because the query performance bottleneck has shifted from network and I/O to compute, where computing power at the storage layer becomes scarce. This paper rethinks pushdown database design via a systematic characterization and identifies three root causes, i.e., table structure agnostic, lower interference tolerance, and lack of operator scheduling. Based on the gathered insights, we build TapDB, a new pushdown database that targets emerging storage disaggregation. Driven by two key ideas (i.e., lazy evaluation and trading network and I/O for compute), TapDB introduces four new mechanisms: a table-aware operator cost estimator based on in-situ meta-learning and cardinality estimation, an admission control scheme to limit execution concurrency, a ballooning-based DRAM-SSD hybrid table, and a critical path-driven operator scheduler. Our prototype shows 1.3-2.3\u00d7 speedups compared with prior solutions when running SSB and TPCH benchmarks.",
"doi": "10.1145/3779212.3790243",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790243",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790243",
"has_full_text": true
},
{
"title": "Reducing T Gates with Unitary Synthesis",
"authors": [
"Tianyi Hao",
"Amanda Xu",
"Swamit Tannu"
],
"abstract": "Quantum error correction is essential for achieving practical quantum computing but has a significant computational overhead. Among fault-tolerant (FT) gate operations, non-Clifford gates, such as T, are particularly expensive due to their reliance on magic state distillation. These costly T gates appear frequently in FT circuits as many quantum algorithms require arbitrary single-qubit rotations, such as Rx and Rz gates, which must be decomposed into a sequence of T and Clifford gates. In many quantum circuits, Rx and Rz gates can be fused to form a single U3 unitary. However, existing synthesis methods, such as gridsynth, rely on indirect decompositions, requiring separate Rz decompositions that result in a threefold increase in T count. This work presents TensoR-based Arbitrary unitary SYNthesis (TRASYN), a novel FT synthesis algorithm that directly synthesizes arbitrary single-qubit unitaries, avoiding the overhead of separate Rz decompositions. By leveraging tensor network-based search, our approach enables native U3 synthesis, reducing the T count, Clifford gate count, and approximation error. Compared to gridsynth-based circuit synthesis, for 187 representative benchmarks, our design reduces the T count by up to 3.5 \u00d7, and Clifford gates by 7 \u00d7, resulting in up to 4 \u00d7 improvement in overall circuit infidelity.",
"doi": "10.1145/3779212.3790210",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790210",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790210",
"has_full_text": true
},
{
"title": "Detecting Inconsistencies in Arm CCA's Formally Verified Specification",
"authors": [
"Changho Choi",
"Xiang Cheng",
"Bokdeuk Jeong",
"Taesoo Kim"
],
"abstract": "Formal verification offers strong guarantees of correctness, robustness, and security. However, these guarantees depend on specification correctness, and even minor flaws can invalidate proofs and introduce critical vulnerabilities. We present Scope, an automated system that identifies specification inconsistencies by combining formal modeling with rule-based consistency checking. Unlike traditional approaches that rely on implementations, Scope treats the specification as the sole ground truth. It translates the specification into a machine-verifiable model using Verus and SMT solvers, then detects inconsistencies in success/failure conditions, dependency rules, and state transitions. We apply Scope to the Realm Management Monitor (RMM) specifications for Arm's Confidential Compute Architecture (CCA), uncovering 35 previously unknown bugs\u2014including security-critical flaws in ABI semantics and missing state transitions\u2014all confirmed by Arm. Compared to modern LLM-based tools, Scope improves inconsistency-detection precision by 7x over GPT-o1 and up to 40\u00d7 over leading chat models (LLaMA 3.1, GPT-4o, Claude 3.7).",
"doi": "10.1145/3779212.3790152",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790152",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790152",
"has_full_text": true
},
{
"title": "PIPM: Partial and Incremental Page Migration for Multi-host CXL Disaggregated Shared Memory",
"authors": [
"Gangqi Huang",
"Heiner Litz",
"Yuanchao Xu"
],
"abstract": "The emerging Compute Express Link (CXL) interconnect supports multi-host cache-coherent disaggregated shared memory (CXL-DSM). However, existing page migration approaches, designed primarily for single-host systems, are inefficient in multi-host CXL-DSM scenarios. To address this, we propose Partial and Incremental Page Migration (PIPM), a hardware-based solution that transparently leverages host-side local memory. PIPM is co-designed with the CXL multi-host coherence protocol, enabling coherent access to data residing in local DRAM. To overcome limitations of existing migration methods, PIPM supports fine-grained data migration and integrates hardware-based monitoring and decision-making mechanisms to optimize data placement. Evaluation results demonstrate that PIPM delivers performance improvements of up to 2.54\u00d7 (1.86\u00d7 on average) over the default multi-host CXL-DSM configuration.",
"doi": "10.1145/3779212.3790203",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790203",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790203",
"has_full_text": true
},
{
"title": "It Takes Two to Entangle",
"authors": [
"Zhanghan Wang",
"Ding Ding",
"Hang Zhu",
"Haibin Lin",
"Aurojit Panda"
],
"abstract": "Distributed machine learning training and inference is common today because today's large models require more memory and compute than can be provided by a single GPU. Distributed models are generally produced by programmers who take a sequential model specification and apply several distribution strategies to distribute state and computation across GPUs. Unfortunately, bugs can be introduced in the process, and a distributed model implementation's outputs might differ from the sequential model's outputs. In this paper, we describe an approach to statically identify such bugs by checking model refinement, that is, can the sequential model's outputs be reconstructed from the distributed model's outputs? Our approach, implemented in Entangle, uses iterative rewriting to prove model refinement. Our approach can scale to today's large models and deployments: we evaluate it using GPT and Llama-3. Further, it provides actionable outputs that aids in bug localization.",
"doi": "10.1145/3779212.3790178",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790178",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790178",
"has_full_text": true
},
{
"title": "QoServe: Breaking the Silos of LLM Inference Serving",
"authors": [
"Kanishk Goel",
"Jayashree Mohan",
"Nipun Kwatra",
"Ravi Shreyas Anupindi",
"Ramachandran Ramjee"
],
"abstract": "The widespread adoption of Large Language Models (LLMs) has enabled diverse applications with very different latency requirements. Existing LLM serving frameworks rely on siloed infrastructure with coarse-grained workload segregation --- interactive and batch --- leading to inefficient resource utilization and limited support for fine-grained Quality-of-Service (QoS) differentiation. We present QOSERVE, a novel QoS-driven inference serving system that enables efficient co-scheduling of diverse workloads on shared infrastructure. QOSERVE introduces fine-grained QoS classification allowing applications to specify precise latency requirements, and dynamically adapts scheduling decisions based on real-time system state. Leveraging the predictable execution characteristics of LLM inference, QOSERVE implements dynamic chunking to improve overall throughput while maintaining strict QoS guarantees. Additionally, QOSERVE introduces hybrid prioritization to balance fairness and efficiency, and employs selective request relegation for graceful service degradation during overloads. Our evaluation demonstrates that QOSERVE increases serving capacity by 23% compared to current siloed deployments, while maintaining QoS guarantees on an A100 cluster, and improves per-replica goodput by up to 2.4x compared to Sarathi on a shared cluster. Notably, under extreme load, our system reduces SLO violations by an order of magnitude compared to current strategies.",
"doi": "10.1145/3779212.3790206",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790206",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790206",
"has_full_text": true
},
{
"title": "HEPIC: Private Inference over Homomorphic Encryption with Client Intervention",
"authors": [
"Kevin Nam",
"Youyeon Joo",
"Seungjin Ha",
"Hyungon Moon",
"Yunheung Paek"
],
"abstract": "Homomorphic Encryption (HE) enables Private Inference (PI) in Machine Learning as a Service (MLaaS), protecting both client inputs and server-side neural network (NN) parameters. Existing PI techniques are predominantly implemented as either HE-based fire-and-forget methods or MPC-based interactive methods. Recent HE-based PI systems improve the accuracy--performance trade-off via a layer-wise scheme and parameter switching, yet remain bottlenecked by fire-and-forget execution in which the server alone performs costly ciphertext management (e.g., bootstrapping and scheme/parameter conversions). We present HEPIC, an HE-based PI system that explores a different design point by leveraging client interventions for ciphertext managements. In a sense, HEPIC shares a common ground with MPC-based PI of being interactive with the client, but differs in that the client only intervenes for ciphertext managements required in HE operations. Because ciphertext management has identical semantics on the client and the server, HEPIC lets developers decide where and how often to execute it, enabling fine-grained trade-offs among computation, communication, and ciphertext configuration. HEPIC makes such execution practical by overlapping client re-encryption, server computation, and communication via dependency-aware pipelining and streaming-based transfers. We further enhance the performance with a cache-aware task allocator (CATA) and a cost-aware client intervention scheduler (CACIS) to exploit ciphertext-level parallelism and to mitigate stalls under client-server performance disparity. Our evaluation shows that HEPIC achieves up to 2.20--41.93\u00d7 speedup over state-of-the-art fire-and-forget HE-based PI, while maintaining zero loss in inference accuracy.",
"doi": "10.1145/3779212.3790170",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790170",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790170",
"has_full_text": true
},
{
"title": "ICARUS: Criticality and Reuse based Instruction Caching for Datacenter Applications",
"authors": [
"Vedant Kalbande",
"Hrishikesh Jedhe Deshmukh",
"Alberto Ros",
"Biswabandan Panda"
],
"abstract": "Datacenter applications with huge code footprints suffer from front-end CPU bottlenecks even with a decoupled front-end. These applications are composed of complex system stacks with subtle interdependencies. One of the primary contributors to the front-end bottleneck is instruction misses at L2, which cause decode starvation. State-of-the-art L2 cache replacement policies, such as EMISSARY, utilize front-end criticality to identify instruction lines that cause decode starvation and attempt to keep those critical lines in L2. We observe that only 28.32% of the critical lines retain their criticality behavior, and a significant fraction of the critical instruction lines show dynamic behavior. We propose ICARUS, an L2 replacement policy that incorporates branch history as context information to improve critical instruction line detection. We observe that the reuse distance of instruction lines varies based on the branch history that led to the instruction fetch. Next, we enhance the L2 replacement policy by considering both criticality and the reuse of instruction lines at L2, as we observe that the reuse behavior of critical lines differs from that of non-critical lines. On average, across 12 datacenter applications, ICARUS outperforms Tree-based Pseudo LRU (TPLRU) by 5.6% and as high as 51%. The state-of-the-art replacement policy, EMISSARY, on the other hand, provides an improvement of 2.2% over TPLRU. We demonstrate the robustness of ICARUS across various L1I and L2 cache sizes, as well as its effectiveness in the presence of hardware prefetchers.",
"doi": "10.1145/3779212.3790175",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790175",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790175",
"has_full_text": true
},
{
"title": "LPO: Discovering Missed Peephole Optimizations with Large Language Models",
"authors": [
"Zhenyang Xu",
"Hongxu Xu",
"Yongqiang Tian",
"Xintong Zhou",
"Chengnian Sun"
],
"abstract": "Peephole optimization is an essential class of compiler optimizations that targets small, inefficient instruction sequences within programs. By replacing such suboptimal instructions with refined and more optimal sequences, these optimizations not only directly optimize code size and performance, but also enable more transformations in the subsequent optimization pipeline. Despite their importance, discovering new and effective peephole optimizations remains challenging due to the complexity and breadth of instruction sets. Prior approaches either lack scalability or have significant restrictions on the peephole optimizations that they can find. This paper introduces LPO, a novel automated framework to discover missed peephole optimizations. Our key insight is that, Large Language Models (LLMs) are effective at creative exploration but susceptible to hallucinations; conversely, formal verification techniques provide rigorous guarantees but struggle with creative discovery. By synergistically combining the strengths of LLMs and formal verifiers in a closed-loop feedback mechanism, LPO can effectively discover verified peephole optimizations that were previously missed. We comprehensively evaluated LPO within LLVM ecosystems. Our evaluation shows that LPO can successfully identify up to 22 out of 25 previously reported missed optimizations in LLVM. In contrast, the recently proposed superoptimizers for LLVM, Souper and Minotaur detected 15 and 3 of them, respectively. More importantly, within eleven months of development and intermittent testing, LPO found 62 missed peephole optimizations, of which 28 were confirmed and an additional 13 had already been fixed in LLVM. These results demonstrate LPO's strong potential to continuously uncover new optimizations as LLMs' reasoning improves.",
"doi": "10.1145/3779212.3790184",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790184",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790184",
"has_full_text": true
},
{
"title": "Enabling Fast Networking in the Public Cloud",
"authors": [
"Alireza Sanaee",
"Vahab Jabrayilov",
"Ilias Marinos",
"Farbod Shahinfar",
"Divyanshu Saxena",
"Gianni Antichi",
"Kostis Kaffes"
],
"abstract": "Despite a decade of research, most high-performance userspace network stacks remain impractical for public cloud tenants developing their applications atop Virtual Machines (VMs). We identify two root causes: (1) reliance on specialized NIC features (e.g., flow steering, deep buffers) absent in commodity cloud vNICs, and (2) rigid execution models ill-suited to diverse application needs. We present Machnet, a highperformance and flexible userspace network stack designed for public cloud VMs. Machnet uses only a minimal set of vNIC features that any major cloud provider supports. It also relies on a microkernel architecture to enable flexible application execution. We evaluate Machnet across three major public clouds and on production-grade applications, including a key-value store, an HTTP server, and a statemachine replication system. We release Machnet at https: //github.com/microsoft/machnet.",
"doi": "10.1145/3779212.3790158",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790158",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790158",
"has_full_text": true
},
{
"title": "Static Analysis for Efficient Streaming Tokenization",
"authors": [
"Angela W. Li",
"Yudi Yang",
"Konstantinos Mamouras"
],
"abstract": "Tokenization, also referred to as lexing or scanning, is the computational task of partitioning an input text into a sequence of substrings called tokens. Tokenization is one of the first stages of program compilation, it is used in natural language processing, and it is also useful for processing unstructured text or semi-structured data such as JSON, CSV, and XML. A tokenizer is typically specified as a list of regular expressions, which is called a tokenization grammar. Each regular expression describes a class of tokens (e.g., integer, floating-point number, variable identifier, string literal). The semantics of tokenization employs the longest match policy to disambiguate among the possible choices. This policy says that we should prefer a longer token over a shorter one. It is also known as the maximal munch policy. Tokenization is an important computational task when processing semi-structured data, as it often precedes parsing, querying, or data transformations. Due to the abundance of large-scale semi-structured data, which can be too large to load in memory, it is desirable to perform tokenization in a streaming fashion with a small memory footprint. First, we observe that some tokenization grammars are inherently more difficult to deal with than others, and we provide a static analysis algorithm for recognizing them. We continue to propose the StreamTok algorithm, which relies on this analysis to enable efficient tokenization. StreamTok is asymptotically better than the standard algorithm of flex. Our experimental results show that our implementation of StreamTok outperforms state-of-the-art tools for tokenization.",
"doi": "10.1145/3779212.3790227",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790227",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790227",
"has_full_text": true
},
{
"title": "SNIP: An Adaptive Mixed Precision Framework for Subbyte Large Language Model Training",
"authors": [
"Yunjie Pan",
"Yongyi Yang",
"Hanmei Yang",
"Scott Mahlke"
],
"abstract": "Training large language models (LLMs) efficiently while preserving model quality poses significant challenges, particularly with subbyte precision supported by state-of-the-art GPUs. Current mixed-precision training approaches either apply uniform precision to all GEMM operations or rely on heuristic-based methods that fail to generalize during training, leading to suboptimal convergence and instability. To address these challenges, this paper introduces SNIP, a fine-grained adaptive mixed-precision training framework for LLM pretraining that supports subbyte precision. SNIP periodically collects statistics on activations, gradients, and optimizer states to assess the precision loss impact on model quality. We define two key metrics: loss divergence in the forward pass, caused by quantization-induced increases in training loss, and weight divergence in the backward pass, which measures error propagation through gradients affecting model updates. These metrics guide an Integer Linear Programming (ILP) problem that systematically optimizes layerwise precision to minimize overall quality loss while meeting efficiency targets. Experiments on 1B, 3B, 7B and 70B Llama-like models demonstrate that SNIP consistently outperforms existing baselines, reducing FLOPs by up to 80% while preserving model quality across different model sizes and training phases with minimal computational overhead.",
"doi": "10.1145/3779212.3790223",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790223",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790223",
"has_full_text": true
},
{
"title": "<scp>Wax:</scp>\n Optimizing Data Center Applications With Stale Profile",
"authors": [
"Tawhid Bhuiyan",
"Sumya Hoque",
"Ang\u00e9lica Aparecida Moreira",
"Tanvir Ahmed Khan"
],
"abstract": "Data center applications' large instruction footprints cause frequent front-end stalls by overwhelming on-chip micro-architectural structures such as instruction cache (I-cache), instruction translation look-aside buffer (iTLB), and branch target buffer (BTB). To reduce pressure on these structures, data center providers leverage profile-guided optimizations by reordering binary layout along a relatively small number of hot code paths. Such reordering provides the highest benefit if profile collection and optimization happen on the same version of the binary. In practice, companies have to optimize and deploy a fresh version of the binary with a profile from a previous version, making a large fraction of the profile stale. In this paper, we propose Wax. We open source our work at https://github.com/ice-rlab/wax, a novel technique to optimize data center applications with stale profiles. Wax's key insight is to leverage the debug and source code information while optimizing fresh binaries with stale profiles. We evaluate Wax for 5 data center applications to show that Wax provides significant (5.76%-26.46%) performance speedups. Wax achieves 1.20%-7.86% greater speedups than the state of the art, obtaining 65%-93% of fresh profiles' benefits.",
"doi": "10.1145/3779212.3790248",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790248",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790248",
"has_full_text": true
},
{
"title": "SLAWS: Spatial Locality Analysis and Workload Orchestration for Sparse Matrix Multiplication",
"authors": [
"Guoyu Li",
"Zheng Guan",
"Beichen Zhang",
"Jun Yu",
"Kun Wang"
],
"abstract": "Sparse matrix-sparse matrix multiplication (SpMSpM) is widely used in modern scientific applications, including high-performance computing, linear algebra, and graph processing. However, the highly variable distribution of nonzero elements in these matrices presents a significant challenge to computational efficiency. While existing sparse matrix accelerators often rely on specialized architectures tailored for specific dataflow, these designs sacrifice generality and fail to fully exploit potential data reuse opportunities. In this paper, we propose Slaws, an efficient and general-purpose accelerator architecture for SpMSpM. This design accelerates computation by analyzing distinct sparsity patterns and leveraging various data reuse opportunities. We introduce two key strategies to address the challenges of different matrix operand positions. For the left multiplicand, we present a Pass-Aware strategy to analyze matrix structures, identify unique sparsity patterns, and optimize memory access. For the right multiplicand, we introduce a Shuffle-Compare strategy to dynamically balance the multiplication workload across computing units. By approximating the top-K outputs per cycle, this method minimizes idle time and ensures a balanced workload distribution. We implement both strategies in Slaws and conduct extensive experiments. The results show that Slaws achieves 1.46\u00d7 and 1.43\u00d7 speedups over two state-of-the-art accelerators, and 1.55\u00d7 speedup over GPU.",
"doi": "10.1145/3779212.3790222",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790222",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790222",
"has_full_text": true
},
{
"title": "ReliaFHE: Resilient Design for Fully Homomorphic Encryption Accelerators",
"authors": [
"Fan Li",
"Mayank Kumar",
"Ruizhi Zhu",
"Mengxin Zheng",
"Qian Lou",
"Xin Xin"
],
"abstract": "The significant computational complexity of Fully Homomorphic Encryption (FHE) has prompted numerous accelerator designs. However, existing FHE accelerators often implicitly assume that all computations are executed reliably, overlooking the fact that modern FHE schemes can be highly vulnerable to hardware faults: even a single-bit error in a ciphertext can cascade into widespread plaintext corruption. In this paper, we present ReliaFHE, a resilient framework that integrates both storage- and computation-oriented protection for FHE accelerators. It leverages the large ciphertext polynomials inherent in FHE and employs lightweight, checksum-based schemes to construct efficient codewords. ReliaFHE is tailored to the three dominant arithmetic kernels: number theoretic transform, base conversion, and element-wise operations. Our evaluation shows that ReliaFHE improves reliability by more than 104 (in terms of primitive operations), with either ~1.5% performance overhead or ~1.9% area overhead. The parity storage overhead is below 1%.",
"doi": "10.1145/3779212.3790211",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790211",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790211",
"has_full_text": true
},
{
"title": "A Programming Model for Disaggregated Memory over CXL",
"authors": [
"Gal Assa",
"Moritz Lumme",
"Lucas B\u00fcrgi",
"Michal Friedman",
"Ori Lahav"
],
"abstract": "CXL (Compute Express Link) is an emerging open industry-standard interconnect between processing and memory devices that is expected to revolutionize the way systems are designed. It enables cache-coherent, shared memory pools in a disaggregated fashion at unprecedented scales, allowing algorithms to interact with various storage devices using simple loads and stores. While CXL unleashes unique opportunities, it also introduces challenges of data management and crash consistency. For example, CXL currently lacks an adequate programming model, making it impossible to reason about the correctness and behavior of systems on top. In this work, we present CXL0, the first programming model for concurrent programs over CXL. We propose a high-level abstraction for memory accesses and formally define operational semantics. We demonstrate that CXL0 captures a wide range of current and future CXL setups and perform initial measurements on real hardware. To illustrate the usefulness of CXL0, we present a general transformation that enhances any linearizable concurrent algorithm with durability in a distributed partial-crash setting. We believe that this work will serve as a stepping stone for systems design and programming on top of CXL.",
"doi": "10.1145/3779212.3790121",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790121",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790121",
"has_full_text": true
},
{
"title": "Reconfigurable Quantum Instruction Set Computers for High Performance Attainable on Hardware",
"authors": [
"Zhaohui Yang",
"Dawei Ding",
"Qi Ye",
"Cupjin Huang",
"Jianxin Chen",
"Yuan Xie"
],
"abstract": "Despite remarkable milestones in quantum computing, the performance of current quantum hardware remains limited. One critical path to higher performance is to expand the quantum ISA with basis gates that have higher fidelity and greater synthesis capabilities than the standard CNOT. However, this substantially increases gate calibration overhead and introduces challenges in compiler optimization. Consequently, although more expressive ISAs (even complex, continuous gate sets) have been proposed, they still remain primarily proofs-of-concept and have not been widely adopted. To move beyond these hurdles and unlock the performance gains offered by expressive continuous ISAs, we introduce the concept of ''reconfigurable quantum instruction set computers'' (ReQISC). It incorporates (1) a unified microarchitecture capable of directly implementing arbitrary 2Q gates equivalently, i.e., SU(4) modulo 1Q rotations, with theoretically optimal gate durations given any 2Q coupling Hamiltonian and (2) a compilation framework tailored to ReQISC primitives for end-to-end synthesis and optimization, comprising a program-aware pass that refines high-level representations, a program-agnostic pass for aggressive circuit-level optimization, and an SU(4)-aware routing pass that minimizes hardware mapping overhead. We detail the hardware implementation to demonstrate the feasibility of this superior gate scheme in terms of both pulse control and calibration. By leveraging the expressivity of SU(4) and the time minimality realized by the underlying microarchitecture, the SU(4)-based ISA achieves remarkable performance, with a 4.97-fold reduction in average pulse duration to implement arbitrary 2Q gates, compared to the usual CNOT/CZ scheme on mainstream flux-tunable transmons. Supported by the end-to-end compiler, ReQISC outperforms the conventional CNOT-based ISA, state-of-the-art compiler, and pulse implementation counterparts by significantly reducing 2Q gate count, circuit depth, pulse duration, qubit mapping overhead, and program fidelity losses. For the first time, ReQISC makes the theoretical benefits of continuous ISAs practically feasible.",
"doi": "10.1145/3779212.3790208",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790208",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790208",
"has_full_text": true
},
{
"title": "<scp>Wave</scp>\n : Leveraging Architecture Observation for Privacy-Preserving Model Oversight",
"authors": [
"Haoxuan Xu",
"Chen Gong",
"Beijie Liu",
"Haizhong Zheng",
"Beidi Chen",
"Mengyuan Li"
],
"abstract": "Large Language Models (LLMs) inference increasingly require mechanisms that provide runtime visibility into what is actually executing, without exposing model weights or code. We present WAVE, a hardware-grounded monitoring framework that leverages GPU performance counters (PMCs) to observe LLM inference. WAVE is built on the insight that legitimate executions of a given model must satisfy hardware-constrained invariants, such as memory accesses, instruction mix, and tensor-core utilization, induced by the model's linear-algebraic structure. WAVE collects lightweight PMC traces and applies a two-stage pipeline: (1) inferring architectural properties (e.g., parameter count, layer depth, hidden dimension, batch size) from the observed traces; and (2) using an SMT-based consistency checker to assess whether the execution aligns with the provisioned compute and the claimed model's constraints. We evaluate WAVE on common open-source LLM architectures, such as LLaMA, GPT, and Qwen, across multiple GPU architectures, including NVIDIA Ada Lovelace, Hopper, and Blackwell. Results show that WAVE recovers key model parameters with an average error of 6.8% and identifies disguised executions under realistic perturbations. By grounding oversight in hardware invariants, WAVE provides a practical avenue for continuous, privacy-preserving runtime monitoring of LLM services.",
"doi": "10.1145/3779212.3790247",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790247",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790247",
"has_full_text": true
},
{
"title": "Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads",
"authors": [
"Mert Hidayetoglu",
"Aurick Qiao",
"Michael Wyatt",
"Jeff Rasley",
"Yuxiong He",
"Samyam Rajbhandari"
],
"abstract": "",
"doi": "10.1145/3779212.3790219",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790219",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790219",
"has_full_text": true
},
{
"title": "Arm Weak Memory Consistency on Apple Silicon: What Is It Good For?",
"authors": [
"Yossi Khayet",
"Adam Morrison"
],
"abstract": "Weak memory models such as the Arm model are perceived as enabling higher performance than strong models such as TSO. We critically test this perception on Apple silicon CPUs, whose runtime-configurable TSO mode enables a direct comparison with native Arm mode. We find that Apple silicon TSO mode preserves Arm weak-memory optimizations, typically yielding execution times within 3% of Arm mode across modern applications and classic benchmarks. Although some applications experience higher TSO slowdowns, we trace these to artifacts of Apple's TSO implementation rather than inherent TSO ordering constraints. Our results challenge the perception that the Arm memory model offers a significant performance advantage over TSO in Apple silicon.",
"doi": "10.1145/3779212.3790129",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790129",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790129",
"has_full_text": true
},
{
"title": "LAIKA: Machine Learning-Assisted In-Kernel APU Acceleration",
"authors": [
"Haoming Zhuo",
"Dingding Li",
"Ronghua Lin",
"Yong Tang"
],
"abstract": "The integration of machine learning (ML) into OS kernels is severely hampered by the high latency of offloading to discrete GPUs (dGPUs), where data transfers across the PCIe bus can consume over 93% of the total execution time. This paper argues that for many latency-sensitive kernel tasks, the solution is not a more powerful dGPU but a fundamental shift to an I/O-efficient architecture: the integrated GPU (iGPU) found in modern APUs. We present LAIKA, a kernel-space acceleration framework that fully exploits the APU's unified memory. LAIKA combines three key mechanisms: a lightweight API proxy (AProxy) to safely bridge the kernel-user boundary, a tri-domain shared memory manager (AShm) for true zero-copy data exchange, and APU Persistent Kernel (APK) to eliminate control-path overhead. Across a suite of representative kernel workloads, our evaluation shows that by eliminating the I/O bottleneck, LAIKA slashes end-to-end inference latency by up to 9.7\u00d7 and reduces total system power to as little as 28.9% of an optimized dGPU baseline. Our work identifies iGPUs as a practical and highly efficient alternative for machine learning acceleration within OS kernels.",
"doi": "10.1145/3779212.3790181",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790181",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790181",
"has_full_text": true
},
{
"title": "CHEHAB RL: Learning to Optimize Fully Homomorphic Encryption Computations",
"authors": [
"Bilel Sefsaf",
"Abderraouf Dandani",
"Abdessamed Seddiki",
"Arab Mohammed",
"Eduardo Chielle",
"Michail Maniatakos",
"Riyadh Baghdadi"
],
"abstract": "",
"doi": "10.1145/3779212.3790138",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790138",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790138",
"has_full_text": true
},
{
"title": "CacheMind: From Miss Rates to Why - Natural-Language, Trace-Grounded Reasoning for Cache Replacement",
"authors": [
"Kaushal Mhapsekar",
"Azam Ghanbari",
"Bita Aslrousta",
"Samira Mirbagher-Ajorpaz"
],
"abstract": "Cache replacement remains a challenging problem in CPU microarchitecture, often addressed using hand-crafted heuristics that limit cache performance. Cache data analysis requires parsing millions of trace entries with manual filtering, making the process slow and non-interactive. To address this, we introduce CacheMind, a conversational tool that uses Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to enable semantic reasoning over cache traces. Architects can now ask natural language questions like, ''Why is the memory access associated with PC X causing more evictions?'', and receive trace grounded, human-readable answers linked to program semantics for the first time. To evaluate CacheMind, we present CacheMindBench, the first verified benchmark suite for LLM-based reasoning for the cache replacement problem. Using the Sieve retriever, CacheMind achieves 66.67% on 75 unseen trace-grounded questions and 84.80% on 25 unseen policy-specific reasoning tasks; with Ranger, it achieves 89.33% and 64.80% on the same evaluations. Additionally, with Ranger, CacheMind achieves 100% accuracy on 4 out of 6 categories in the trace-grounded tier of CacheMindBench. Compared to LlamaIndex (10% retrieval success), Sieve achieves 60% and Ranger achieves 90%, demonstrating that existing RetrievalAugmented Generation (RAGs) are insufficient for precise, trace-grounded microarchitectural reasoning.",
"doi": "10.1145/3779212.3790136",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790136",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790136",
"has_full_text": true
},
{
"title": "FastTTS: Accelerating Test-Time Scaling for Edge LLM Reasoning",
"authors": [
"Hao Mark Chen",
"Zhiwen Mo",
"Guanxi Lu",
"Shuang Liang",
"Lingxiao Ma",
"Wayne Luk",
"Hongxiang Fan"
],
"abstract": "Recent advances in reasoning Large Language Models (LLMs) are driving the emergence of agentic AI systems. Edge deployment of LLM agents near end users is increasingly necessary to protect data privacy, enable offline use, and provide responsive interaction with local context. However, strict memory constraints on edge devices limit deployment to smaller LLMs, whose reasoning capabilities are much weaker than those of large cloud models, hindering practical deployment of edge agentic AI. Test-Time Scaling (TTS) offers a promising solution by allocating more compute during inference to enhance the reasoning capability of edge LLMs. However, current TTS methods introduce heavy hardware performance overhead on resource-constrained devices, making them impractical for real applications. To address this challenge, we present FastTTS, a serving system that enables fast and efficient TTS for memory-constrained LLM reasoning. After analyzing common patterns across various TTS methods and identifying their performance bottlenecks, we introduce three novel techniques: i) Speculative Beam Extension, which mitigates system stragglers caused by irregular reasoning paths, ii) Asymmetric Multi-Model Memory Allocation, which dynamically balances memory usage between token generation and reasoning-step verification, and iii) Dynamic Prefix-Aware Scheduling, which optimizes reasoning execution to maximize KV-cache reuse across search paths. FastTTS offers a plug-and-play third-party library on top of vLLM, enabling edge LLMs (\u0142eq 7B) on a single consumer GPU (24 GB) to match cloud-model accuracy and cloud-measured latency. Comprehensive evaluation shows that FastTTS achieves an average 2.2\u00d7 higher goodput and reduces latency by 38%-68% compared to the vLLM baseline; it pushes the boundaries of low-latency Test-Time Scaling on memory-constrained edge devices and highlights the potential for democratizing agentic AI.",
"doi": "10.1145/3779212.3790161",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790161",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790161",
"has_full_text": true
},
{
"title": "Neura: A Unified Framework for Hierarchical and Adaptive CGRAs",
"authors": [
"Cheng Tan",
"Miaomiao Jiang",
"Yuqi Sun",
"Ruihong Yin",
"Yanghui Ou",
"Qing Zhong",
"Lei Ju",
"Jeff Zhang"
],
"abstract": "Coarse-Grained Reconfigurable Arrays (CGRAs) are a promising solution for energy-efficient acceleration across multiple application domains. Yet, CGRAs face significant scalability challenges that hinder their widespread adoption, stemming from three main concerns: (1) Mapping Scalability \u2014 existing mapping algorithms struggle to find feasible and optimal solutions as the design complexity grows; (2) Architectural Limitations \u2014 rigid mapping granularity and memory access restrict flexibility and performance; and (3) Dynamic Multi-Kernel Support \u2014 dynamic and simultaneous execution of multiple kernels are not thoroughly explored, limiting the applicability of CGRAs in complex multi-kernel scenarios. In this paper, we propose Neura, an open-source unified framework towards scalable CGRAs to address these challenges. Neura introduces a novel hierarchical spatial-temporal CGRA architecture combined with a migration-aware mapping algorithm. This combination uniquely enables dynamic resource allocation and kernel migration, addressing key scalability and utilization challenges in multi-kernel scenarios. Our experimental results demonstrate that Neura achieves a throughput improvement of 1.64\u00d7 to 3.85\u00d7 and significant higher utilization, across different scenarios over a conventional CGRA at the same scale.",
"doi": "10.1145/3779212.3790193",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790193",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790193",
"has_full_text": true
},
{
"title": "PrioriFI: More Informed Fault Injection for Edge Neural Networks",
"authors": [
"Olivia Weng",
"Andres Meza",
"Nhan Tran",
"Ryan Kastner"
],
"abstract": "As neural networks (NNs) are increasingly used to provide edge intelligence, there is a growing need to make the edge devices that run them robust to faults. Edge devices must mitigate the resulting hardware failures while maintaining strict constraints on power, energy, latency, throughput, memory size, and computational resources. Edge NNs require fundamental changes in model architecture, e.g., quantization and fewer, smaller layers. PrioriFI is a more informed fault injection (FI) algorithm that evaluates edge NN robustness by ranking NN bits based on their fault sensitivity. PrioriFI prioritizes finding highly fault-sensitive bits, that is, the bits most critical to an NN's correctness, first. To accomplish this, PrioriFI uses the Hessian for the initial parameter ranking. Then, during an FI campaign, PrioriFI uses the information gained from past FIs as a heuristic so that future FIs target the bits likely to be the next most sensitive. With PrioriFI, designers can quickly evaluate different NNs and better co-design fault-tolerant edge NN systems.",
"doi": "10.1145/3779212.3790204",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790204",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790204",
"has_full_text": true
},
{
"title": "Segment Only Where You Look: Leveraging Human Gaze Behavior for Efficient Computer Vision Applications in Augmented Reality",
"authors": [
"Tianhua Xia",
"Haiyu Wang",
"Sai Qian Zhang"
],
"abstract": "Augmented reality (AR) comprises groundbreaking technologies that are reshaping the landscape of human interaction. Image segmentation, which divides a user front-scene frame into more manageable parts for analysis, is of paramount importance since this technique enables AR systems to extract digital information precisely from the real world by identifying and isolating specific objects in the user's surroundings. Despite its importance, the segmentation task imposes substantial computational demands and processing delays on AR devices, significantly degrading the user experience. In this work, we aim to reduce the high computational costs of the segmentation task in AR by leveraging natural human eye dynamics and focusing on segmenting only where you look (SOLO). This involves co-optimizing image segmentation algorithms with underlying hardware for greater efficiency. We introduce SOLO algorithm, an efficient deep learning framework that takes high-resolution input images and user eye images to effectively segment only the instance of interest. Integrated with the saliency-based sensing (SBS) and SOLO accelerator as a plug-in for SoCs of the AR device, SOLO significantly lowers the computational costs for image segmentation, achieving up to a 12\u00d7 reduction in end-to-end latency compared to other baselines.",
"doi": "10.1145/3779212.3790216",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790216",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790216",
"has_full_text": true
},
{
"title": "Finding Reusable Instructions via E-Graph Anti-Unification",
"authors": [
"Youwei Xiao",
"Chenyun Yin",
"Yitian Sun",
"Yuyang Zou",
"Yun Liang"
],
"abstract": "Domain-specific accelerators provide an increasingly valuable source of performance for diverse applications. Custom instructions that trigger the execution of dedicated hardware units or accelerators for common application functions become key building blocks in modern computing systems, balancing performance and cost effectiveness. RISC-V, the open and extensible instruction set architecture, is increasingly popularizing this trend. However, exploring custom instructions for an application domain remains challenging. Existing automated approaches suffer from poor reusability and limited performance. They can only identify or merge syntactically similar, scalar instruction sequences while missing semantically equivalent patterns. We present ISAMORE, an end-to-end framework for discovering reusable custom instructions from domain applications. ISAMORE encodes general applications in an e-graph by constructing a structured domain-specific language. Its core methodology, reusable instruction identification (RII), leverages e-graph anti-unification (AU) to identify semantically equivalent common patterns across diverse applications, fully unleashing the potential of custom instructions. RII employs a phase-oriented iterative process with smart heuristics to enhance the scalability when dealing with real-world codebases. Besides, RII introduces the novel pattern vectorization technique, packing common operations from scalar programs into lanes of vectorized custom instructions to exploit data-level parallelism. Moreover, RII's Pareto-optimal pattern selection balances performance gains with area overheads, guided by a profiling-based hardware-aware cost model. Evaluation demonstrates ISAMORE 's substantial performance gains, 1.12x-2.69x, over baseline approaches. We also demonstrate ISAMORE's practical potentials using various case studies, including library analysis for three application domains and hardware specialization for quantized LLM inference and post-quantum cryptography.",
"doi": "10.1145/3779212.3790162",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790162",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790162",
"has_full_text": true
},
{
"title": "Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration",
"authors": [
"Zejia Lin",
"Hongxin Xu",
"Guanyi Chen",
"Zhiguang Chen",
"Yutong Lu",
"Xianwei Zhang"
],
"abstract": "Modern large language model (LLM) serving systems confront inefficient GPU utilization due to the fundamental mismatch between compute-intensive prefill phase and memory-bound decode phase. While current practices attempt to address this by organizing these phases into hybrid batches, such solutions create an inefficient tradeoff that sacrifices either throughput or latency, leaving substantial GPU resources underutilized. For this, we identify two key root causes: 1) the prefill phase suffers from suboptimal compute utilization due to wave quantization and attention bottlenecks, and 2) hybrid batching disproportionately prioritizes latency over throughput, wasting both compute resources and memory bandwidth. To mitigate the issues, we present Bullet, a novel spatial-temporal orchestration system that eliminates these inefficiencies through fine-grained phase coordination. Bullet enables concurrent execution of prefill and decode requests, while dynamically provisioning GPU resources based on real-time performance modeling. By integrating SLO-aware scheduling and adaptive resource allocation, Bullet maximizes GPU utilization without compromising latency targets. Experimental evaluations on real-world workloads demonstrate that Bullet delivers 1.26\u00d7 average throughput gains (up to 1.55\u00d7) over state-of-the-arts, while consistently meeting latency constraints.",
"doi": "10.1145/3779212.3790135",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790135",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790135",
"has_full_text": true
},
{
"title": "STRAW: Stress-Aware WL-Based Read Disturbance Management for High-Density NAND Flash Memory",
"authors": [
"Myoungjun Chun",
"Jaeyong Lee",
"Inhyuk Choi",
"Jisung Park",
"Myungsuk Kim",
"Jihong Kim"
],
"abstract": "While NAND flash memory has continuously increased its storage density over decades, this progress has exacerbated the read-disturbance problem. In this work, we identify two fundamental limitations of existing read-disturbance management techniques that trigger read reclaim (RR) at the block granularity: (i) they overlook the heterogeneous reliability impact of read disturbance across individual wordlines (WLs), leading to unnecessary RR in many cases; and (ii) they address read disturbance only after disturbance-induced errors have already accumulated, which forces substantial RR-induced copy overheads in read-disturbance-prone modern NAND flash memory. To address these limitations, we propose STRAW (STRess-Aware Wordline-based read-disturbance management), a new technique that minimizes RR overheads through two key ideas: (i) stress-aware WL-based read reclaim, which monitors the accumulated read-disturbance effect on each WL and reclaims only heavily disturbed WLs, and (ii) stress-reduced read, which mitigates disturbance on valid WLs during each read operation by scaling pass-through voltages based on WL validity. Our experimental results using a modern SSD emulator show that STRAW reduces RR-induced page-copy overhead by 88.6% on average compared with the state-of-the-art technique.",
"doi": "10.1145/3779212.3790228",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790228",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790228",
"has_full_text": true
},
{
"title": "Efficient Remote Memory Ordering for Non-Coherent Systems",
"authors": [
"Wei Siew Liew",
"Md Ashfaqur Rahaman",
"Adarsh Patil",
"Ryan Stutsman",
"Vijay Nagarajan"
],
"abstract": "Software using non-coherent interconnects like PCI Express requires fine-grained memory ordering, but current hardware mandates the use of costly source-side serialization. We show that this architectural mismatch severely limits the performance of two critical applications: (1) the transmission of network packets from a CPU to a NIC (requiring write-to-write ordering) and (2) key-value store lookups by an RDMA-enabled NIC (requiring read-to-read ordering). We address this by proposing a new destination-based ordering model and the hardware-software co-design comprising PCIe extensions and ISA extensions that allow software to express ordering intent efficiently. Novel microarchitecture at the Root Complex enforces these expressed semantics, eliminating source-side stalls. Our approach significantly improves the throughput of these application kernels and enables new, simpler protocols that outperform the state-of-the-art.",
"doi": "10.1145/3779212.3790156",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790156",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790156",
"has_full_text": true
},
{
"title": "Architecting Scalable Trapped Ion Quantum Computers using Surface Codes",
"authors": [
"Scott Jones",
"Prakash Murali"
],
"abstract": "",
"doi": "10.1145/3779212.3790128",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790128",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790128",
"has_full_text": true
},
{
"title": "SpecProto: A Parallelizing Compiler for Speculative Decoding of Large Protocol Buffers Data",
"authors": [
"Zhijie Wang",
"Chales Hong",
"Dhruv Parmar",
"Shengbo Ma",
"Zhijia Zhao",
"Qidong Zhao",
"Xu Liu"
],
"abstract": "Protobuf is a widely used data serialization format, especially in cloud environments. However, existing compilers generate only serial decoders, limiting scalability for large datasets. While parallel parsing has been studied for textual formats (e.g., XML), parallel decoding of binary formats like Protobuf remains unexplored, which present unique opportunities. We propose two techniques to enable parallel Protobuf decoding. First, we leverage the length-prefixed encoding of Protobuf to ''skim'' the binary and identify decoding tasks. To address inefficiencies caused by many small or imbalanced fields, we further introduce speculative parallelization, which partitions a binary into even segments and predicts decoding states across boundaries. We implement these techniques in SpecProto, a parallelizing compiler that generates parallel decoders for a given Protobuf schema. Experiments on real-world and synthetic datasets show that SpecProto achieves significant speedups by leveraging multiple CPU cores.",
"doi": "10.1145/3779212.3790225",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790225",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790225",
"has_full_text": true
},
{
"title": "PACT: A Criticality-First Design for Tiered Memory",
"authors": [
"Hamid Hadian",
"Jinshu Liu",
"Hanchen Xu",
"Hansen Idden",
"Huaicheng Li"
],
"abstract": "Tiered memory systems typically place pages based on access frequency (hotness), yet frequency alone fails to capture the true performance impact. We present PACT, an online, page-granular tiered memory design that elevates performance criticality to a first-class design principle. At its core is Per-page Access Criticality (PAC), a fine-grained metric that quantifies each page's contribution to application performance rather than merely counting accesses. PACT profiles PAC online using a lightweight analytical model that uniquely decomposes per-tier memory-level parallelism via hardware queue occupancy counters, enabling direct CPU stall attribution to individual pages. To handle highly skewed PAC distributions, PACT employs PAC-centric migration policies: eager demotion and adaptive promotion, to dynamically place performance-critical pages in DRAM. Across 13 workloads, PACT achieves up to 61% performance improvement over the best of 7 state-of-the-art tiering designs with up to 50\u00d7 fewer migrations.",
"doi": "10.1145/3779212.3790198",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790198",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790198",
"has_full_text": true
},
{
"title": "C\n <scp>xlalloc</scp>\n : Safe and Efficient Memory Allocation for a CXL Pod",
"authors": [
"Newton Ni",
"Yan Sun",
"Zhiting Zhu",
"Emmett Witchel"
],
"abstract": "A Compute Express Link (CXL) pod is a group of hosts that share CXL-attached memory. A memory allocator for a CXL pod faces novel challenges: (1) CXL devices may not fully support inter-host hardware cache coherence (HWcc), (2) the allocator may be concurrently accessed from different processes, and (3) with more hosts, failures become more likely. We present cxlalloc, a user-space memory allocator that addresses these challenges through careful metadata layout and new protocols to maintain cache coherence in software, coordinate memory mappings across processes, and recover from crashes. Cxlalloc uses compare-and-swap (CAS) for efficient synchronization; to support CXL devices with no HWcc, we present a memory-based CAS (mCAS) primitive implemented in an FPGA. Experiments with in-memory key-value store workloads demonstrate that cxlalloc retains competitive performance while enabling new use-cases. Experiments with a commercial CXL device show that cxlalloc can achieve 80% of its maximum allocation throughput using mCAS.",
"doi": "10.1145/3779212.3790149",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790149",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790149",
"has_full_text": true
},
{
"title": "LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training",
"authors": [
"Xinyi Liu",
"Yujie Wang",
"Fangcheng Fu",
"Xuefeng Xiao",
"Huixia Li",
"Jiashi Li",
"Bin Cui"
],
"abstract": "Expert parallelism is vital for effectively training Mixture-of-Experts (MoE) models, enabling different devices to host distinct experts, with each device processing different input data. However, during expert parallel training, dynamic routing results in significant load imbalance among experts: a handful of overloaded experts hinder overall iteration, emerging as a training bottleneck. In this paper, we introduce LAER-MoE, an efficient MoE training framework. The core of LAER-MoE is a novel parallel paradigm, Fully Sharded Expert Parallel (FSEP), which fully partitions each expert parameter by the number of devices and restores partial experts at expert granularity through All-to-All communication during training. This allows for flexible re-layout of expert parameters during training to enhance load balancing. In particular, we perform fine-grained scheduling of communication operations to minimize communication overhead. Additionally, we develop a load balancing planner to formulate re-layout strategies of experts and routing schemes for tokens during training. We perform experiments on an A100 cluster, and the results indicate that our system achieves up to 1.69x acceleration compared to the current state-of-the-art training systems. Source code available at https://github.com/PKU-DAIR/Hetu-Galvatron/tree/laer-moe.",
"doi": "10.1145/3779212.3790180",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790180",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790180",
"has_full_text": true
},
{
"title": "BlendServe: Optimizing Offline Inference with Resource-Aware Batching",
"authors": [
"Yilong Zhao",
"Shuo Yang",
"Kan Zhu",
"Lianmin Zheng",
"Baris Kasikci",
"Yifan Qiao",
"Yang Zhou",
"Jiarong Xing",
"Ion Stoica"
],
"abstract": "Offline batch inference is gaining popularity as a cost-effective solution for latency-insensitive tasks, such as model evaluation and data curation. As the latency objective is highly relaxed, maximizing throughput is the primary goal in offline inference. Previous studies focused solely on throughput optimization within a batch. However, the diverse resource demands (compute-intensive vs. memory-intensive) across a wide range of applications make these approaches less effective, as imbalanced resource demands between batches restrict optimization opportunities. Our insight is to create batches with mixed compute- and memory-intensive requests through request reordering to maximize resource overlapping. However, such a request schedule can conflict with the schedule that maximizes prefix sharing, a widely-used performance optimization, causing suboptimal inference throughput. In this paper, we first build a performance model to analyze request resource demands. Based on it, we design BlendServe, which harmonizes both resource overlapping and prefix sharing to maximize throughput. BlendServe organizes all requests using a resource-aware prefix tree and proposes a dual scanning algorithm to obtain the request schedule. Our evaluation on various models and workloads shows that BlendServe can achieve up to 90% of the optimal throughput.",
"doi": "10.1145/3779212.3790133",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790133",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790133",
"has_full_text": true
},
{
"title": "FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations",
"authors": [
"Zhihao Shu",
"Md Musfiqur Rahman Sanim",
"Hangyu Zheng",
"Kunxiong Zhu",
"Miao Yin",
"Gagan Agrawal",
"Wei Niu"
],
"abstract": "The increasing size and complexity of modern deep neural networks (DNNs) pose significant challenges for on-device inference on mobile GPUs, with limited memory and computational resources. Existing DNN acceleration frameworks primarily deploy a weight preloading strategy, where all model parameters are loaded into memory before execution on mobile GPUs. We posit that this approach is not adequate for modern DNN workloads that comprise very large model(s) and possibly execution of several distinct models in succession. In this work, we introduce FlashMem, a memory streaming framework designed to efficiently execute large-scale modern DNNs and multi-DNN workloads while minimizing memory consumption and reducing inference latency. Instead of fully preloading weights, FlashMem statically determines model loading schedules and dynamically streams them on demand, leveraging 2.5D texture memory to minimize data transformations and improve execution efficiency. Experimental results on 11 models demonstrate that FlashMem achieves 2.0\u00d7 to 8.4\u00d7 memory reduction and 1.7\u00d7 to 75.0\u00d7 speedup compared to existing frameworks, enabling efficient execution of large-scale models and multi-DNN support on resource-constrained mobile GPUs.",
"doi": "10.1145/3779212.3790164",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790164",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790164",
"has_full_text": true
},
{
"title": "Signal Breaker: Fuzzing Digital Signal Processors",
"authors": [
"Cameron Santiago Garcia",
"Matthew Hicks"
],
"abstract": "Fuzzing is one of the most effective techniques for discovering software vulnerabilities. Fuzzers use feedback from prior executions to generate new test cases via random mutation,executing these inputs to uncover bugs. Fuzzing has been successfully applied to applications, operating systems, processors, and network protocols, making it one of the most widely adopted software testing methodologies. Despite this success, fuzzing has seen little adoption for Digital Signal Processor (DSP) software. DSPs occupy a unique position at the boundary of hardware and software: they ingest signals from the physical world, execute software instructions, and are tightly integrated into data-processing pipelines. Many safety- and security-critical domains, including telecommunications, transportation, and defense, rely heavily on DSPs, making robust DSP testing essential. To address this gap, we introduce SBFUZZ, a coverage-guided fuzzer designed specifically for DSP software. SBFUZZ is driven by three key observations: (1) DSPs expose limited and high-latency execution control and communication interfaces, and (2) DSPs have unique architectures that necessitate new instrumentation and mutation routines, and (3) DSP fuzzing must detect both traditional software bugs manifested as crashes and hardware-style bugs manifested as divergent yet continuing execution. Based on these insights, SBFUZZ advocates a DSP-centric fuzzer decomposition, where the DSP executes most fuzzing tasks autonomously while periodically leveraging a more powerful host for coordination, analysis, and storage. This design allows a single host to concurrently fuzz multiple end devices and supports both physical DSPs and simulated DSPs for re-hosted fuzzing. We implement SBFUZZ on a Texas Instruments TMS320C5515 DSP and evaluate it on 15 DSP benchmark programs.Our results show that SBFUZZ achieves 17.4x higher throughput and 2.6x greater code coverage than prior embedded fuzzing approaches applied to DSPs, uncovering 2491 unique crashes, yielding 34 unique bugs",
"doi": "10.1145/3779212.3790220",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790220",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790220",
"has_full_text": true
},
{
"title": "CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting",
"authors": [
"Hexu Zhao",
"Xiwen Min",
"Xiaoteng Liu",
"Moonjun Gong",
"Yiming Li",
"Ang Li",
"Saining Xie",
"Jinyang Li",
"Aurojit Panda"
],
"abstract": "",
"doi": "10.1145/3779212.3790140",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790140",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790140",
"has_full_text": true
},
{
"title": "RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators",
"authors": [
"Xinsheng Tang",
"Yangcheng Li",
"Nan Wang",
"Zhiyi Shu",
"Xingyu Ling",
"Junna Xing",
"Peng Zhou",
"Qiang Liu"
],
"abstract": "Operator fusion, as a key performance optimization technique in the deployment of AI models, significantly improves execution efficiency and has been widely adopted in modern AI compilers. However, for cascaded reduction operations involving multiple loops with inter-loop data dependencies, such as the safe softmax followed by GEMM within attention mechanisms, existing compilers lack effective automated fusion and kernel generation capabilities. Although some works have addressed specific instances through hand-crafted fusion strategies, their solutions are limited in generality and difficult to extend to other similar structures. Given the prevalence of such computational patterns in deep learning models, there remains significant untapped potential in achieving general and automated fusion optimization. In this paper, we present a formal theoretical methodology for analyzing cascaded reductions which can fuse them into a single loop and introduce an incremental computation form. Based on this methodology, we design Red uction Fuser (RedFuser), a framework that automatically identifies supported cascaded reduction patterns and generates optimized fused kernels. Experiments show that RedFuser successfully fuses diverse workloads, achieving up to 2\u00d7 to 5\u00d7 speedup over state-of-the-art AI compilers and matching the performance of highly optimized hand-written kernels.",
"doi": "10.1145/3779212.3790209",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790209",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790209",
"has_full_text": true
},
{
"title": "CXLMC: Model Checking CXL Shared Memory Programs",
"authors": [
"Simon Guo",
"Conan Truong",
"Brian Demsky"
],
"abstract": "Compute Express Link (CXL) shared memory is an emerging industry standard that will allow for cache coherent sharing of remote memory between many machines. Memory devices will contain large amounts of DRAM that can be shared by many machines in a CXL cluster. This will enable software running on clusters of computers to use shared memory to communicate more efficiently and to share important data between these machines. As CXL clusters grow larger, machine failures will become a significant risk. Software will need to tolerate machine failures. A key challenge is that CXL uses caching of remote memory to hide latency. If a machine fails before it has flushed dirty cache lines back to the CXL shared memory device, the latest stores to those cache lines will be lost. Data structures have been developed that combine crash-consistent designs with flush and fence instructions to ensure that the data structures remain consistent even in the presence of failures. However, developing such crash-consistent data structures is error prone. It is easy to make a design or implementation error. Such crash consistency errors are hard to detect with testing. We propose CXLMC, a model checker that systematically explores crashing executions for the x86-CXL shared memory platform. We have evaluated CXLMC and found 24 bugs in 8 applications including 7 new bugs.",
"doi": "10.1145/3779212.3790150",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790150",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790150",
"has_full_text": true
},
{
"title": "CounterPoint: Using Hardware Event Counters to Refute and Refine Microarchitectural Assumptions",
"authors": [
"Nick Lindsay",
"Caroline Trippel",
"Anurag Khandelwal",
"Abhishek Bhattacharjee"
],
"abstract": "",
"doi": "10.1145/3779212.3790145",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790145",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790145",
"has_full_text": true
},
{
"title": "Falcon: Algorithm-Hardware Co-Design for Efficient Fully Homomorphic Encryption Accelerator",
"authors": [
"Liang Kong",
"Xianglong Deng",
"Guang Fan",
"Shengyu Fan",
"Lei Chen",
"Yilan Zhu",
"Geng Yang",
"Yisong Chang",
"Shoumeng Yan",
"Mingzhe Zhang"
],
"abstract": "Fully homomorphic encryption (FHE) enables computation on encrypted data without compromising privacy, positioning it as a promising solution for secure cloud computing. However, its substantial computational overhead impedes practical deployment, prompting the development of dedicated hardware accelerators. In practice, when deploying cryptographic algorithm optimizations on FHE accelerators, hardware constraints typically such as limited memory capacity, often lead to a disparity between theoretical algorithmic advantage and achievable hardware efficiency. We present Falcon, an algorithm-hardware co-design for efficient FHE acceleration. We first conduct computational analysis of the prevalent minimum key-switching method, and propose hardware-oriented algorithmic optimizations that substantially reduce computational overhead. We then introduce further refinements to decrease the inherent inter-cluster communication overhead in multi-cluster architectures, while preserving computational advantages. By characterizing the arithmetic patterns intrinsic to primary primitives, we devise hardware-specific arithmetic fusion and functional reuse across computational units. Finally, we present a memory-adaptive strategy to effectively deploy Falcon to prior accelerators under constrained on-chip memory budgets. Experimental results demonstrate that applying our algorithm\u2013hardware co-design delivers up to a 1.48\u00d7 speedup with a negligible area overhead (only a 0.8% increase).",
"doi": "10.1145/3779212.3790160",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790160",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790160",
"has_full_text": true
},
{
"title": "Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter",
"authors": [
"Qinghao Hu",
"Shang Yang",
"Junxian Guo",
"Xiaozhe Yao",
"Yujun Lin",
"Yuxian Gu",
"Han Cai",
"Chuang Gan",
"Ana Klimovic",
"Song Han"
],
"abstract": "",
"doi": "10.1145/3779212.3790231",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790231",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790231",
"has_full_text": true
},
{
"title": "Co-Exploration of RISC-V Processor Microarchitectures and FreeRTOS Extensions for Lower Context-Switch Latency",
"authors": [
"Markus Scheck",
"Tammo M\u00fcrmann",
"Andreas Koch"
],
"abstract": "Embedded real-time systems must respond to external events within tightly bounded timeframes to ensure safety, correctness, and reliability. While Real-Time Operating Systems (RTOSes) ease the development of complex applications by providing abstractions for multi-tasked execution, they introduce overheads in the form of task switching latency and jitter, which impact timing predictability. Minimizing these effects is essential for reducing response times and enabling robust worst-case timing analysis. We present RTOSUnit, a configurable hardware acceleration unit designed to reduce context-switch latency and jitter in embedded real-time systems. By integrating the RTOSUnit into three RISC-V cores of varying complexity, we demonstrate its portability. Through a range of configurations, from lightweight scheduling acceleration to full context-switch and -preloading support, RTOSUnit achieves up to 76% reduction in mean context-switch latency and can be configured to completely eliminate jitter on selected cores. Area overheads in 22nm ASIC implementations range from negligible (within EDA tool heuristics noise) to 44\\,%, with all configurations maintaining viable operating frequencies and power envelopes suitable for embedded systems. RTOSUnit offers a flexible and efficient, open-source. https://github.com/esa-tu-darmstadt/RTOSUnit_Integration foundation for hardware-assisted real-time scheduling, paving the way for broader integration into future embedded SoCs.",
"doi": "10.1145/3779212.3790141",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790141",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790141",
"has_full_text": true
},
{
"title": "Understanding Query Optimization Bugs in Graph Database Systems",
"authors": [
"Yuyu Chen",
"Zhongxing Yu"
],
"abstract": "Recent years have witnessed an ever-growing usage of graph database management systems (GDBMSs) in various data-driven applications. Query optimization aims to improve the performance of database queries by identifying the most efficient way to execute them, and is an important stage of GDBMS workflow. Like other sophisticated systems, such as compilers, the query optimization process is complex and its implementation is prone to bugs. This paper conducts the first characteristic study of query optimization bugs in GDBMSs, including the root causes, manifestation methods, and fix strategies, and delivers 10 novel and important findings about them. Based on the characteristic study, we also developed a testing tool tailored to uncover GDBMS query optimization bugs, and the tool found 20 unique GDBMS bugs, 10 of which are query optimization bugs.",
"doi": "10.1145/3779212.3790244",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790244",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790244",
"has_full_text": true
},
{
"title": "SEVI: Silent Data Corruption of Vector Instructions in Hyper-Scale Datacenters",
"authors": [
"Yixuan Mei",
"Shreya Varshini",
"Harish Dixit",
"Sriram Sankar",
"K. V. Rashmi"
],
"abstract": "Silent Data Corruption (SDC) poses a reliability threat in modern datacenters. These insidious errors evade detections and propagate incorrect results throughout the system. Companies including Google, Meta, and Alibaba have reported SDC incidents affecting their production. In this paper, we present the first comprehensive instruction- and application-level analysis of vector instruction SDCs in hyper-scale datacenters using a two-stage approach. We perform over 78 trillion test rounds in more than 14 billion CPU seconds. Our observations reveal undocumented SDC patterns that provide insights into possible underlying causes and inspire new mitigation strategies. Based on these findings, we propose a low-overhead SDC detection mechanism leveraging in-application algorithm-based fault tolerance. Our method achieves 88% to 100% SDC machine detection rate with a time overhead of only 1.35% even for modestly sized inputs.",
"doi": "10.1145/3779212.3790217",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790217",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790217",
"has_full_text": true
},
{
"title": "T\n <scp>EE</scp>\n M\u00b3: Core-Independent and Cooperating Trusted Execution Environments",
"authors": [
"Nils Asmussen",
"Sebastian Haas",
"Carsten Weinhold",
"Nicholas Gordon",
"Stephan Gerhold",
"Friedrich Pauls",
"Nilanjana Das",
"Michael Roitzsch"
],
"abstract": "Trusted Execution Environments (TEEs) enable secure code execution on machines that are not fully trusted by the user who runs the workload. However, existing TEE solutions mostly target CPUs and are typically tied to one specific instruction set architecture. Although some accelerators also provide support for TEEs, this leads to multiple, different TEE implementations on the same system, increasing its complexity and trusted computing base (TCB). This challenge becomes particularly apparent when workloads span heterogeneous processing units, because the diversity of TEE implementations complicates the creation of secure communication channels between the individual TEEs. In this paper, we present TeeM\u00b3, a trusted execution architecture with discrete, out-of-core enforcement, which is core-independent and better-suited to heterogeneous systems than existing approaches. We build upon M\u00b3 and extend both the hardware platform and operating system (OS). On the OS side, we add a root of trust (RoT) for remote attestation and a unikernel-like library for TEEs. On the hardware side, we add two lightweight isolation mechanisms to protect both TEEs and the RoT from other components in the system. Furthermore, TeeM\u00b3 inherits uniform communication channels from M\u00b3 and protects them to support communicating groups of TEEs. We show in the evaluation that our approach has low performance overhead, modest additional hardware costs, and reduces the hardware TCB by a factor of 1.8 and the software TCB by a factor of 3.42.",
"doi": "10.1145/3779212.3790232",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790232",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790232",
"has_full_text": true
},
{
"title": "Performance Predictability in Heterogeneous Memory",
"authors": [
"Jinshu Liu",
"Hanchen Xu",
"Daniel S. Berger",
"Marcos K. Aguilera",
"Huaicheng Li"
],
"abstract": "Heterogeneous memory combining DRAM and CXL exhibits variable performance, yet existing metrics correlate weakly with actual slowdown. We present CAMP, a principled framework for predicting CXL-induced slowdown. Our key insight is that a DRAM run (plus a CXL run for bandwidth-bound workloads) exposes the causal microarchitectural pressure points where CXL latency translates into additional processor stall cycles. CAMP captures these signals using 12 performance counters to analytically decompose slowdown into three orthogonal components: demand reads, cache/prefetching, and stores. CAMP also introduces a closed-form model for software-based weighted interleaving that predicts performance across DRAM--CXL ratios. Across 265 workloads on NUMA and three CXL devices, CAMP achieves 91--97% prediction accuracy within 10% absolute error. We demonstrate that these models enable practical system policies, including ''Best-shot'' interleaving and colocated workload placement, improving performance by up to 21% and 23% over existing tiering and colocation approaches.",
"doi": "10.1145/3779212.3790201",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790201",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790201",
"has_full_text": true
},
{
"title": "Reconfigurable Torus Fabrics for Multi-tenant ML",
"authors": [
"Abhishek Vijaya Kumar",
"Eric Ding",
"Arjun Devraj",
"Darius Bunandar",
"Rachee Singh"
],
"abstract": "We develop Morphlux, a server-scale programmable photonic fabric to interconnect accelerators within servers. We show that augmenting state-of-the-art torus-based ML datacenters with Morphlux can improve the bandwidth of tenant compute allocations by up to 66%, reduce compute fragmentation by up to 70%, and minimize the blast radius of accelerator failures. We develop a novel end-to-end hardware prototype of Morphlux to demonstrate these performance benefits which translate to 1.72x improvement in finetuning throughput of ML models. By rapidly programming the server-scale fabric in our hardware testbed, Morphlux can replace a failed accelerator with a healthy one in 1.2 seconds.",
"doi": "10.1145/3779212.3790238",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790238",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790238",
"has_full_text": true
},
{
"title": "Toasty: Speeding Up Network I/O with Cache-Warm Buffers",
"authors": [
"Preeti",
"Nitish Bhat",
"Ashwin Kumar",
"Mythili Vutukuru"
],
"abstract": "Modern NICs DMA packets directly to the LLC using technologies like DDIO, reducing access latencies for networking applications. Prior work has observed several performance issues with DDIO when the working set of packet buffers does not fit into LLC. For example, the leaky DMA problem arises when incoming packets evict older packets that have not yet been processed by the application from LLC, causing them to be fetched again from main memory. While using a smaller pool of packet buffers that fits in cache is an obvious solution, this may result in the NIC running out of buffers to DMA packets into when a burst of packets arrives. This paper proposes Toasty, a system that mitigates this tradeoff between high throughput and resilience to packet loss that arises when sizing the network packet buffer pool. While prior work has proposed hardware-based solutions to this problem, Toasty is a software-only solution that can be deployed on commodity NIC hardware. Toasty manages the packet buffer pool as a LIFO stack instead of a FIFO queue, and adapts the number of buffers populated into the NIC hardware RX ring based on incoming packet load and application processing rate. Together, these changes enable Toasty to recirculate a small working set of cache-warm buffers in steady state, while falling back to a larger pool of buffers during traffic bursts. We implement Toasty over the AF_XDP kernel bypass framework, and our evaluation shows that Toasty improves network throughput for a variety of network functions by up to 78% over the default buffer pool implementation of AF_XDP. We also show that Toasty matches the performance of a small buffer pool that fits in cache, while being more resilient to traffic bursts.",
"doi": "10.1145/3779212.3790235",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790235",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790235",
"has_full_text": true
},
{
"title": "Z\n <scp>ip</scp>\n S\n <scp>erv</scp>\n : Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression",
"authors": [
"Ruibo Fan",
"Xiangrui Yu",
"Xinglin Pan",
"Zeyu Li",
"Weile Luo",
"Qiang Wang",
"Wei Wang",
"Xiaowen Chu"
],
"abstract": "Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact Large Language Model (LLM) serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic. We present ZipServ, a lossless compression framework co-designed for efficient LLM inference. ZipServ introduces Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE), a novel fixed-length format that enables constant-time, parallel decoding, together with a fused decompression-GEMM (ZipGEMM) kernel that decompresses weights on-the-fly directly into Tensor Core registers. This '' load-compressed, compute-decompressed '' design eliminates intermediate buffers and maximizes compute intensity. Experiments show that ZipServ reduces the model size by up to 30%, achieves up to 2.21\u00d7 kernel-level speedup over NVIDIA's cuBLAS, and expedites end-to-end inference by an average of 1.22\u00d7 over vLLM. ZipServ is the first lossless compression system that provides both storage savings and substantial acceleration for LLM inference on GPUs.",
"doi": "10.1145/3779212.3790250",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790250",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790250",
"has_full_text": true
},
{
"title": "Once4All: Skeleton-Guided SMT Solver Fuzzing with LLM-Synthesized Generators",
"authors": [
"Maolin Sun",
"Yibiao Yang",
"Yuming Zhou"
],
"abstract": "Satisfiability Modulo Theory (SMT) solvers are foundational to modern systems and programming languages research, providing the foundation for tasks like symbolic execution and automated verification. Because these solvers sit on the critical path, their correctness is essential, and high-quality test formulas are key to uncovering bugs. However, while prior testing techniques performed well on earlier solver versions, they struggle to keep pace with rapidly evolving features. Recent approaches based on Large Language Models (LLMs) show promise in exploring advanced solver capabilities, but two obstacles remain: nearly half of the generated formulas are syntactically invalid, and iterative interactions with LLMs introduce substantial computational overhead. In this study, we present Once4All, a novel LLM-assisted fuzzing framework that addresses both issues by shifting from direct formula generation to the synthesis of generators for reusable terms (i.e., logical expressions). Specifically, Once4All uses LLMs to (1) automatically extract context-free grammars (CFGs) for SMT theories, including solverspecific extensions, from documentation, and (2) synthesize composable Boolean term generators that adhere to these grammars. During fuzzing, Once4All populates structural skeletons derived from existing formulas with the terms iteratively produced by the LLM-synthesized generators. This design ensures syntactic validity while promoting semantic diversity. Notably, Once4All requires only one-time LLM interaction investment, dramatically reducing runtime cost. We evaluated Once4All on two leading SMT solvers: Z3 and cvc5. Our experiments show that Once4All has identified 43 confirmed bugs, 40 of which have already been fixed by developers.",
"doi": "10.1145/3779212.3790195",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790195",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790195",
"has_full_text": true
},
{
"title": "TreeVQA: A Tree-Structured Execution Framework for Shot Reduction in Variational Quantum Algorithms",
"authors": [
"Yuewen Hou",
"Dhanvi Bharadwaj",
"Gokul Subramanian Ravi"
],
"abstract": "",
"doi": "10.1145/3779212.3790239",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790239",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790239",
"has_full_text": true
},
{
"title": "Nebula: Infinite-Scale 3D Gaussian Splatting in VR via Collaborative Rendering and Accelerated Stereo Rasterization",
"authors": [
"He Zhu",
"Zheng Liu",
"Xingyang Li",
"Anbang Wu",
"Jieru Zhao",
"Fangxin Liu",
"Yiming Gan",
"Jingwen Leng",
"Yu Feng"
],
"abstract": "3D Gaussian splatting (3DGS) has drawn significant attention in the architectural community recently. However, current architectural designs often overlook the 3DGS scalability, making them fragile for extremely large-scale 3DGS. Meanwhile, the VR bandwidth requirement makes it impossible to deliver high-fidelity and smooth VR content from the cloud. We present Nebula, a coherent acceleration framework for large-scale 3DGS collaborative rendering. Instead of streaming videos, Nebula streams intermediate results after the LoD search, reducing 1925% data communication between the cloud and the client. To further enhance the motion-to-photon experience, we introduce a temporal-aware LoD search in the cloud that tames the irregular memory access and reduces redundant data access by exploiting temporal coherence across frames. On the client side, we propose a novel stereo rasterization that enables two eyes to share most computations during the stereo rendering with bit-accurate quality. With minimal hardware augmentations, Nebula achieves 2.7\u00d7 motion-to-photon speedup and reduces 1925% bandwidth over lossy video streaming.",
"doi": "10.1145/3779212.3790190",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790190",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790190",
"has_full_text": true
},
{
"title": "Chips Need DIP: Time-Proportional Per-Instruction Cycle Stacks at Dispatch",
"authors": [
"Silvio Campelo de Santana",
"Joseph Rogers",
"Lieven Eeckhout",
"Magnus Jahre"
],
"abstract": "Despite continuous accelerator improvements, high single-thread performance remains crucial due to Amdahl's Law. With technology scaling slowing, developers must produce code that is as performant as possible, which often involves instruction-level performance analysis. Such analysis can be visualized as Per-Instruction Cycle Stacks (PICS), which, when created at the commit stage of the processor, report each static instruction's contribution to overall execution time and capture the performance events it was subject to. Sadly, PICS at commit (PICSC) often cannot explain performance because they solely capture how effectively instructions egress from the processor core's out-of-order execution window. In contrast, developers must typically also gain insight into how efficiently instructions ingress into the execution window to fully understand application performance. PICSC must hence be complemented by PICS at dispatch (PICSD) because execution window ingress issues result in specific instructions exhibiting high dispatch latencies. We therefore propose dispatch-time profiling (DIP), which combines time-proportional at dispatch attribution policies with statistical sampling to accurately report each static instruction's contribution to dispatch time as well as list the reason(s) for delayed ingress. DIP is simple to implement and incurs low overhead, i.e., storage and execution time and overheads of 49 bytes and ~1%, respectively, while delivering high accuracy (average profile error of 5.2%). This is a significant improvement over the 26.9% error of state-of-the-art dispatch-tagging as implemented in AMD IBS, Arm SPE, and IBM RIS. We demonstrate that needing PICSD and PICSC is the common case by showing that 18 out of our 22 SPEC2017 benchmarks simultaneously exhibit both ingress and egress issues. Additionally, we leverage DIP's PICSD to optimize the fotonik3d benchmark, improving its performance by 8.5%.",
"doi": "10.1145/3779212.3790139",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790139",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790139",
"has_full_text": true
},
{
"title": "Graphiti: Formally Verified Out-of-Order Execution in Dataflow Circuits",
"authors": [
"Yann Herklotz",
"Ayatallah Elakhras",
"Martina Camaioni",
"Paolo Ienne",
"Lana Josipovi\u0107",
"Thomas Bourgeat"
],
"abstract": "High-level synthesis (HLS) tools automatically synthesise hardware from imperative programs and have seen a significant rise in adoption in both industry and academia. To deliver high-quality hardware designs for increasingly general purpose programs, HLS compilers have to become more aggressive. For the most irregular programs, HLS tools generating dataflow circuits show promising performance by adapting and specializing key ideas from processor architectures, like out-of-order execution and speculation. However, the complexity of these transformations makes them difficult to reason about, increasing the risk of subtle bugs and potentially delaying their adoption in a conservative industry where bugs can be extremely costly. This paper introduces Graphiti, a framework embedded in the Lean 4 proof assistant designed to formally reason about and manipulate dataflow circuits at the core of these HLS tools. We develop a metatheory of graph refinement that allows us to verify a general-purpose dataflow circuit rewriting algorithm. Using this framework, we formally verify a loop rewrite that introduces out-of-order execution into a dataflow circuit. Our evaluation shows that the resulting verified optimization pipeline achieves a 2.1\u00d7 speedup over the in-order HLS flow and a 5.8\u00d7 speedup over a verified HLS tool generating a static state machine. We also show that it achieves the same performance compared to an existing unverified approach which introduces out-of-order execution. Graphiti is a step toward a fully-verified HLS flow targeting dataflow circuits. In the interim, it can serve as an extensible, verified, optimizing engine that can be integrated into existing dataflow HLS compilers.",
"doi": "10.1145/3779212.3790166",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790166",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790166",
"has_full_text": true
},
{
"title": "Borrowing Dirty Qubits in Quantum Programs",
"authors": [
"Bonan Su",
"Li Zhou",
"Yuan Feng",
"Mingsheng Ying"
],
"abstract": "",
"doi": "10.1145/3779212.3790134",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790134",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790134",
"has_full_text": true
},
{
"title": "Anvil: A General-Purpose Timing-Safe Hardware Description Language",
"authors": [
"Jason Zhijingcheng Yu",
"Aditya Ranjan Jha",
"Umang Mathur",
"Trevor E. Carlson",
"Prateek Saxena"
],
"abstract": "",
"doi": "10.1145/3779212.3790125",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790125",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790125",
"has_full_text": true
},
{
"title": "Parameterized Hardware Design with Latency-Abstract Interfaces",
"authors": [
"Rachit Nigam",
"Ethan Gabizon",
"Edmund Lam",
"Carolyn Zech",
"Jonathan Balkind",
"Adrian Sampson"
],
"abstract": "Hardware designs must use latency-insensitive (LI) interfaces when timing is input-dependent. When timing is input-independent, designs should use latency-sensitive (LS) interfaces for maximum performance. However, designs commonly use LI interfaces to integrate with externally generated LS modules--from, e.g., IP generators, high-level synthesis, or domain specific languages. In every fully integrated design, such uses of LI represent pure overhead. The challenge is that generators can dramatically change timing interfaces of the modules to meet performance objectives, and LI interfaces act as a useful design abstraction and enable timing adaptation. We define latency-abstract (LA) interfaces, a new design abstraction, which provide the timing adaptability of LI interfaces at design-time and the efficient integration of LS interfaces. LA interfaces use output parameters, a novel compile-time mechanism for child modules to return values parent modules, to abstract and encapsulate timing behaviors at design time. During design elaboration, LA interfaces are compiled into efficient LS interfaces based on parameter values. While an attractive option, LA interfaces inherit the complexities of parameterized hardware design: the user must reason how parameters influence timing behaviors of modules and ensure that designs adapt to interface changes. To address this challenge and demonstrate the utility of LA interfaces, we design Lilac, a parameterized HDL that uses a type system track the influence of parameters on timing behaviors and formally guarantee that every parameterization of an LA design results in a circuit without structural hazards. We demonstrate Lilac's efficacy by using it to implement parameterized designs and integrate designs generated from external tools. We show that LA designs use 26--33% fewer chip resources and achieve 6.8% better maximum frequencies than comparable LI implementations.",
"doi": "10.1145/3779212.3790199",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790199",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790199",
"has_full_text": true
},
{
"title": "MSCCL++: Rethinking GPU Communication Abstractions for AI Inference",
"authors": [
"Changho Hwang",
"Peng Cheng",
"Roshan Dathathri",
"Abhinav Jangda",
"Saeed Maleki",
"Madan Musuvathi",
"Olli Saarikivi",
"Aashaka Shah",
"Ziyue Yang",
"Binyang Li",
"Caio Rocha",
"Qinghua Zhou",
"Mahdieh Ghazimirsaeed",
"Sreevatsa Anantharamu",
"Jithin Jose"
],
"abstract": "AI applications increasingly run on fast-evolving, heterogeneous hardware to maximize performance, but general-purpose libraries lag in supporting these features. Performance-minded programmers often build custom communication stacks that are fast but error-prone and non-portable. This paper introduces MSCCL++, a design methodology for developing high-performance, portable communication kernels. It provides (1) a low-level, performance-preserving primitive interface that exposes minimal hardware abstractions while hiding the complexities of synchronization and consistency, (2) a higher-level DSL for application developers to implement workload-specific communication algorithms, and (3) a library of efficient algorithms implementing the standard collective API, enabling adoption by users with minimal expertise. Compared to state-of-the-art baselines, MSCCL++ achieves geomean speedups of 1.7\u00d7 (up to 5.4\u00d7) for collective communication and 1.2\u00d7 (up to 1.38\u00d7) for AI inference workloads. MSCCL++ is in production of multiple AI services provided by Microsoft Azure, and has also been adopted by RCCL, the GPU collective communication library maintained by AMD. MSCCL++ is open source and available at https://github.com/microsoft/mscclpp. Our two years of experience with MSCCL++ suggests that its abstractions are robust, enabling support for new hardware features, such as multimem, within weeks of development.",
"doi": "10.1145/3779212.3790188",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790188",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790188",
"has_full_text": true
},
{
"title": "gShare: Efficient GPU Sharing with Aggressive Scheduling in Multi-tenant FaaS platform",
"authors": [
"Yanan Yang",
"Zhengxiong Jiang",
"Meiqi Zhu",
"Hongqiang Xu",
"Yujun Wang",
"Liang Li",
"Jiansong zhang",
"Jie Wu"
],
"abstract": "Serving ML models with serverless computing has become increasingly popular in recent years. Many of today's cloud vendors have provided GPU functions to meet the performance requirements of different ML scenarios. However, existing production FaaS platforms suffer from GPU under-utilization and high cloud costs due to poor GPU resource management. In this paper, we propose gShare, an on-demand and efficient GPU function management policy in FaaS platforms. gShare provides a fine-grained GPU virtualization solution for a VM-based multi-tenant FaaS environment. It further decouples the GPU resource from the existing CPU-oriented function management paradigm, enabling flexible GPU sharing across tenants. With a user-transparent vGPU remapping design and aggressive request scheduling policy, gShare can significantly improve the cost-efficiency of GPU functions without causing appreciable function performance degradation. Experimental results show that gShare can reduce GPU usage by 43%\u201363% compared to the baseline while meeting more than 95% of user latency targets. Compared with the state-of-the-art method, it can also reduce cloud costs by 24%-58% while maintaining better function performance, benefiting both the cloud provider and users.",
"doi": "10.1145/3779212.3790168",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790168",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790168",
"has_full_text": true
},
{
"title": "CREST: High-Performance Contention Resolution for Disaggregated Transactions",
"authors": [
"Qihan Kang",
"Mi Zhang",
"Patrick P. C. Lee",
"Yongkang Hu"
],
"abstract": "Distributed transaction systems can leverage memory disaggregation for efficient resource scaling, yet they experience significant performance degradation under high-contention workloads. We present CREST, a disaggregated transaction system that efficiently manages high-contention transaction workloads in disaggregated memory architectures via three key techniques: (i) cell-level concurrency control, which achieves more fine-grained transaction concurrency than existing record-level approaches and reduces remote access latencies using a metadata-aggregated record structure; (ii) localized execution, which allows compute nodes to operate on local uncommitted results to reduce blocking time; and (iii) parallel commits, which parallelize commit operations under transaction dependencies. Evaluation shows that CREST achieves a throughput gain of up to 1.92\u00d7 over state-of-the-art systems under high-contention workloads.",
"doi": "10.1145/3779212.3790148",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790148",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790148",
"has_full_text": true
},
{
"title": "T-Control: An Efficient Dynamic Tensor Rematerialization System for DNN Training",
"authors": [
"Zehua Wang",
"Junmin Xiao",
"Xiaochuan Deng",
"Huibing Wang",
"Hui Ma",
"Mingyi Li",
"Yunfei Pang",
"Guangming Tan"
],
"abstract": "With the continuous growth of model and batch sizes, DNN model training increasingly suffers from excessive memory consumption. Tensor rematerialization has emerged as an effective technique to enable training under limited memory constraints. However, dynamic rematerialization methods often underperform static approaches, primarily due to their greedy tensor eviction schedules and runtime overhead. In this work, we develop T-Control, a dynamic tensor rematerialization system for DNN model training. T-Control integrates the topology of the traced tensor dependency graph with real-time memory usage to make informed, adaptive tensor retention decisions, preserving critical tensors and reducing eviction-induced recomputation. Furthermore, by extending PyTorch's native memory manager with our fine-grained memory management strategy, T-control effectively improves memory utilization, thereby reducing unnecessary tensor eviction. Experimental results demonstrate that T-Control boosts throughput by up to 1.58\u00d7 and 1.91\u00d7 over state-of-the-art static and dynamic tensor rematerialization systems.",
"doi": "10.1145/3779212.3790230",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790230",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790230",
"has_full_text": true
},
{
"title": "An MLIR Lowering Pipeline for Stencils at Wafer-Scale",
"authors": [
"Nicolai Stawinoga",
"David Katz",
"Anton Lydike",
"Justs Zarins",
"Nick Brown",
"George Bisbas",
"Tobias Grosser"
],
"abstract": "The Cerebras Wafer-Scale Engine (WSE) delivers performance at an unprecedented scale of over 900,000 compute units, all connected via a single-wafer on-chip interconnect. Initially designed for AI, the WSE architecture is also well-suited for High Performance Computing (HPC). However, its distributed asynchronous programming model diverges significantly from the simple sequential or bulk-synchronous programs that one would typically derive for a given mathematical program description. Targeting the WSE requires a bespoke re-implementation when porting existing code. The absence of WSE support in compilers such as MLIR, meant that there was little hope for automating this process. Stencils are ubiquitous in HPC, and in this paper we explore the hypothesis that domain specific information about stencils can be leveraged by the compiler to automatically target the WSE without requiring application-level code changes. We present a compiler pipeline that transforms stencil-based kernels into highly optimized CSL code for the WSE, bridging the semantic gap between the mathematical representation of the problem and the WSE's asynchronous execution model. Based upon five benchmarks across three HPC programming technologies, running on both the Cerebras WSE2 and WSE3, our approach delivers comparable, if not slightly better, performance than manually optimized code. Furthermore, without requiring any application level code changes, performance on the WSE3 is around 14 times faster than 128 Nvidia A100 GPUs and 20 times faster than 128 nodes of a CPU-based Cray-EX supercomputer when using our approach.",
"doi": "10.1145/3779212.3790124",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790124",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790124",
"has_full_text": true
},
{
"title": "LOOPRAG: Enhancing Loop Transformation Optimization with Retrieval-Augmented Large Language Models",
"authors": [
"Yijie Zhi",
"Yayu Cao",
"Jianhua Dai",
"Xiaoyang Han",
"Jingwen Pu",
"Qinran Wu",
"Sheng Cheng",
"Ming Cai"
],
"abstract": "",
"doi": "10.1145/3779212.3790183",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790183",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790183",
"has_full_text": true
},
{
"title": "RTeAAL Sim: Using Tensor Algebra to Represent and Accelerate RTL Simulation",
"authors": [
"Yan Zhu",
"Boru Chen",
"Christopher W. Fletcher",
"Nandeeka Nayak"
],
"abstract": "RTL simulation on CPUs remains a persistent bottleneck in hardware design. State-of-the-art simulators embed the circuit directly into the simulation binary, resulting in long compilation times and execution that is fundamentally CPU frontend-bound, with severe instruction-cache pressure. This work proposes RTeAAL Sim, which reformulates RTL simulation as a sparse tensor algebra problem. By representing RTL circuits as tensors and simulation as a sparse tensor algebra kernel, RTeAAL Sim decouples simulation behavior from binary size and makes RTL simulation amenable to well-studied tensor algebra optimizations. We demonstrate that a prototype of our tensor-based simulator, even with a subset of these optimizations, already mitigates the compilation overhead and frontend pressure and achieves performance competitive with the highly optimized Verilator simulator across multiple CPUs and ISAs.",
"doi": "10.1145/3779212.3790214",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790214",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790214",
"has_full_text": true
},
{
"title": "Trinity: Three-Dimensional Tensor Program Optimization via Tile-level Equality Saturation",
"authors": [
"Jaehyeong Park",
"Youngchan Kim",
"Haechan An",
"Gieun Jeong",
"Jeehoon Kang",
"Dongsu Han"
],
"abstract": "Modern tensor program optimizers operate at two separate levels: graph-level optimizations (operator fusion, algebraic rewrites) and operator-level scheduling (tiling, parallelization). This separation prevents them from discovering cross-operator, tile-level optimizations that make hand-tuned kernels like FlashAttention effective. We present Trinity, the first tensor program optimizer that achieves scalable joint optimization through tile-level equality saturation. Our key insight is that optimal performance requires simultaneously optimizing three interdependent dimensions -- algebraic equivalence, memory I/O, and compute orchestration. To enable this, Trinity introduces a novel fine-grained IR that exposes all three axes as first-class, rewritable entities and applies equality saturation to perform scalable joint optimization. As a result, Trinity automatically discovers complex optimizations that require coordinated reasoning across all three dimensions. Across diverse Transformer variants, Trinity achieves up to 2.09\u00d7 speedup over TensorRT and 2.35\u00d7 over TorchInductor, both state-of-the-art production compilers.",
"doi": "10.1145/3779212.3790240",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790240",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790240",
"has_full_text": true
},
{
"title": "oFFN: Outlier and Neuron-aware Structured FFN for Fast yet Accurate LLM Inference",
"authors": [
"Geunsoo Song",
"Hoeseok Yang",
"Youngmin Yi"
],
"abstract": "With the advent of large-scale language models (LLMs), various optimization techniques have been proposed to enable efficient inference. Among these, methods that aggressively exploit output activation sparsity have attracted significant attention, which leverage ReLU-fied LLMs and skip the entire memory accesses as well as the computation for the output element if it was predicted as sparse. Achieving fast and accurate prediction of output activation sparsity is crucial to enhancing inference efficiency. However, in practice, phenomena such as activation outliers and hot and cold neurons, which significantly affect the exploitation of sparsity during LLM inference, have either been addressed individually or not structurally integrated in existing work. In this paper, we reveal that these two phenomena are closely related and propose a novel FFN architecture called oFFN (Outlier-aware Structured FFN) that effectively exploits them simultaneously. The proposed method rearranges the FFN weights in both row and column dimensions to enable efficient and accurate prediction of output sparsity, considering outliers, and to enable separation of hot and cold neurons, computing them with respective optimal operations. The proposed method allows for the optimal computation path for each neuron, even when the batch size dynamically changes. Compared to existing sparsity prediction techniques, our method achieves the fastest speed with negligible accuracy loss. Experimental results show that it delivers up to 2.01\u00d7 faster end-to-end inference speed compared to dense inference, and up to 5.46\u00d7 acceleration in FFN layers under autoregressive decoding in ReLU-fied LLMs.",
"doi": "10.1145/3779212.3790194",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790194",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790194",
"has_full_text": true
},
{
"title": "I/O Analysis is All You Need: An I/O Analysis for Long-Sequence Attention",
"authors": [
"Xiaoyang Lu",
"Boyu Long",
"Xiaoming Chen",
"Yinhe Han",
"Xian-He Sun"
],
"abstract": "As GPUs and other accelerators become increasingly popular, optimizing I/O operations between on-chip and off-chip memory is increasingly critical. I/O analysis, however, is complex, requiring a deep understanding of application dataflow and memory hierarchy. Developing a practical I/O analysis methodology remains a timely challenge. Self-attention is employed extensively in transformer models, but its quadratic memory complexity poses significant challenges to modern memory systems. In this study, we explore how to use I/O analysis to develop optimal solutions for accelerating exact long-sequence self-attention. We first introduce a novel I/O analysis for tall-and-skinny matrix-matrix multiplication, which captures the dominant data movement behavior of long-sequence self-attention. Guided by systematic I/O analysis, we develop AttenIO, an I/O-driven accelerator for exact long-sequence self-attention with three key optimizations: (1) an analytically derived I/O-optimal tiling and scheduling to minimize I/O operations, (2) fine-grained three-level communication-computation overlapping to hide I/O stalls, and (3) parallel execution patterns for efficient softmax. Our evaluation shows that AttenIO achieves a 1.6\u00d7-8.8\u00d7 speedup over the state-of-the-art solutions. Although AttenIO is designed for self-attention, it also highlights the broader potential of I/O analysis as a principled foundation for guiding high-performance I/O optimizations.",
"doi": "10.1145/3779212.3790174",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790174",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790174",
"has_full_text": true
},
{
"title": "STARC: Selective Token Access with Remapping and Clustering for Efficient LLM Decoding on PIM Systems",
"authors": [
"Zehao Fan",
"Yunzhen Liu",
"Garrett Gagnon",
"Zhenyu Liu",
"Yayue Hou",
"Hadjer Benmeziane",
"Kaoutar El Maghraoui",
"Liu Liu"
],
"abstract": "Serving large language models (LLMs) places significant pressure on memory systems due to frequent accesses and growing key\u2013value (KV) caches as context lengths increase. Processing-in-memory (PIM) architectures offer high internal bandwidth and near-data compute parallelism, but current designs target dense attention and perform poorly under the irregular access patterns of dynamic KV cache sparsity. To mitigate this limitation, we propose STARC, a sparsity-optimized data mapping scheme for efficient LLM decoding on PIM. STARC clusters semantically similar KV pairs and co-locates them contiguously within PIM banks, enabling retrieval at cluster granularity by matching queries against precomputed centroids. This bridges the gap between fine-grained sparse attention and row-level PIM operations, improving utilization while minimizing overhead. On a simulated HBM-PIM system, under constrained KV budgets, STARC achieves up to 78% and 65% reductions in attention-layer latency and energy over token-wise sparsity methods, and up to 93% and 92% reductions relative to full attention, while preserving model accuracy.",
"doi": "10.1145/3779212.3790226",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790226",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790226",
"has_full_text": true
},
{
"title": "<i>SpeContext:</i>\n Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs",
"authors": [
"Jiaming Xu",
"Jiayi Pan",
"Hanzhen Wang",
"Yongkang Zhou",
"Jiancai Ye",
"Yu Wang",
"Guohao Dai"
],
"abstract": "",
"doi": "10.1145/3779212.3790224",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790224",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790224",
"has_full_text": true
},
{
"title": "DIP: Efficient Large Multimodal Model Training with Dynamic Interleaved Pipeline",
"authors": [
"Zhenliang Xue",
"Hanpeng Hu",
"Xing Chen",
"Yimin Jiang",
"Yixin Song",
"Zeyu Mi",
"Yibo Zhu",
"Daxin Jiang",
"Yubin Xia",
"Haibo Chen"
],
"abstract": "Large multimodal models (LMMs) have demonstrated excellent capabilities in both understanding and generation tasks with various modalities. While these models can accept flexible combinations of input data, their training efficiency suffers from two major issues: pipeline stage imbalance caused by heterogeneous model architectures, and training data dynamicity stemming from the diversity of multimodal data. In this paper, we present DIP, a dynamic and modality-aware pipeline scheduling framework designed for LMM training. DIP tackles the challenge of dynamic imbalance via two key techniques: (1) separating computations of different modalities into dedicated pipeline segments to balance workloads within a continuous set of stages; (2) dynamically splitting input data into finer-grained, modality-specific sub-microbatches to balance workloads across these segments. By asynchronously generating pipeline schedules on idle CPU resources during training, DIP dynamically tailors stage executions to each input batch without stalling the training process. We validate DIP on a diverse set of five LMMs, ranging from 12B to 94B parameters and including vision-language and diffusion models. Experimental results show that our system achieves up to 97.3% higher throughput compared to state-of-the-art systems, demonstrating strong adaptability to fluctuating multimodal training workloads.",
"doi": "10.1145/3779212.3790154",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790154",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790154",
"has_full_text": true
},
{
"title": "GS-Scale: Unlocking Large-Scale 3D Gaussian Splatting Training via Host Offloading",
"authors": [
"Donghyun Lee",
"Dawoon Jeong",
"Jae W. Lee",
"Hongil Yoon"
],
"abstract": "",
"doi": "10.1145/3779212.3790167",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790167",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790167",
"has_full_text": true
},
{
"title": "Asynchrony and GPUs: Bridging this Dichotomy for I/O with AGIO",
"authors": [
"Jihoon Han",
"Anand Sivasubramaniam",
"Chia-Hao Chang",
"Vikram Sharma Mailthody",
"Zaid Qureshi",
"Wen-Mei Hwu"
],
"abstract": "GPUs rely on a largely synchronous programming and execution model. With increasing need to access data residing on SSDs, GPU threads can incur significant latencies for such accesses when using blocking/synchronous I/O mechanisms. There is little hardware/systems support today to perform non-blocking/asynchronous operations from GPU threads directly to tolerate microsecond level latencies incurred in SSD accesses. To fix this dichotomy, this paper presents the design, implementation and evaluation of AGIO, which provides APIs and a runtime environment for GPU threads to directly perform asynchronous I/O operations (fully GPU-orchestrated without CPU involvement). AGIO decouples, both in time and space, the I/O initiation from its completion, to allow useful computation in-between, in order to hide much of the I/O latency. This is particularly useful in applications with access patterns known at compile time, where similar to prefetching, AGIO I/Os can be introduced ahead of need, to yield 65% better performance than its synchronous counterpart. A non-intuitive benefit of AGIO, particularly in applications with data dependent accesses, is the ability to allow threads to proceed beyond I/O initiation, towards initiating more I/O, even if there is little compute to overlap. Such pro-active I/O issuance increases the I/O parallelism to more fully utilize underlying bandwidths, yielding 32% better performance than the synchronous alternative in data dependent executions. Decoupling initiation from completion makes AGIO more adaptive to dataset characteristics, with the programmer not needing a priori knowledge of inputs for effective performance. We also show that AGIO can meet (or better) the performance using a GPU with fewer than half the compute engines of its synchronous counterparts.",
"doi": "10.1145/3779212.3790130",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790130",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790130",
"has_full_text": true
},
{
"title": "REPA:\n <u>Re</u>\n configurable\n <u>P</u>\n IM for the Joint\n <u>A</u>\n cceleration of KV Cache Offloading and Processing",
"authors": [
"Yang Hong",
"Junlong Yang",
"Bo Peng",
"Jianguo Yao"
],
"abstract": "The use of KV cache in LLM inference leads to large memory footprint and sub-optimal decoding performance. Prior studies typically address one of these two limitations by either offloading or stage-split inference. In this paper, we explore and reveal the possibility of a joint solution, and propose REPA, a GPU-PIM hybrid system to prototype this idea. We leverage reconfigurable ReRAM PIM to achieve fast KV cache persistence, and balance the requirement of processing speed and memory capacity. To fully unleash the parallelization potential of REPA, we propose optimizations in (1) architecture, (2) data mapping and (3) pipelining: (1) We propose bulk-wise memory instructions and multi-level controllers to enable finer-grained parallelism in the PIM device. (2) We propose locality-aware data mapping to make the best of the aforementioned architectural optimization, and reduce long-range data transfer on chip. (3) We adopt sub-batch pipelining to reduce idleness in batches, and propose transfer overlapping to shadow the KV cache transfer by computation. Experimental results show that REPA exhibits high inference speed, energy efficiency and integratability. It is 1.5--6.5\u00d7 faster, and 8--10\u00d7 more efficient than NVIDIA A100. It also outperforms state-of-the-art DRAM PIM systems by up to 1.4\u00d7 for long context inference. When integrated into existing offloading systems, REPA achieves 1.4--2.0\u00d7 offloading speed, and 1.2--1.4\u00d7 end-to-end speedup, showcasing its high potential for fast KV cache offloading and processing.",
"doi": "10.1145/3779212.3790212",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790212",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790212",
"has_full_text": true
},
{
"title": "Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context",
"authors": [
"Hao Wu",
"Qidong Zhao",
"Songqing Chen",
"Yang Chen",
"Yueming Hao",
"Tony CW Liu",
"Sijia Chen",
"Adnan Aziz",
"Keren Zhou"
],
"abstract": "Memory access errors remain one of the most pervasive bugs in GPU programming. Existing GPU sanitizers such as compute-sanitizer detect memory access errors by instrumenting every memory instruction in low-level IRs or binaries, which imposes high overhead and provides minimal memory access error diagnostic context for fixing problems. We present Triton-Sanitizer, the first device-agnostic memory sanitizer designed for Triton, a domain-specific language for developing portable, efficient GPU kernels for deep learning workloads. Triton-Sanitizer leverages Triton's tile-oriented semantics to construct symbolic expressions for memory addresses and masks, verifies them with an SMT solver, and selectively falls back to eager simulation for indirect accesses. This hybrid analysis enables precise detection of memory access errors without false positives while avoiding the cost of per-access instrumentation. Beyond detection, Triton-Sanitizer generates rich diagnostic reports that attribute violations to the tensors nearest to the violated addresses, track the complete call path, and expose the symbolic operations responsible for incorrect addresses. Evaluated on seven widely used open-source repositories of Triton kernels, Triton-Sanitizer uncovered 24 previously unknown memory access errors, of which 8 have already been fixed and upstreamed by us. Compared to compute-sanitizer, Triton-Sanitizer achieves speedups ranging from 1.07\u00d7 to 14.66\u00d7, with an average improvement of 1.62\u00d7, demonstrating its ability to enhance performance, precision, and usability in memory access error detection.",
"doi": "10.1145/3779212.3790241",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790241",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790241",
"has_full_text": true
},
{
"title": "JOSer: Just-In-Time Object Serialization for Heavy Java Serialization Workloads",
"authors": [
"Chaokun Yang",
"Pengbo Nie",
"Ziyi Lin",
"Weipeng Wang",
"Qianwei Yu",
"Chengcheng Wan",
"He Jiang",
"Yuting Chen"
],
"abstract": "Object serialization is critical in Java, which preserves objects in memory and transfers them among software systems if needed. However, the serialization techniques of modern Java systems are usually inflexible and inefficient under heavy serialization workloads, as they rely on manually-defined object schemas or omni-functional serializers. To tackle the above problem, we reveal a novel, serialization-specific optimization opportunity in Java. Based on it, we develop JOSer (Just-in-time Object SERializer), an efficient, Just-in-Time (JIT) object serialization technique. At runtime, JOSer generates a set of class-specific, JIT-friendly object serializers (i.e., serialization code), and then continuously optimizes them with the JIT compiler of Java Virtual Machine (JVM). JOSer also shares the metadata of objects under serialization and the serializers under optimization. We evaluate JOSer against six Java serialization techniques including OpenJDK's built-in serialization technique. Overall, JOSer improves the throughput by up to 20~83\u00d7 in serialization and 43~229\u00d7 in deserialization. JOSer has been successfully deployed in real-world products, reducing serialization CPU usage of Flink by 35.32~41.14% and latency of search recommendations by 30+ ms. JOSer grounds Apache Fory#8482;, an open-source serialization framework available at https://fory.apache.org/.",
"doi": "10.1145/3779212.3790179",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790179",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790179",
"has_full_text": true
},
{
"title": "TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill &amp; Decode Inference",
"authors": [
"Xiaojuan Tang",
"Fanxu Meng",
"Pingzhi Tang",
"Yuxuan Wang",
"Di Yin",
"Xing Sun",
"Muhan Zhang"
],
"abstract": "",
"doi": "10.1145/3779212.3790237",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790237",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790237",
"has_full_text": true
},
{
"title": "Lifetime-Aware Design for Item-Level Intelligence at the Extreme Edge",
"authors": [
"Shvetank Prakash",
"Andrew Cheng",
"Olof Kindgren",
"Ashiq Ahamed",
"Graham Knight",
"Jedrzej Kufel",
"Francisco Rodriguez",
"Arya Tschand",
"David Kong",
"Mariam Elgamal",
"Jerry Huang",
"Emma Chen",
"Gage Hills",
"Richard Price",
"Emre Ozer",
"Vijay Janapa Reddi"
],
"abstract": "We present FlexiFlow, a lifetime-aware design framework for item-level intelligence (ILI) where computation is integrated directly into disposable products like food packaging and medical patches. Our framework leverages natively flexible electronics which offer significantly lower costs than silicon but are limited to kHz speeds and several thousands of gates. Our insight is that unlike traditional computing with more uniform deployment patterns, ILI applications exhibit 1000\u00d7 variation in operational lifetime, fundamentally changing optimal architectural design decisions when considering trillion-item deployment scales. To enable holistic design and optimization, we model the trade-offs between embodied carbon footprint and operational carbon footprint based on application-specific lifetimes. The framework includes: (1) FlexiBench, a workload suite targeting sustainability applications from spoilage detection to health monitoring; (2) FlexiBits, area-optimized RISC-V cores with 1/4/8-bit datapaths achieving 2.65\u00d7 to 3.50\u00d7 better energy efficiency per workload execution; and (3) a carbon-aware model that selects optimal architectures based on deployment characteristics. We show that lifetime-aware microarchitectural design can reduce carbon footprint by 1.62\u00d7, while algorithmic decisions can reduce carbon footprint by 14.5\u00d7. We validate our approach through the first tape-out using a PDK for flexible electronics with fully open-source tools, achieving 30.9\\,kHz operation. FlexiFlow enables exploration of computing at the Extreme Edge where conventional design methodologies must be reevaluated to account for new constraints and considerations. FlexiFlow is available at https://github.com/harvard-edge/FlexiFlow.",
"doi": "10.1145/3779212.3790182",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790182",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790182",
"has_full_text": true
},
{
"title": "SwiftSpec: Disaggregated Speculative Decoding and Fused Kernels for Low-Latency LLM Inference",
"authors": [
"Ziyi Zhang",
"Ziheng Jiang",
"Chengquan Jiang",
"Menghan Yu",
"Size Zheng",
"Haibin Lin",
"Xin Liu",
"Henry Hoffmann"
],
"abstract": "Low-latency, single-request decoding of large language models is critical for interactive systems with tight SLA demands. Prior work reduces latency through speculative decoding (combining a small draft model with a larger target model), but the draft model remains on the critical path, and communication overhead limits scaling across GPUs due to the small batch size associated with single-request decoding. To address these limitations, this paper introduces SwiftSpec: a system architecture that disaggregates draft and target models across homogeneous GPUs within a single node and utilizes NCCL-low-latency primitives directly to improve the performance of core GEMM and attention kernels. Our implementation includes 3k lines of custom CUDA for fused kernels and an evolving tree cache for KV-cache consistency and maximized reuse between draft and target models. On a single 8\u00d7H800 GPU node, SwiftSpec achieves 347 tokens/s for Llama-3-70B---1.3\u00d7 faster than NVIDIA's own benchmarks on a higher-performance 8\u00d7H200 setup---and averages 1.75\u00d7 faster decoding than state-of-the-art speculative decoding across five model families and six datasets. Specifically, we find that for Llama-3-70B SwiftSpec is significantly faster across all 480 tested queries, showing 1.7\u00d7 speedup over the best open-source baseline for 95th percentile requests. Code for SwiftSpec will be available at https://github.com/ByteDance-Seed/SwiftSpec",
"doi": "10.1145/3779212.3790246",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790246",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790246",
"has_full_text": true
},
{
"title": "Rage Against the State Machine: Type-Stated Hardware Peripherals for Increased Driver Correctness",
"authors": [
"Tyler Potyondy",
"Anthony Tarbinian",
"Leon Schuermann",
"Eric Mugnier",
"Adin Ackerman",
"Amit Levy",
"Pat Pannuto"
],
"abstract": "Hardware provides driver authors both a strict specification of the operations a driver is allowed to do, and a highly permissive interface full of operations a driver can do. Authoring drivers that adhere to the provided hardware device protocol is challenged by dynamic definitions of what a driver should do based on the hardware's state. This is further complicated by increasingly capable hardware which may transition between states concurrently and independently from the software driver. We present Abacus, a framework that statically prevents drivers from violating device protocols. Abacus refines type-states to model hardware-software concurrency and presents a formalization of hardware states into two families: stable and transient states. The Abacus framework provides a domain specific language for developers to encode device protocol invariants in tens of lines of code, and, using the generated Abacus type-states, statically prevents device protocol bugs. We demonstrate the Abacus framework's practicality by integrating it into drivers in two Rust OSes. We find that Abacus imposes minimal to no overhead in code-size and runtime performance, statically detects device protocol violations, and enables the usage of hardware features that would otherwise be prohibitively complex.",
"doi": "10.1145/3779212.3790207",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790207",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790207",
"has_full_text": true
},
{
"title": "Highly Automated Verification of Security Properties for Unmodified System Software",
"authors": [
"Ganxiang Yang",
"Wei Qiang",
"Yi Rong",
"Xuheng Li",
"Fanqi Yu",
"Jason Nieh",
"Ronghui Gu"
],
"abstract": "System software is often complex and hides exploitable security vulnerabilities. Formal verification promises bug-free software but comes with a prohibitive proof cost. We present Spoq2, the first verification framework to highly automate security verification of unmodified system software. Spoq2 is based on the observation that many security properties, such as noninterference, can be reduced to establishing inductive invariants on individual transitions of a transition system that models system software. However, directly verifying such invariants for real system code overwhelms existing SMT solvers. Spoq2 makes this possible by automatically reducing verification complexity. It decomposes transitions into individual execution paths, extends cone-of-influence analysis to the individual transition level, and eliminates irrelevant machine states, clauses, and control-flow paths before invoking the SMT solver. Spoq2 further optimizes how pointer operations are modeled and verified through pointer abstractions that eliminate expensive bit-wise operations from SMT queries. We demonstrate the effectiveness of Spoq2 by verifying security properties of four unmodified, real-world system codebases with minimal manual effort.",
"doi": "10.1145/3779212.3790171",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790171",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790171",
"has_full_text": true
},
{
"title": "Ouroboros: Wafer-Scale SRAM CIM with Token-Grained Pipelining for Large Language Model Inference",
"authors": [
"Yiqi Liu",
"Yudong Pan",
"Mengdi Wang",
"Shixin Zhao",
"Haonan Zhu",
"Yinhe Han",
"Lei Zhang",
"Ying Wang"
],
"abstract": "Large language model (LLM) inference demands vast memory capacity and hierarchical memory structures, but conventional architectures suffer from excessive energy and latency costs due to frequent data movement across deep memory tiers. To address this, we propose a wafer-scale SRAM-based Computing-in-Memory (CIM) architecture that performs all LLM operations in situ within the first-level SRAM, eliminating off-chip data migration and achieving unprecedented energy efficiency. However, wafer-scale SRAM CIM presents multiple challenges due to the limited first-level memory capacity, which requires efficient compute-memory resource allocation. To enable efficient LLM execution on this architecture, we propose three key innovations: Token-Grained Pipelining \u2013 Conventional sequence-level pipelining suffers from underutilization due to varying input sequence lengths and batching policies. We introduce a fine-grained token-level pipeline that mitigates sequence length variations, enhancing the utilization of CIM cores while minimizing the storage capacity required for activation. Distributed Dynamic KV Cache Management \u2013 KV cache storage occupies significant memory in LLM inference. We optimize on-chip KV caching by decoupling CIM memory from compute assignment, leveraging fragmented SRAM CIM memory within already-allocated cores for efficient KV storage, and reducing dedicated memory overhead. Communication-Aware and Fault-Tolerant Core Mapping \u2013 Efficient execution on wafer-scale CIM requires optimal mapping from transformer blocks to CIM cores to minimize inter-(pipeline) stage communication while ensuring intra-stage compute locality. We design a network-on-wafer-aware mapping strategy that places pipeline stages in close proximity while also distributing large layers efficiently across multiple cores. This mapping accounts for core-level defects, improving fault tolerance in wafer-scale deployment. Experimental results demonstrate that Ouroboros achieves 4.1\u00d7 average throughput improvement and 4.2\u00d7 average energy efficiency gain over state-of-the-art systems, peaking at 9.1\u00d7 throughput and 17\u00d7 energy efficiency for the 13B model.",
"doi": "10.1145/3779212.3790197",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790197",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790197",
"has_full_text": true
},
{
"title": "A Framework for Developing and Optimizing Fully Homomorphic Encryption Programs on GPUs",
"authors": [
"Jianyu Zhao",
"Xueyu Wu",
"Guang Fan",
"Mingzhe Zhang",
"Shoumeng Yan",
"Lei Ju",
"Zhuoran Ji"
],
"abstract": "In sensitive domains such as healthcare and finance, machine learning increasingly employs Fully Homomorphic Encryption (FHE) to secure both user data and models. Although FHE's intrinsic parallelism naturally aligns with GPU architectures, optimizing GPU kernels alone remains insufficient for efficient end-to-end FHE application development. The inherent complexity of FHE schemes and intricate GPU-specific details impede developers from focusing on high-level program logic. Additionally, FHE's high memory requirements, fine-grained memory operations, and redundant computations introduce further optimization challenges, resulting in inefficiencies even when GPU kernels are individually optimized. This paper introduces EasyFHE, a framework designed to simplify the development and optimization of GPU-accelerated FHE applications. Similar to PyTorch, EasyFHE provides high-level interfaces for defining computational logic while automatically handling low-level tasks, such as implementation selection and memory management. Furthermore, it incorporates an optimization framework that systematically addresses performance bottlenecks by applying tailored optimization passes during the lowering from high-level FHE programs to GPU kernels. Compared to state-of-the-art open-source GPU FHE libraries, EasyFHE uniquely supports FHE programs with memory requirements exceeding typical GPU capacities, achieving an average speedup of 2.88\u00d7 with a peak of 4.39\u00d7.",
"doi": "10.1145/3779212.3790120",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790120",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790120",
"has_full_text": true
},
{
"title": "Arancini: A Hybrid Binary Translator for Weak Memory Model Architectures",
"authors": [
"Sebastian Reimers",
"Dennis Sprokholt",
"Martin Fink",
"Theofilos Augoustis",
"Simon Kammermeier",
"Rodrigo C. O. Rocha",
"Tom Spink",
"Redha Gouicem",
"Soham Chakraborty",
"Pramod Bhatotia"
],
"abstract": "Binary translation is a powerful approach to support cross-architecture emulation of unmodified binaries in increasingly heterogeneous computing environments. However, binary translation systems face correctness issues, due to the strong-on-weak memory model mismatch (e.g., from x86-64 to Arm/RISC-V) for concurrent programs. Besides, the current landscape of binary translation systems is fundamentally limited in terms of completeness for static systems and performance for dynamic ones. To address these limitations, we propose Arancini, a hybrid binary translator system designed and implemented from the ground up that strives for correct, complete, and efficient emulation for weak memory model architectures. Our system makes three foundational contributions to achieve these design goals: ArancinIR, a unified intermediate representation for static and dynamic binary translators; a formalization of ArancinIR,'s memory model and formally verified mapping schemes from x86-6 to Arm and RISC-V, to ensure strong-on-weak correctness; and Arancini, a complete and performant hybrid binary translator, implementing the verified mapping schemes for correctness. We evaluate Arancini using a multi-threaded benchmark suite with two backends (Arm and RISC-V), and show that Arancini can be up to 5\u00d7 faster than QEMU--based translators while ensuring correctness and completeness. To our knowledge, Arancini is the first hybrid binary translator whose implementation is guided by formal proofs, to ensure correct execution of strong memory guests on weak memory hosts. It is also the first translator to address mixed-sized accesses for Arm targets.",
"doi": "10.1145/3779212.3790127",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790127",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790127",
"has_full_text": true
},
{
"title": "TetriServe: Efficiently Serving Mixed DiT Workloads",
"authors": [
"Runyu Lu",
"Shiqi He",
"Wenxuan Tan",
"Shenggui Li",
"Ruofan Wu",
"Jeff J. Ma",
"Ang Chen",
"Mosharaf Chowdhury"
],
"abstract": "Diffusion Transformer (DiT) models excel at generating high-quality images through iterative denoising steps, but serving them under strict Service Level Objectives (SLOs) is challenging due to their high computational cost, particularly at larger resolutions. Existing serving systems use fixed-degree sequence parallelism, which is inefficient for heterogeneous workloads with mixed resolutions and deadlines, leading to poor GPU utilization and low SLO attainment. In this paper, we propose step-level sequence parallelism to dynamically adjust the degree of parallelism of individual requests according to their deadlines. We present TetriServe1, a DiT serving system that implements this strategy for highly efficient image generation. Specifically, TetriServe introduces a novel round-based scheduling mechanism that improves SLO attainment by (1) discretizing time into fixed rounds to make deadline-aware scheduling tractable, (2) adapting parallelism at the step level and minimizing GPU hour consumption, and (3) jointly packing requests to minimize late completions. Extensive evaluation on state-of-the-art DiT models shows that TetriServe achieves up to 32% higher SLO attainment compared to existing solutions without degrading image quality. TetriServe is available at https://github.com/DiT-Serving/TetriServe.",
"doi": "10.1145/3779212.3790233",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790233",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790233",
"has_full_text": true
},
{
"title": "Accelerating Computation in Quantum LDPC Code",
"authors": [
"Jungmin Cho",
"Hyeonseong Jeong",
"Junpyo Kim",
"Junhyuk Choi",
"Juwon Hong",
"Jangwoo Kim"
],
"abstract": "Fault-tolerant quantum computing (FTQC) uses quantum error correction (QEC) codes to execute large-scale quantum programs on noisy quantum computers. Quantum low-density parity-check (qLDPC) codes are promising as they use an order of magnitude fewer qubits than widely used surface codes. However, as qLDPC codes support only a limited set of operations, they require programs to be decomposed into many qLDPC-supported operations. This prohibitively increases the execution time of FTQC applications to tens of days, hindering the practical viability of qLDPC codes. In this paper, we propose ACQC, a software\u2013hardware co-design approach to accelerate computation in qLDPC codes, realizing the practical use of qLDPC codes for FTQC. To achieve this goal, we first introduce a novel decomposition\u2013layout co-design that significantly reduces execution time at the cost of qubit overhead. Then, we reduce the qubit overhead by exploiting characteristics of qLDPC codes and our decomposition technique. Lastly, we reduce the qubit overhead of magic state distillation by designing an optimal qLDPC code layout. As there is currently no available hardware for qLDPC codes, we evaluate ACQC with a comprehensive simulation. The results show that ACQC achieves 4.4\u00d7 speedup over the baseline qLDPC code FTQC and 18.3\u00d7 qubit reduction over the surface code FTQC on average.",
"doi": "10.1145/3779212.3790122",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790122",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790122",
"has_full_text": true
},
{
"title": "Towards High-Goodput LLM Serving with Prefill-decode Multiplexing",
"authors": [
"Yukang Chen",
"Weihao Cui",
"Han Zhao",
"Ziyi Xu",
"Xiaoze Fan",
"Xusheng Chen",
"Yangjie Zhou",
"Shixuan Sun",
"Bingsheng He",
"Quan Chen"
],
"abstract": "Large Language Model (LLM) serving must meet stringent Service Level Objectives (SLOs) for both the prefill and decode phases. Some existing solutions disaggregate the two phases, causing potential resource idleness or compute redundancy. Others split the prefill phase into chunks and fuse it with decode iteration, creating a dilemma between SLO compliance and high utilization. To address these issues, an efficient serving system should dynamically adapt compute allocation, decouple compute from memory management, and execute prefill and decode independently. We present MuxWise, an LLM serving framework that adopts a new paradigm, intra-GPU prefill-decode multiplexing, to meet these requirements. To fully exploit the paradigm, MuxWise integrates a bubble-less multiplex engine, a contention-tolerant estimator, and an SLO-aware dispatcher. Evaluation shows that MuxWise improves peak throughput under SLO guarantees by an average of 2.20x (up to 3.06x) over state-of-the-art baselines.",
"doi": "10.1145/3779212.3790236",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790236",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790236",
"has_full_text": true
},
{
"title": "vCXLGen: Automated Synthesis and Verification of CXL Bridges for Heterogeneous Architectures",
"authors": [
"Anatole Lefort",
"Julian Pritzi",
"Nicol\u00f2 Carpentieri",
"David Schall",
"Simon Dittrich",
"Soham Chakraborty",
"Nicolai Oswald",
"Pramod Bhatotia"
],
"abstract": "Compute Express Link (CXL) offers byte-addressable, cache-coherent remote memory accesses across multiple hosts. Unfortunately, the CXL specification lacks mechanisms to ensure safe interoperability between heterogeneous host architectures with diverse cache coherence (CC) protocols and memory consistency models (MCMs). This semantic gap poses fundamental challenges and a significant barrier to adopting CXL in modern heterogeneous data centers. We propose CXL bridges, an abstraction that interposes between hosts and CXL to reconcile differences in CC protocols and MCMs. We present vCXLGen, the first system that automatically synthesizes and verifies these CXL bridges. We make two core contributions: (1) a fully automated approach to synthesize CXL bridges from machine-readable CC protocol specifications, and (2) a compositional formal verification approach for scalable model-checking of liveness properties. Our evaluation shows that vCXLGen is general, i.e., it supports diverse protocols (both SWMR and relaxed consistency) and is easily extensible when integrating new protocols, such as CXL.mem. Our performance evaluations indicate that synthesised bridges achieve comparable results to manually designed homogeneous protocols. Finally, for correctness, our formal verification rigorously proves the safety and liveness of synthesized bridges, all while achieving significant scalability in liveness verification of complex heterogeneous systems.",
"doi": "10.1145/3779212.3790245",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790245",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790245",
"has_full_text": true
},
{
"title": "DARTH-PUM: A Hybrid Processing-Using-Memory Architecture",
"authors": [
"Ryan Wong",
"Ben Feinberg",
"Saugata Ghose"
],
"abstract": "Analog processing-using-memory (PUM; a.k.a. in-memory computing) makes use of electrical interactions inside memory arrays to perform bulk matrix\u2013vector multiplication (MVM) operations. However, many popular matrix-based kernels need to execute non-MVM operations, which analog PUM cannot directly perform. To retain its energy efficiency, analog PUM architectures augment memory arrays with CMOS-based domain-specific fixed-function hardware to provide complete kernel functionality, but the difficulty of integrating such specialized CMOS logic with memory arrays has largely limited analog PUM to being an accelerator for machine learning inference, or for closely related kernels. An opportunity exists to harness analog PUM for general-purpose computation: recent works have shown that memory arrays can also perform Boolean PUM operations, albeit with very different supporting hardware and electrical signals than analog PUM. We propose DARTH-PUM, a general-purpose hybrid PUM architecture that tackles key hardware and software challenges to integrating analog PUM and digital PUM. We propose optimized peripheral circuitry, coordinating hardware to manage and interface between both types of PUM, an easy-to-use programming interface, and low-cost support for flexible data widths. These design elements allow us to build a practical PUM architecture that can execute kernels fully in memory, and can scale easily to cater to domains ranging from embedded applications to large-scale data-driven computing. We show how three popular applications (AES encryption, convolutional neural networks, large-language models) can map to and benefit from DARTH-PUM, with speedups of 59.4x, 14.8x, and 40.8x over an analog+CPU baseline.",
"doi": "10.1145/3779212.3790151",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790151",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790151",
"has_full_text": true
},
{
"title": "EARTH: An Efficient MoE Accelerator with Entropy-Aware Speculative Prefetch and Result Reuse",
"authors": [
"Fangxin Liu",
"Ning Yang",
"Jingkui Yang",
"Zongwu Wang",
"Chenyang Guan",
"Yu Feng",
"Li Jiang",
"Haibing Guan"
],
"abstract": "Mixture-of-Experts (MoE) models significantly reduce computation in large language models by activating only a subset of experts per input token, but they introduce severe memory bottlenecks due to the large number of expert parameters. Existing offloading and prefetching strategies either incur accuracy loss, prohibitively high memory traffic, or high decoding overhead, limiting deployment on resource-constrained hardware. In this work, we present EARTH, a hardware\u2013software co-design that addresses these challenges through three key innovations. First, we propose a dual-entropy encoding scheme that decomposes each expert into a high-information base and a delta component, enabling compact storage while preserving accuracy via adaptive precision management. Second, we introduce a delta-aware speculative prefetching and reuse mechanism that preloads base components of predicted experts and selectively fetches deltas, reusing previously computed delta patterns to reduce memory traffic and redundant computation. Third, we design a hardware accelerator that is co-designed to efficiently support this encoding and prefetching strategy, optimizing execution order, parallelism, and memory utilization. Across representative MoE workloads, EARTH reduces data movement overhead, improves prefetch efficiency, and achieves up to 2.10\u00d7 speedup compared to state-of-the-art baselines, while maintaining high model accuracy.",
"doi": "10.1145/3779212.3790155",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790155",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790155",
"has_full_text": true
},
{
"title": "PAT: Accelerating LLM Decoding via\n <u>P</u>\n refix-\n <u>A</u>\n ware A\n <u>t</u>\n tention with Resource Efficient Multi-Tile Kernel",
"authors": [
"Jinjun Yi",
"Zhixin Zhao",
"Yitao Hu",
"Ke Yan",
"Weiwei Sun",
"Hao Wang",
"Laiping Zhao",
"Yuhao Zhang",
"Wenxin Li",
"Keqiu Li"
],
"abstract": "LLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across requests (e.g., system prompts, tools/templates, RAG). Existing attention implementations fail to fully exploit prefix sharing: one-query-per-CTA execution repeatedly loads shared prefix KV cache, while one-size-fits-all tiling leaves on-chip resources idle and exacerbates bubbles for uneven KV lengths. These choices amplify memory bandwidth pressure and stall memory-bound decode attention. This paper introduces PAT, a prefix-aware attention kernel implementation for LLM decoding that organizes execution with a pack-forward-merge paradigm. PAT packs queries by shared prefix to reduce repeated memory accesses, runs a customized multi-tile kernel to achieve high resource efficiency. It further applies practical multi-stream forwarding and KV splitting to reduce resource bubbles. The final merge performs online softmax with negligible overhead. We implement PAT as an off-the-shelf plugin for vLLM. Evaluation on both real-world and synthetic workloads shows that PAT reduces attention latency by 53.5% on average and TPOT by 17.0-93.1% under the same configurations against state-of-the-art attention kernels. PAT's source code is publicly available at https://github.com/flashserve/PAT.",
"doi": "10.1145/3779212.3790200",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790200",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790200",
"has_full_text": true
},
{
"title": "Compass: Navigating the Design Space of Taint Schemes for RTL Security Verification",
"authors": [
"Yuheng Yang",
"Qinhan Tan",
"Thomas Bourgeat",
"Sharad Malik",
"Mengjia Yan"
],
"abstract": "Hardware information flow tracking (IFT) using taint analysis provides a methodology to check whether a hardware design satisfies certain security properties. Previous work has shown a broad trade-off space between precision and complexity when using different taint analysis schemes. A careful investigation of this space has led to the insight that applying different taint schemes to different components of a hardware design can improve overall efficiency. We present Compass, a systematic framework to guide users in designing appropriate taint schemes that are as lightweight as possible while still sufficient to accomplish their security verification goals. We first establish a unified terminology to comprehensively capture existing taint schemes. We then apply counterexample-guided abstraction refinement (CEGAR) for taint refinement to iteratively improve the taint scheme. We evaluated Compass on a set of open-source RISCV processors to verify the information flow properties for speculative execution vulnerabilities, and demonstrate that Compass significantly improves both simulation speed and formal-verification scalability of taint analysis.",
"doi": "10.1145/3779212.3790144",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790144",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790144",
"has_full_text": true
},
{
"title": "Fine-grained and Non-intrusive LLM Training Monitoring via Microsecond-level Traffic Measurement",
"authors": [
"Yibo Xiao",
"Hao Zheng",
"Haifeng Sun",
"Qingkai Meng",
"Jiong Duan",
"Xiaohe Hu",
"Rong Gu",
"Guihai Chen",
"Chen Tian"
],
"abstract": "Large language model (LLM) training is prone to anomalies due to its long duration and large scale, which can lead to significant performance degradation or even training crashes. Due to the synchronization nature of LLM training, anomalies exhibit the cascading effect, making their diagnosis challenging. Existing approaches rely on collecting communication operator information via code instrumentation, which yields only coarse-grained monitoring data and requires modifications to training code or communication libraries. We propose Pulse, a fine-grained, non-intrusive, and easy-to-deploy monitoring system. Our key idea is to enable fine-grained monitoring via traffic measurement. Pulse conducts microsecond-level RDMA traffic measurement on NICs, and transforms flow-level measurements into communication operator measurements, thereby enabling fine-grained and non-intrusive monitoring. We deploy Pulse on a testbed with 64 H200 GPUs and evaluate its anomaly localization capability under common failure scenarios. Pulse achieves machine-level localization in 10 out of 12 scenarios, while existing methods succeed in only 4 and even misdiagnose 2 of the remaining scenarios. Additionally, Pulse achieves over 90% precision and 100% recall, supports up to 2000 concurrent RDMA flow measurements per NIC, and imposes negligible overhead on training performance, making it a practical solution for real-world LLM training environments.",
"doi": "10.1145/3779212.3790163",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790163",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790163",
"has_full_text": true
},
{
"title": "MoE-APEX: An Efficient MoE Inference System with Adaptive Precision Expert Offloading",
"authors": [
"Peng Tang",
"Jiacheng Liu",
"Xiaofeng Hou",
"Yifei Pu",
"Jing Wang",
"Pheng-Ann Heng",
"Chao Li",
"Minyi Guo"
],
"abstract": "Mixture-of-experts (MoE) architectures enable scalable Large Language Models (LLMs) with reduced computational overhead, yet their deployment on memory-constrained edge devices is hindered by substantial memory demands. Traditional expert-offloading techniques mitigate memory constraints but often significantly increase inference latency. We introduce MoE-APEX, an Adaptive Precision EXpert offloading system that optimizes MoE inference for edge architectures by dynamically managing expert precision. Our core innovation is to replace less critical cache-miss experts with low-precision variants, reducing loading latency while maintaining accuracy. MoE-APEX introduces three innovative techniques that map the natural hierarchy of MoE computation: (1) a token-level dynamic expert loading mechanism, (2) a layer-level adaptive expert prefetching technique, and (3) a sequence-level cost-aware expert caching policy. These innovations enable MoE-APEX to leverage the benefits of mixed-precision expert inference fully. Implemented atop Llama.cpp, MoE-APEX achieves decoding speedups ranging from 1.34x to 9.75x compared to state-of-the-art MoE offloading systems across diverse edge devices, offering a robust solution for efficient MoE deployment in resource-constrained environments.",
"doi": "10.1145/3779212.3790187",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790187",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790187",
"has_full_text": true
},
{
"title": "AlphaSyndrome: Tackling the Syndrome Measurement Circuit Scheduling Problem for QEC Codes",
"authors": [
"Yuhao Liu",
"Shuohao Ping",
"Junyu Zhou",
"Ethan Decker",
"Justin Kalloor",
"Mathias Weiden",
"Kean Chen",
"Yunong Shi",
"Ali Javadi-Abhari",
"Costin Iancu",
"Gushu Li"
],
"abstract": "",
"doi": "10.1145/3779212.3790123",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790123",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790123",
"has_full_text": true
},
{
"title": "BitRed: Taming Non-Uniform Bit-Level Sparsity with a Programmable RISC-V ISA for DNN Acceleration",
"authors": [
"Yanhuan Liu",
"Wenming Li",
"Kunming Zhang",
"Yuqun Liu",
"Siao Wen",
"Lexin Wang",
"Tianyu Liu",
"Haibin Wu",
"Zhihua Fan",
"Xiaochun Ye",
"Dongrui Fan",
"Xuejun An"
],
"abstract": "The non-uniform and dynamic nature of Bit-Level Sparsity (BLS) poses a critical load-imbalance challenge for parallel hardware accelerators. While the Bit-Interleaving paradigm, represented by state-of-the-art accelerators like Bitlet, shows promise, it is fundamentally constrained by a rigid datapath and severe inter-channel load imbalance. This paper introduces BitRed, an accelerator that embodies a new ''programmable adaptive bit-interleaving'' philosophy. Rather than a monolithic design, BitRed's core is an Adaptive-Sparse Processing Unit (ASPU) that deconstructs the acceleration process into a set of orthogonal RISC-V ISA extensions for pre-processing (cal.pre), adaptive distillation with dynamic load balancing (cal.adis), and PDP-optimal reduction (cal.red). By transforming a rigid hardware problem into a flexible scheduling problem, this ISA-based approach provides a fundamentally more adaptable and extensible solution. Empirical studies on a broad set of benchmarks highlight the following results (normalized to a SCNN baseline): (1) up to 9.4\u00d7 speedup over the Bitlet, and 5.6\u00d7 over the latest bit-serial SOTA BitWave; (2) up to 7.6\u00d7 higher inference efficiency than Bitlet on representative models; (3) 5.072mm2 area and scalable power consumption from 550.43mW (float32) to 495.12mW (16b) and 457.90mW (8b) @28nm TSMC; and (4) high versatility across precisions, and up to 18.9\u00d713.5\u00d7 higher than NVIDIA A100 and Jetson Orin 32GB, demonstrating significant competitiveness against GPUs.",
"doi": "10.1145/3779212.3790132",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790132",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790132",
"has_full_text": true
},
{
"title": "Hitchhike: Efficient Request Submission via Deferred Enforcement of Address Contiguity",
"authors": [
"Xuda Zheng",
"Jian Zhou",
"Shuhan Bai",
"Runjin Wu",
"Xianlin Tang",
"Zhiyuan Li",
"Hong Jiang",
"Fei Wu"
],
"abstract": "Modern storage systems operate under high concurrency, making large volumes of outstanding I/Os the norm. However, current I/O submission logic requires requests assigned to the same CPU core to be processed in a serialized manner, turning the software stack into a bottleneck due to high per-request overhead. We argue that the primary cause of this inefficiency lies in the end-to-end enforcement of the strict address contiguity validation -- the constraint that each I/O request must access a contiguous range of addresses (e.g., file offsets, sectors, and logical block addresses). Through system-level analysis, we find that while the address contiguity requirement is crucial at the device level as defined by the NVMe protocol, it is unnecessarily enforced throughout the entire I/O stack. Based on this insight, we propose Hitchhike, an efficient request submission logic that defers address contiguity validation to the device driver. Unlike solutions that resort to kernel-bypass or aggressive polling for performance, Hitchhike achieves high efficiency within the standard OS stack by allowing a single request to encapsulate multiple non-contiguous address ranges. This mechanism drastically reduces the number of individual requests traversing the stack, thereby amortizing the kernel's per-request overhead. We apply Hitchhike to a graph engine, a B-tree store, and FIO. Experimental results show that Hitchhike effectively improves throughput and reduces CPU overhead, while preserving compatibility with existing kernel semantics and hardware interfaces.",
"doi": "10.1145/3779212.3790173",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790173",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790173",
"has_full_text": true
},
{
"title": "WorksetEnclave: Towards Optimizing Cold Starts in Confidential Serverless with Workset-Based Enclave Restore",
"authors": [
"Xiaolong Yan",
"Qihang Zhou",
"Zisen Wan",
"Feifan Qian",
"Wentao Yao",
"Weijuan Zhang",
"Xiaoqi Jia"
],
"abstract": "Serverless computing has become a popular cloud computing paradigm. However, the increasing demand for data security in serverless applications necessitates the use of Trusted Execution Environments (TEEs) such as Intel Software Guard Extensions (SGX). Despite the promise of SGX for secure computation, its adoption in serverless environments is hindered by high startup latencies and excessive Enclave Page Cache (EPC) consumption, particularly during cold starts. This paper identifies the key challenges of SGX in serverless workloads and proposes WorksetEnclave, an efficient optimization method designed to address these issues. WorksetEnclave leverages a snapshot-based approach to optimize both startup time and enclave memory usage. By tracking workset pages used during execution, WorksetEnclave minimizes enclave memory footprints and significantly accelerates enclave restore time during secure checkpointing and recovery. We have implemented two separate prototypes, each based on a different LibOS: Gramine and Occlum. Our evaluation shows that WorksetEnclave accelerates cold start times by 1.9--54\u00d7 and reduces enclave memory consumption by 13.37--94.87%. Our findings demonstrate that WorksetEnclave significantly improves the performance of confidential serverless.",
"doi": "10.1145/3779212.3790249",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790249",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790249",
"has_full_text": true
},
{
"title": "Insum: Sparse GPU Kernels Simplified and Optimized with Indirect Einsums",
"authors": [
"Jaeyeon Won",
"Willow Ahrens",
"Saman Amarasinghe",
"Joel S. Emer"
],
"abstract": "",
"doi": "10.1145/3779212.3790176",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790176",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790176",
"has_full_text": true
},
{
"title": "PF-LLM:\n <u>L</u>\n arge\n <u>L</u>\n anguage\n <u>M</u>\n odel Hinted Hardware\n <u>P</u>\n re\n <u>f</u>\n etching",
"authors": [
"Ceyu Xu",
"Xiangfeng Sun",
"Weihang Li",
"Chen Bai",
"Bangyan Wang",
"Mengming Li",
"Zhiyao Xie",
"Yuan Xie"
],
"abstract": "Hardware data prefetching is a critical technique for mitigating memory latency in modern processors. While sophisticated hardware prefetching algorithms exist, their exclusive reliance on runtime information limits their ability to adapt quickly and comprehend broader program context. Our key insight is that the optimal prefetching strategy for a load instruction is often discernible from its static code context -- a task at which experienced developers excel. This motivates our central question: can a Large Language Model (LLM) be trained to perform this analysis automatically? We introduce PF-LLM, an LLM fine-tuned to analyze the assembly context surrounding a load instruction and generate prefetching hints. These offline-generated hints are consumed at runtime by LMHint Prefetcher, a lightweight hardware prefetcher ensemble designed to leverage this static guidance. Our approach boosts the performance of the on-chip hardware prefetcher by moving the hard ''when, how, and how aggressively to prefetch'' decisions out of the runtime hardware and into an offline LLM-powered analysis. This turns the on-chip prefetcher into a zero-latency, oracle-level system that always follows the best prefetching policy for every single load instruction. Our evaluation shows that our approach achieves a 9.8% instruction-per-cycle (IPC) improvement on average for memory-intensive SPEC 2017 benchmarks over state-of-the-art hardware prefetching baselines and 18.9% improvement on average over state-of-the-art ensemble methods, demonstrating the significant potential of leveraging LLMs to guide microarchitectural decisions.",
"doi": "10.1145/3779212.3790202",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790202",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790202",
"has_full_text": true
},
{
"title": "Efficient Temporal Graph Network Training via Unified Redundancy Elimination",
"authors": [
"Yiqing Wang",
"Hailong Yang",
"Kejie Ma",
"Enze Yu",
"Pengbo Wang",
"Xin You",
"Qingxiao Sun",
"Chenhao Xie",
"Zhongzhi Luan",
"Yi Liu",
"Depei Qian"
],
"abstract": "Temporal Graph Network (TGN) is increasingly adopted to model evolving relationships in dynamic graphs. However, the training pipeline is plagued by pervasive redundancy in computation, storage, and data loading. These redundancies harm computational efficiency, exacerbate memory pressure, and induce excessive CPU-GPU data transfers. We present PULSE, an end-to-end TGN training framework that systematically eliminates redundancies guided by a unified minimal-unit principle. To realize such principle, PULSE defines three synergetic units: 1) the Minimal Input Unit (MIU) for component-wise deduplication and operator-level reconstruction of redundant computations, 2) the Minimal Storage Unit (MSU) for dependency-guided message reconstruction, only preserving irreproducible entries while enabling on-demand recovery of others, and 3) the Minimal Reuse Unit (MRU) for GPU memory management, combining a BlockPool-based buffer allocator with a bipartite temporal reuse strategy to mitigate fragmentation and exploit inter-batch locality. Experimental results on representative benchmarks demonstrate that PULSE improves training throughput by up to 6.67\u00d7 over the state-of-the-art baselines.",
"doi": "10.1145/3779212.3790157",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790157",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790157",
"has_full_text": true
},
{
"title": "CPU-Oblivious Offloading of Failure-Atomic Transactions for Disaggregated Memory",
"authors": [
"Cheng Chen",
"Chencheng Ye",
"Yuanchao Xu",
"Xipeng Shen",
"Xiaofei Liao",
"Hai Jin",
"Wenbin Jiang",
"Yan Solihin"
],
"abstract": "Memory disaggregation introduces new challenges for application reliability, as compute server or interconnection failures can interrupt execution and lead to data inconsistency in the memory server. This paper presents Fanmem, a novel failure-atomic transaction system designed specifically for disaggregated memory architectures. Fanmem ensures data consistency in the presence of failures, drawing inspiration from persistent memory transactions while tailored for memory disaggregation. The key innovations of Fanmem include an asynchronous transaction model and the integration of a processing unit within the switch, enabling the offloading of time-consuming log persistency operations to the switch processing unit and significantly reducing the overhead on the compute servers. Evaluation confirms the effectiveness of Fanmem on two representative memory-disaggregated architectures. Compared to the state-of-the-art persistent memory transaction system, Fanmem achieves an average performance improvement of 1.2X and 1.7X on the respective architectures.",
"doi": "10.1145/3779212.3790146",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790146",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790146",
"has_full_text": true
},
{
"title": "TierX: A Simulation Framework for Multi-tier BCI System Design Evaluation and Exploration",
"authors": [
"Seunghyun Song",
"Yeongwoo Jang",
"Daye Jung",
"Kyungsoo Park",
"Donghan Kim",
"Gwangjin Kim",
"Hunjun Lee",
"Jerald Yoo",
"Jangwoo Kim"
],
"abstract": "Brain-computer interfaces (BCIs) have made remarkable progress in recent years, driven by advances in neuroscience and clinical applications. For practical use, underlying processing systems must meet strict latency and power budgets. However, existing BCI systems typically rely on a single processing node to handle the entire workload, making it difficult to satisfy these budgets across diverse applications. In this work, we present TierX, the first simulation framework for design space exploration of multi-tier BCI systems. TierX models heterogeneous processing nodes across tiers, including implanted processors, body-attached devices, and external servers, together with diverse communication and powering methods. It navigates the extensive design space to identify optimal (1) workload partitioning options and (2) system configurations that leverage the strengths of each tier. We validate TierX on representative system configurations and demonstrate its effectiveness across diverse use cases.",
"doi": "10.1145/3779212.3790234",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790234",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790234",
"has_full_text": true
},
{
"title": "History Doesn't Repeat Itself but Rollouts Rhyme: Accelerating Reinforcement Learning with RhymeRL",
"authors": [
"Jingkai He",
"Tianjian Li",
"Erhu Feng",
"Dong Du",
"Qian Liu",
"Tao Liu",
"Yubin Xia",
"Haibo Chen"
],
"abstract": "With the rapid advancement of large language models (LLMs), reinforcement learning (RL) has emerged as a pivotal methodology for enhancing the reasoning capabilities of LLMs. Unlike traditional pre-training approaches, RL encompasses multiple stages: rollout, reward, and training, which necessitates collaboration among various worker types. However, current RL systems continue to grapple with substantial GPU underutilization, due to two primary factors: (1) The rollout stage dominates the overall RL process due to test-time scaling; (2) Imbalances in rollout lengths (within the same batch) result in GPU bubbles. While prior solutions like asynchronous execution and truncation offer partial relief, they may compromise training accuracy for efficiency. Our key insight stems from a previously overlooked observation: rollout responses exhibit remarkable similarity across adjacent training epochs. Based on the insight, we introduce RhymeRL, an LLM RL system designed to accelerate RL training with two key innovations. First, to enhance rollout generation, we present HistoSpec, a speculative decoding inference engine that utilizes the similarity of historical rollout token sequences to obtain accurate drafts. Second, to tackle rollout bubbles, we introduce HistoPipe, a two-tier scheduling strategy that leverages the similarity of historical rollout distributions to balance workload among rollout workers. Experimental results demonstrate that RhymeRL achieves up to a 2.6x performance improvement over existing methods, without compromising accuracy or modifying the RL paradigm.",
"doi": "10.1145/3779212.3790172",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790172",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790172",
"has_full_text": true
},
{
"title": "Maverick: Rethinking TFHE Bootstrapping on GPUs via Algorithm-Hardware Co-Design",
"authors": [
"Zhiwei Wang",
"Haoqi He",
"Lutan Zhao",
"Qingyun Niu",
"Dan Meng",
"Rui Hou"
],
"abstract": "Fully homomorphic encryption (FHE) enables arbitrary computation over encrypted data (ciphertext) without compromising confidentiality. Within this family, TFHE features versatile bootstrapping mechanisms that is attractive for security-critical applications. However, its prohibitive computational cost severely limits practical deployment. While hardware acceleration is promising, mere compute scaling fails to overcome the inherent barriers. In particular, the combination of limited algorithmic parallelism and inadequate understanding of hardware behaviors prevents full exploitation of the available performance headroom. In this paper, we present Maverick, a GPU-based solution that rethinks TFHE bootstrapping acceleration through joint algorithm\u2013hardware co-design. At the algorithmic level, we reconstruct blind rotation by shifting test vectors from pre-loading to post-injection. This reformulation turns the original length-n single chain into a multi-chain structure with \u221an sub-chains of depth \u221an, raising external-product (EP) parallelism from 1 to the theoretical bound of \u221an. At the hardware level, we introduce a partial-domain transformation strategy that terminates the NTT early in a stage-aware manner and selectively shifts computation to downstream operators, redefining operator boundaries to balance the workload across EP operators. Experimental results show that Maverick delivers state-of-the-art GPU acceleration. It outperforms the best existing CPU and GPU baselines by 331.2\u00d7 and 3.4\u00d7, respectively, on programmable bootstrapping, and achieves up to 108.5\u00d7 speedup over CPU-based circuit bootstrapping. Together, these results highlight the substantial potential of algorithm\u2013hardware co-design for advancing FHE performance on general-purpose platforms.",
"doi": "10.1145/3779212.3790186",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790186",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790186",
"has_full_text": true
},
{
"title": "APT: Securing Against DRAM Read Disturbance via Adaptive Probabilistic In-DRAM Trackers",
"authors": [
"Runjin Wu",
"Meng Zhang",
"You Zhou",
"Changsheng Xie",
"Fei Wu"
],
"abstract": "With exacerbated DRAM read disturbance, transparent in-DRAM defenses require space (to track the aggressor rows) and time (to perform more mitigations). To reduce storage overhead, recent works have developed probabilistic row-sampling techniques with several entries. To address the time issue, these techniques employ Refresh Management (RFM) commands introduced in DDR5. However, probabilistic defenses with RFM face two critical challenges: (i) fixed-probability sampling under dynamic activation patterns can cause row-sampling misses, allowing attacks to evade mitigation, and (ii) RFM causes timing variations that can be exploited for side and covert channels to leak sensitive information. The goal of this paper is to design a low-cost and secure in-DRAM defense that overcomes these challenges. Our solution, APT (Adaptive Probabilistic In-DRAM Tracker), builds on two key insights. First, by ensuring that mitigation probabilities across activations always sum to 1, row-sampling misses can be prevented and the likelihood of capturing attacked rows can be improved. Second, as RFM is insecure, mitigation should exploit the unutilized slack within the refresh cycle (tRFC) while minimizing per-mitigation latency. APT adaptively allots equal sampling chances to all activations, based on real-time activation counts, avoiding sampling misses. To increase the mitigation rate without relying on RFM, we develop customizable Step Mitigation, which securely protects against transitive attacks with only two victim refreshes. Step Mitigation enables APT to perform up to three mitigations (APT-3) per periodic refresh (REF). We analytically derive the minimum RowHammer threshold tolerated by APT across all patterns. APT achieves a worst-case threshold of 694, which can be reduced to 228 using secure Timing-Based RFM (TB-RFM) at negligible overhead. We also show that APT compares favorably to TPRAC.",
"doi": "10.1145/3779212.3790126",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790126",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790126",
"has_full_text": true
},
{
"title": "RowArmor: Efficient and Comprehensive Protection Against DRAM Disturbance Attacks",
"authors": [
"Minbok Wi",
"Yoonyul Yoo",
"Yoojin Kim",
"Jaeho Shin",
"Jumin Kim",
"Yesin Ryu",
"Saeid Gorgin",
"Jung Ho Ahn",
"Jungrae Kim"
],
"abstract": "Shrinking process technologies have made DRAM increasingly vulnerable to disturbance attacks, such as RowHammer, which can compromise data integrity or induce a Denial-of-Service (DoS) state. Existing solutions focus on preventing errors through activation monitoring and extra refreshes, but they often incur substantial performance overhead and can unintentionally exacerbate DoS risks. This paper introduces RowArmor, a novel defense mechanism that addresses both data corruption and DoS attacks. RowArmor adopts a reactive approach, correcting disturbance errors as they occur via address scrambling and enhanced error correcting codes. This strategy avoids the high costs of preventive methods while addressing diverse attack patterns, including RowPress. Our evaluation demonstrates that RowArmor effectively defends against data corruption and DoS attacks with a negligible performance overhead of up to 0.7%.",
"doi": "10.1145/3779212.3790213",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790213",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790213",
"has_full_text": true
},
{
"title": "M\n <sup>2</sup>\n XFP: A Metadata-Augmented Microscaling Data Format for Efficient Low-bit Quantization",
"authors": [
"Weiming Hu",
"Zihan Zhang",
"Haoyan Zhang",
"Chen Zhang",
"Cong Guo",
"Yu Feng",
"Tianchi Hu",
"Guanglin Li",
"Guipeng Hu",
"Junsong Wang",
"Jingwen Leng"
],
"abstract": "",
"doi": "10.1145/3779212.3790185",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790185",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790185",
"has_full_text": true
},
{
"title": "Streaming Tensor Programs: A Streaming Abstraction for Dynamic Parallelism",
"authors": [
"Gina Sohn",
"Genghan Zhang",
"Konstantin Hossfeld",
"Jungwoo Kim",
"Nathan Sobotka",
"Nathan Zhang",
"Olivia Hsu",
"Kunle Olukotun"
],
"abstract": "Dynamic behaviors are becoming prevalent in tensor applications, like machine learning, where many widely used models contain data-dependent tensor shapes and control flow. However, the limited expressiveness of prior programming abstractions for spatial dataflow accelerators (SDAs) forces these dynamic behaviors to be implemented statically and/or unoptimized. To address these challenges, we present Streaming Tensor Programs (STeP), a streaming abstraction that enables dynamic tensor workloads to run efficiently on SDAs. STeP introduces flexible routing operators, an explicit memory hierarchy, and symbolic-shape semantics that expose dynamic data rates and tensor dimensions. These capabilities unlock new optimizations, like dynamic tiling, dynamic parallelization, and configuration time-multiplexing, that adapt SDA execution to dynamic behaviors while preserving dataflow efficiency. Using a cycle-approximate simulator on representative LLM layers and a full model with real-world traces, STeP enables: dynamic tiling that breaks the Pareto-optimal frontier from prior work, dynamic parallelization that improves latency by ~2.72x, and configuration time-multiplexing that increases compute utilization by ~2.64x over prior SDA abstractions and their implementations.",
"doi": "10.1145/3779212.3790229",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790229",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790229",
"has_full_text": true
},
{
"title": "FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow",
"authors": [
"Rubens Lacouture",
"Nathan Zhang",
"Ritvik Sharma",
"Marco Siracusa",
"Fredrik Kjolstad",
"Kunle Olukotun",
"Olivia Hsu"
],
"abstract": "As deep learning models scale, sparse deep learning (DL) models that exploit sparsity in weights, activations, or inputs and specialized dataflow hardware have emerged as powerful solutions to address efficiency. We propose FuseFlow, a compiler that converts sparse machine learning models written in PyTorch to fused sparse dataflow graphs for reconfigurable dataflow architectures (RDAs). FuseFlow is the first compiler to support general cross-expression fusion of sparse operations. In addition to fusion across kernels (expressions), FuseFlow also supports optimizations like parallelization, dataflow ordering, and sparsity blocking. It targets a cycle-accurate dataflow simulator for microarchitectural analysis of fusion strategies. We use FuseFlow for design-space exploration across four real-world machine learning applications with sparsity, showing that full fusion (entire cross-expression fusion across all computation in an end-to-end model) is not always optimal for sparse models\u2014fusion granularity depends on the model itself. FuseFlow also provides a heuristic to identify and prune suboptimal configurations. Using FuseFlow, we achieve performance improvements, including a ~2.7x speedup over an unfused baseline for GPT-3 with BigBird block-sparse attention.",
"doi": "10.1145/3779212.3790165",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790165",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790165",
"has_full_text": true
},
{
"title": "<scp>Bat:</scp>\n Efficient Generative Recommender Serving with Bipartite Attention",
"authors": [
"Jie Sun",
"Shaohang Wang",
"Zimo Zhang",
"Zhengyu Liu",
"Yunlong Xu",
"Peng Sun",
"Bo Zhao",
"Bingsheng He",
"Fei Wu",
"Zeke Wang"
],
"abstract": "Generative Recommenders (GRs) have recently emerged as promising alternatives to traditional Deep Learning Recommendation Models (DLRMs). Despite their potential, GRs remain computationally expensive in inference, exhibiting compute-bound characteristics similar to the prefill stage of Large Language Model (LLM) inference. Prefix caching can reduce redundant computation by reusing previously constructed KV caches. However, the unique properties of GRs, i.e., highly personalized user profiles and real-time item retrieval, make cache reuse across queries challenging, resulting in limited computational savings. To address these challenges, we present Bat, an efficient serving system for GRs. The key observation is that the semantics between user and item tokens are permutation-invariant. Building on this, we propose Bipartite Attention, a novel attention mechanism that enables adaptive selection of either the user or the item as the prompt prefix without compromising accuracy, thereby unlocking new opportunities for KV cache reuse. We further co-design a disaggregated KV cache pool to proactively manage user-prefix and item-prefix caches as separate components. Since introducing item caches incurs additional memory overhead, we develop a hot-replicated cold-sharded item cache placement strategy that minimizes memory usage and maintains low communication overheads. Finally, we introduce a hotness-aware prompt scheduling strategy to optimize prefix selection under memory constraints. Extensive experiments on multiple recommendation datasets demonstrate that BAT improves serving throughput by up to 1.6x over the conventional user-as-prefix approach, while reducing total computation by up to 58%.",
"doi": "10.1145/3779212.3790131",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790131",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790131",
"has_full_text": true
},
{
"title": "COMPAS: A Distributed Multi-Party SWAP Test for Parallel Quantum Algorithms",
"authors": [
"Brayden Goldstein-Gelb",
"Kun Liu",
"John M. Martyn",
"Hengyun (Harry) Zhou",
"Yongshan Ding",
"Yuan Liu"
],
"abstract": "The limited number of qubits per chip remains a critical bottleneck in quantum computing, motivating the use of distributed architectures that interconnect multiple quantum processing units (QPUs). However, executing quantum algorithms across distributed systems requires careful co-design of algorithmic primitives and hardware architectures to manage circuit depth and entanglement overhead. We identify multivariate trace estimation as a key subroutine that is naturally suited for distribution, and broadly useful in tasks such as estimating R\u00e9nyi entropies, virtual cooling and distillation, and certain applications of quantum signal processing. In this work, we introduce COMPAS, an architecture that realizes multivariate trace estimation across a multi-party network of interconnected modular and distributed QPUs by leveraging pre-shared entangled Bell pairs as resources. COMPAS adds only a constant depth overhead and consumes Bell pairs at a rate linear in circuit width, making it suitable for near-term hardware. Unlike other schemes, which must choose between asymptotic optimality in circuit depth or GHZ width, COMPAS achieves both at once. Additionally, we analyze network-level errors and simulate the effects of circuit-level noise on the architecture.",
"doi": "10.1145/3779212.3790143",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790143",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790143",
"has_full_text": true
},
{
"title": "PropHunt: Automated Optimization of Quantum Syndrome Measurement Circuits",
"authors": [
"Joshua Viszlai",
"Satvik Maurya",
"Swamit Tannu",
"Margaret Martonosi",
"Frederic T. Chong"
],
"abstract": "Fault-Tolerant Quantum Computing (FTQC) relies on Quantum Error Correction (QEC) codes to reach error rates necessary for large scale quantum applications. At a physical level, QEC codes perform parity checks on data qubits, producing syndrome information, through Syndrome Measurement (SM) circuits. These circuits define a code's logical error rate and must be run repeatedly throughout the entire program. The performance of SM circuits is therefore critical to the success of a FTQC system. While ultimately implemented as physical circuits, SM circuits have challenges that are not addressed by existing circuit optimization tools. Importantly, inside SM circuits themselves errors are expected to occur, and how errors propagate through SM circuits directly impacts which errors are detectable and correctable, defining the code's logical error rate. This is not modeled in NISQ-era tools, which instead optimize for targets such as gate depth or gate count to mitigate the chance that any error occurs. This gap leaves key questions unanswered about the expected real-world effectiveness of QEC codes. In this work we address this gap and present PropHunt, an automated tool for optimizing SM circuits for CSS codes. We evaluate PropHunt on a suite of relevant QEC codes and demonstrate PropHunt's ability to iteratively improve performance and recover existing hand-designed circuits automatically. We also propose a near-term QEC application, Hook-ZNE, which leverages PropHunt's fine-grained control over logical error rate to improve Zero-Noise Extrapolation (ZNE), a promising error mitigation strategy.",
"doi": "10.1145/3779212.3790205",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790205",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790205",
"has_full_text": true
},
{
"title": "DFVG: A Heterogeneous Architecture for Speculative Decoding with\n <u>D</u>\n raft-on-\n <u>F</u>\n PGA and\n <u>V</u>\n erify-on-\n <u>G</u>\n PU",
"authors": [
"Shaoqiang Lu",
"Yangbo Wei",
"Junhong Qian",
"Dongge Qin",
"Shiji Gao",
"Yizhi Ding",
"Qifan Wang",
"Chen Wu",
"Xiao Shi",
"Lei He"
],
"abstract": "Speculative decoding is a promising paradigm that accelerates LLM inference by generating drafts and performing verification. However, such systems still face three major challenges: (1) The imbalance in resource requirements between draft and verification models result in low utilization and energy inefficiency when deployed together. (2) Fixed-pattern token trees produce many candidates but few valid paths, resulting in redundant drafts due to the lack of full leverage of the inherent confidence in dynamic generation. (3) Asynchronous execution with frequent alternation between the two stages suffers from idle waiting and rollback overhead. To address these issues, we propose DFVG, a heterogeneous speculative decoding architecture that offloads draft generation to FPGAs and verification to GPUs, exploiting their complementary strengths. We introduce three key contributions: (1) Heterogeneous architecture design that partitions speculative decoding into FPGA-based drafting and GPU-based verification, exploiting complementary hardware strengths with an overlap processor for high-throughput execution; (2) Hardware-aware dynamic draft generation that dynamically predicts speculative branches and token lengths based on model confidence while considering hardware parallelism limits; (3) Tightly-coupled heterogeneous pipeline with stagedecoupled scheduling that allocates execution windows between stages, combined with lightweight cross-device alignment and rollback prediction strategies. Comprehensive evaluation on mainstream models (OPT, LLaMA, Qwen) demonstrates DFVG achieves up to 3.26\u00d7 speedup and 5.8\u00d7 energy efficiency improvement over existing approaches. The source code at: https://github.com/ShaoqiangLu/DFVG",
"doi": "10.1145/3779212.3790153",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790153",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790153",
"has_full_text": true
},
{
"title": "CoGraf: Fully Accelerating Graph Applications with Fine-Grained PIM",
"authors": [
"Ali Semi Yenimol",
"Anirban Nag",
"Chang Hyun Park",
"David Black-Schaffer"
],
"abstract": "Processing-in-Memory (PIM) delivers enormous performance by taking advantage of internal DRAM bandwidth and parallelism. However, graph applications are difficult to adapt to PIM due to their irregular access patterns. We present the first Fine-Grained PIM (FGPIM) design that fully accelerates vertex-centric push-based graph applications by accelerating both their update (computing vertex updates) and apply (summing up the updates) phases. For the update phase, we design a tuple-based LLC that can coalesce at different granularities to group graph updates together and propose multi-DRAM column processing FGPIM instructions to match the cache coalescing to the row-level parallelism of the FGPIM. With this acceleration, the apply phase becomes the bottleneck, and we propose bank-parallel FGPIM instructions with predicates to allow FGPIM to accelerate the conditional updates as well. We achieve an average speedup in the region of interest of 1.8\u00d7/3\u00d7 compared to naive FGPIM and 4.4\u00d7/9.8\u00d7 compared to state-of-the-art non-PIM baseline (HBM2/DDR4), and DRAM energy reduction of 67%/86% and 88%/94%. These results show the importance of providing a complete solution that accelerates both the update and apply phases.",
"doi": "10.1145/3779212.3790142",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790142",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790142",
"has_full_text": true
},
{
"title": "CEMU: Enabling Full-System Emulation of Computational Storage Beyond Hardware Limits",
"authors": [
"Qiuyang Zhang",
"Jiapin Wang",
"You Zhou",
"Peng Xu",
"Kai Lu",
"Jiguang Wan",
"Fei Wu",
"Tao Lu"
],
"abstract": "Computational storage drives (CSDs) present a promising approach to improve system performance through near data processing in SSDs. However, current research platforms are fragmented and inadequate to explore the full design space of CSD systems. Existing hardware and emulator platforms are constrained by physical compute resources, while simulators lack full-system fidelity. To address the problems, we introduce CEMU, a new software-based CSD emulation platform that enables full-system research. It consists of a CSD device emulator and a CSD-oriented software stack. Through a novel virtual machine freezing mechanism, CSD emulation achieves high configurability. While the CSD can utilize the host CPU to physically perform computation to preserve full-system behaviors, the computational delay can be modeled separately to emulate CSDs with CPU-unbounded high computing power. The software stack is designed with two principles, adhering to recent industry CSD standards and being compatible with the existing I/O stack, which is achieved via a newly developed file system FDMFS. We verify CEMU's emulation fidelity across a range of applications by benchmarking against actual CSD hardware, demonstrating average end-to-end performance accuracy of 95% or higher. We also use two case studies on large language model training and LevelDB to demonstrate that CEMU is effective in exploring CSD system research and can uncover insights that have not been discovered in previous research platforms.",
"doi": "10.1145/3779212.3790137",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790137",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790137",
"has_full_text": true
},
{
"title": "SG-IOV: Socket-Granular I/O Virtualization for SmartNIC-Based Container Networks",
"authors": [
"Chenxingyu Zhao",
"Hongtao Zhang",
"Jaehong Min",
"Shengkai Lin",
"Wei Zhang",
"Kaiyuan Zhang",
"Ming Liu",
"Arvind Krishnamurthy"
],
"abstract": "I/O Virtualization (IOV) is a cornerstone of cloud computing, with container networking as a critical form of IOV in modern cloud paradigms. While container networks serve as feature-rich infrastructure, they incur a high CPU tax yet leave room for efficiency improvement. A natural idea is to offload container networks onto hardware such as SmartNICs via IOV interfaces. However, existing IOV mechanisms, such as SR-IOV, are misaligned with container requirements: limited device scalability versus high container density, packet-layer abstraction versus application-layer processing demands, and coarse-grained virtualization versus fine-grained container workloads. In this work, we propose Socket-Granular I/O Virtualization (SG-IOV), a new IOV mechanism to offload container networks efficiently. SG-IOV provisions socket-level devices, supports size-varying transformations over a message-stream abstraction, and enables granular virtualization. Realizing this vision is non-trivial: Finer granularity stresses device scalability, while size-varying message processing complicates IO buffer management. Additionally, many fine-grained devices strain hardware virtualization enforcement. To overcome these challenges, the core principle of SG-IOV is that software mediates signals (e.g., descriptors) while accelerators touch the payload. This separation enables several innovations: resource multiplexing to support scalability, adaptive handling of size-varying tasks, and intermediate queuing to enforce virtualization. We prototype SG-IOV on the latest NVIDIA BlueField-3 and deliver an end-to-end container network solution. Our evaluations show that SG-IOV scales beyond 4K devices (10x more than VFs). Compared to Cilium, SG-IOV saves up to 1.9 cores per 10Gbps of traffic. Beyond core savings, SG-IOV achieves 53% higher bandwidth and reduces latency by up to 48%.",
"doi": "10.1145/3779212.3790218",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790218",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790218",
"has_full_text": true
},
{
"title": "Nemo: A Low-Write-Amplification Cache for Tiny Objects on Log-Structured Flash Devices",
"authors": [
"Xufeng Yang",
"Tingting Tan",
"Jingxin Hu",
"Congming Gao",
"Mingyang Liu",
"Tianyang Jiang",
"Jian Chen",
"Linbo Long",
"Yina Lv",
"Jiwu Shu"
],
"abstract": "Modern storage systems predominantly use flash-based SSDs as a cache layer due to their favorable performance and cost efficiency. However, in tiny-object workloads, existing flash cache designs still suffer from high write amplification. Even when deploying advanced log-structured flash devices (e.g., Zoned Namespace SSDs and Flexible Data Placement SSDs) with low device-level write amplification, application-level write amplification still dominates. This work proposes Nemo, which enhances set-associative cache design by increasing hash collision probability to improve set fill rate, thereby reducing application-level write amplification. To satisfy caching requirements, including high memory efficiency and low miss ratio, we introduce a Bloom filter-based indexing mechanism that significantly reduces memory overhead, and adopt a hybrid hotness tracking to achieve low miss ratio without losing memory efficiency. Experimental results show that Nemo simultaneously achieves three key objectives for flash cache: low write amplification, high memory efficiency, and low miss ratio.",
"doi": "10.1145/3779212.3790191",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790191",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790191",
"has_full_text": true
},
{
"title": "iSwitch: QEC on Demand via In-Situ Encoding of Bare Qubits for Ion Trap Architectures",
"authors": [
"Keyi Yin",
"Xiang Fang",
"Zhuo Chen",
"David Hayes",
"Eneet Kaur",
"Reza Nejabati",
"Hartmut Haeffner",
"Wes Campbell",
"Eric Hudson",
"Jens Palsberg",
"Travis Humble",
"Yufei Ding"
],
"abstract": "Recent advances in quantum hardware and error correction have paved the way for early fault-tolerant (EFT) quantum computing. We propose iSwitch, a hybrid system architecture for trapped-ion quantum computers (TIQC) that exploits ultra-high-fidelity single-qubit gates and efficient logical CNOTs enabled by ion shuttling. iSwitch employs bare qubits for single-qubit operations and QEC-encoded logical qubits for two-qubit gates, avoiding full logical encoding, gate synthesis, and magic state distillation. To enable this selective encoding, we develop a low-noise conversion protocol between bare and logical qubits, a hybrid instruction set tailored to 2D TIQC layouts, and a compiler that minimizes conversion overhead and optimizes scheduling. Evaluations on variational quantum algorithm benchmarks show that iSwitch achieves comparable fidelity to conventional QEC methods, while reducing qubit and operation counts by roughly 33\u201350%, offering a practical, resource-efficient path toward EFT quantum computing on trapped-ion platforms.",
"doi": "10.1145/3779212.3790177",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790177",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790177",
"has_full_text": true
},
{
"title": "Neo: Real-Time On-Device 3D Gaussian Splatting with Reuse-and-Update Sorting Acceleration",
"authors": [
"Changhun Oh",
"Seongryong Oh",
"Jinwoo Hwang",
"Yoonsung Kim",
"Hardik Sharma",
"Jongse Park"
],
"abstract": "",
"doi": "10.1145/3779212.3790192",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790192",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790192",
"has_full_text": true
},
{
"title": "CREATE:\n <u>C</u>\n ross-Layer Resilience Characterization and Optimization for Efficient yet\n <u>R</u>\n eliable\n <u>E</u>\n mbodied\n <u>A</u>\n I Sys\n <u>te</u>\n ms",
"authors": [
"Tong Xie",
"Yijiahao Qi",
"Jinqi Wen",
"Zishen Wan",
"Yanchi Dong",
"Zihao Wang",
"Shaofei Cai",
"Yitao Liang",
"Tianyu Jia",
"Yuan Wang",
"Runsheng Wang",
"Meng Li"
],
"abstract": "Embodied Artificial Intelligence (AI) has recently attracted significant attention as it bridges AI with the physical world. Modern embodied AI systems often combine a Large Language Model (LLM)-based planner for high-level task planning and a reinforcement learning (RL)-based controller for low-level action generation, enabling embodied agents to tackle complex tasks in real-world environments. However, deploying embodied agents remains challenging due to their high computation requirements, especially for battery-powered local devices. Although techniques like lowering operating voltage can improve energy efficiency, they can introduce bit errors and result in task failures. In this work, we propose CREATE, a general design principle that leverages heterogeneous resilience at different layers for synergistic energy-reliability co-optimization. For the first time, we conduct a comprehensive error injection study on modern embodied AI systems and observe an inherent but heterogeneous fault tolerance. Building upon these insights, we develop an anomaly detection and clearance mechanism at the circuit level to eliminate outlier errors. At the model level, we propose a weight-rotation-enhanced planning algorithm to improve the fault tolerance of the LLM-based planner. Furthermore, we introduce an application-level technique, autonomy-adaptive voltage scaling, to dynamically adjust the operating voltage of the controllers. The voltage scaling circuit is co-designed to enable online voltage adjustment. Extensive experiments demonstrate that without compromising task quality, CREATE achieves 40.6% computational energy savings on average over nominal-voltage baselines and 35.0% over prior-art techniques. This further leads to 29.5% to 37.3% chip-level energy savings and approximately a 15% to 30% improvement in battery life.",
"doi": "10.1145/3779212.3790147",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790147",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790147",
"has_full_text": true
},
{
"title": "Trust-V: Toward Secure and Reliable Storage for Trusted Execution Environments",
"authors": [
"SeungKyun Han",
"Jiyeon Yang",
"Jinsoo Jang"
],
"abstract": "Trusted execution environments (TEEs) provide strong isolation for security-critical applications (enclaves) and their runtime data from potentially malicious operating systems. While TEEs also support secure storage by sealing data and storing it persistently on media, they do not ensure the integrity of sealed data, leaving it vulnerable to deletion or manipulation by an untrusted OS. In this work, we present Trusted Storage, a runtime storage isolation mechanism for TEEs that guarantees the integrity of TEE persistent data. The core design partitions storage into trusted and untrusted regions and enforces access control by locking MMIO regions and monitoring block I/O. To ensure portability, Trusted Storage requires no hardware modifications or additional hardware. We implemented a prototype, Trust-V, on the RISC-V platform. In particular, to support secure operation on legacy devices where security features are limited, Trust-V introduces a Virtual-M mode, a sandboxed privileged execution environment in machine mode, for securely hosting the monitor. Our evaluation demonstrates that, although Trust-V incurs up to 3.86\u00d7 system overhead, it allows TEE software to reliably store and retrieve persistent data.",
"doi": "10.1145/3779212.3790242",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790242",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790242",
"has_full_text": true
},
{
"title": "A Cost-Effective Near-Storage Processing Solution for Offline Inference of Long-Context LLMs",
"authors": [
"Hongsun Jang",
"Jaeyong Song",
"Changmin Shin",
"Si Ung Noh",
"Jaewon Jung",
"Jisung Park",
"Jinho Lee"
],
"abstract": "The computational and memory demands of large language models for generative inference present significant challenges for practical deployment. One promising solution targeting offline inference is offloading-based batched inference, which extends the GPU's memory hierarchy with host memory and storage. However, it often suffers from substantial I/O overhead, primarily due to the large KV cache sizes that scale with batch size and context window length. In this paper, we introduce HILOS, a framework that boosts offline inference throughput using near-storage processing. The core of HILOS is attention near storage, which offloads memory-intensive attention operations to near-storage accelerators, reducing traffic across the system interconnect. Building on attention near storage, HILOS incorporates three additional optimizations. First, cooperative X-cache minimizes KV cache I/O by exploiting available host resources after offloading. Second, delayed KV cache writeback hides storage write latency and mitigates storage write amplification. Finally, a memory-efficient attention accelerator sustains high throughput for long sequences within the resource constraints of NSP devices. We implemented and evaluated HILOS on a real system equipped with 16 SmartSSDs. Compared to state-of-the-art offloading-based inference frameworks, HILOS achieves up to 7.86x throughput while reducing energy consumption by up to 85%. The source code for HILOS is available at https://github.com/hongsunjang/HILOS.",
"doi": "10.1145/3779212.3790119",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790119",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790119",
"has_full_text": true
},
{
"title": "Hardwired-Neuron Language Processing Units as General-Purpose Cognitive Substrates",
"authors": [
"Yang Liu",
"Yi Chen",
"Yongwei Zhao",
"Yifan Hao",
"Zifu Zheng",
"Weihao Kong",
"Zhangmai Li",
"Dongchen Jiang",
"Ruiyang Xia",
"Zhihong Ma",
"Zisheng Liu",
"Zhaoyong Wan",
"Yunqi Lu",
"Ximing Liu",
"Hongrui Guo",
"Zhihao Yang",
"Zhe Wang",
"Tianrui Ma",
"Mo Zou",
"Rui Zhang",
"Ling Li",
"Xing Hu",
"Zidong Du",
"Zhiwei Xu",
"Qi Guo",
"Tianshi Chen",
"Yunji Chen"
],
"abstract": "The rapid advancement of Large Language Models (LLMs) has established language as a core general-purpose cognitive substrate, driving the demand for specialized Language Processing Units (LPUs) tailored for LLM inference. To overcome the growing energy consumption of LLM inference systems, this paper proposes a Hardwired-Neurons Language Processing Unit (HNLPU), which physically hardwires LLM weight parameters into the computational fabric, achieving several orders of magnitude computational efficiency improvement by extreme specialization. However, a significant challenge still lies in the scale of modern LLMs. A straightforward hardwiring of GPT-OSS-120B would require fabricating photomask sets valued at over 6 billion dollars, rendering this straightforward solution economically impractical. Addressing this challenge, we propose the novel Metal-Embedding methodology. Instead of embedding weights in a 2D grid of silicon device cells, Metal-Embedding embeds weight parameters into the 3D topology of metal wires. This brings two benefits: (1) a 15\u00d7 increase in density, and (2) 60 out of 70 photomask layers are homogeneous across chips, including all EUV photomasks. In total, Metal-Embedding reduced the photomask cost by 112\u00d7, bringing the Non-Recurring Engineering (NRE) cost of HNLPU into an economically viable range. Experimental results show that HNLPU achieved 249,960 tokens/s (5,555\u00d7/85\u00d7 that of GPU/WSE), 36 tokens/J (1,047\u00d7/283\u00d7 that of GPU/WSE), 13,232 mm2 total die area, 59.46 M-123.5 M estimated NRE at 5 nm technology. Analysis shows that HNLPU achieved 41.7-80.4\u00d7 improvement in cost-effectiveness and 357\u00d7 reduction in carbon footprint compared to OpenAI-scale H100 clusters, under an annual weight updating assumption.",
"doi": "10.1145/3779212.3790169",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790169",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790169",
"has_full_text": true
},
{
"title": "Skyler: Static Analysis for Predicting API-Driven Costs in Serverless Applications",
"authors": [
"Bernardo Ribeiro",
"Mafalda Ferreira",
"Jos\u00e9 Fragoso Santos",
"Rodrigo Bruno",
"Nuno Santos"
],
"abstract": "Unpredictable costs are a growing concern in serverless computing, where applications rely on cloud APIs with complex tiered pricing models. In many deployments, API calls dominate expenses, and a single overlooked design choice can escalate costs by thousands of dollars. Existing tools fall short: provider calculators need unrealistic manual estimates, and dynamic profilers only work post-deployment. We present Skyler, a static analysis framework for pre-deployment cost estimation of API invocations in serverless workflows. Skyler models control flow behavior and pricing semantics to construct symbolic cost expressions using SMT formulas, exposing economic sinks, i.e., code paths where API usage disproportionately impacts cost. This enables developers to identify hotspots and prevent costly architectural errors early. Skyler supports JavaScript-based serverless applications across AWS Lambda, Google Cloud Functions, and Azure Functions, achieving high accuracy (mean absolute percentage error &#lt;1% for AWS and Google, 4.5% for Azure).",
"doi": "10.1145/3779212.3790221",
"url": "https://dl.acm.org/doi/10.1145/3779212.3790221",
"pdf_url": "https://dl.acm.org/doi/pdf/10.1145/3779212.3790221",
"has_full_text": true
}
]