When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems
Abstract
Hybrid multi-agent systems combining large and small language models offer flexible inference trade-offs, but optimal architecture depends heavily on specific tasks and performance metrics.
The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.
Community
If you use an edge device-sized, or self hosted, LM to power your agentic system, you will usually observe subpar performance; on the other hand, while cloud-based frontier models can deliver satisfactory performance, they also come with potentially high API costs.
In this paper, we explore how this dilemma can worked around by putting a Multi-Agentic spin on the idea of Hybrid AI. In our system, an Executor agent living on device receives periodic assistance from a Supervisor agent living on the cloud. We explore the design space of such a system and make some non-trivial observations: we see that edge-sized Executors can indeed benefit from assistance from the cloud, resulting in performance superior to an edge-only setup for less API costs than a cloud-only setup; that the best-performing multi-agent architecture depends on the nature of the task; and that our Hybrid MAS is fundamentally different from a routing system.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper