GGML-VirtGPU Backend
The GGML-VirtGPU backend enables GGML applications to run machine learning computations on host hardware while the application itself runs inside a virtual machine. It uses host-guest shared memory to efficiently share data buffers between the two sides.
This backend relies on the virtio-gpu, and VirglRenderer API Remoting (APIR) component. The backend is split into two libraries:
- a GGML implementation (the "remoting frontend"), running in the guest and interacting with the virtgpu device
- a VirglRenderer APIR compatible library (the "remoting backend"), running in the host and interacting with Virglrenderer and an actual GGML device backend.
OS support
| OS | Status | Backend | CI testing | Notes |
|---|---|---|---|---|
| MacOS 14 | Supported | ggml-metal | X | Working when compiled on MacOS 14 |
| MacOS 15 | Supported | ggml-metal | X | Working when compiled on MacOS 14 or MacOS 15 |
| MacOS 26 | Not tested | |||
| Linux | Under development | ggml-vulkan | not working | Working locally, CI running into deadlocks |
Architecture Overview
The GGML-VirtGPU backend consists of three main components:
graph TD
%% Nodes
subgraph GuestVM ["Guest VM - Frontend"]
App([GGML Application<br/>llama.cpp, etc.])
direction TB
Interface[GGML Backend Interface]
Comm["GGML-VirtGPU<br/>(hypercalls + shared mem)"]
App --> Interface
Interface --> Comm
end
API[virtio-gpu / virglrenderer API]
subgraph HostSystem [Host System - Backend]
direction TB
Dispatcher[GGML-VirtGPU-Backend]
BackendLib[GGML Backend library<br/>Metal / Vulkan / CPU / ...]
Dispatcher --> BackendLib
end
%% Connections
Comm --> API
API --> HostSystem
Key Components
- Guest-side Frontend (
ggml-virtgpu/): Implements the GGML backend interface and forwards operations to the host - Host-side Backend (
ggml-virtgpu/backend/): Receives forwarded operations and executes them on actual hardware backends - Communication Layer: Uses virtio-gpu hypercalls and shared memory for efficient data transfer
Features
- Dynamic backend loading on the host side (CPU, CUDA, Metal, etc.)
- Zero-copy data transfer via host-guest shared memory pages
Communication Protocol
Hypercalls and Shared Memory
The backend uses two primary communication mechanisms:
- Hypercalls (
DRM_IOCTL_VIRTGPU_EXECBUFFER): Trigger remote execution from guest to host - Shared Memory Pages: Zero-copy data transfer for tensors and parameters
Shared Memory Layout
Each connection uses two shared memory buffers:
- Data Buffer (24 MiB): For command/response data and tensor transfers
- Reply Buffer (16 KiB): For command replies and status information
- Data Buffers: Dynamically allocated host-guest shared buffers served as GGML buffers.
APIR Protocol
The Virglrender API Remoting protocol defines three command types:
HANDSHAKE: Protocol version negotiation and capability discoveryLOADLIBRARY: Dynamic loading of backend libraries on the hostFORWARD: API function call forwarding
Binary Serialization
Commands and data are serialized using a custom binary protocol with:
- Fixed-size encoding for basic types
- Variable-length arrays with size prefixes
- Buffer bounds checking
- Error recovery mechanisms
Supported Operations
Device Operations
- Device enumeration and capability queries
- Memory information (total/free)
- Backend type detection
Buffer Operations
- Buffer allocation and deallocation
- Tensor data transfer (host ↔ guest)
- Memory copying and clearing
Computation Operations
- Graph execution forwarding
Build Requirements
Guest-side Dependencies
libdrmfor DRM/virtio-gpu communication- C++20 compatible compiler
- CMake 3.14+
Host-side Dependencies
- virglrenderer with APIR support (pending upstream review)
- Target backend libraries (libggml-metal, libggml-vulkan, etc.)
Configuration
Environment Variables
GGML_VIRTGPU_BACKEND_LIBRARY: Path to the host-side backend libraryGGML_VIRTGPU_DEBUG: Enable debug logging
Build Options
GGML_VIRTGPU: Enable the VirtGPU backend (ONorOFF, default:OFF)GGML_VIRTGPU_BACKEND: Build the host-side backend component (ON,OFForONLY, default:OFF)
System Requirements
- VM with virtio-gpu support
- VirglRenderer with APIR patches
- Compatible backend libraries on host
Limitations
- VM-specific: Only works in virtual machines with virtio-gpu support
- Host dependency: Requires properly configured host-side backend
- Latency: Small overhead from VM escaping for each operation
- Shared-memory size: with the
libkrunhypervisor, the RAM + VRAM addressable memory is limited to 64 GB. So the maximum GPU memory will be64GB - RAM, regardless of the hardware VRAM size.
This work is pending upstream changes in the VirglRenderer project.
- The backend can be tested with Virglrenderer compiled from source using this PR: https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1590
This work is pending changes in the VMM/hypervisor running the virtual machine, which need to know how to route the newly introduced APIR capset.
- The environment variable
VIRGL_ROUTE_VENUS_TO_APIR=1allows using the Venus capset, until the relevant hypervisors have been patched. However, setting this flag breaks the Vulkan/Venus normal behavior. - The environment variable
GGML_REMOTING_USE_APIR_CAPSETtells theggml-virtgpubackend to use the APIR capset. This will become the default when the relevant hypervisors have been patched.
- The environment variable
This work focused on improving the performance of llama.cpp running on MacOS containers, and is mainly tested on this platform. The linux support (via
krun) is in progress.