| # GGML-VirtGPU Backend | |
| The GGML-VirtGPU backend enables GGML applications to run machine | |
| learning computations on host hardware while the application itself | |
| runs inside a virtual machine. It uses host-guest shared memory to | |
| efficiently share data buffers between the two sides. | |
| This backend relies on the virtio-gpu, and VirglRenderer API Remoting | |
| (APIR) component. The backend is split into two libraries: | |
| - a GGML implementation (the "remoting frontend"), running in the | |
| guest and interacting with the virtgpu device | |
| - a VirglRenderer APIR compatible library (the "remoting backend"), | |
| running in the host and interacting with Virglrenderer and an actual | |
| GGML device backend. | |
| ## OS support | |
| | OS | Status | Backend | CI testing | Notes | |
| | -------- | ----------------- | ----------- | ----------- | ----- | |
| | MacOS 14 | Supported | ggml-metal | X | Working when compiled on MacOS 14 | |
| | MacOS 15 | Supported | ggml-metal | X | Working when compiled on MacOS 14 or MacOS 15 | |
| | MacOS 26 | Not tested | | | | |
| | Linux | Under development | ggml-vulkan | not working | Working locally, CI running into deadlocks | |
| ## Architecture Overview | |
| The GGML-VirtGPU backend consists of three main components: | |
| ```mermaid | |
| graph TD | |
| %% Nodes | |
| subgraph GuestVM ["Guest VM - Frontend"] | |
| App([GGML Application<br/>llama.cpp, etc.]) | |
| direction TB | |
| Interface[GGML Backend Interface] | |
| Comm["GGML-VirtGPU<br/>(hypercalls + shared mem)"] | |
| App --> Interface | |
| Interface --> Comm | |
| end | |
| API[virtio-gpu / virglrenderer API] | |
| subgraph HostSystem [Host System - Backend] | |
| direction TB | |
| Dispatcher[GGML-VirtGPU-Backend] | |
| BackendLib[GGML Backend library<br/>Metal / Vulkan / CPU / ...] | |
| Dispatcher --> BackendLib | |
| end | |
| %% Connections | |
| Comm --> API | |
| API --> HostSystem | |
| ``` | |
| ### Key Components | |
| 1. **Guest-side Frontend** (`ggml-virtgpu/`): Implements the GGML backend interface and forwards operations to the host | |
| 2. **Host-side Backend** (`ggml-virtgpu/backend/`): Receives forwarded operations and executes them on actual hardware backends | |
| 3. **Communication Layer**: Uses virtio-gpu hypercalls and shared memory for efficient data transfer | |
| ## Features | |
| - **Dynamic backend loading** on the host side (CPU, CUDA, Metal, etc.) | |
| - **Zero-copy data transfer** via host-guest shared memory pages | |
| ## Communication Protocol | |
| ### Hypercalls and Shared Memory | |
| The backend uses two primary communication mechanisms: | |
| 1. **Hypercalls (`DRM_IOCTL_VIRTGPU_EXECBUFFER`)**: Trigger remote execution from guest to host | |
| 2. **Shared Memory Pages**: Zero-copy data transfer for tensors and parameters | |
| #### Shared Memory Layout | |
| Each connection uses two shared memory buffers: | |
| - **Data Buffer** (24 MiB): For command/response data and tensor transfers | |
| - **Reply Buffer** (16 KiB): For command replies and status information | |
| - **Data Buffers**: Dynamically allocated host-guest shared buffers | |
| served as GGML buffers. | |
| ### APIR Protocol | |
| The Virglrender API Remoting protocol defines three command types: | |
| - `HANDSHAKE`: Protocol version negotiation and capability discovery | |
| - `LOADLIBRARY`: Dynamic loading of backend libraries on the host | |
| - `FORWARD`: API function call forwarding | |
| ### Binary Serialization | |
| Commands and data are serialized using a custom binary protocol with: | |
| - Fixed-size encoding for basic types | |
| - Variable-length arrays with size prefixes | |
| - Buffer bounds checking | |
| - Error recovery mechanisms | |
| ## Supported Operations | |
| ### Device Operations | |
| - Device enumeration and capability queries | |
| - Memory information (total/free) | |
| - Backend type detection | |
| ### Buffer Operations | |
| - Buffer allocation and deallocation | |
| - Tensor data transfer (host ↔ guest) | |
| - Memory copying and clearing | |
| ### Computation Operations | |
| - Graph execution forwarding | |
| ## Build Requirements | |
| ### Guest-side Dependencies | |
| - `libdrm` for DRM/virtio-gpu communication | |
| - C++20 compatible compiler | |
| - CMake 3.14+ | |
| ### Host-side Dependencies | |
| - virglrenderer with APIR support (pending upstream review) | |
| - Target backend libraries (libggml-metal, libggml-vulkan, etc.) | |
| ## Configuration | |
| ### Environment Variables | |
| - `GGML_VIRTGPU_BACKEND_LIBRARY`: Path to the host-side backend library | |
| - `GGML_VIRTGPU_DEBUG`: Enable debug logging | |
| ### Build Options | |
| - `GGML_VIRTGPU`: Enable the VirtGPU backend (`ON` or `OFF`, default: `OFF`) | |
| - `GGML_VIRTGPU_BACKEND`: Build the host-side backend component (`ON`, `OFF` or `ONLY`, default: `OFF`) | |
| ### System Requirements | |
| - VM with virtio-gpu support | |
| - VirglRenderer with APIR patches | |
| - Compatible backend libraries on host | |
| ## Limitations | |
| - **VM-specific**: Only works in virtual machines with virtio-gpu support | |
| - **Host dependency**: Requires properly configured host-side backend | |
| - **Latency**: Small overhead from VM escaping for each operation | |
| - **Shared-memory size**: with the `libkrun` hypervisor, the RAM + VRAM | |
| addressable memory is limited to 64 GB. So the maximum GPU memory | |
| will be `64GB - RAM`, regardless of the hardware VRAM size. | |
| * This work is pending upstream changes in the VirglRenderer | |
| project. | |
| * The backend can be tested with Virglrenderer compiled from source | |
| using this PR: | |
| https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1590 | |
| * This work is pending changes in the VMM/hypervisor running the | |
| virtual machine, which need to know how to route the newly | |
| introduced APIR capset. | |
| * The environment variable `VIRGL_ROUTE_VENUS_TO_APIR=1` allows | |
| using the Venus capset, until the relevant hypervisors have been | |
| patched. However, setting this flag breaks the Vulkan/Venus normal | |
| behavior. | |
| * The environment variable `GGML_REMOTING_USE_APIR_CAPSET` tells the | |
| `ggml-virtgpu` backend to use the APIR capset. This will become | |
| the default when the relevant hypervisors have been patched. | |
| * This work focused on improving the performance of llama.cpp running | |
| on MacOS containers, and is mainly tested on this platform. The | |
| linux support (via `krun`) is in progress. | |
| ## See Also | |
| - [Development and Testing](VirtGPU/development.md) | |
| - [Backend configuration](VirtGPU/configuration.md) | |