File size: 6,409 Bytes
93e7af1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
# GGML-VirtGPU Backend

The GGML-VirtGPU backend enables GGML applications to run machine
learning computations on host hardware while the application itself
runs inside a virtual machine.  It uses host-guest shared memory to
efficiently share data buffers between the two sides.

This backend relies on the virtio-gpu, and VirglRenderer API Remoting
(APIR) component. The backend is split into two libraries:
- a GGML implementation (the "remoting frontend"), running in the
  guest and interacting with the virtgpu device
- a VirglRenderer APIR compatible library (the "remoting backend"),
  running in the host and interacting with Virglrenderer and an actual
  GGML device backend.

## OS support

| OS       | Status            | Backend     | CI testing  | Notes
| -------- | ----------------- | ----------- | ----------- | -----
| MacOS 14 | Supported         | ggml-metal  | X           | Working when compiled on MacOS 14
| MacOS 15 | Supported         | ggml-metal  | X           | Working when compiled on MacOS 14 or MacOS 15
| MacOS 26 | Not tested        |             |             |
| Linux    | Under development | ggml-vulkan | not working | Working locally, CI running into deadlocks


## Architecture Overview

The GGML-VirtGPU backend consists of three main components:

```mermaid

graph TD

    %% Nodes



 subgraph GuestVM ["Guest VM - Frontend"]

        App([GGML Application<br/>llama.cpp, etc.])



        direction TB

        Interface[GGML Backend Interface]

        Comm["GGML-VirtGPU<br/>(hypercalls + shared mem)"]



        App --> Interface

        Interface --> Comm

    end



    API[virtio-gpu / virglrenderer API]



    subgraph HostSystem [Host System - Backend]

        direction TB

        Dispatcher[GGML-VirtGPU-Backend]

        BackendLib[GGML Backend library<br/>Metal / Vulkan / CPU / ...]



        Dispatcher --> BackendLib

    end



    %% Connections

    Comm --> API

    API --> HostSystem

```

### Key Components

1. **Guest-side Frontend** (`ggml-virtgpu/`): Implements the GGML backend interface and forwards operations to the host
2. **Host-side Backend** (`ggml-virtgpu/backend/`): Receives forwarded operations and executes them on actual hardware backends
3. **Communication Layer**: Uses virtio-gpu hypercalls and shared memory for efficient data transfer

## Features

- **Dynamic backend loading** on the host side (CPU, CUDA, Metal, etc.)
- **Zero-copy data transfer** via host-guest shared memory pages

## Communication Protocol

### Hypercalls and Shared Memory

The backend uses two primary communication mechanisms:

1. **Hypercalls (`DRM_IOCTL_VIRTGPU_EXECBUFFER`)**: Trigger remote execution from guest to host

2. **Shared Memory Pages**: Zero-copy data transfer for tensors and parameters



#### Shared Memory Layout



Each connection uses two shared memory buffers:



- **Data Buffer** (24 MiB): For command/response data and tensor transfers

- **Reply Buffer** (16 KiB): For command replies and status information

- **Data Buffers**: Dynamically allocated host-guest shared buffers

  served as GGML buffers.



### APIR Protocol



The Virglrender API Remoting protocol defines three command types:



- `HANDSHAKE`: Protocol version negotiation and capability discovery

- `LOADLIBRARY`: Dynamic loading of backend libraries on the host

- `FORWARD`: API function call forwarding



### Binary Serialization



Commands and data are serialized using a custom binary protocol with:



- Fixed-size encoding for basic types

- Variable-length arrays with size prefixes

- Buffer bounds checking

- Error recovery mechanisms



## Supported Operations



### Device Operations

- Device enumeration and capability queries

- Memory information (total/free)

- Backend type detection



### Buffer Operations

- Buffer allocation and deallocation

- Tensor data transfer (host ↔ guest)

- Memory copying and clearing



### Computation Operations

- Graph execution forwarding



## Build Requirements



### Guest-side Dependencies

- `libdrm` for DRM/virtio-gpu communication

- C++20 compatible compiler

- CMake 3.14+



### Host-side Dependencies

- virglrenderer with APIR support (pending upstream review)

- Target backend libraries (libggml-metal, libggml-vulkan, etc.)



## Configuration



### Environment Variables



- `GGML_VIRTGPU_BACKEND_LIBRARY`: Path to the host-side backend library

- `GGML_VIRTGPU_DEBUG`: Enable debug logging



### Build Options



- `GGML_VIRTGPU`: Enable the VirtGPU backend (`ON` or `OFF`, default: `OFF`)

- `GGML_VIRTGPU_BACKEND`: Build the host-side backend component (`ON`, `OFF` or `ONLY`, default: `OFF`)



### System Requirements



- VM with virtio-gpu support

- VirglRenderer with APIR patches

- Compatible backend libraries on host



## Limitations



- **VM-specific**: Only works in virtual machines with virtio-gpu support

- **Host dependency**: Requires properly configured host-side backend

- **Latency**: Small overhead from VM escaping for each operation

- **Shared-memory size**: with the `libkrun` hypervisor, the RAM + VRAM

  addressable memory is limited to 64 GB. So the maximum GPU memory

  will be `64GB - RAM`, regardless of the hardware VRAM size.



* This work is pending upstream changes in the VirglRenderer

  project.

  * The backend can be tested with Virglrenderer compiled from source

  using this PR:

  https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1590

* This work is pending changes in the VMM/hypervisor running the

  virtual machine, which need to know how to route the newly

  introduced APIR capset.

  * The environment variable `VIRGL_ROUTE_VENUS_TO_APIR=1` allows

    using the Venus capset, until the relevant hypervisors have been

    patched. However, setting this flag breaks the Vulkan/Venus normal

    behavior.

  * The environment variable `GGML_REMOTING_USE_APIR_CAPSET` tells the

    `ggml-virtgpu` backend to use the APIR capset. This will become

    the default when the relevant hypervisors have been patched.



* This work focused on improving the performance of llama.cpp running

  on MacOS containers, and is mainly tested on this platform. The

  linux support (via `krun`) is in progress.



## See Also



- [Development and Testing](VirtGPU/development.md)

- [Backend configuration](VirtGPU/configuration.md)