File size: 3,783 Bytes

e6c51f8

## Overview

> [!IMPORTANT]
> This example and the RPC backend are currently in a proof-of-concept development stage. As such, the functionality is fragile and
> insecure. **Never run the RPC server on an open network or in a sensitive environment!**

The `rpc-server` allows exposing `ggml` devices on a remote host.
The RPC backend communicates with one or several instances of `rpc-server` and offloads computations to them.
This can be used for distributed LLM inference with `llama.cpp` in the following way:

```mermaid

flowchart TD

    rpcb<-->|TCP|srva

    rpcb<-->|TCP|srvb

    rpcb<-.->|TCP|srvn

    subgraph hostn[Host N]

    srvn[rpc-server]<-.->dev4["CUDA0"]

    srvn[rpc-server]<-.->dev5["CPU"]

    end

    subgraph hostb[Host B]

    srvb[rpc-server]<-->dev3["Metal"]

    end

    subgraph hosta[Host A]

    srva[rpc-server]<-->dev["CUDA0"]

    srva[rpc-server]<-->dev2["CUDA1"]

    end

    subgraph host[Main Host]

    local["Local devices"]<-->ggml[llama-cli]

    ggml[llama-cli]<-->rpcb[RPC backend]

    end

    style hostn stroke:#66,stroke-width:2px,stroke-dasharray: 5 5

    classDef devcls fill:#5B9BD5

    class local,dev,dev2,dev3,dev4,dev5 devcls

```

By default, `rpc-server` exposes all available accelerator devices on the host.
If there are no accelerators, it exposes a single `CPU` device.

## Usage

### Remote hosts

On each remote host, build the backends for each accelerator by adding `-DGGML_RPC=ON` to the build options.
For example, to build the `rpc-server` with support for CUDA accelerators:

```bash

mkdir build-rpc-cuda

cd build-rpc-cuda

cmake .. -DGGML_CUDA=ON -DGGML_RPC=ON

cmake --build . --config Release

```

When started, the `rpc-server` will detect and expose all available `CUDA` devices:

```bash

$ bin/rpc-server

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no

ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no

ggml_cuda_init: found 1 CUDA devices:

  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

Starting RPC server v3.0.0

  endpoint       : 127.0.0.1:50052

  local cache    : n/a

Devices:

  CUDA0: NVIDIA GeForce RTX 5090 (32109 MiB, 31588 MiB free)

```

You can control the set of exposed CUDA devices with the `CUDA_VISIBLE_DEVICES` environment variable or the `--device` command line option. The following two commands have the same effect:
```bash

$ CUDA_VISIBLE_DEVICES=0 bin/rpc-server -p 50052

$ bin/rpc-server --device CUDA0 -p 50052

```

### Main host

On the main host build `llama.cpp` with the backends for the local devices and add `-DGGML_RPC=ON` to the build options.
Finally, when running `llama-cli` or `llama-server`, use the `--rpc` option to specify the host and port of each `rpc-server`:

```bash

$ llama-cli -hf ggml-org/gemma-3-1b-it-GGUF -ngl 99 --rpc 192.168.88.10:50052,192.168.88.11:50052

```

By default, llama.cpp distributes model weights and the KV cache across all available devices -- both local and remote -- in proportion to each device's available memory.
You can override this behavior with the `--tensor-split` option and set custom proportions when splitting tensor data across devices.

### Local cache

The RPC server can use a local cache to store large tensors and avoid transferring them over the network.
This can speed up model loading significantly, especially when using large models.
To enable the cache, use the `-c` option:

```bash

$ bin/rpc-server -c

```

By default, the cache is stored in the `$HOME/.cache/llama.cpp/rpc` directory and can be controlled via the `LLAMA_CACHE` environment variable.

### Troubleshooting

Use the `GGML_RPC_DEBUG` environment variable to enable debug messages from `rpc-server`:
```bash

$ GGML_RPC_DEBUG=1 bin/rpc-server

```