rule_violation2 / llama.cpp /tools /rpc /README.md

Upload folder using huggingface_hub

4d35814 verified 4 months ago

3.68 kB

	## Overview

	> [!IMPORTANT]
	> This example and the RPC backend are currently in a proof-of-concept development stage. As such, the functionality is fragile and
	> insecure. Never run the RPC server on an open network or in a sensitive environment!

	The `rpc-server` allows exposing `ggml` devices on a remote host.
	The RPC backend communicates with one or several instances of `rpc-server` and offloads computations to them.
	This can be used for distributed LLM inference with `llama.cpp` in the following way:

	```mermaid
	flowchart TD
	rpcb<-->\|TCP\|srva
	rpcb<-->\|TCP\|srvb
	rpcb<-.->\|TCP\|srvn
	subgraph hostn[Host N]
	srvn[rpc-server]<-.->dev4["CUDA0"]
	srvn[rpc-server]<-.->dev5["CPU"]
	end
	subgraph hostb[Host B]
	srvb[rpc-server]<-->dev3["Metal"]
	end
	subgraph hosta[Host A]
	srva[rpc-server]<-->dev["CUDA0"]
	srva[rpc-server]<-->dev2["CUDA1"]
	end
	subgraph host[Main Host]
	local["Local devices"]<-->ggml[llama-cli]
	ggml[llama-cli]<-->rpcb[RPC backend]
	end
	style hostn stroke:#66,stroke-width:2px,stroke-dasharray: 5 5
	classDef devcls fill:#5B9BD5
	class local,dev,dev2,dev3,dev4,dev5 devcls
	```

	By default, `rpc-server` exposes all available accelerator devices on the host.
	If there are no accelerators, it exposes a single `CPU` device.

	## Usage

	### Remote hosts

	On each remote host, build the backends for each accelerator by adding `-DGGML_RPC=ON` to the build options.
	For example, to build the `rpc-server` with support for CUDA accelerators:

	```bash
	mkdir build-rpc-cuda
	cd build-rpc-cuda
	cmake .. -DGGML_CUDA=ON -DGGML_RPC=ON
	cmake --build . --config Release
	```

	When started, the `rpc-server` will detect and expose all available `CUDA` devices:

	```bash
	$ bin/rpc-server
	ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
	ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
	ggml_cuda_init: found 1 CUDA devices:
	Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
	Starting RPC server v3.0.0
	endpoint : 127.0.0.1:50052
	local cache : n/a
	Devices:
	CUDA0: NVIDIA GeForce RTX 5090 (32109 MiB, 31588 MiB free)
	```

	You can control the set of exposed CUDA devices with the `CUDA_VISIBLE_DEVICES` environment variable or the `--device` command line option. The following two commands have the same effect:
	```bash
	$ CUDA_VISIBLE_DEVICES=0 bin/rpc-server -p 50052
	$ bin/rpc-server --device CUDA0 -p 50052
	```

	### Main host

	On the main host build `llama.cpp` with the backends for the local devices and add `-DGGML_RPC=ON` to the build options.
	Finally, when running `llama-cli` or `llama-server`, use the `--rpc` option to specify the host and port of each `rpc-server`:

	```bash
	$ llama-cli -hf ggml-org/gemma-3-1b-it-GGUF -ngl 99 --rpc 192.168.88.10:50052,192.168.88.11:50052
	```

	By default, llama.cpp distributes model weights and the KV cache across all available devices -- both local and remote -- in proportion to each device's available memory.
	You can override this behavior with the `--tensor-split` option and set custom proportions when splitting tensor data across devices.

	### Local cache

	The RPC server can use a local cache to store large tensors and avoid transferring them over the network.
	This can speed up model loading significantly, especially when using large models.
	To enable the cache, use the `-c` option:

	```bash
	$ bin/rpc-server -c
	```

	By default, the cache is stored in the `$HOME/.cache/llama.cpp/rpc` directory and can be controlled via the `LLAMA_CACHE` environment variable.

	### Troubleshooting

	Use the `GGML_RPC_DEBUG` environment variable to enable debug messages from `rpc-server`:
	```bash
	$ GGML_RPC_DEBUG=1 bin/rpc-server
	```