Buckets:

|
download
raw
3.67 kB

Using TEI on AMD Instinct GPUs (ROCm)

Text Embeddings Inference supports AMD Instinct GPUs (MI200, MI300 series) using ROCm.

Prerequisites

  • AMD Instinct GPU (MI200, MI300 series) with ROCm drivers on the host

Option A: Docker (recommended)

The easiest way to run TEI on AMD GPUs is with the pre-built Docker image:

model=BAAI/bge-base-en-v1.5
volume=$PWD/data  # share a volume to avoid re-downloading weights

docker run \
  --device /dev/kfd --device /dev/dri \
  --group-add video \
  --ipc=host \
  -p 8080:80 \
  -v $volume:/data \
  --pull always \
  ghcr.io/huggingface/text-embeddings-inference:rocm-latest \
  --model-id $model --dtype bfloat16

Then test it:

curl http://localhost:8080/embed \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{"inputs": "What is Deep Learning?"}'

Option B: Manual setup from source

If you prefer to build from source, use AMD's official ROCm PyTorch image as the base environment.

Step 1: Start the container

docker run -it --device=/dev/kfd --device=/dev/dri \
  --group-add video --shm-size 8g \
  -v $PWD:/workspace \
  rocm/pytorch:latest bash

Inside the container, clone the TEI repository (or mount it via -v) and run the remaining steps from the repo root.

Step 2: Install Rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"

Step 3: Install Python dependencies

PyTorch is already provided by the container image, so install the remaining dependencies without pulling a new torch:

pip install --no-deps -r backends/python/server/requirements-amd.txt
pip install safetensors opentelemetry-api opentelemetry-sdk \
    opentelemetry-exporter-otlp-proto-grpc grpcio-reflection \
    grpc-interceptor einops packaging

Step 4: Generate protobuf stubs

pip install grpcio-tools==1.62.2 mypy-protobuf==3.6.0 types-protobuf

mkdir -p backends/python/server/text_embeddings_server/pb

python -m grpc_tools.protoc \
    -I backends/proto \
    --python_out=backends/python/server/text_embeddings_server/pb \
    --grpc_python_out=backends/python/server/text_embeddings_server/pb \
    --mypy_out=backends/python/server/text_embeddings_server/pb \
    backends/proto/embed.proto

# Fix relative imports in generated files
find backends/python/server/text_embeddings_server/pb/ -name "*.py" \
    -exec sed -i 's/^\(import.*pb2\)/from . \1/g' {} \;

touch backends/python/server/text_embeddings_server/pb/__init__.py

Step 5: Install the Python server package

pip install -e backends/python/server

Step 6: Build the Rust router

cargo build --release \
    --no-default-features \
    --features python,http \
    --bin text-embeddings-router

Step 7: Launch TEI

model=BAAI/bge-base-en-v1.5

./target/release/text-embeddings-router --model-id $model --dtype bfloat16 --port 8080

Once the server is ready, you can test it with a simple embed request:

curl http://localhost:8080/embed \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{"inputs": "What is Deep Learning?"}'

Verifying GPU detection

After launch you should see a log line confirming ROCm was detected:

INFO text_embeddings_server::utils::device: ROCm / HIP version: X.Y.Z

You can also verify from Python:

import torch
print(torch.cuda.is_available())  # True
print(torch.version.hip)          # e.g. 6.2.12345-...

Notes

This is a work in progress — more model support and optimized operations for AMD GPUs are coming soon.

Xet Storage Details

Size:
3.67 kB
·
Xet hash:
09438df31a66b9e4409c29b46452074c024cf0a61b886f8e9481a30e0a45e038

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.