GB10 Workbench Setup
The Two-Machine Picture
Desktop (Yoneda) GPU Box (atom)
Ubuntu, VS Code ←→ NVIDIA Linux, Docker, vLLM
client server, port 8000
The GPU machine is reachable via mDNS. A short SSH config alias makes it seamless:
Host atom
HostName <hostname>.local
User <username>
ServerAliveInterval 60
So ssh atom just works.
The GPU Machine
Hardware: Gigabyte AI TOP (DGX Spark variant) with NVIDIA Grace-Blackwell Superchip. Unified memory — a single 128 GB pool shared between CPU and GPU. No PCIe bottleneck, no VRAM vs RAM boundaries.
Software stack:
- NVIDIA Linux (custom kernel with Grace-Blackwell drivers)
- Docker for all GPU workloads
- vLLM runs inside Docker with
--gpus all --ipc=host
Pulling a Model
Models land in ~/cache/hf/ and ~/models/. Docker containers mount these paths so models persist across restarts.
# Install HF CLI
curl -LsSf https://hf.co/cli/install.sh | bash
# Authenticate
hf auth login
# Pull a model
hf download Qwen/Qwen2.5-7B-Instruct
Serving with vLLM
Quick one-shot Docker command:
docker run -d --name vllm --gpus all --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-p 8000:8000 \
-v ~/cache/hf:/root/.cache/huggingface \
-v ~/models:/models \
nvcr.io/nvidia/vllm:26.04-py3 \
vllm serve Qwen/Qwen2.5-7B-Instruct \
--host 0.0.0.0 --port 8000 \
--gpu-memory-utilization 0.85 \
--dtype bfloat16 --max-model-len 8192
Check it's up:
docker logs -f vllm
# Look for: "Uvicorn running on http://0.0.0.0:8000"
Smoke test:
curl -s http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"Qwen/Qwen2.5-7B-Instruct",
"messages":[{"role":"user","content":"Hi"}],
"max_tokens":20}'
Config-Driven Harness
Managing raw Docker flags per model gets tedious. A config-driven harness wraps each model recipe in a .env file:
# Start: one command, any model
./scripts/up.sh qwen2.5-7b
# Stop
./scripts/down.sh
# View logs
./scripts/logs.sh
# Smoke test
./scripts/test.sh
What happens under the hood:
- Pulls the Docker image if not cached
- Starts a container named
vllmwith--gpus all, port8000 - Mounts
~/cache/hfand~/modelsfor model access - Model loads into unified GPU memory
Inference from the Desktop
The GPU machine exposes vLLM on port 8000. Three ways to reach it:
Option A — Direct via mDNS:
curl -s http://<hostname>.local:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "Explain backpropagation in one sentence."}],
"max_tokens": 80
}'
Option B — SSH tunnel:
ssh -N -L 8000:localhost:8000 atom
# Then use localhost:8000
Option C — VS Code Remote-SSH: Port 8000 forwards automatically.
What This Enables
This is the loop for everything that follows: pick a model, serve it, query it, benchmark it. The workbench is now ready for deep-dives into inference optimization, fine-tuning, and model internals.