avp-rag-system / docs /docker-deploy.md
BeefStewBibi's picture
docs: update README, CLAUDE.md, and deploy guide for HF Spaces
8656155

Deployment Guide

This guide covers all deployment options for the AVP RAG system.

Recommended: Hugging Face Spaces + GitHub Pages

The production deployment uses fully managed hosting with no tunnels or local servers required:

GitHub Pages                              Hugging Face Spaces
huytran088.github.io/avp_rag_system       beefstewbibi-avp-rag-system.hf.space
  React SPA (static)              ──────►   FastAPI + BGE + Anthropic
  auto-deployed via CI/CD                   auto-deployed via CI/CD

Both are auto-deployed on every push to main. See One-Time Setup to configure secrets.


One-Time Setup

GitHub Repo

Go to Settings β†’ Secrets and variables β†’ Actions:

Name Type Value
HF_TOKEN Secret Hugging Face token with write access to the Space
VITE_API_BASE_URL Variable https://beefstewbibi-avp-rag-system.hf.space

Hugging Face Space

Go to your Space's Settings:

Name Type Value
ANTHROPIC_API_KEY Secret Your Anthropic API key
LLM_PROVIDER Variable anthropic
CORS_ORIGINS Variable https://huytran088.github.io

After setup, push any commit to main β€” both workflows trigger automatically.


How the CI/CD Works

Frontend β†’ GitHub Pages (deploy-gh-pages.yml)

On every push to main:

  1. Builds the React frontend with VITE_BASE_PATH=/avp_rag_system/ and VITE_API_BASE_URL baked in
  2. Deploys the static build to GitHub Pages via actions/deploy-pages

The VITE_API_BASE_URL variable is build-time only β€” Vite inlines it into the JS bundle. Changing it requires re-running the deploy workflow.

Backend β†’ HF Spaces (sync-hf-spaces.yml)

On every push to main:

  1. Copies hf-space/README.md (which contains HF Spaces YAML front matter) to README.md
  2. Force-pushes the entire repo to huggingface.co/spaces/BeefStewBibi/avp-rag-system
  3. HF Spaces detects the push, builds Dockerfile, and restarts the container

The backend-only Dockerfile (not Dockerfile.full) is used β€” it skips the Node.js build stage and listens on port 7860 as required by HF Spaces.

CI (ci.yml)

On every push/PR to main:

  • Backend: uv run pytest tests/
  • Frontend: tsc --noEmit + vite build

Alternative: Local Backend + Tunnel

Use this if you want to run a local GPU model (Qwen3 via Ollama or vLLM) and expose it to the internet for the GitHub Pages frontend.

Step 1: Set Up Local Backend

Option A: Ollama (Recommended)

Ollama manages model downloads and GPU inference with zero Docker config. It exposes an OpenAI-compatible API.

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (RTX 4070 Super, 12 GB VRAM)
ollama pull qwen3:8b      # ~5 GB download, ~6 GB VRAM (quantized)

# Verify
ollama run qwen3:8b "write a hello world function"

Configure .env:

LLM_PROVIDER=vllm
VLLM_BASE_URL=http://localhost:11434/v1
VLLM_MODEL=qwen3:8b
VLLM_API_KEY=ollama

Start:

ollama serve &
uv run uvicorn api.main:app --host 0.0.0.0 --port 8000

Option B: vLLM via Docker Compose

Requires NVIDIA GPU with 16 GB+ VRAM and the NVIDIA Container Toolkit.

docker compose --profile vllm up --build -d

First run downloads ~16 GB of model weights (cached in huggingface_cache volume).

Qwen3 Model Sizes

Model Ollama tag VRAM (quantized) VRAM (full)
Qwen3-4B qwen3:4b ~4 GB ~8 GB
Qwen3-8B qwen3:8b ~6 GB ~16 GB
Qwen3-14B qwen3:14b ~10 GB ~28 GB

For RTX 4070 Super (12 GB), qwen3:8b via Ollama is the sweet spot.

Step 2: Expose Backend to the Internet

Your local backend needs a public HTTPS URL so GitHub Pages can reach it.

Option A: ngrok (Quickest)

  1. Install from ngrok.com/download
  2. ngrok config add-authtoken <your-token>
  3. ngrok http 8000

This gives you a URL like https://abc123.ngrok-free.app. Free URLs change on restart; paid plans ($8/mo) give stable URLs.

Option B: Cloudflare Tunnel (Free, Stable)

# Install cloudflared and authenticate
cloudflared tunnel login

# Create and route
cloudflared tunnel create avp-rag
cloudflared tunnel route dns avp-rag api.yourdomain.com
cloudflared tunnel run --url http://localhost:8000 avp-rag

Step 3: Configure CORS and Frontend

Set CORS_ORIGINS in .env:

CORS_ORIGINS=https://huytran088.github.io

Restart the backend after changing.

Update the GitHub repo variable VITE_API_BASE_URL to your tunnel URL, then re-run the deploy workflow:

Actions β†’ Deploy to GitHub Pages β†’ Run workflow

Using Anthropic as Fallback

Configure Anthropic Claude as a fallback when your local Ollama/vLLM is unreachable:

LLM_PROVIDER=vllm
LLM_FALLBACK_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
VLLM_BASE_URL=http://localhost:11434/v1
VLLM_MODEL=qwen3:8b

The system tries the primary provider first and falls back to Anthropic on any error.


Self-Hosted Docker (Full-Stack)

For teams who want a single-container deployment that serves both frontend and API:

cp .env.example .env
# Set ANTHROPIC_API_KEY or vLLM env vars in .env
docker compose up --build

Uses Dockerfile.full, which builds the React frontend in a Node stage and copies the built assets to static/ in the Python container. Served at http://localhost:8000.

The data/ directory is volume-mounted so you can add .avp files and re-ingest without rebuilding the image.


Production Checklist

  • HTTPS on the backend (HF Spaces / ngrok / Cloudflare provide this automatically)
  • CORS_ORIGINS set to your exact frontend origin (e.g., https://huytran088.github.io)
  • .env file is not committed to git β€” verify with git status
  • VITE_API_BASE_URL set as a GitHub repo variable (not secret β€” it's embedded in the built JS)
  • Backend health check passes before directing users to the frontend
  • Rate limits in api/dependencies.py tuned for expected traffic (defaults: 10 generate/min, 30 retrieve/min)

Troubleshooting

Frontend renders blank page on GitHub Pages:

  • BrowserRouter must use basename={import.meta.env.BASE_URL} to match the /avp_rag_system/ subpath
  • Verify VITE_BASE_PATH=/avp_rag_system/ was set at build time in the deploy workflow

Frontend loads but API calls fail:

  • Open browser DevTools β†’ Network tab, confirm requests go to the right URL
  • Check CORS: the backend's CORS_ORIGINS must include your exact frontend origin
  • VITE_API_BASE_URL is build-time only β€” changing the GitHub variable requires re-running the deploy workflow

HF Space build fails:

  • Check hf-space/README.md has correct YAML front matter (sdk: docker, app_port: 7860)
  • Verify HF_TOKEN secret in GitHub repo has write access to the Space
  • Check Space build logs on huggingface.co

503 "provider is not configured":

  • LLM_PROVIDER=anthropic requires ANTHROPIC_API_KEY in HF Space secrets
  • LLM_PROVIDER=vllm requires VLLM_BASE_URL to point to a running server

Ollama: "model not found":

  • Run ollama list to see installed models
  • Model names are case-sensitive: qwen3:8b, not Qwen3:8b

Ollama: out of memory:

  • Try ollama pull qwen3:4b (~4 GB VRAM)
  • Check current usage: nvidia-smi

vLLM container keeps restarting:

  • Check logs: docker compose logs vllm
  • Try Qwen/Qwen3-4B or reduce --max-model-len in docker-compose.yml
  • Verify NVIDIA Container Toolkit: nvidia-smi on the host

ngrok URL changed:

  • Update VITE_API_BASE_URL in GitHub repo variables
  • Re-run the deploy workflow (Actions β†’ Deploy to GitHub Pages β†’ Run workflow)
  • Update CORS_ORIGINS in .env and restart the backend