Spaces:

BeefStewBibi
/

avp-rag-system

Sleeping

App Files Files Community

avp-rag-system / docs /docker-deploy.md

BeefStewBibi

docs: update README, CLAUDE.md, and deploy guide for HF Spaces

8656155 29 days ago

preview code

raw

history blame contribute delete

7.96 kB

Deployment Guide

This guide covers all deployment options for the AVP RAG system.

Recommended: Hugging Face Spaces + GitHub Pages

The production deployment uses fully managed hosting with no tunnels or local servers required:

GitHub Pages                              Hugging Face Spaces
huytran088.github.io/avp_rag_system       beefstewbibi-avp-rag-system.hf.space
  React SPA (static)              ──────►   FastAPI + BGE + Anthropic
  auto-deployed via CI/CD                   auto-deployed via CI/CD

Both are auto-deployed on every push to main. See One-Time Setup to configure secrets.

One-Time Setup

GitHub Repo

Go to Settings → Secrets and variables → Actions:

Name	Type	Value
`HF_TOKEN`	Secret	Hugging Face token with write access to the Space
`VITE_API_BASE_URL`	Variable	`https://beefstewbibi-avp-rag-system.hf.space`

Hugging Face Space

Go to your Space's Settings:

Name	Type	Value
`ANTHROPIC_API_KEY`	Secret	Your Anthropic API key
`LLM_PROVIDER`	Variable	`anthropic`
`CORS_ORIGINS`	Variable	`https://huytran088.github.io`

After setup, push any commit to main — both workflows trigger automatically.

How the CI/CD Works

Frontend → GitHub Pages (`deploy-gh-pages.yml`)

On every push to main:

Builds the React frontend with VITE_BASE_PATH=/avp_rag_system/ and VITE_API_BASE_URL baked in
Deploys the static build to GitHub Pages via actions/deploy-pages

The VITE_API_BASE_URL variable is build-time only — Vite inlines it into the JS bundle. Changing it requires re-running the deploy workflow.

Backend → HF Spaces (`sync-hf-spaces.yml`)

On every push to main:

Copies hf-space/README.md (which contains HF Spaces YAML front matter) to README.md
Force-pushes the entire repo to huggingface.co/spaces/BeefStewBibi/avp-rag-system
HF Spaces detects the push, builds Dockerfile, and restarts the container

The backend-only Dockerfile (not Dockerfile.full) is used — it skips the Node.js build stage and listens on port 7860 as required by HF Spaces.

CI (`ci.yml`)

On every push/PR to main:

Backend: uv run pytest tests/
Frontend: tsc --noEmit + vite build

Alternative: Local Backend + Tunnel

Use this if you want to run a local GPU model (Qwen3 via Ollama or vLLM) and expose it to the internet for the GitHub Pages frontend.

Step 1: Set Up Local Backend

Option A: Ollama (Recommended)

Ollama manages model downloads and GPU inference with zero Docker config. It exposes an OpenAI-compatible API.

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (RTX 4070 Super, 12 GB VRAM)
ollama pull qwen3:8b      # ~5 GB download, ~6 GB VRAM (quantized)

# Verify
ollama run qwen3:8b "write a hello world function"

Configure .env:

LLM_PROVIDER=vllm
VLLM_BASE_URL=http://localhost:11434/v1
VLLM_MODEL=qwen3:8b
VLLM_API_KEY=ollama

Start:

ollama serve &
uv run uvicorn api.main:app --host 0.0.0.0 --port 8000

Option B: vLLM via Docker Compose

Requires NVIDIA GPU with 16 GB+ VRAM and the NVIDIA Container Toolkit.

docker compose --profile vllm up --build -d

First run downloads ~16 GB of model weights (cached in huggingface_cache volume).

Qwen3 Model Sizes

Model	Ollama tag	VRAM (quantized)	VRAM (full)
Qwen3-4B	`qwen3:4b`	~4 GB	~8 GB
Qwen3-8B	`qwen3:8b`	~6 GB	~16 GB
Qwen3-14B	`qwen3:14b`	~10 GB	~28 GB

For RTX 4070 Super (12 GB), qwen3:8b via Ollama is the sweet spot.

Step 2: Expose Backend to the Internet

Your local backend needs a public HTTPS URL so GitHub Pages can reach it.

Option A: ngrok (Quickest)

Install from ngrok.com/download
ngrok config add-authtoken <your-token>
ngrok http 8000

This gives you a URL like https://abc123.ngrok-free.app. Free URLs change on restart; paid plans ($8/mo) give stable URLs.

Option B: Cloudflare Tunnel (Free, Stable)

# Install cloudflared and authenticate
cloudflared tunnel login

# Create and route
cloudflared tunnel create avp-rag
cloudflared tunnel route dns avp-rag api.yourdomain.com
cloudflared tunnel run --url http://localhost:8000 avp-rag

Step 3: Configure CORS and Frontend

Set CORS_ORIGINS in .env:

CORS_ORIGINS=https://huytran088.github.io

Restart the backend after changing.

Update the GitHub repo variable VITE_API_BASE_URL to your tunnel URL, then re-run the deploy workflow:

Actions → Deploy to GitHub Pages → Run workflow

Using Anthropic as Fallback

Configure Anthropic Claude as a fallback when your local Ollama/vLLM is unreachable:

LLM_PROVIDER=vllm
LLM_FALLBACK_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
VLLM_BASE_URL=http://localhost:11434/v1
VLLM_MODEL=qwen3:8b

The system tries the primary provider first and falls back to Anthropic on any error.

Self-Hosted Docker (Full-Stack)

For teams who want a single-container deployment that serves both frontend and API:

cp .env.example .env
# Set ANTHROPIC_API_KEY or vLLM env vars in .env
docker compose up --build

Uses Dockerfile.full, which builds the React frontend in a Node stage and copies the built assets to static/ in the Python container. Served at http://localhost:8000.

The data/ directory is volume-mounted so you can add .avp files and re-ingest without rebuilding the image.

Production Checklist

HTTPS on the backend (HF Spaces / ngrok / Cloudflare provide this automatically)
CORS_ORIGINS set to your exact frontend origin (e.g., https://huytran088.github.io)
.env file is not committed to git — verify with git status
VITE_API_BASE_URL set as a GitHub repo variable (not secret — it's embedded in the built JS)
Backend health check passes before directing users to the frontend
Rate limits in api/dependencies.py tuned for expected traffic (defaults: 10 generate/min, 30 retrieve/min)

Troubleshooting

Frontend renders blank page on GitHub Pages:

BrowserRouter must use basename={import.meta.env.BASE_URL} to match the /avp_rag_system/ subpath
Verify VITE_BASE_PATH=/avp_rag_system/ was set at build time in the deploy workflow

Frontend loads but API calls fail:

Open browser DevTools → Network tab, confirm requests go to the right URL
Check CORS: the backend's CORS_ORIGINS must include your exact frontend origin
VITE_API_BASE_URL is build-time only — changing the GitHub variable requires re-running the deploy workflow

HF Space build fails:

Check hf-space/README.md has correct YAML front matter (sdk: docker, app_port: 7860)
Verify HF_TOKEN secret in GitHub repo has write access to the Space
Check Space build logs on huggingface.co

503 "provider is not configured":

LLM_PROVIDER=anthropic requires ANTHROPIC_API_KEY in HF Space secrets
LLM_PROVIDER=vllm requires VLLM_BASE_URL to point to a running server

Ollama: "model not found":

Run ollama list to see installed models
Model names are case-sensitive: qwen3:8b, not Qwen3:8b

Ollama: out of memory:

Try ollama pull qwen3:4b (~4 GB VRAM)
Check current usage: nvidia-smi

vLLM container keeps restarting:

Check logs: docker compose logs vllm
Try Qwen/Qwen3-4B or reduce --max-model-len in docker-compose.yml
Verify NVIDIA Container Toolkit: nvidia-smi on the host

ngrok URL changed:

Update VITE_API_BASE_URL in GitHub repo variables
Re-run the deploy workflow (Actions → Deploy to GitHub Pages → Run workflow)
Update CORS_ORIGINS in .env and restart the backend