Spaces:

BeefStewBibi
/

avp-rag-system

Sleeping

File size: 7,959 Bytes

# Deployment Guide

This guide covers all deployment options for the AVP RAG system.

## Recommended: Hugging Face Spaces + GitHub Pages

The production deployment uses fully managed hosting with no tunnels or local servers required:

```
GitHub Pages                              Hugging Face Spaces
huytran088.github.io/avp_rag_system       beefstewbibi-avp-rag-system.hf.space
  React SPA (static)              ──────►   FastAPI + BGE + Anthropic
  auto-deployed via CI/CD                   auto-deployed via CI/CD
```

Both are auto-deployed on every push to `main`. See [One-Time Setup](#one-time-setup) to configure secrets.

---

## One-Time Setup

### GitHub Repo

Go to **Settings → Secrets and variables → Actions**:

| Name | Type | Value |
|---|---|---|
| `HF_TOKEN` | Secret | Hugging Face token with write access to the Space |
| `VITE_API_BASE_URL` | Variable | `https://beefstewbibi-avp-rag-system.hf.space` |

### Hugging Face Space

Go to your Space's **Settings**:

| Name | Type | Value |
|---|---|---|
| `ANTHROPIC_API_KEY` | Secret | Your Anthropic API key |
| `LLM_PROVIDER` | Variable | `anthropic` |
| `CORS_ORIGINS` | Variable | `https://huytran088.github.io` |

After setup, push any commit to `main` — both workflows trigger automatically.

---

## How the CI/CD Works

### Frontend → GitHub Pages (`deploy-gh-pages.yml`)

On every push to `main`:
1. Builds the React frontend with `VITE_BASE_PATH=/avp_rag_system/` and `VITE_API_BASE_URL` baked in
2. Deploys the static build to GitHub Pages via `actions/deploy-pages`

The `VITE_API_BASE_URL` variable is **build-time only** — Vite inlines it into the JS bundle. Changing it requires re-running the deploy workflow.

### Backend → HF Spaces (`sync-hf-spaces.yml`)

On every push to `main`:
1. Copies `hf-space/README.md` (which contains HF Spaces YAML front matter) to `README.md`
2. Force-pushes the entire repo to `huggingface.co/spaces/BeefStewBibi/avp-rag-system`
3. HF Spaces detects the push, builds `Dockerfile`, and restarts the container

The backend-only `Dockerfile` (not `Dockerfile.full`) is used — it skips the Node.js build stage and listens on port 7860 as required by HF Spaces.

### CI (`ci.yml`)

On every push/PR to `main`:
- Backend: `uv run pytest tests/`
- Frontend: `tsc --noEmit` + `vite build`

---

## Alternative: Local Backend + Tunnel

Use this if you want to run a local GPU model (Qwen3 via Ollama or vLLM) and expose it to the internet for the GitHub Pages frontend.

### Step 1: Set Up Local Backend

#### Option A: Ollama (Recommended)

Ollama manages model downloads and GPU inference with zero Docker config. It exposes an OpenAI-compatible API.

```bash
# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (RTX 4070 Super, 12 GB VRAM)
ollama pull qwen3:8b      # ~5 GB download, ~6 GB VRAM (quantized)

# Verify
ollama run qwen3:8b "write a hello world function"
```

Configure `.env`:
```
LLM_PROVIDER=vllm
VLLM_BASE_URL=http://localhost:11434/v1
VLLM_MODEL=qwen3:8b
VLLM_API_KEY=ollama
```

Start:
```bash
ollama serve &
uv run uvicorn api.main:app --host 0.0.0.0 --port 8000
```

#### Option B: vLLM via Docker Compose

Requires NVIDIA GPU with 16 GB+ VRAM and the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).

```bash
docker compose --profile vllm up --build -d
```

First run downloads ~16 GB of model weights (cached in `huggingface_cache` volume).

#### Qwen3 Model Sizes

| Model | Ollama tag | VRAM (quantized) | VRAM (full) |
|---|---|---|---|
| Qwen3-4B | `qwen3:4b` | ~4 GB | ~8 GB |
| Qwen3-8B | `qwen3:8b` | ~6 GB | ~16 GB |
| Qwen3-14B | `qwen3:14b` | ~10 GB | ~28 GB |

For RTX 4070 Super (12 GB), `qwen3:8b` via Ollama is the sweet spot.

### Step 2: Expose Backend to the Internet

Your local backend needs a public HTTPS URL so GitHub Pages can reach it.

#### Option A: ngrok (Quickest)

1. Install from [ngrok.com/download](https://ngrok.com/download)
2. `ngrok config add-authtoken <your-token>`
3. `ngrok http 8000`

This gives you a URL like `https://abc123.ngrok-free.app`. Free URLs change on restart; paid plans ($8/mo) give stable URLs.

#### Option B: Cloudflare Tunnel (Free, Stable)

```bash
# Install cloudflared and authenticate
cloudflared tunnel login

# Create and route
cloudflared tunnel create avp-rag
cloudflared tunnel route dns avp-rag api.yourdomain.com
cloudflared tunnel run --url http://localhost:8000 avp-rag
```

### Step 3: Configure CORS and Frontend

Set `CORS_ORIGINS` in `.env`:
```
CORS_ORIGINS=https://huytran088.github.io
```

Restart the backend after changing.

Update the GitHub repo variable `VITE_API_BASE_URL` to your tunnel URL, then re-run the deploy workflow:

**Actions → Deploy to GitHub Pages → Run workflow**

### Using Anthropic as Fallback

Configure Anthropic Claude as a fallback when your local Ollama/vLLM is unreachable:

```
LLM_PROVIDER=vllm
LLM_FALLBACK_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
VLLM_BASE_URL=http://localhost:11434/v1
VLLM_MODEL=qwen3:8b
```

The system tries the primary provider first and falls back to Anthropic on any error.

---

## Self-Hosted Docker (Full-Stack)

For teams who want a single-container deployment that serves both frontend and API:

```bash
cp .env.example .env
# Set ANTHROPIC_API_KEY or vLLM env vars in .env
docker compose up --build
```

Uses `Dockerfile.full`, which builds the React frontend in a Node stage and copies the built assets to `static/` in the Python container. Served at `http://localhost:8000`.

The `data/` directory is volume-mounted so you can add `.avp` files and re-ingest without rebuilding the image.

---

## Production Checklist

- [ ] HTTPS on the backend (HF Spaces / ngrok / Cloudflare provide this automatically)
- [ ] `CORS_ORIGINS` set to your exact frontend origin (e.g., `https://huytran088.github.io`)
- [ ] `.env` file is **not** committed to git — verify with `git status`
- [ ] `VITE_API_BASE_URL` set as a GitHub repo **variable** (not secret — it's embedded in the built JS)
- [ ] Backend health check passes before directing users to the frontend
- [ ] Rate limits in `api/dependencies.py` tuned for expected traffic (defaults: 10 generate/min, 30 retrieve/min)

---

## Troubleshooting

**Frontend renders blank page on GitHub Pages:**
- `BrowserRouter` must use `basename={import.meta.env.BASE_URL}` to match the `/avp_rag_system/` subpath
- Verify `VITE_BASE_PATH=/avp_rag_system/` was set at build time in the deploy workflow

**Frontend loads but API calls fail:**
- Open browser DevTools → Network tab, confirm requests go to the right URL
- Check CORS: the backend's `CORS_ORIGINS` must include your exact frontend origin
- `VITE_API_BASE_URL` is build-time only — changing the GitHub variable requires re-running the deploy workflow

**HF Space build fails:**
- Check `hf-space/README.md` has correct YAML front matter (`sdk: docker`, `app_port: 7860`)
- Verify `HF_TOKEN` secret in GitHub repo has write access to the Space
- Check Space build logs on huggingface.co

**503 "provider is not configured":**
- `LLM_PROVIDER=anthropic` requires `ANTHROPIC_API_KEY` in HF Space secrets
- `LLM_PROVIDER=vllm` requires `VLLM_BASE_URL` to point to a running server

**Ollama: "model not found":**
- Run `ollama list` to see installed models
- Model names are case-sensitive: `qwen3:8b`, not `Qwen3:8b`

**Ollama: out of memory:**
- Try `ollama pull qwen3:4b` (~4 GB VRAM)
- Check current usage: `nvidia-smi`

**vLLM container keeps restarting:**
- Check logs: `docker compose logs vllm`
- Try `Qwen/Qwen3-4B` or reduce `--max-model-len` in `docker-compose.yml`
- Verify NVIDIA Container Toolkit: `nvidia-smi` on the host

**ngrok URL changed:**
- Update `VITE_API_BASE_URL` in GitHub repo variables
- Re-run the deploy workflow (Actions → Deploy to GitHub Pages → Run workflow)
- Update `CORS_ORIGINS` in `.env` and restart the backend