Spaces:
Sleeping
Sleeping
File size: 7,959 Bytes
a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f 8656155 a533d1f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | # Deployment Guide
This guide covers all deployment options for the AVP RAG system.
## Recommended: Hugging Face Spaces + GitHub Pages
The production deployment uses fully managed hosting with no tunnels or local servers required:
```
GitHub Pages Hugging Face Spaces
huytran088.github.io/avp_rag_system beefstewbibi-avp-rag-system.hf.space
React SPA (static) βββββββΊ FastAPI + BGE + Anthropic
auto-deployed via CI/CD auto-deployed via CI/CD
```
Both are auto-deployed on every push to `main`. See [One-Time Setup](#one-time-setup) to configure secrets.
---
## One-Time Setup
### GitHub Repo
Go to **Settings β Secrets and variables β Actions**:
| Name | Type | Value |
|---|---|---|
| `HF_TOKEN` | Secret | Hugging Face token with write access to the Space |
| `VITE_API_BASE_URL` | Variable | `https://beefstewbibi-avp-rag-system.hf.space` |
### Hugging Face Space
Go to your Space's **Settings**:
| Name | Type | Value |
|---|---|---|
| `ANTHROPIC_API_KEY` | Secret | Your Anthropic API key |
| `LLM_PROVIDER` | Variable | `anthropic` |
| `CORS_ORIGINS` | Variable | `https://huytran088.github.io` |
After setup, push any commit to `main` β both workflows trigger automatically.
---
## How the CI/CD Works
### Frontend β GitHub Pages (`deploy-gh-pages.yml`)
On every push to `main`:
1. Builds the React frontend with `VITE_BASE_PATH=/avp_rag_system/` and `VITE_API_BASE_URL` baked in
2. Deploys the static build to GitHub Pages via `actions/deploy-pages`
The `VITE_API_BASE_URL` variable is **build-time only** β Vite inlines it into the JS bundle. Changing it requires re-running the deploy workflow.
### Backend β HF Spaces (`sync-hf-spaces.yml`)
On every push to `main`:
1. Copies `hf-space/README.md` (which contains HF Spaces YAML front matter) to `README.md`
2. Force-pushes the entire repo to `huggingface.co/spaces/BeefStewBibi/avp-rag-system`
3. HF Spaces detects the push, builds `Dockerfile`, and restarts the container
The backend-only `Dockerfile` (not `Dockerfile.full`) is used β it skips the Node.js build stage and listens on port 7860 as required by HF Spaces.
### CI (`ci.yml`)
On every push/PR to `main`:
- Backend: `uv run pytest tests/`
- Frontend: `tsc --noEmit` + `vite build`
---
## Alternative: Local Backend + Tunnel
Use this if you want to run a local GPU model (Qwen3 via Ollama or vLLM) and expose it to the internet for the GitHub Pages frontend.
### Step 1: Set Up Local Backend
#### Option A: Ollama (Recommended)
Ollama manages model downloads and GPU inference with zero Docker config. It exposes an OpenAI-compatible API.
```bash
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model (RTX 4070 Super, 12 GB VRAM)
ollama pull qwen3:8b # ~5 GB download, ~6 GB VRAM (quantized)
# Verify
ollama run qwen3:8b "write a hello world function"
```
Configure `.env`:
```
LLM_PROVIDER=vllm
VLLM_BASE_URL=http://localhost:11434/v1
VLLM_MODEL=qwen3:8b
VLLM_API_KEY=ollama
```
Start:
```bash
ollama serve &
uv run uvicorn api.main:app --host 0.0.0.0 --port 8000
```
#### Option B: vLLM via Docker Compose
Requires NVIDIA GPU with 16 GB+ VRAM and the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
```bash
docker compose --profile vllm up --build -d
```
First run downloads ~16 GB of model weights (cached in `huggingface_cache` volume).
#### Qwen3 Model Sizes
| Model | Ollama tag | VRAM (quantized) | VRAM (full) |
|---|---|---|---|
| Qwen3-4B | `qwen3:4b` | ~4 GB | ~8 GB |
| Qwen3-8B | `qwen3:8b` | ~6 GB | ~16 GB |
| Qwen3-14B | `qwen3:14b` | ~10 GB | ~28 GB |
For RTX 4070 Super (12 GB), `qwen3:8b` via Ollama is the sweet spot.
### Step 2: Expose Backend to the Internet
Your local backend needs a public HTTPS URL so GitHub Pages can reach it.
#### Option A: ngrok (Quickest)
1. Install from [ngrok.com/download](https://ngrok.com/download)
2. `ngrok config add-authtoken <your-token>`
3. `ngrok http 8000`
This gives you a URL like `https://abc123.ngrok-free.app`. Free URLs change on restart; paid plans ($8/mo) give stable URLs.
#### Option B: Cloudflare Tunnel (Free, Stable)
```bash
# Install cloudflared and authenticate
cloudflared tunnel login
# Create and route
cloudflared tunnel create avp-rag
cloudflared tunnel route dns avp-rag api.yourdomain.com
cloudflared tunnel run --url http://localhost:8000 avp-rag
```
### Step 3: Configure CORS and Frontend
Set `CORS_ORIGINS` in `.env`:
```
CORS_ORIGINS=https://huytran088.github.io
```
Restart the backend after changing.
Update the GitHub repo variable `VITE_API_BASE_URL` to your tunnel URL, then re-run the deploy workflow:
**Actions β Deploy to GitHub Pages β Run workflow**
### Using Anthropic as Fallback
Configure Anthropic Claude as a fallback when your local Ollama/vLLM is unreachable:
```
LLM_PROVIDER=vllm
LLM_FALLBACK_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
VLLM_BASE_URL=http://localhost:11434/v1
VLLM_MODEL=qwen3:8b
```
The system tries the primary provider first and falls back to Anthropic on any error.
---
## Self-Hosted Docker (Full-Stack)
For teams who want a single-container deployment that serves both frontend and API:
```bash
cp .env.example .env
# Set ANTHROPIC_API_KEY or vLLM env vars in .env
docker compose up --build
```
Uses `Dockerfile.full`, which builds the React frontend in a Node stage and copies the built assets to `static/` in the Python container. Served at `http://localhost:8000`.
The `data/` directory is volume-mounted so you can add `.avp` files and re-ingest without rebuilding the image.
---
## Production Checklist
- [ ] HTTPS on the backend (HF Spaces / ngrok / Cloudflare provide this automatically)
- [ ] `CORS_ORIGINS` set to your exact frontend origin (e.g., `https://huytran088.github.io`)
- [ ] `.env` file is **not** committed to git β verify with `git status`
- [ ] `VITE_API_BASE_URL` set as a GitHub repo **variable** (not secret β it's embedded in the built JS)
- [ ] Backend health check passes before directing users to the frontend
- [ ] Rate limits in `api/dependencies.py` tuned for expected traffic (defaults: 10 generate/min, 30 retrieve/min)
---
## Troubleshooting
**Frontend renders blank page on GitHub Pages:**
- `BrowserRouter` must use `basename={import.meta.env.BASE_URL}` to match the `/avp_rag_system/` subpath
- Verify `VITE_BASE_PATH=/avp_rag_system/` was set at build time in the deploy workflow
**Frontend loads but API calls fail:**
- Open browser DevTools β Network tab, confirm requests go to the right URL
- Check CORS: the backend's `CORS_ORIGINS` must include your exact frontend origin
- `VITE_API_BASE_URL` is build-time only β changing the GitHub variable requires re-running the deploy workflow
**HF Space build fails:**
- Check `hf-space/README.md` has correct YAML front matter (`sdk: docker`, `app_port: 7860`)
- Verify `HF_TOKEN` secret in GitHub repo has write access to the Space
- Check Space build logs on huggingface.co
**503 "provider is not configured":**
- `LLM_PROVIDER=anthropic` requires `ANTHROPIC_API_KEY` in HF Space secrets
- `LLM_PROVIDER=vllm` requires `VLLM_BASE_URL` to point to a running server
**Ollama: "model not found":**
- Run `ollama list` to see installed models
- Model names are case-sensitive: `qwen3:8b`, not `Qwen3:8b`
**Ollama: out of memory:**
- Try `ollama pull qwen3:4b` (~4 GB VRAM)
- Check current usage: `nvidia-smi`
**vLLM container keeps restarting:**
- Check logs: `docker compose logs vllm`
- Try `Qwen/Qwen3-4B` or reduce `--max-model-len` in `docker-compose.yml`
- Verify NVIDIA Container Toolkit: `nvidia-smi` on the host
**ngrok URL changed:**
- Update `VITE_API_BASE_URL` in GitHub repo variables
- Re-run the deploy workflow (Actions β Deploy to GitHub Pages β Run workflow)
- Update `CORS_ORIGINS` in `.env` and restart the backend
|