File size: 7,959 Bytes
a533d1f
 
8656155
a533d1f
8656155
a533d1f
8656155
a533d1f
 
8656155
 
 
 
a533d1f
 
8656155
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
 
 
 
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
 
 
 
 
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
 
 
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
 
 
 
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
 
 
a533d1f
 
 
8656155
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
 
 
a533d1f
8656155
 
a533d1f
8656155
 
 
a533d1f
8656155
a533d1f
8656155
 
 
 
a533d1f
 
8656155
 
 
 
a533d1f
 
8656155
a533d1f
8656155
a533d1f
8656155
 
 
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
 
 
 
 
a533d1f
8656155
 
 
 
 
a533d1f
8656155
a533d1f
8656155
 
 
a533d1f
8656155
a533d1f
8656155
a533d1f
 
8656155
 
 
 
 
 
 
a533d1f
 
8656155
a533d1f
8656155
 
 
 
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
a533d1f
8656155
a533d1f
 
 
 
 
 
 
 
 
8656155
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a533d1f
 
 
 
 
8656155
 
 
 
 
a533d1f
 
8656155
 
a533d1f
 
8656155
 
 
 
a533d1f
8656155
 
 
 
 
 
 
 
 
 
 
 
a533d1f
 
 
 
 
 
8656155
 
a533d1f
 
 
8656155
 
a533d1f
 
8656155
 
a533d1f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
# Deployment Guide

This guide covers all deployment options for the AVP RAG system.

## Recommended: Hugging Face Spaces + GitHub Pages

The production deployment uses fully managed hosting with no tunnels or local servers required:

```
GitHub Pages                              Hugging Face Spaces
huytran088.github.io/avp_rag_system       beefstewbibi-avp-rag-system.hf.space
  React SPA (static)              ──────►   FastAPI + BGE + Anthropic
  auto-deployed via CI/CD                   auto-deployed via CI/CD
```

Both are auto-deployed on every push to `main`. See [One-Time Setup](#one-time-setup) to configure secrets.

---

## One-Time Setup

### GitHub Repo

Go to **Settings β†’ Secrets and variables β†’ Actions**:

| Name | Type | Value |
|---|---|---|
| `HF_TOKEN` | Secret | Hugging Face token with write access to the Space |
| `VITE_API_BASE_URL` | Variable | `https://beefstewbibi-avp-rag-system.hf.space` |

### Hugging Face Space

Go to your Space's **Settings**:

| Name | Type | Value |
|---|---|---|
| `ANTHROPIC_API_KEY` | Secret | Your Anthropic API key |
| `LLM_PROVIDER` | Variable | `anthropic` |
| `CORS_ORIGINS` | Variable | `https://huytran088.github.io` |

After setup, push any commit to `main` β€” both workflows trigger automatically.

---

## How the CI/CD Works

### Frontend β†’ GitHub Pages (`deploy-gh-pages.yml`)

On every push to `main`:
1. Builds the React frontend with `VITE_BASE_PATH=/avp_rag_system/` and `VITE_API_BASE_URL` baked in
2. Deploys the static build to GitHub Pages via `actions/deploy-pages`

The `VITE_API_BASE_URL` variable is **build-time only** β€” Vite inlines it into the JS bundle. Changing it requires re-running the deploy workflow.

### Backend β†’ HF Spaces (`sync-hf-spaces.yml`)

On every push to `main`:
1. Copies `hf-space/README.md` (which contains HF Spaces YAML front matter) to `README.md`
2. Force-pushes the entire repo to `huggingface.co/spaces/BeefStewBibi/avp-rag-system`
3. HF Spaces detects the push, builds `Dockerfile`, and restarts the container

The backend-only `Dockerfile` (not `Dockerfile.full`) is used β€” it skips the Node.js build stage and listens on port 7860 as required by HF Spaces.

### CI (`ci.yml`)

On every push/PR to `main`:
- Backend: `uv run pytest tests/`
- Frontend: `tsc --noEmit` + `vite build`

---

## Alternative: Local Backend + Tunnel

Use this if you want to run a local GPU model (Qwen3 via Ollama or vLLM) and expose it to the internet for the GitHub Pages frontend.

### Step 1: Set Up Local Backend

#### Option A: Ollama (Recommended)

Ollama manages model downloads and GPU inference with zero Docker config. It exposes an OpenAI-compatible API.

```bash
# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (RTX 4070 Super, 12 GB VRAM)
ollama pull qwen3:8b      # ~5 GB download, ~6 GB VRAM (quantized)

# Verify
ollama run qwen3:8b "write a hello world function"
```

Configure `.env`:
```
LLM_PROVIDER=vllm
VLLM_BASE_URL=http://localhost:11434/v1
VLLM_MODEL=qwen3:8b
VLLM_API_KEY=ollama
```

Start:
```bash
ollama serve &
uv run uvicorn api.main:app --host 0.0.0.0 --port 8000
```

#### Option B: vLLM via Docker Compose

Requires NVIDIA GPU with 16 GB+ VRAM and the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).

```bash
docker compose --profile vllm up --build -d
```

First run downloads ~16 GB of model weights (cached in `huggingface_cache` volume).

#### Qwen3 Model Sizes

| Model | Ollama tag | VRAM (quantized) | VRAM (full) |
|---|---|---|---|
| Qwen3-4B | `qwen3:4b` | ~4 GB | ~8 GB |
| Qwen3-8B | `qwen3:8b` | ~6 GB | ~16 GB |
| Qwen3-14B | `qwen3:14b` | ~10 GB | ~28 GB |

For RTX 4070 Super (12 GB), `qwen3:8b` via Ollama is the sweet spot.

### Step 2: Expose Backend to the Internet

Your local backend needs a public HTTPS URL so GitHub Pages can reach it.

#### Option A: ngrok (Quickest)

1. Install from [ngrok.com/download](https://ngrok.com/download)
2. `ngrok config add-authtoken <your-token>`
3. `ngrok http 8000`

This gives you a URL like `https://abc123.ngrok-free.app`. Free URLs change on restart; paid plans ($8/mo) give stable URLs.

#### Option B: Cloudflare Tunnel (Free, Stable)

```bash
# Install cloudflared and authenticate
cloudflared tunnel login

# Create and route
cloudflared tunnel create avp-rag
cloudflared tunnel route dns avp-rag api.yourdomain.com
cloudflared tunnel run --url http://localhost:8000 avp-rag
```

### Step 3: Configure CORS and Frontend

Set `CORS_ORIGINS` in `.env`:
```
CORS_ORIGINS=https://huytran088.github.io
```

Restart the backend after changing.

Update the GitHub repo variable `VITE_API_BASE_URL` to your tunnel URL, then re-run the deploy workflow:

**Actions β†’ Deploy to GitHub Pages β†’ Run workflow**

### Using Anthropic as Fallback

Configure Anthropic Claude as a fallback when your local Ollama/vLLM is unreachable:

```
LLM_PROVIDER=vllm
LLM_FALLBACK_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
VLLM_BASE_URL=http://localhost:11434/v1
VLLM_MODEL=qwen3:8b
```

The system tries the primary provider first and falls back to Anthropic on any error.

---

## Self-Hosted Docker (Full-Stack)

For teams who want a single-container deployment that serves both frontend and API:

```bash
cp .env.example .env
# Set ANTHROPIC_API_KEY or vLLM env vars in .env
docker compose up --build
```

Uses `Dockerfile.full`, which builds the React frontend in a Node stage and copies the built assets to `static/` in the Python container. Served at `http://localhost:8000`.

The `data/` directory is volume-mounted so you can add `.avp` files and re-ingest without rebuilding the image.

---

## Production Checklist

- [ ] HTTPS on the backend (HF Spaces / ngrok / Cloudflare provide this automatically)
- [ ] `CORS_ORIGINS` set to your exact frontend origin (e.g., `https://huytran088.github.io`)
- [ ] `.env` file is **not** committed to git β€” verify with `git status`
- [ ] `VITE_API_BASE_URL` set as a GitHub repo **variable** (not secret β€” it's embedded in the built JS)
- [ ] Backend health check passes before directing users to the frontend
- [ ] Rate limits in `api/dependencies.py` tuned for expected traffic (defaults: 10 generate/min, 30 retrieve/min)

---

## Troubleshooting

**Frontend renders blank page on GitHub Pages:**
- `BrowserRouter` must use `basename={import.meta.env.BASE_URL}` to match the `/avp_rag_system/` subpath
- Verify `VITE_BASE_PATH=/avp_rag_system/` was set at build time in the deploy workflow

**Frontend loads but API calls fail:**
- Open browser DevTools β†’ Network tab, confirm requests go to the right URL
- Check CORS: the backend's `CORS_ORIGINS` must include your exact frontend origin
- `VITE_API_BASE_URL` is build-time only β€” changing the GitHub variable requires re-running the deploy workflow

**HF Space build fails:**
- Check `hf-space/README.md` has correct YAML front matter (`sdk: docker`, `app_port: 7860`)
- Verify `HF_TOKEN` secret in GitHub repo has write access to the Space
- Check Space build logs on huggingface.co

**503 "provider is not configured":**
- `LLM_PROVIDER=anthropic` requires `ANTHROPIC_API_KEY` in HF Space secrets
- `LLM_PROVIDER=vllm` requires `VLLM_BASE_URL` to point to a running server

**Ollama: "model not found":**
- Run `ollama list` to see installed models
- Model names are case-sensitive: `qwen3:8b`, not `Qwen3:8b`

**Ollama: out of memory:**
- Try `ollama pull qwen3:4b` (~4 GB VRAM)
- Check current usage: `nvidia-smi`

**vLLM container keeps restarting:**
- Check logs: `docker compose logs vllm`
- Try `Qwen/Qwen3-4B` or reduce `--max-model-len` in `docker-compose.yml`
- Verify NVIDIA Container Toolkit: `nvidia-smi` on the host

**ngrok URL changed:**
- Update `VITE_API_BASE_URL` in GitHub repo variables
- Re-run the deploy workflow (Actions β†’ Deploy to GitHub Pages β†’ Run workflow)
- Update `CORS_ORIGINS` in `.env` and restart the backend