File size: 4,794 Bytes
66121eb
a0225bd
84a0530
 
 
 
a0225bd
 
84a0530
 
 
a0225bd
 
66121eb
 
84a0530
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a66b788
 
 
 
 
 
84a0530
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
pinned: false
title: sphinx-ai-assistant proxy
emoji: πŸ”
colorFrom: blue
colorTo: green
license: bsd-3-clause
short_description: ai
hf_oauth: true
hf_oauth_scopes:
  - inference-api
sdk: docker
app_port: 7860
---

# sphinx-ai-assistant proxy

Thin OpenAI-compatible reverse proxy for the **sphinx-ai-assistant** Sphinx
extension.  Runs as a free CPU Docker Space on HuggingFace.

## What it does

1. Accepts `POST /v1/chat/completions` from the browser (no auth header).
2. Resolves the upstream backend via env vars (see Configuration below).
3. Injects `Authorization: Bearer $HF_TOKEN` when required.
4. Forwards the request body verbatim to the model backend.
5. Returns the response (JSON or SSE stream) with CORS headers.

## Files committed to this Space

| File | Purpose |
|---|---|
| `app.py` | FastAPI proxy application |
| `_shared_logic.py` | Pure helpers imported by `app.py` (no pip deps) |
| `Dockerfile` | Container build β€” copies both Python files |
| `requirements.txt` | FastAPI, uvicorn, httpx |
| `README.md` | This file (HF Space metadata + documentation) |

> **Important** β€” `_shared_logic.py` **must** be committed alongside `app.py`.
> The Dockerfile copies it into the container.  If it is absent, the container
> will crash on startup with `ModuleNotFoundError: No module named 'shared_logic'`.

## Endpoints

| Method | Path | Purpose | Link |
|---|---|---|---|
| `GET`  | `/`                    | Status page / HF health-check | https://scikit-plots-ai.hf.space |
| `GET`  | `/health`              | Liveness probe | https://scikit-plots-ai.hf.space/health |
| `POST` | `/`                    | Backward-compat alias | https://scikit-plots-ai.hf.space |
| `POST` | `/v1/chat/completions` | Primary proxy endpoint | https://scikit-plots-ai.hf.space/v1/chat/completions |

## Configuration

Set in **Space β†’ Settings β†’ Repository secrets**:

| Variable | Required? | Default | Description |
|---|---|---|---|
| `HF_TOKEN` | Yes (unless `BACKEND_URL` set) | β€” | HuggingFace API token. Requires "Make calls to Inference API" permission. |
| `BACKEND_URL` | No | `""` | Custom backend URL. Bypasses HF Serverless API. Use for DMR (Path A) or ZeroGPU Space (Path C). |
| `DEFAULT_MODEL` | No | `openai/gpt-oss-20b` | Fallback model when request body omits `model`. Must have Inference Provider if `BACKEND_URL` unset. |
| `HF_BASE` | No | `https://api-inference.huggingface.co/models` | HF API base URL. Only used when `BACKEND_URL` is empty. |
| `PROXY_TIMEOUT` | No | `120` | Upstream read timeout (seconds). Set `600` for ZeroGPU cold start. Non-integer values fall back to `120`. |
| `ALLOWED_ORIGINS` | No | `*` | Comma-separated CORS origins. Production: `https://scikit-plots.github.io` |
| `MAX_BODY_BYTES` | No | `10485760` | Maximum request body size (bytes). Non-integer values fall back to `10485760`. |

## Quick path reference

```
Path B β€” HF Serverless API (original provider models):
  HF_TOKEN      = hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxx
  DEFAULT_MODEL = openai/gpt-oss-20b
  PROXY_TIMEOUT = 120

Path C β€” ZeroGPU Space (any model weights, free GPU):
  BACKEND_URL   = https://scikit-plots-ai-model.hf.space/v1/chat/completions
  PROXY_TIMEOUT = 600   ← cold start for 20B model takes 2–10 minutes

Path A β€” Local Docker Model Runner (set in your shell, not Space secrets):
  export BACKEND_URL=http://localhost:12434/engines/llama.cpp/v1/chat/completions
```

## Verify the deployment

```bash
# Liveness check
curl https://scikit-plots-ai.hf.space/health
# Expected: {"status":"ok","version":"3.1.0"}

# Status page (shows active backend)
curl https://scikit-plots-ai.hf.space/
# Expected: {"status":"ok","service":"sphinx-ai-assistant proxy v3.1.0",...}

# Test a real completion (Path B β€” replace with your Space URL)
curl https://scikit-plots-ai.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-oss-20b","messages":[{"role":"user","content":"hi"}]}'
```

## Why mirror repos fail with HF Serverless API

`scikit-plots/gpt-oss-20b` and `scikit-plots/Qwen2.5-Coder-32B-Instruct` are
**mirror repositories** β€” they contain weights copied from the original repos
but are **not registered with any HF Inference Provider**.

- `POST` to `api-inference.huggingface.co/models/scikit-plots/gpt-oss-20b/...` β†’ **404 / 503**
- `POST` to `api-inference.huggingface.co/models/openai/gpt-oss-20b/...` β†’ **βœ“ works**

Use `DEFAULT_MODEL = openai/gpt-oss-20b` for Path B.
Use Path C (ZeroGPU) to run `scikit-plots/*` weights directly.

## References

- [FREE_PROXY_SOLUTIONS.md](./FREE_PROXY_SOLUTIONS.md) β€” Full path decision tree
- [HuggingFace Inference API](https://huggingface.co/docs/api-inference/)
- [ZeroGPU documentation](https://huggingface.co/docs/hub/spaces-zerogpu)