File size: 2,710 Bytes
317d0a2
aaa2c82
385004a
 
 
 
aaa2c82
 
317d0a2
 
385004a
aaa2c82
 
9d29f7d
aaa2c82
317d0a2
 
385004a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
pinned: false
title: scikit-plots AI Model Endpoint
emoji: πŸ€–
colorFrom: blue
colorTo: green
license: bsd-3-clause
short_description: ai-model
hf_oauth: true
hf_oauth_scopes:
  - inference-api
sdk: gradio
sdk_version: 6.15.2
python_version: '3.13'
app_file: app.py
---

# scikit-plots AI Model Endpoint

ZeroGPU Space that serves scikit-plots model weights via an
OpenAI-compatible REST endpoint.  Called by the proxy Space
(`scikit-plots/ai`) via its `BACKEND_URL` environment variable.

## Primary endpoint

```
POST /v1/chat/completions
```

Request body (OpenAI Chat Completions format):

```json
{
  "model": "scikit-plots/Qwen2.5-Coder-7B-Instruct",
  "messages": [{"role": "user", "content": "Hello"}],
  "max_tokens": 512
}
```

## Other endpoints

| Method | Path | Purpose |
|---|---|---|
| `GET` | `/` | Status page |
| `GET` | `/health` | Liveness probe (model loaded check) |
| `GET` | `/ui` | Gradio test UI |
| `POST` | `/v1/chat/completions` | Primary inference endpoint |

## ⚠️ Cold start

The **first request** in a new ZeroGPU session triggers:

1. GPU allocation from the shared pool (1–5 minutes)
2. Model loading from storage to GPU VRAM (for models 14GB of RAM under 16GiB hard limit: 3–8 minutes)

**Total cold start: 2–10 minutes for a model.**

Set `PROXY_TIMEOUT=600` in the proxy Space (`scikit-plots/ai`) secrets.
Subsequent requests in the same active session complete in seconds.

## Configuration

Set in **Space β†’ Settings β†’ Repository secrets**:

| Variable | Required? | Default | Description |
|---|---|---|---|
| `MODEL_ID` | No | `scikit-plots/Qwen2.5-Coder-7B-Instruct` | Model weights to load. Supports `scikit-plots/*` mirrors. |
| `ALLOWED_ORIGINS` | No | `https://scikit-plots-ai.hf.space` | Comma-separated CORS origins. Add `http://localhost:7860` for local dev. Do not set to `*` in production. |
| `MAX_BODY_BYTES` | No | `10485760` | Maximum request body size (bytes). Non-integer values fall back to default. |

## Why this works with scikit-plots/* mirrors

This ZeroGPU Space downloads raw model weights directly from HuggingFace
storage (Git LFS), bypassing the HuggingFace Serverless Inference API.
Mirror repos (`scikit-plots/*`) have weights but no registered Inference
Provider β€” so they work here but fail with the HF Serverless API.

## Wire the proxy Space

Add these to `scikit-plots/ai` β†’ Settings β†’ Repository secrets:

```
BACKEND_URL   = https://scikit-plots-ai-model.hf.space/v1/chat/completions
PROXY_TIMEOUT = 600
```

## References

- [FREE_PROXY_SOLUTIONS.md Path C](./FREE_PROXY_SOLUTIONS.md#path-c--new-zerogpu-space-completely-free-gpu)
- [ZeroGPU documentation](https://huggingface.co/docs/hub/spaces-zerogpu)