Spaces:
Sleeping
Sleeping
| title: OpenCode ZeroGPU Provider | |
| emoji: 🚀 | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.16.1 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| hardware: zero-a10g | |
| # OpenCode ZeroGPU Provider | |
| OpenAI-compatible inference endpoint for [opencode](https://github.com/sst/opencode), powered by HuggingFace ZeroGPU (NVIDIA H200). | |
| ## Features | |
| - **OpenAI-compatible API** - Drop-in replacement for OpenAI endpoints | |
| - **Pass-through model selection** - Use any HuggingFace model ID | |
| - **ZeroGPU H200 inference** - 25 min/day of H200 GPU compute (PRO plan) | |
| - **Automatic fallback** - Falls back to HF Serverless when quota exhausted | |
| - **SSE streaming** - Real-time token streaming support | |
| - **Authentication** - Requires valid HuggingFace token | |
| ## API Endpoint | |
| ``` | |
| POST /v1/chat/completions | |
| ``` | |
| ### Request Format | |
| ```json | |
| { | |
| "model": "meta-llama/Llama-3.1-8B-Instruct", | |
| "messages": [ | |
| {"role": "system", "content": "You are a helpful assistant."}, | |
| {"role": "user", "content": "Hello!"} | |
| ], | |
| "temperature": 0.7, | |
| "max_tokens": 512, | |
| "stream": true | |
| } | |
| ``` | |
| ### Headers | |
| ``` | |
| Authorization: Bearer hf_YOUR_TOKEN | |
| Content-Type: application/json | |
| ``` | |
| ## Usage with opencode | |
| Configure in `~/.config/opencode/opencode.json`: | |
| ```json | |
| { | |
| "providers": { | |
| "zerogpu": { | |
| "npm": "@ai-sdk/openai-compatible", | |
| "options": { | |
| "baseURL": "https://serenichron-opencode-zerogpu.hf.space/v1", | |
| "headers": { | |
| "Authorization": "Bearer hf_YOUR_TOKEN" | |
| } | |
| }, | |
| "models": { | |
| "llama-8b": { | |
| "name": "meta-llama/Llama-3.1-8B-Instruct" | |
| }, | |
| "mistral-7b": { | |
| "name": "mistralai/Mistral-7B-Instruct-v0.3" | |
| }, | |
| "qwen-7b": { | |
| "name": "Qwen/Qwen2.5-7B-Instruct" | |
| }, | |
| "qwen-14b": { | |
| "name": "Qwen/Qwen2.5-14B-Instruct" | |
| } | |
| } | |
| } | |
| } | |
| } | |
| ``` | |
| Then use `/models` in opencode to select a zerogpu model. | |
| ## Supported Models | |
| Any HuggingFace model that fits in ~70GB VRAM. Examples: | |
| | Model | Size | Quantization | | |
| |-------|------|--------------| | |
| | `meta-llama/Llama-3.1-8B-Instruct` | 8B | None | | |
| | `mistralai/Mistral-7B-Instruct-v0.3` | 7B | None | | |
| | `Qwen/Qwen2.5-7B-Instruct` | 7B | None | | |
| | `Qwen/Qwen2.5-14B-Instruct` | 14B | None | | |
| | `Qwen/Qwen2.5-32B-Instruct` | 32B | None | | |
| | `meta-llama/Llama-3.1-70B-Instruct` | 70B | INT4 (auto) | | |
| Models larger than 34B are automatically quantized to INT4. | |
| ## VRAM Guidelines | |
| | Model Size | FP16 VRAM | INT8 VRAM | INT4 VRAM | | |
| |------------|-----------|-----------|-----------| | |
| | 7B | ~14GB | ~7GB | ~3.5GB | | |
| | 13B | ~26GB | ~13GB | ~6.5GB | | |
| | 34B | ~68GB | ~34GB | ~17GB | | |
| | 70B | ~140GB | ~70GB | ~35GB | | |
| *70B models require INT4 quantization. Add ~20% overhead for KV cache.* | |
| ## Quota Information | |
| - **PRO plan**: 25 minutes/day of H200 GPU compute | |
| - **Priority**: PRO users get highest queue priority | |
| - **Fallback**: When quota exhausted, falls back to HF Serverless Inference API | |
| ## API Endpoints | |
| | Endpoint | Method | Description | | |
| |----------|--------|-------------| | |
| | `/v1/chat/completions` | POST | Chat completion (OpenAI-compatible) | | |
| | `/v1/models` | GET | List loaded models | | |
| | `/health` | GET | Health check and quota status | | |
| ## Local Development | |
| ```bash | |
| # Clone the repo | |
| git clone https://huggingface.co/spaces/serenichron/opencode-zerogpu | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Run locally (ZeroGPU decorator no-ops) | |
| python app.py | |
| ``` | |
| ## Testing | |
| ```bash | |
| # Run tests | |
| pytest tests/ -v | |
| # Test the API locally | |
| curl -X POST http://localhost:7860/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -H "Authorization: Bearer $HF_TOKEN" \ | |
| -d '{ | |
| "model": "mistralai/Mistral-7B-Instruct-v0.3", | |
| "messages": [{"role": "user", "content": "Hello!"}], | |
| "stream": false | |
| }' | |
| ``` | |
| ## Environment Variables | |
| | Variable | Required | Description | | |
| |----------|----------|-------------| | |
| | `HF_TOKEN` | No* | Token for gated models (* Space uses its own token) | | |
| | `FALLBACK_ENABLED` | No | Enable HF Serverless fallback (default: true) | | |
| | `LOG_LEVEL` | No | Logging verbosity (default: INFO) | | |
| ## License | |
| MIT | |