--- title: OpenCode ZeroGPU Provider emoji: 🚀 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 5.16.1 app_file: app.py pinned: false license: mit hardware: zero-a10g --- # OpenCode ZeroGPU Provider OpenAI-compatible inference endpoint for [opencode](https://github.com/sst/opencode), powered by HuggingFace ZeroGPU (NVIDIA H200). ## Features - **OpenAI-compatible API** - Drop-in replacement for OpenAI endpoints - **Pass-through model selection** - Use any HuggingFace model ID - **ZeroGPU H200 inference** - 25 min/day of H200 GPU compute (PRO plan) - **Automatic fallback** - Falls back to HF Serverless when quota exhausted - **SSE streaming** - Real-time token streaming support - **Authentication** - Requires valid HuggingFace token ## API Endpoint ``` POST /v1/chat/completions ``` ### Request Format ```json { "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} ], "temperature": 0.7, "max_tokens": 512, "stream": true } ``` ### Headers ``` Authorization: Bearer hf_YOUR_TOKEN Content-Type: application/json ``` ## Usage with opencode Configure in `~/.config/opencode/opencode.json`: ```json { "providers": { "zerogpu": { "npm": "@ai-sdk/openai-compatible", "options": { "baseURL": "https://serenichron-opencode-zerogpu.hf.space/v1", "headers": { "Authorization": "Bearer hf_YOUR_TOKEN" } }, "models": { "llama-8b": { "name": "meta-llama/Llama-3.1-8B-Instruct" }, "mistral-7b": { "name": "mistralai/Mistral-7B-Instruct-v0.3" }, "qwen-7b": { "name": "Qwen/Qwen2.5-7B-Instruct" }, "qwen-14b": { "name": "Qwen/Qwen2.5-14B-Instruct" } } } } } ``` Then use `/models` in opencode to select a zerogpu model. ## Supported Models Any HuggingFace model that fits in ~70GB VRAM. Examples: | Model | Size | Quantization | |-------|------|--------------| | `meta-llama/Llama-3.1-8B-Instruct` | 8B | None | | `mistralai/Mistral-7B-Instruct-v0.3` | 7B | None | | `Qwen/Qwen2.5-7B-Instruct` | 7B | None | | `Qwen/Qwen2.5-14B-Instruct` | 14B | None | | `Qwen/Qwen2.5-32B-Instruct` | 32B | None | | `meta-llama/Llama-3.1-70B-Instruct` | 70B | INT4 (auto) | Models larger than 34B are automatically quantized to INT4. ## VRAM Guidelines | Model Size | FP16 VRAM | INT8 VRAM | INT4 VRAM | |------------|-----------|-----------|-----------| | 7B | ~14GB | ~7GB | ~3.5GB | | 13B | ~26GB | ~13GB | ~6.5GB | | 34B | ~68GB | ~34GB | ~17GB | | 70B | ~140GB | ~70GB | ~35GB | *70B models require INT4 quantization. Add ~20% overhead for KV cache.* ## Quota Information - **PRO plan**: 25 minutes/day of H200 GPU compute - **Priority**: PRO users get highest queue priority - **Fallback**: When quota exhausted, falls back to HF Serverless Inference API ## API Endpoints | Endpoint | Method | Description | |----------|--------|-------------| | `/v1/chat/completions` | POST | Chat completion (OpenAI-compatible) | | `/v1/models` | GET | List loaded models | | `/health` | GET | Health check and quota status | ## Local Development ```bash # Clone the repo git clone https://huggingface.co/spaces/serenichron/opencode-zerogpu # Install dependencies pip install -r requirements.txt # Run locally (ZeroGPU decorator no-ops) python app.py ``` ## Testing ```bash # Run tests pytest tests/ -v # Test the API locally curl -X POST http://localhost:7860/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $HF_TOKEN" \ -d '{ "model": "mistralai/Mistral-7B-Instruct-v0.3", "messages": [{"role": "user", "content": "Hello!"}], "stream": false }' ``` ## Environment Variables | Variable | Required | Description | |----------|----------|-------------| | `HF_TOKEN` | No* | Token for gated models (* Space uses its own token) | | `FALLBACK_ENABLED` | No | Enable HF Serverless fallback (default: true) | | `LOG_LEVEL` | No | Logging verbosity (default: INFO) | ## License MIT