Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.5.1
metadata
title: OpenCode ZeroGPU Provider
emoji: 🚀
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.16.1
app_file: app.py
pinned: false
license: mit
hardware: zero-a10g
OpenCode ZeroGPU Provider
OpenAI-compatible inference endpoint for opencode, powered by HuggingFace ZeroGPU (NVIDIA H200).
Features
- OpenAI-compatible API - Drop-in replacement for OpenAI endpoints
- Pass-through model selection - Use any HuggingFace model ID
- ZeroGPU H200 inference - 25 min/day of H200 GPU compute (PRO plan)
- Automatic fallback - Falls back to HF Serverless when quota exhausted
- SSE streaming - Real-time token streaming support
- Authentication - Requires valid HuggingFace token
API Endpoint
POST /v1/chat/completions
Request Format
{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 512,
"stream": true
}
Headers
Authorization: Bearer hf_YOUR_TOKEN
Content-Type: application/json
Usage with opencode
Configure in ~/.config/opencode/opencode.json:
{
"providers": {
"zerogpu": {
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "https://serenichron-opencode-zerogpu.hf.space/v1",
"headers": {
"Authorization": "Bearer hf_YOUR_TOKEN"
}
},
"models": {
"llama-8b": {
"name": "meta-llama/Llama-3.1-8B-Instruct"
},
"mistral-7b": {
"name": "mistralai/Mistral-7B-Instruct-v0.3"
},
"qwen-7b": {
"name": "Qwen/Qwen2.5-7B-Instruct"
},
"qwen-14b": {
"name": "Qwen/Qwen2.5-14B-Instruct"
}
}
}
}
}
Then use /models in opencode to select a zerogpu model.
Supported Models
Any HuggingFace model that fits in ~70GB VRAM. Examples:
| Model | Size | Quantization |
|---|---|---|
meta-llama/Llama-3.1-8B-Instruct |
8B | None |
mistralai/Mistral-7B-Instruct-v0.3 |
7B | None |
Qwen/Qwen2.5-7B-Instruct |
7B | None |
Qwen/Qwen2.5-14B-Instruct |
14B | None |
Qwen/Qwen2.5-32B-Instruct |
32B | None |
meta-llama/Llama-3.1-70B-Instruct |
70B | INT4 (auto) |
Models larger than 34B are automatically quantized to INT4.
VRAM Guidelines
| Model Size | FP16 VRAM | INT8 VRAM | INT4 VRAM |
|---|---|---|---|
| 7B | ~14GB | ~7GB | ~3.5GB |
| 13B | ~26GB | ~13GB | ~6.5GB |
| 34B | ~68GB | ~34GB | ~17GB |
| 70B | ~140GB | ~70GB | ~35GB |
70B models require INT4 quantization. Add ~20% overhead for KV cache.
Quota Information
- PRO plan: 25 minutes/day of H200 GPU compute
- Priority: PRO users get highest queue priority
- Fallback: When quota exhausted, falls back to HF Serverless Inference API
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat completion (OpenAI-compatible) |
/v1/models |
GET | List loaded models |
/health |
GET | Health check and quota status |
Local Development
# Clone the repo
git clone https://huggingface.co/spaces/serenichron/opencode-zerogpu
# Install dependencies
pip install -r requirements.txt
# Run locally (ZeroGPU decorator no-ops)
python app.py
Testing
# Run tests
pytest tests/ -v
# Test the API locally
curl -X POST http://localhost:7860/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $HF_TOKEN" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
Environment Variables
| Variable | Required | Description |
|---|---|---|
HF_TOKEN |
No* | Token for gated models (* Space uses its own token) |
FALLBACK_ENABLED |
No | Enable HF Serverless fallback (default: true) |
LOG_LEVEL |
No | Logging verbosity (default: INFO) |
License
MIT