opencode-zerogpu / README.md
serenichron's picture
Upgrade to Gradio 5.16.1+ for ZeroGPU compatibility
16b4dcd

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: OpenCode ZeroGPU Provider
emoji: 🚀
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.16.1
app_file: app.py
pinned: false
license: mit
hardware: zero-a10g

OpenCode ZeroGPU Provider

OpenAI-compatible inference endpoint for opencode, powered by HuggingFace ZeroGPU (NVIDIA H200).

Features

  • OpenAI-compatible API - Drop-in replacement for OpenAI endpoints
  • Pass-through model selection - Use any HuggingFace model ID
  • ZeroGPU H200 inference - 25 min/day of H200 GPU compute (PRO plan)
  • Automatic fallback - Falls back to HF Serverless when quota exhausted
  • SSE streaming - Real-time token streaming support
  • Authentication - Requires valid HuggingFace token

API Endpoint

POST /v1/chat/completions

Request Format

{
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7,
  "max_tokens": 512,
  "stream": true
}

Headers

Authorization: Bearer hf_YOUR_TOKEN
Content-Type: application/json

Usage with opencode

Configure in ~/.config/opencode/opencode.json:

{
  "providers": {
    "zerogpu": {
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "https://serenichron-opencode-zerogpu.hf.space/v1",
        "headers": {
          "Authorization": "Bearer hf_YOUR_TOKEN"
        }
      },
      "models": {
        "llama-8b": {
          "name": "meta-llama/Llama-3.1-8B-Instruct"
        },
        "mistral-7b": {
          "name": "mistralai/Mistral-7B-Instruct-v0.3"
        },
        "qwen-7b": {
          "name": "Qwen/Qwen2.5-7B-Instruct"
        },
        "qwen-14b": {
          "name": "Qwen/Qwen2.5-14B-Instruct"
        }
      }
    }
  }
}

Then use /models in opencode to select a zerogpu model.

Supported Models

Any HuggingFace model that fits in ~70GB VRAM. Examples:

Model Size Quantization
meta-llama/Llama-3.1-8B-Instruct 8B None
mistralai/Mistral-7B-Instruct-v0.3 7B None
Qwen/Qwen2.5-7B-Instruct 7B None
Qwen/Qwen2.5-14B-Instruct 14B None
Qwen/Qwen2.5-32B-Instruct 32B None
meta-llama/Llama-3.1-70B-Instruct 70B INT4 (auto)

Models larger than 34B are automatically quantized to INT4.

VRAM Guidelines

Model Size FP16 VRAM INT8 VRAM INT4 VRAM
7B ~14GB ~7GB ~3.5GB
13B ~26GB ~13GB ~6.5GB
34B ~68GB ~34GB ~17GB
70B ~140GB ~70GB ~35GB

70B models require INT4 quantization. Add ~20% overhead for KV cache.

Quota Information

  • PRO plan: 25 minutes/day of H200 GPU compute
  • Priority: PRO users get highest queue priority
  • Fallback: When quota exhausted, falls back to HF Serverless Inference API

API Endpoints

Endpoint Method Description
/v1/chat/completions POST Chat completion (OpenAI-compatible)
/v1/models GET List loaded models
/health GET Health check and quota status

Local Development

# Clone the repo
git clone https://huggingface.co/spaces/serenichron/opencode-zerogpu

# Install dependencies
pip install -r requirements.txt

# Run locally (ZeroGPU decorator no-ops)
python app.py

Testing

# Run tests
pytest tests/ -v

# Test the API locally
curl -X POST http://localhost:7860/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $HF_TOKEN" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

Environment Variables

Variable Required Description
HF_TOKEN No* Token for gated models (* Space uses its own token)
FALLBACK_ENABLED No Enable HF Serverless fallback (default: true)
LOG_LEVEL No Logging verbosity (default: INFO)

License

MIT