Spaces:

serenichron
/

opencode-zerogpu

Sleeping

App Files Files Community

opencode-zerogpu / README.md

serenichron

Upgrade to Gradio 5.16.1+ for ZeroGPU compatibility

16b4dcd 9 days ago

preview code

raw

history blame contribute delete

4.15 kB

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

metadata

title: OpenCode ZeroGPU Provider
emoji: 🚀
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.16.1
app_file: app.py
pinned: false
license: mit
hardware: zero-a10g

OpenCode ZeroGPU Provider

OpenAI-compatible inference endpoint for opencode, powered by HuggingFace ZeroGPU (NVIDIA H200).

Features

OpenAI-compatible API - Drop-in replacement for OpenAI endpoints
Pass-through model selection - Use any HuggingFace model ID
ZeroGPU H200 inference - 25 min/day of H200 GPU compute (PRO plan)
Automatic fallback - Falls back to HF Serverless when quota exhausted
SSE streaming - Real-time token streaming support
Authentication - Requires valid HuggingFace token

API Endpoint

POST /v1/chat/completions

Request Format

{
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7,
  "max_tokens": 512,
  "stream": true
}

Headers

Authorization: Bearer hf_YOUR_TOKEN
Content-Type: application/json

Usage with opencode

Configure in ~/.config/opencode/opencode.json:

{
  "providers": {
    "zerogpu": {
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "https://serenichron-opencode-zerogpu.hf.space/v1",
        "headers": {
          "Authorization": "Bearer hf_YOUR_TOKEN"
        }
      },
      "models": {
        "llama-8b": {
          "name": "meta-llama/Llama-3.1-8B-Instruct"
        },
        "mistral-7b": {
          "name": "mistralai/Mistral-7B-Instruct-v0.3"
        },
        "qwen-7b": {
          "name": "Qwen/Qwen2.5-7B-Instruct"
        },
        "qwen-14b": {
          "name": "Qwen/Qwen2.5-14B-Instruct"
        }
      }
    }
  }
}

Then use /models in opencode to select a zerogpu model.

Supported Models

Any HuggingFace model that fits in ~70GB VRAM. Examples:

Model	Size	Quantization
`meta-llama/Llama-3.1-8B-Instruct`	8B	None
`mistralai/Mistral-7B-Instruct-v0.3`	7B	None
`Qwen/Qwen2.5-7B-Instruct`	7B	None
`Qwen/Qwen2.5-14B-Instruct`	14B	None
`Qwen/Qwen2.5-32B-Instruct`	32B	None
`meta-llama/Llama-3.1-70B-Instruct`	70B	INT4 (auto)

Models larger than 34B are automatically quantized to INT4.

VRAM Guidelines

Model Size	FP16 VRAM	INT8 VRAM	INT4 VRAM
7B	~14GB	~7GB	~3.5GB
13B	~26GB	~13GB	~6.5GB
34B	~68GB	~34GB	~17GB
70B	~140GB	~70GB	~35GB

70B models require INT4 quantization. Add ~20% overhead for KV cache.

Quota Information

PRO plan: 25 minutes/day of H200 GPU compute
Priority: PRO users get highest queue priority
Fallback: When quota exhausted, falls back to HF Serverless Inference API

API Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completion (OpenAI-compatible)
`/v1/models`	GET	List loaded models
`/health`	GET	Health check and quota status

Local Development

# Clone the repo
git clone https://huggingface.co/spaces/serenichron/opencode-zerogpu

# Install dependencies
pip install -r requirements.txt

# Run locally (ZeroGPU decorator no-ops)
python app.py

Testing

# Run tests
pytest tests/ -v

# Test the API locally
curl -X POST http://localhost:7860/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $HF_TOKEN" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

Environment Variables

Variable	Required	Description
`HF_TOKEN`	No*	Token for gated models (* Space uses its own token)
`FALLBACK_ENABLED`	No	Enable HF Serverless fallback (default: true)
`LOG_LEVEL`	No	Logging verbosity (default: INFO)

License

MIT