opencode-zerogpu / CLAUDE.md
serenichron's picture
Initial implementation of ZeroGPU OpenCode Provider
adcb9bd

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

HuggingFace ZeroGPU Space serving as an OpenAI-compatible inference provider for opencode. Deployed at serenichron/opencode-zerogpu.

Key Features:

  • OpenAI-compatible /v1/chat/completions endpoint
  • Pass-through model selection (any HF model ID)
  • ZeroGPU H200 inference with HF Serverless fallback
  • HF Token authentication required
  • SSE streaming support

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  opencode   │────▢│  serenichron/opencode-zerogpu (HF Space)    β”‚
β”‚  (client)   β”‚     β”‚                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
                    β”‚  β”‚ app.py (Gradio + FastAPI mount)        β”‚  β”‚
                    β”‚  β”‚  └─ /v1/chat/completions               β”‚  β”‚
                    β”‚  β”‚      └─ auth_middleware (HF token)     β”‚  β”‚
                    β”‚  β”‚      └─ inference_router               β”‚  β”‚
                    β”‚  β”‚           β”œβ”€ ZeroGPU (@spaces.GPU)     β”‚  β”‚
                    β”‚  β”‚           └─ HF Serverless (fallback)  β”‚  β”‚
                    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
                    β”‚                                              β”‚
                    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
                    β”‚  β”‚ models.py    β”‚  β”‚ openai_compat.py      β”‚ β”‚
                    β”‚  β”‚ - load/unloadβ”‚  β”‚ - request/response    β”‚ β”‚
                    β”‚  β”‚ - quantize   β”‚  β”‚ - streaming format    β”‚ β”‚
                    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Development Commands

Local Development (CPU/Mock Mode)

# Install dependencies
pip install -r requirements.txt

# Run locally (ZeroGPU decorator no-ops)
python app.py

# Run with specific port
gradio app.py --server-port 7860

Testing

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_openai_compat.py -v

# Run with coverage
pytest tests/ --cov=. --cov-report=term-missing

API Testing

# Test chat completions endpoint
curl -X POST http://localhost:7860/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $HF_TOKEN" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Deployment

# Push to HuggingFace Space (after git remote setup)
git push hf main

# Or use HF CLI
huggingface-cli upload serenichron/opencode-zerogpu . --repo-type space

Key Files

File Purpose
app.py Main Gradio app with FastAPI mount for OpenAI endpoints
models.py Model loading, unloading, quantization, caching
openai_compat.py OpenAI request/response format conversion
config.py Environment variables, settings, quota tracking
README.md HF Space config (YAML frontmatter) + documentation

ZeroGPU Patterns

GPU Decorator Usage

import spaces

# Standard inference (60s default)
@spaces.GPU
def generate(prompt, model_id):
    ...

# Extended duration for large models
@spaces.GPU(duration=120)
def generate_large(prompt, model_id):
    ...

# Dynamic duration based on input
def calc_duration(prompt, max_tokens):
    return min(120, max_tokens // 10)

@spaces.GPU(duration=calc_duration)
def generate_dynamic(prompt, max_tokens):
    ...

Model Loading Pattern

import gc
import torch

current_model = None
current_model_id = None

@spaces.GPU
def load_and_generate(model_id, prompt):
    global current_model, current_model_id

    if model_id != current_model_id:
        # Cleanup previous model
        if current_model:
            del current_model
            gc.collect()
            torch.cuda.empty_cache()

        # Load new model
        current_model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
        current_model_id = model_id

    return generate(current_model, prompt)

Important Constraints

  1. ZeroGPU Compatibility

    • torch.compile NOT supported - use PyTorch AoT instead
    • Gradio SDK only (no Streamlit)
    • GPU allocated only during @spaces.GPU decorated functions
  2. Memory Management

    • H200 provides ~70GB VRAM
    • 70B models require INT4 quantization
    • Always cleanup with gc.collect() and torch.cuda.empty_cache()
  3. Quota Awareness

    • PRO plan: 25 min/day H200 compute
    • Track usage, fall back to HF Serverless when exhausted
    • Shorter duration = higher queue priority
  4. Authentication

    • All API requests require Authorization: Bearer hf_... header
    • Validate tokens via HuggingFace Hub API

Environment Variables

Variable Required Description
HF_TOKEN No* Token for accessing gated models (* Space has its own token)
FALLBACK_ENABLED No Enable HF Serverless fallback (default: true)
LOG_LEVEL No Logging verbosity (default: INFO)

Testing Strategy

  1. Unit Tests: Model loading, OpenAI format conversion
  2. Integration Tests: Full API request/response cycle
  3. Local Testing: CPU-only mode (decorator no-ops)
  4. Live Testing: Deploy to Space, test via opencode