Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.5.1
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
HuggingFace ZeroGPU Space serving as an OpenAI-compatible inference provider for opencode. Deployed at serenichron/opencode-zerogpu.
Key Features:
- OpenAI-compatible
/v1/chat/completionsendpoint - Pass-through model selection (any HF model ID)
- ZeroGPU H200 inference with HF Serverless fallback
- HF Token authentication required
- SSE streaming support
Architecture
βββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββ
β opencode ββββββΆβ serenichron/opencode-zerogpu (HF Space) β
β (client) β β β
βββββββββββββββ β ββββββββββββββββββββββββββββββββββββββββββ β
β β app.py (Gradio + FastAPI mount) β β
β β ββ /v1/chat/completions β β
β β ββ auth_middleware (HF token) β β
β β ββ inference_router β β
β β ββ ZeroGPU (@spaces.GPU) β β
β β ββ HF Serverless (fallback) β β
β ββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββ βββββββββββββββββββββββββ β
β β models.py β β openai_compat.py β β
β β - load/unloadβ β - request/response β β
β β - quantize β β - streaming format β β
β ββββββββββββββββ βββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββ
Development Commands
Local Development (CPU/Mock Mode)
# Install dependencies
pip install -r requirements.txt
# Run locally (ZeroGPU decorator no-ops)
python app.py
# Run with specific port
gradio app.py --server-port 7860
Testing
# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_openai_compat.py -v
# Run with coverage
pytest tests/ --cov=. --cov-report=term-missing
API Testing
# Test chat completions endpoint
curl -X POST http://localhost:7860/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $HF_TOKEN" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
Deployment
# Push to HuggingFace Space (after git remote setup)
git push hf main
# Or use HF CLI
huggingface-cli upload serenichron/opencode-zerogpu . --repo-type space
Key Files
| File | Purpose |
|---|---|
app.py |
Main Gradio app with FastAPI mount for OpenAI endpoints |
models.py |
Model loading, unloading, quantization, caching |
openai_compat.py |
OpenAI request/response format conversion |
config.py |
Environment variables, settings, quota tracking |
README.md |
HF Space config (YAML frontmatter) + documentation |
ZeroGPU Patterns
GPU Decorator Usage
import spaces
# Standard inference (60s default)
@spaces.GPU
def generate(prompt, model_id):
...
# Extended duration for large models
@spaces.GPU(duration=120)
def generate_large(prompt, model_id):
...
# Dynamic duration based on input
def calc_duration(prompt, max_tokens):
return min(120, max_tokens // 10)
@spaces.GPU(duration=calc_duration)
def generate_dynamic(prompt, max_tokens):
...
Model Loading Pattern
import gc
import torch
current_model = None
current_model_id = None
@spaces.GPU
def load_and_generate(model_id, prompt):
global current_model, current_model_id
if model_id != current_model_id:
# Cleanup previous model
if current_model:
del current_model
gc.collect()
torch.cuda.empty_cache()
# Load new model
current_model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
current_model_id = model_id
return generate(current_model, prompt)
Important Constraints
ZeroGPU Compatibility
torch.compileNOT supported - use PyTorch AoT instead- Gradio SDK only (no Streamlit)
- GPU allocated only during
@spaces.GPUdecorated functions
Memory Management
- H200 provides ~70GB VRAM
- 70B models require INT4 quantization
- Always cleanup with
gc.collect()andtorch.cuda.empty_cache()
Quota Awareness
- PRO plan: 25 min/day H200 compute
- Track usage, fall back to HF Serverless when exhausted
- Shorter
duration= higher queue priority
Authentication
- All API requests require
Authorization: Bearer hf_...header - Validate tokens via HuggingFace Hub API
- All API requests require
Environment Variables
| Variable | Required | Description |
|---|---|---|
HF_TOKEN |
No* | Token for accessing gated models (* Space has its own token) |
FALLBACK_ENABLED |
No | Enable HF Serverless fallback (default: true) |
LOG_LEVEL |
No | Logging verbosity (default: INFO) |
Testing Strategy
- Unit Tests: Model loading, OpenAI format conversion
- Integration Tests: Full API request/response cycle
- Local Testing: CPU-only mode (decorator no-ops)
- Live Testing: Deploy to Space, test via opencode