opencode-zerogpu / CLAUDE.md
serenichron's picture
Initial implementation of ZeroGPU OpenCode Provider
adcb9bd
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
HuggingFace ZeroGPU Space serving as an OpenAI-compatible inference provider for opencode. Deployed at `serenichron/opencode-zerogpu`.
**Key Features:**
- OpenAI-compatible `/v1/chat/completions` endpoint
- Pass-through model selection (any HF model ID)
- ZeroGPU H200 inference with HF Serverless fallback
- HF Token authentication required
- SSE streaming support
## Architecture
```
┌─────────────┐ ┌──────────────────────────────────────────────┐
│ opencode │────▶│ serenichron/opencode-zerogpu (HF Space) │
│ (client) │ │ │
└─────────────┘ │ ┌────────────────────────────────────────┐ │
│ │ app.py (Gradio + FastAPI mount) │ │
│ │ └─ /v1/chat/completions │ │
│ │ └─ auth_middleware (HF token) │ │
│ │ └─ inference_router │ │
│ │ ├─ ZeroGPU (@spaces.GPU) │ │
│ │ └─ HF Serverless (fallback) │ │
│ └────────────────────────────────────────┘ │
│ │
│ ┌──────────────┐ ┌───────────────────────┐ │
│ │ models.py │ │ openai_compat.py │ │
│ │ - load/unload│ │ - request/response │ │
│ │ - quantize │ │ - streaming format │ │
│ └──────────────┘ └───────────────────────┘ │
└──────────────────────────────────────────────┘
```
## Development Commands
### Local Development (CPU/Mock Mode)
```bash
# Install dependencies
pip install -r requirements.txt
# Run locally (ZeroGPU decorator no-ops)
python app.py
# Run with specific port
gradio app.py --server-port 7860
```
### Testing
```bash
# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_openai_compat.py -v
# Run with coverage
pytest tests/ --cov=. --cov-report=term-missing
```
### API Testing
```bash
# Test chat completions endpoint
curl -X POST http://localhost:7860/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $HF_TOKEN" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
```
### Deployment
```bash
# Push to HuggingFace Space (after git remote setup)
git push hf main
# Or use HF CLI
huggingface-cli upload serenichron/opencode-zerogpu . --repo-type space
```
## Key Files
| File | Purpose |
|------|---------|
| `app.py` | Main Gradio app with FastAPI mount for OpenAI endpoints |
| `models.py` | Model loading, unloading, quantization, caching |
| `openai_compat.py` | OpenAI request/response format conversion |
| `config.py` | Environment variables, settings, quota tracking |
| `README.md` | HF Space config (YAML frontmatter) + documentation |
## ZeroGPU Patterns
### GPU Decorator Usage
```python
import spaces
# Standard inference (60s default)
@spaces.GPU
def generate(prompt, model_id):
...
# Extended duration for large models
@spaces.GPU(duration=120)
def generate_large(prompt, model_id):
...
# Dynamic duration based on input
def calc_duration(prompt, max_tokens):
return min(120, max_tokens // 10)
@spaces.GPU(duration=calc_duration)
def generate_dynamic(prompt, max_tokens):
...
```
### Model Loading Pattern
```python
import gc
import torch
current_model = None
current_model_id = None
@spaces.GPU
def load_and_generate(model_id, prompt):
global current_model, current_model_id
if model_id != current_model_id:
# Cleanup previous model
if current_model:
del current_model
gc.collect()
torch.cuda.empty_cache()
# Load new model
current_model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
current_model_id = model_id
return generate(current_model, prompt)
```
## Important Constraints
1. **ZeroGPU Compatibility**
- `torch.compile` NOT supported - use PyTorch AoT instead
- Gradio SDK only (no Streamlit)
- GPU allocated only during `@spaces.GPU` decorated functions
2. **Memory Management**
- H200 provides ~70GB VRAM
- 70B models require INT4 quantization
- Always cleanup with `gc.collect()` and `torch.cuda.empty_cache()`
3. **Quota Awareness**
- PRO plan: 25 min/day H200 compute
- Track usage, fall back to HF Serverless when exhausted
- Shorter `duration` = higher queue priority
4. **Authentication**
- All API requests require `Authorization: Bearer hf_...` header
- Validate tokens via HuggingFace Hub API
## Environment Variables
| Variable | Required | Description |
|----------|----------|-------------|
| `HF_TOKEN` | No* | Token for accessing gated models (* Space has its own token) |
| `FALLBACK_ENABLED` | No | Enable HF Serverless fallback (default: true) |
| `LOG_LEVEL` | No | Logging verbosity (default: INFO) |
## Testing Strategy
1. **Unit Tests**: Model loading, OpenAI format conversion
2. **Integration Tests**: Full API request/response cycle
3. **Local Testing**: CPU-only mode (decorator no-ops)
4. **Live Testing**: Deploy to Space, test via opencode