Spaces:

serenichron
/

opencode-zerogpu

Sleeping

┌─────────────┐     ┌──────────────────────────────────────────────┐
│  opencode   │────▶│  serenichron/opencode-zerogpu (HF Space)    │
│  (client)   │     │                                              │
└─────────────┘     │  ┌────────────────────────────────────────┐  │
                    │  │ app.py (Gradio + FastAPI mount)        │  │
                    │  │  └─ /v1/chat/completions               │  │
                    │  │      └─ auth_middleware (HF token)     │  │
                    │  │      └─ inference_router               │  │
                    │  │           ├─ ZeroGPU (@spaces.GPU)     │  │
                    │  │           └─ HF Serverless (fallback)  │  │
                    │  └────────────────────────────────────────┘  │
                    │                                              │
                    │  ┌──────────────┐  ┌───────────────────────┐ │
                    │  │ models.py    │  │ openai_compat.py      │ │
                    │  │ - load/unload│  │ - request/response    │ │
                    │  │ - quantize   │  │ - streaming format    │ │
                    │  └──────────────┘  └───────────────────────┘ │
                    └──────────────────────────────────────────────┘

Development Commands

Local Development (CPU/Mock Mode)

# Install dependencies
pip install -r requirements.txt

# Run locally (ZeroGPU decorator no-ops)
python app.py

# Run with specific port
gradio app.py --server-port 7860

Testing

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_openai_compat.py -v

# Run with coverage
pytest tests/ --cov=. --cov-report=term-missing

API Testing

# Test chat completions endpoint
curl -X POST http://localhost:7860/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $HF_TOKEN" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Deployment

# Push to HuggingFace Space (after git remote setup)
git push hf main

# Or use HF CLI
huggingface-cli upload serenichron/opencode-zerogpu . --repo-type space

Key Files

File	Purpose
`app.py`	Main Gradio app with FastAPI mount for OpenAI endpoints
`models.py`	Model loading, unloading, quantization, caching
`openai_compat.py`	OpenAI request/response format conversion
`config.py`	Environment variables, settings, quota tracking
`README.md`	HF Space config (YAML frontmatter) + documentation

ZeroGPU Patterns

GPU Decorator Usage

import spaces

# Standard inference (60s default)
@spaces.GPU
def generate(prompt, model_id):
    ...

# Extended duration for large models
@spaces.GPU(duration=120)
def generate_large(prompt, model_id):
    ...

# Dynamic duration based on input
def calc_duration(prompt, max_tokens):
    return min(120, max_tokens // 10)

@spaces.GPU(duration=calc_duration)
def generate_dynamic(prompt, max_tokens):
    ...

Model Loading Pattern

import gc
import torch

current_model = None
current_model_id = None

@spaces.GPU
def load_and_generate(model_id, prompt):
    global current_model, current_model_id

    if model_id != current_model_id:
        # Cleanup previous model
        if current_model:
            del current_model
            gc.collect()
            torch.cuda.empty_cache()

        # Load new model
        current_model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
        current_model_id = model_id

    return generate(current_model, prompt)

Important Constraints

ZeroGPU Compatibility
- torch.compile NOT supported - use PyTorch AoT instead
- Gradio SDK only (no Streamlit)
- GPU allocated only during @spaces.GPU decorated functions
Memory Management
- H200 provides ~70GB VRAM
- 70B models require INT4 quantization
- Always cleanup with gc.collect() and torch.cuda.empty_cache()
Quota Awareness
- PRO plan: 25 min/day H200 compute
- Track usage, fall back to HF Serverless when exhausted
- Shorter duration = higher queue priority
Authentication
- All API requests require Authorization: Bearer hf_... header
- Validate tokens via HuggingFace Hub API

Environment Variables

Variable	Required	Description
`HF_TOKEN`	No*	Token for accessing gated models (* Space has its own token)
`FALLBACK_ENABLED`	No	Enable HF Serverless fallback (default: true)
`LOG_LEVEL`	No	Logging verbosity (default: INFO)

Testing Strategy

Unit Tests: Model loading, OpenAI format conversion
Integration Tests: Full API request/response cycle
Local Testing: CPU-only mode (decorator no-ops)
Live Testing: Deploy to Space, test via opencode