# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview HuggingFace ZeroGPU Space serving as an OpenAI-compatible inference provider for opencode. Deployed at `serenichron/opencode-zerogpu`. **Key Features:** - OpenAI-compatible `/v1/chat/completions` endpoint - Pass-through model selection (any HF model ID) - ZeroGPU H200 inference with HF Serverless fallback - HF Token authentication required - SSE streaming support ## Architecture ``` ┌─────────────┐ ┌──────────────────────────────────────────────┐ │ opencode │────▶│ serenichron/opencode-zerogpu (HF Space) │ │ (client) │ │ │ └─────────────┘ │ ┌────────────────────────────────────────┐ │ │ │ app.py (Gradio + FastAPI mount) │ │ │ │ └─ /v1/chat/completions │ │ │ │ └─ auth_middleware (HF token) │ │ │ │ └─ inference_router │ │ │ │ ├─ ZeroGPU (@spaces.GPU) │ │ │ │ └─ HF Serverless (fallback) │ │ │ └────────────────────────────────────────┘ │ │ │ │ ┌──────────────┐ ┌───────────────────────┐ │ │ │ models.py │ │ openai_compat.py │ │ │ │ - load/unload│ │ - request/response │ │ │ │ - quantize │ │ - streaming format │ │ │ └──────────────┘ └───────────────────────┘ │ └──────────────────────────────────────────────┘ ``` ## Development Commands ### Local Development (CPU/Mock Mode) ```bash # Install dependencies pip install -r requirements.txt # Run locally (ZeroGPU decorator no-ops) python app.py # Run with specific port gradio app.py --server-port 7860 ``` ### Testing ```bash # Run all tests pytest tests/ -v # Run specific test file pytest tests/test_openai_compat.py -v # Run with coverage pytest tests/ --cov=. --cov-report=term-missing ``` ### API Testing ```bash # Test chat completions endpoint curl -X POST http://localhost:7860/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $HF_TOKEN" \ -d '{ "model": "mistralai/Mistral-7B-Instruct-v0.3", "messages": [{"role": "user", "content": "Hello!"}], "stream": true }' ``` ### Deployment ```bash # Push to HuggingFace Space (after git remote setup) git push hf main # Or use HF CLI huggingface-cli upload serenichron/opencode-zerogpu . --repo-type space ``` ## Key Files | File | Purpose | |------|---------| | `app.py` | Main Gradio app with FastAPI mount for OpenAI endpoints | | `models.py` | Model loading, unloading, quantization, caching | | `openai_compat.py` | OpenAI request/response format conversion | | `config.py` | Environment variables, settings, quota tracking | | `README.md` | HF Space config (YAML frontmatter) + documentation | ## ZeroGPU Patterns ### GPU Decorator Usage ```python import spaces # Standard inference (60s default) @spaces.GPU def generate(prompt, model_id): ... # Extended duration for large models @spaces.GPU(duration=120) def generate_large(prompt, model_id): ... # Dynamic duration based on input def calc_duration(prompt, max_tokens): return min(120, max_tokens // 10) @spaces.GPU(duration=calc_duration) def generate_dynamic(prompt, max_tokens): ... ``` ### Model Loading Pattern ```python import gc import torch current_model = None current_model_id = None @spaces.GPU def load_and_generate(model_id, prompt): global current_model, current_model_id if model_id != current_model_id: # Cleanup previous model if current_model: del current_model gc.collect() torch.cuda.empty_cache() # Load new model current_model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) current_model_id = model_id return generate(current_model, prompt) ``` ## Important Constraints 1. **ZeroGPU Compatibility** - `torch.compile` NOT supported - use PyTorch AoT instead - Gradio SDK only (no Streamlit) - GPU allocated only during `@spaces.GPU` decorated functions 2. **Memory Management** - H200 provides ~70GB VRAM - 70B models require INT4 quantization - Always cleanup with `gc.collect()` and `torch.cuda.empty_cache()` 3. **Quota Awareness** - PRO plan: 25 min/day H200 compute - Track usage, fall back to HF Serverless when exhausted - Shorter `duration` = higher queue priority 4. **Authentication** - All API requests require `Authorization: Bearer hf_...` header - Validate tokens via HuggingFace Hub API ## Environment Variables | Variable | Required | Description | |----------|----------|-------------| | `HF_TOKEN` | No* | Token for accessing gated models (* Space has its own token) | | `FALLBACK_ENABLED` | No | Enable HF Serverless fallback (default: true) | | `LOG_LEVEL` | No | Logging verbosity (default: INFO) | ## Testing Strategy 1. **Unit Tests**: Model loading, OpenAI format conversion 2. **Integration Tests**: Full API request/response cycle 3. **Local Testing**: CPU-only mode (decorator no-ops) 4. **Live Testing**: Deploy to Space, test via opencode