Spaces:
Sleeping
Sleeping
| # CLAUDE.md | |
| This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | |
| ## Project Overview | |
| HuggingFace ZeroGPU Space serving as an OpenAI-compatible inference provider for opencode. Deployed at `serenichron/opencode-zerogpu`. | |
| **Key Features:** | |
| - OpenAI-compatible `/v1/chat/completions` endpoint | |
| - Pass-through model selection (any HF model ID) | |
| - ZeroGPU H200 inference with HF Serverless fallback | |
| - HF Token authentication required | |
| - SSE streaming support | |
| ## Architecture | |
| ``` | |
| ┌─────────────┐ ┌──────────────────────────────────────────────┐ | |
| │ opencode │────▶│ serenichron/opencode-zerogpu (HF Space) │ | |
| │ (client) │ │ │ | |
| └─────────────┘ │ ┌────────────────────────────────────────┐ │ | |
| │ │ app.py (Gradio + FastAPI mount) │ │ | |
| │ │ └─ /v1/chat/completions │ │ | |
| │ │ └─ auth_middleware (HF token) │ │ | |
| │ │ └─ inference_router │ │ | |
| │ │ ├─ ZeroGPU (@spaces.GPU) │ │ | |
| │ │ └─ HF Serverless (fallback) │ │ | |
| │ └────────────────────────────────────────┘ │ | |
| │ │ | |
| │ ┌──────────────┐ ┌───────────────────────┐ │ | |
| │ │ models.py │ │ openai_compat.py │ │ | |
| │ │ - load/unload│ │ - request/response │ │ | |
| │ │ - quantize │ │ - streaming format │ │ | |
| │ └──────────────┘ └───────────────────────┘ │ | |
| └──────────────────────────────────────────────┘ | |
| ``` | |
| ## Development Commands | |
| ### Local Development (CPU/Mock Mode) | |
| ```bash | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Run locally (ZeroGPU decorator no-ops) | |
| python app.py | |
| # Run with specific port | |
| gradio app.py --server-port 7860 | |
| ``` | |
| ### Testing | |
| ```bash | |
| # Run all tests | |
| pytest tests/ -v | |
| # Run specific test file | |
| pytest tests/test_openai_compat.py -v | |
| # Run with coverage | |
| pytest tests/ --cov=. --cov-report=term-missing | |
| ``` | |
| ### API Testing | |
| ```bash | |
| # Test chat completions endpoint | |
| curl -X POST http://localhost:7860/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -H "Authorization: Bearer $HF_TOKEN" \ | |
| -d '{ | |
| "model": "mistralai/Mistral-7B-Instruct-v0.3", | |
| "messages": [{"role": "user", "content": "Hello!"}], | |
| "stream": true | |
| }' | |
| ``` | |
| ### Deployment | |
| ```bash | |
| # Push to HuggingFace Space (after git remote setup) | |
| git push hf main | |
| # Or use HF CLI | |
| huggingface-cli upload serenichron/opencode-zerogpu . --repo-type space | |
| ``` | |
| ## Key Files | |
| | File | Purpose | | |
| |------|---------| | |
| | `app.py` | Main Gradio app with FastAPI mount for OpenAI endpoints | | |
| | `models.py` | Model loading, unloading, quantization, caching | | |
| | `openai_compat.py` | OpenAI request/response format conversion | | |
| | `config.py` | Environment variables, settings, quota tracking | | |
| | `README.md` | HF Space config (YAML frontmatter) + documentation | | |
| ## ZeroGPU Patterns | |
| ### GPU Decorator Usage | |
| ```python | |
| import spaces | |
| # Standard inference (60s default) | |
| @spaces.GPU | |
| def generate(prompt, model_id): | |
| ... | |
| # Extended duration for large models | |
| @spaces.GPU(duration=120) | |
| def generate_large(prompt, model_id): | |
| ... | |
| # Dynamic duration based on input | |
| def calc_duration(prompt, max_tokens): | |
| return min(120, max_tokens // 10) | |
| @spaces.GPU(duration=calc_duration) | |
| def generate_dynamic(prompt, max_tokens): | |
| ... | |
| ``` | |
| ### Model Loading Pattern | |
| ```python | |
| import gc | |
| import torch | |
| current_model = None | |
| current_model_id = None | |
| @spaces.GPU | |
| def load_and_generate(model_id, prompt): | |
| global current_model, current_model_id | |
| if model_id != current_model_id: | |
| # Cleanup previous model | |
| if current_model: | |
| del current_model | |
| gc.collect() | |
| torch.cuda.empty_cache() | |
| # Load new model | |
| current_model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto" | |
| ) | |
| current_model_id = model_id | |
| return generate(current_model, prompt) | |
| ``` | |
| ## Important Constraints | |
| 1. **ZeroGPU Compatibility** | |
| - `torch.compile` NOT supported - use PyTorch AoT instead | |
| - Gradio SDK only (no Streamlit) | |
| - GPU allocated only during `@spaces.GPU` decorated functions | |
| 2. **Memory Management** | |
| - H200 provides ~70GB VRAM | |
| - 70B models require INT4 quantization | |
| - Always cleanup with `gc.collect()` and `torch.cuda.empty_cache()` | |
| 3. **Quota Awareness** | |
| - PRO plan: 25 min/day H200 compute | |
| - Track usage, fall back to HF Serverless when exhausted | |
| - Shorter `duration` = higher queue priority | |
| 4. **Authentication** | |
| - All API requests require `Authorization: Bearer hf_...` header | |
| - Validate tokens via HuggingFace Hub API | |
| ## Environment Variables | |
| | Variable | Required | Description | | |
| |----------|----------|-------------| | |
| | `HF_TOKEN` | No* | Token for accessing gated models (* Space has its own token) | | |
| | `FALLBACK_ENABLED` | No | Enable HF Serverless fallback (default: true) | | |
| | `LOG_LEVEL` | No | Logging verbosity (default: INFO) | | |
| ## Testing Strategy | |
| 1. **Unit Tests**: Model loading, OpenAI format conversion | |
| 2. **Integration Tests**: Full API request/response cycle | |
| 3. **Local Testing**: CPU-only mode (decorator no-ops) | |
| 4. **Live Testing**: Deploy to Space, test via opencode | |