Spaces:
Sleeping
Sleeping
File size: 6,263 Bytes
adcb9bd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
HuggingFace ZeroGPU Space serving as an OpenAI-compatible inference provider for opencode. Deployed at `serenichron/opencode-zerogpu`.
**Key Features:**
- OpenAI-compatible `/v1/chat/completions` endpoint
- Pass-through model selection (any HF model ID)
- ZeroGPU H200 inference with HF Serverless fallback
- HF Token authentication required
- SSE streaming support
## Architecture
```
βββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββ
β opencode ββββββΆβ serenichron/opencode-zerogpu (HF Space) β
β (client) β β β
βββββββββββββββ β ββββββββββββββββββββββββββββββββββββββββββ β
β β app.py (Gradio + FastAPI mount) β β
β β ββ /v1/chat/completions β β
β β ββ auth_middleware (HF token) β β
β β ββ inference_router β β
β β ββ ZeroGPU (@spaces.GPU) β β
β β ββ HF Serverless (fallback) β β
β ββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββ βββββββββββββββββββββββββ β
β β models.py β β openai_compat.py β β
β β - load/unloadβ β - request/response β β
β β - quantize β β - streaming format β β
β ββββββββββββββββ βββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββ
```
## Development Commands
### Local Development (CPU/Mock Mode)
```bash
# Install dependencies
pip install -r requirements.txt
# Run locally (ZeroGPU decorator no-ops)
python app.py
# Run with specific port
gradio app.py --server-port 7860
```
### Testing
```bash
# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_openai_compat.py -v
# Run with coverage
pytest tests/ --cov=. --cov-report=term-missing
```
### API Testing
```bash
# Test chat completions endpoint
curl -X POST http://localhost:7860/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $HF_TOKEN" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
```
### Deployment
```bash
# Push to HuggingFace Space (after git remote setup)
git push hf main
# Or use HF CLI
huggingface-cli upload serenichron/opencode-zerogpu . --repo-type space
```
## Key Files
| File | Purpose |
|------|---------|
| `app.py` | Main Gradio app with FastAPI mount for OpenAI endpoints |
| `models.py` | Model loading, unloading, quantization, caching |
| `openai_compat.py` | OpenAI request/response format conversion |
| `config.py` | Environment variables, settings, quota tracking |
| `README.md` | HF Space config (YAML frontmatter) + documentation |
## ZeroGPU Patterns
### GPU Decorator Usage
```python
import spaces
# Standard inference (60s default)
@spaces.GPU
def generate(prompt, model_id):
...
# Extended duration for large models
@spaces.GPU(duration=120)
def generate_large(prompt, model_id):
...
# Dynamic duration based on input
def calc_duration(prompt, max_tokens):
return min(120, max_tokens // 10)
@spaces.GPU(duration=calc_duration)
def generate_dynamic(prompt, max_tokens):
...
```
### Model Loading Pattern
```python
import gc
import torch
current_model = None
current_model_id = None
@spaces.GPU
def load_and_generate(model_id, prompt):
global current_model, current_model_id
if model_id != current_model_id:
# Cleanup previous model
if current_model:
del current_model
gc.collect()
torch.cuda.empty_cache()
# Load new model
current_model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
current_model_id = model_id
return generate(current_model, prompt)
```
## Important Constraints
1. **ZeroGPU Compatibility**
- `torch.compile` NOT supported - use PyTorch AoT instead
- Gradio SDK only (no Streamlit)
- GPU allocated only during `@spaces.GPU` decorated functions
2. **Memory Management**
- H200 provides ~70GB VRAM
- 70B models require INT4 quantization
- Always cleanup with `gc.collect()` and `torch.cuda.empty_cache()`
3. **Quota Awareness**
- PRO plan: 25 min/day H200 compute
- Track usage, fall back to HF Serverless when exhausted
- Shorter `duration` = higher queue priority
4. **Authentication**
- All API requests require `Authorization: Bearer hf_...` header
- Validate tokens via HuggingFace Hub API
## Environment Variables
| Variable | Required | Description |
|----------|----------|-------------|
| `HF_TOKEN` | No* | Token for accessing gated models (* Space has its own token) |
| `FALLBACK_ENABLED` | No | Enable HF Serverless fallback (default: true) |
| `LOG_LEVEL` | No | Logging verbosity (default: INFO) |
## Testing Strategy
1. **Unit Tests**: Model loading, OpenAI format conversion
2. **Integration Tests**: Full API request/response cycle
3. **Local Testing**: CPU-only mode (decorator no-ops)
4. **Live Testing**: Deploy to Space, test via opencode
|