File size: 6,263 Bytes
adcb9bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

HuggingFace ZeroGPU Space serving as an OpenAI-compatible inference provider for opencode. Deployed at `serenichron/opencode-zerogpu`.

**Key Features:**
- OpenAI-compatible `/v1/chat/completions` endpoint
- Pass-through model selection (any HF model ID)
- ZeroGPU H200 inference with HF Serverless fallback
- HF Token authentication required
- SSE streaming support

## Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  opencode   │────▢│  serenichron/opencode-zerogpu (HF Space)    β”‚
β”‚  (client)   β”‚     β”‚                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
                    β”‚  β”‚ app.py (Gradio + FastAPI mount)        β”‚  β”‚
                    β”‚  β”‚  └─ /v1/chat/completions               β”‚  β”‚
                    β”‚  β”‚      └─ auth_middleware (HF token)     β”‚  β”‚
                    β”‚  β”‚      └─ inference_router               β”‚  β”‚
                    β”‚  β”‚           β”œβ”€ ZeroGPU (@spaces.GPU)     β”‚  β”‚
                    β”‚  β”‚           └─ HF Serverless (fallback)  β”‚  β”‚
                    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
                    β”‚                                              β”‚
                    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
                    β”‚  β”‚ models.py    β”‚  β”‚ openai_compat.py      β”‚ β”‚
                    β”‚  β”‚ - load/unloadβ”‚  β”‚ - request/response    β”‚ β”‚
                    β”‚  β”‚ - quantize   β”‚  β”‚ - streaming format    β”‚ β”‚
                    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Development Commands

### Local Development (CPU/Mock Mode)
```bash
# Install dependencies
pip install -r requirements.txt

# Run locally (ZeroGPU decorator no-ops)
python app.py

# Run with specific port
gradio app.py --server-port 7860
```

### Testing
```bash
# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_openai_compat.py -v

# Run with coverage
pytest tests/ --cov=. --cov-report=term-missing
```

### API Testing
```bash
# Test chat completions endpoint
curl -X POST http://localhost:7860/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $HF_TOKEN" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'
```

### Deployment
```bash
# Push to HuggingFace Space (after git remote setup)
git push hf main

# Or use HF CLI
huggingface-cli upload serenichron/opencode-zerogpu . --repo-type space
```

## Key Files

| File | Purpose |
|------|---------|
| `app.py` | Main Gradio app with FastAPI mount for OpenAI endpoints |
| `models.py` | Model loading, unloading, quantization, caching |
| `openai_compat.py` | OpenAI request/response format conversion |
| `config.py` | Environment variables, settings, quota tracking |
| `README.md` | HF Space config (YAML frontmatter) + documentation |

## ZeroGPU Patterns

### GPU Decorator Usage
```python
import spaces

# Standard inference (60s default)
@spaces.GPU
def generate(prompt, model_id):
    ...

# Extended duration for large models
@spaces.GPU(duration=120)
def generate_large(prompt, model_id):
    ...

# Dynamic duration based on input
def calc_duration(prompt, max_tokens):
    return min(120, max_tokens // 10)

@spaces.GPU(duration=calc_duration)
def generate_dynamic(prompt, max_tokens):
    ...
```

### Model Loading Pattern
```python
import gc
import torch

current_model = None
current_model_id = None

@spaces.GPU
def load_and_generate(model_id, prompt):
    global current_model, current_model_id

    if model_id != current_model_id:
        # Cleanup previous model
        if current_model:
            del current_model
            gc.collect()
            torch.cuda.empty_cache()

        # Load new model
        current_model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
        current_model_id = model_id

    return generate(current_model, prompt)
```

## Important Constraints

1. **ZeroGPU Compatibility**
   - `torch.compile` NOT supported - use PyTorch AoT instead
   - Gradio SDK only (no Streamlit)
   - GPU allocated only during `@spaces.GPU` decorated functions

2. **Memory Management**
   - H200 provides ~70GB VRAM
   - 70B models require INT4 quantization
   - Always cleanup with `gc.collect()` and `torch.cuda.empty_cache()`

3. **Quota Awareness**
   - PRO plan: 25 min/day H200 compute
   - Track usage, fall back to HF Serverless when exhausted
   - Shorter `duration` = higher queue priority

4. **Authentication**
   - All API requests require `Authorization: Bearer hf_...` header
   - Validate tokens via HuggingFace Hub API

## Environment Variables

| Variable | Required | Description |
|----------|----------|-------------|
| `HF_TOKEN` | No* | Token for accessing gated models (* Space has its own token) |
| `FALLBACK_ENABLED` | No | Enable HF Serverless fallback (default: true) |
| `LOG_LEVEL` | No | Logging verbosity (default: INFO) |

## Testing Strategy

1. **Unit Tests**: Model loading, OpenAI format conversion
2. **Integration Tests**: Full API request/response cycle
3. **Local Testing**: CPU-only mode (decorator no-ops)
4. **Live Testing**: Deploy to Space, test via opencode