# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

HuggingFace ZeroGPU Space serving as an OpenAI-compatible inference provider for opencode. Deployed at `serenichron/opencode-zerogpu`.

**Key Features:**
- OpenAI-compatible `/v1/chat/completions` endpoint
- Pass-through model selection (any HF model ID)
- ZeroGPU H200 inference with HF Serverless fallback
- HF Token authentication required
- SSE streaming support

## Architecture

```
┌─────────────┐     ┌──────────────────────────────────────────────┐
│  opencode   │────▶│  serenichron/opencode-zerogpu (HF Space)    │
│  (client)   │     │                                              │
└─────────────┘     │  ┌────────────────────────────────────────┐  │
                    │  │ app.py (Gradio + FastAPI mount)        │  │
                    │  │  └─ /v1/chat/completions               │  │
                    │  │      └─ auth_middleware (HF token)     │  │
                    │  │      └─ inference_router               │  │
                    │  │           ├─ ZeroGPU (@spaces.GPU)     │  │
                    │  │           └─ HF Serverless (fallback)  │  │
                    │  └────────────────────────────────────────┘  │
                    │                                              │
                    │  ┌──────────────┐  ┌───────────────────────┐ │
                    │  │ models.py    │  │ openai_compat.py      │ │
                    │  │ - load/unload│  │ - request/response    │ │
                    │  │ - quantize   │  │ - streaming format    │ │
                    │  └──────────────┘  └───────────────────────┘ │
                    └──────────────────────────────────────────────┘
```

## Development Commands

### Local Development (CPU/Mock Mode)
```bash
# Install dependencies
pip install -r requirements.txt

# Run locally (ZeroGPU decorator no-ops)
python app.py

# Run with specific port
gradio app.py --server-port 7860
```

### Testing
```bash
# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_openai_compat.py -v

# Run with coverage
pytest tests/ --cov=. --cov-report=term-missing
```

### API Testing
```bash
# Test chat completions endpoint
curl -X POST http://localhost:7860/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $HF_TOKEN" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'
```

### Deployment
```bash
# Push to HuggingFace Space (after git remote setup)
git push hf main

# Or use HF CLI
huggingface-cli upload serenichron/opencode-zerogpu . --repo-type space
```

## Key Files

| File | Purpose |
|------|---------|
| `app.py` | Main Gradio app with FastAPI mount for OpenAI endpoints |
| `models.py` | Model loading, unloading, quantization, caching |
| `openai_compat.py` | OpenAI request/response format conversion |
| `config.py` | Environment variables, settings, quota tracking |
| `README.md` | HF Space config (YAML frontmatter) + documentation |

## ZeroGPU Patterns

### GPU Decorator Usage
```python
import spaces

# Standard inference (60s default)
@spaces.GPU
def generate(prompt, model_id):
    ...

# Extended duration for large models
@spaces.GPU(duration=120)
def generate_large(prompt, model_id):
    ...

# Dynamic duration based on input
def calc_duration(prompt, max_tokens):
    return min(120, max_tokens // 10)

@spaces.GPU(duration=calc_duration)
def generate_dynamic(prompt, max_tokens):
    ...
```

### Model Loading Pattern
```python
import gc
import torch

current_model = None
current_model_id = None

@spaces.GPU
def load_and_generate(model_id, prompt):
    global current_model, current_model_id

    if model_id != current_model_id:
        # Cleanup previous model
        if current_model:
            del current_model
            gc.collect()
            torch.cuda.empty_cache()

        # Load new model
        current_model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
        current_model_id = model_id

    return generate(current_model, prompt)
```

## Important Constraints

1. **ZeroGPU Compatibility**
   - `torch.compile` NOT supported - use PyTorch AoT instead
   - Gradio SDK only (no Streamlit)
   - GPU allocated only during `@spaces.GPU` decorated functions

2. **Memory Management**
   - H200 provides ~70GB VRAM
   - 70B models require INT4 quantization
   - Always cleanup with `gc.collect()` and `torch.cuda.empty_cache()`

3. **Quota Awareness**
   - PRO plan: 25 min/day H200 compute
   - Track usage, fall back to HF Serverless when exhausted
   - Shorter `duration` = higher queue priority

4. **Authentication**
   - All API requests require `Authorization: Bearer hf_...` header
   - Validate tokens via HuggingFace Hub API

## Environment Variables

| Variable | Required | Description |
|----------|----------|-------------|
| `HF_TOKEN` | No* | Token for accessing gated models (* Space has its own token) |
| `FALLBACK_ENABLED` | No | Enable HF Serverless fallback (default: true) |
| `LOG_LEVEL` | No | Logging verbosity (default: INFO) |

## Testing Strategy

1. **Unit Tests**: Model loading, OpenAI format conversion
2. **Integration Tests**: Full API request/response cycle
3. **Local Testing**: CPU-only mode (decorator no-ops)
4. **Live Testing**: Deploy to Space, test via opencode