Spaces:

serenichron
/

opencode-zerogpu

Sleeping

App Files Files Community

opencode-zerogpu / CLAUDE.md

serenichron

Initial implementation of ZeroGPU OpenCode Provider

adcb9bd 17 days ago

preview code

raw

history blame contribute delete

6.26 kB

	# CLAUDE.md

	This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

	## Project Overview

	HuggingFace ZeroGPU Space serving as an OpenAI-compatible inference provider for opencode. Deployed at `serenichron/opencode-zerogpu`.

	Key Features:
	- OpenAI-compatible `/v1/chat/completions` endpoint
	- Pass-through model selection (any HF model ID)
	- ZeroGPU H200 inference with HF Serverless fallback
	- HF Token authentication required
	- SSE streaming support

	## Architecture

	```
	┌─────────────┐ ┌──────────────────────────────────────────────┐
	│ opencode │────▶│ serenichron/opencode-zerogpu (HF Space) │
	│ (client) │ │ │
	└─────────────┘ │ ┌────────────────────────────────────────┐ │
	│ │ app.py (Gradio + FastAPI mount) │ │
	│ │ └─ /v1/chat/completions │ │
	│ │ └─ auth_middleware (HF token) │ │
	│ │ └─ inference_router │ │
	│ │ ├─ ZeroGPU (@spaces.GPU) │ │
	│ │ └─ HF Serverless (fallback) │ │
	│ └────────────────────────────────────────┘ │
	│ │
	│ ┌──────────────┐ ┌───────────────────────┐ │
	│ │ models.py │ │ openai_compat.py │ │
	│ │ - load/unload│ │ - request/response │ │
	│ │ - quantize │ │ - streaming format │ │
	│ └──────────────┘ └───────────────────────┘ │
	└──────────────────────────────────────────────┘
	```

	## Development Commands

	### Local Development (CPU/Mock Mode)
	```bash
	# Install dependencies
	pip install -r requirements.txt

	# Run locally (ZeroGPU decorator no-ops)
	python app.py

	# Run with specific port
	gradio app.py --server-port 7860
	```

	### Testing
	```bash
	# Run all tests
	pytest tests/ -v

	# Run specific test file
	pytest tests/test_openai_compat.py -v

	# Run with coverage
	pytest tests/ --cov=. --cov-report=term-missing
	```

	### API Testing
	```bash
	# Test chat completions endpoint
	curl -X POST http://localhost:7860/v1/chat/completions \
	-H "Content-Type: application/json" \
	-H "Authorization: Bearer $HF_TOKEN" \
	-d '{
	"model": "mistralai/Mistral-7B-Instruct-v0.3",
	"messages": [{"role": "user", "content": "Hello!"}],
	"stream": true
	}'
	```

	### Deployment
	```bash
	# Push to HuggingFace Space (after git remote setup)
	git push hf main

	# Or use HF CLI
	huggingface-cli upload serenichron/opencode-zerogpu . --repo-type space
	```

	## Key Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `app.py` \| Main Gradio app with FastAPI mount for OpenAI endpoints \|
	\| `models.py` \| Model loading, unloading, quantization, caching \|
	\| `openai_compat.py` \| OpenAI request/response format conversion \|
	\| `config.py` \| Environment variables, settings, quota tracking \|
	\| `README.md` \| HF Space config (YAML frontmatter) + documentation \|

	## ZeroGPU Patterns

	### GPU Decorator Usage
	```python
	import spaces

	# Standard inference (60s default)
	@spaces.GPU
	def generate(prompt, model_id):
	...

	# Extended duration for large models
	@spaces.GPU(duration=120)
	def generate_large(prompt, model_id):
	...

	# Dynamic duration based on input
	def calc_duration(prompt, max_tokens):
	return min(120, max_tokens // 10)

	@spaces.GPU(duration=calc_duration)
	def generate_dynamic(prompt, max_tokens):
	...
	```

	### Model Loading Pattern
	```python
	import gc
	import torch

	current_model = None
	current_model_id = None

	@spaces.GPU
	def load_and_generate(model_id, prompt):
	global current_model, current_model_id

	if model_id != current_model_id:
	# Cleanup previous model
	if current_model:
	del current_model
	gc.collect()
	torch.cuda.empty_cache()

	# Load new model
	current_model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)
	current_model_id = model_id

	return generate(current_model, prompt)
	```

	## Important Constraints

	1. ZeroGPU Compatibility
	- `torch.compile` NOT supported - use PyTorch AoT instead
	- Gradio SDK only (no Streamlit)
	- GPU allocated only during `@spaces.GPU` decorated functions

	2. Memory Management
	- H200 provides ~70GB VRAM
	- 70B models require INT4 quantization
	- Always cleanup with `gc.collect()` and `torch.cuda.empty_cache()`

	3. Quota Awareness
	- PRO plan: 25 min/day H200 compute
	- Track usage, fall back to HF Serverless when exhausted
	- Shorter `duration` = higher queue priority

	4. Authentication
	- All API requests require `Authorization: Bearer hf_...` header
	- Validate tokens via HuggingFace Hub API

	## Environment Variables

	\| Variable \| Required \| Description \|
	\|----------\|----------\|-------------\|
	\| `HF_TOKEN` \| No* \| Token for accessing gated models (* Space has its own token) \|
	\| `FALLBACK_ENABLED` \| No \| Enable HF Serverless fallback (default: true) \|
	\| `LOG_LEVEL` \| No \| Logging verbosity (default: INFO) \|

	## Testing Strategy

	1. Unit Tests: Model loading, OpenAI format conversion
	2. Integration Tests: Full API request/response cycle
	3. Local Testing: CPU-only mode (decorator no-ops)
	4. Live Testing: Deploy to Space, test via opencode