Spaces:

serenichron
/

opencode-zerogpu

Sleeping

App Files Files Community

opencode-zerogpu / README.md

serenichron

Upgrade to Gradio 5.16.1+ for ZeroGPU compatibility

16b4dcd 10 days ago

preview code

raw

history blame contribute delete

4.15 kB

	---
	title: OpenCode ZeroGPU Provider
	emoji: 🚀
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 5.16.1
	app_file: app.py
	pinned: false
	license: mit
	hardware: zero-a10g
	---

	# OpenCode ZeroGPU Provider

	OpenAI-compatible inference endpoint for [opencode](https://github.com/sst/opencode), powered by HuggingFace ZeroGPU (NVIDIA H200).

	## Features

	- OpenAI-compatible API - Drop-in replacement for OpenAI endpoints
	- Pass-through model selection - Use any HuggingFace model ID
	- ZeroGPU H200 inference - 25 min/day of H200 GPU compute (PRO plan)
	- Automatic fallback - Falls back to HF Serverless when quota exhausted
	- SSE streaming - Real-time token streaming support
	- Authentication - Requires valid HuggingFace token

	## API Endpoint

	```
	POST /v1/chat/completions
	```

	### Request Format

	```json
	{
	"model": "meta-llama/Llama-3.1-8B-Instruct",
	"messages": [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "Hello!"}
	],
	"temperature": 0.7,
	"max_tokens": 512,
	"stream": true
	}
	```

	### Headers

	```
	Authorization: Bearer hf_YOUR_TOKEN
	Content-Type: application/json
	```

	## Usage with opencode

	Configure in `~/.config/opencode/opencode.json`:

	```json
	{
	"providers": {
	"zerogpu": {
	"npm": "@ai-sdk/openai-compatible",
	"options": {
	"baseURL": "https://serenichron-opencode-zerogpu.hf.space/v1",
	"headers": {
	"Authorization": "Bearer hf_YOUR_TOKEN"
	}
	},
	"models": {
	"llama-8b": {
	"name": "meta-llama/Llama-3.1-8B-Instruct"
	},
	"mistral-7b": {
	"name": "mistralai/Mistral-7B-Instruct-v0.3"
	},
	"qwen-7b": {
	"name": "Qwen/Qwen2.5-7B-Instruct"
	},
	"qwen-14b": {
	"name": "Qwen/Qwen2.5-14B-Instruct"
	}
	}
	}
	}
	}
	```

	Then use `/models` in opencode to select a zerogpu model.

	## Supported Models

	Any HuggingFace model that fits in ~70GB VRAM. Examples:

	\| Model \| Size \| Quantization \|
	\|-------\|------\|--------------\|
	\| `meta-llama/Llama-3.1-8B-Instruct` \| 8B \| None \|
	\| `mistralai/Mistral-7B-Instruct-v0.3` \| 7B \| None \|
	\| `Qwen/Qwen2.5-7B-Instruct` \| 7B \| None \|
	\| `Qwen/Qwen2.5-14B-Instruct` \| 14B \| None \|
	\| `Qwen/Qwen2.5-32B-Instruct` \| 32B \| None \|
	\| `meta-llama/Llama-3.1-70B-Instruct` \| 70B \| INT4 (auto) \|

	Models larger than 34B are automatically quantized to INT4.

	## VRAM Guidelines

	\| Model Size \| FP16 VRAM \| INT8 VRAM \| INT4 VRAM \|
	\|------------\|-----------\|-----------\|-----------\|
	\| 7B \| ~14GB \| ~7GB \| ~3.5GB \|
	\| 13B \| ~26GB \| ~13GB \| ~6.5GB \|
	\| 34B \| ~68GB \| ~34GB \| ~17GB \|
	\| 70B \| ~140GB \| ~70GB \| ~35GB \|

	70B models require INT4 quantization. Add ~20% overhead for KV cache.

	## Quota Information

	- PRO plan: 25 minutes/day of H200 GPU compute
	- Priority: PRO users get highest queue priority
	- Fallback: When quota exhausted, falls back to HF Serverless Inference API

	## API Endpoints

	\| Endpoint \| Method \| Description \|
	\|----------\|--------\|-------------\|
	\| `/v1/chat/completions` \| POST \| Chat completion (OpenAI-compatible) \|
	\| `/v1/models` \| GET \| List loaded models \|
	\| `/health` \| GET \| Health check and quota status \|

	## Local Development

	```bash
	# Clone the repo
	git clone https://huggingface.co/spaces/serenichron/opencode-zerogpu

	# Install dependencies
	pip install -r requirements.txt

	# Run locally (ZeroGPU decorator no-ops)
	python app.py
	```

	## Testing

	```bash
	# Run tests
	pytest tests/ -v

	# Test the API locally
	curl -X POST http://localhost:7860/v1/chat/completions \
	-H "Content-Type: application/json" \
	-H "Authorization: Bearer $HF_TOKEN" \
	-d '{
	"model": "mistralai/Mistral-7B-Instruct-v0.3",
	"messages": [{"role": "user", "content": "Hello!"}],
	"stream": false
	}'
	```

	## Environment Variables

	\| Variable \| Required \| Description \|
	\|----------\|----------\|-------------\|
	\| `HF_TOKEN` \| No* \| Token for gated models (* Space uses its own token) \|
	\| `FALLBACK_ENABLED` \| No \| Enable HF Serverless fallback (default: true) \|
	\| `LOG_LEVEL` \| No \| Logging verbosity (default: INFO) \|

	## License

	MIT