Spaces:
Sleeping
Sleeping
File size: 4,154 Bytes
adcb9bd 16b4dcd adcb9bd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
---
title: OpenCode ZeroGPU Provider
emoji: 🚀
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.16.1
app_file: app.py
pinned: false
license: mit
hardware: zero-a10g
---
# OpenCode ZeroGPU Provider
OpenAI-compatible inference endpoint for [opencode](https://github.com/sst/opencode), powered by HuggingFace ZeroGPU (NVIDIA H200).
## Features
- **OpenAI-compatible API** - Drop-in replacement for OpenAI endpoints
- **Pass-through model selection** - Use any HuggingFace model ID
- **ZeroGPU H200 inference** - 25 min/day of H200 GPU compute (PRO plan)
- **Automatic fallback** - Falls back to HF Serverless when quota exhausted
- **SSE streaming** - Real-time token streaming support
- **Authentication** - Requires valid HuggingFace token
## API Endpoint
```
POST /v1/chat/completions
```
### Request Format
```json
{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 512,
"stream": true
}
```
### Headers
```
Authorization: Bearer hf_YOUR_TOKEN
Content-Type: application/json
```
## Usage with opencode
Configure in `~/.config/opencode/opencode.json`:
```json
{
"providers": {
"zerogpu": {
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "https://serenichron-opencode-zerogpu.hf.space/v1",
"headers": {
"Authorization": "Bearer hf_YOUR_TOKEN"
}
},
"models": {
"llama-8b": {
"name": "meta-llama/Llama-3.1-8B-Instruct"
},
"mistral-7b": {
"name": "mistralai/Mistral-7B-Instruct-v0.3"
},
"qwen-7b": {
"name": "Qwen/Qwen2.5-7B-Instruct"
},
"qwen-14b": {
"name": "Qwen/Qwen2.5-14B-Instruct"
}
}
}
}
}
```
Then use `/models` in opencode to select a zerogpu model.
## Supported Models
Any HuggingFace model that fits in ~70GB VRAM. Examples:
| Model | Size | Quantization |
|-------|------|--------------|
| `meta-llama/Llama-3.1-8B-Instruct` | 8B | None |
| `mistralai/Mistral-7B-Instruct-v0.3` | 7B | None |
| `Qwen/Qwen2.5-7B-Instruct` | 7B | None |
| `Qwen/Qwen2.5-14B-Instruct` | 14B | None |
| `Qwen/Qwen2.5-32B-Instruct` | 32B | None |
| `meta-llama/Llama-3.1-70B-Instruct` | 70B | INT4 (auto) |
Models larger than 34B are automatically quantized to INT4.
## VRAM Guidelines
| Model Size | FP16 VRAM | INT8 VRAM | INT4 VRAM |
|------------|-----------|-----------|-----------|
| 7B | ~14GB | ~7GB | ~3.5GB |
| 13B | ~26GB | ~13GB | ~6.5GB |
| 34B | ~68GB | ~34GB | ~17GB |
| 70B | ~140GB | ~70GB | ~35GB |
*70B models require INT4 quantization. Add ~20% overhead for KV cache.*
## Quota Information
- **PRO plan**: 25 minutes/day of H200 GPU compute
- **Priority**: PRO users get highest queue priority
- **Fallback**: When quota exhausted, falls back to HF Serverless Inference API
## API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/v1/chat/completions` | POST | Chat completion (OpenAI-compatible) |
| `/v1/models` | GET | List loaded models |
| `/health` | GET | Health check and quota status |
## Local Development
```bash
# Clone the repo
git clone https://huggingface.co/spaces/serenichron/opencode-zerogpu
# Install dependencies
pip install -r requirements.txt
# Run locally (ZeroGPU decorator no-ops)
python app.py
```
## Testing
```bash
# Run tests
pytest tests/ -v
# Test the API locally
curl -X POST http://localhost:7860/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $HF_TOKEN" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
```
## Environment Variables
| Variable | Required | Description |
|----------|----------|-------------|
| `HF_TOKEN` | No* | Token for gated models (* Space uses its own token) |
| `FALLBACK_ENABLED` | No | Enable HF Serverless fallback (default: true) |
| `LOG_LEVEL` | No | Logging verbosity (default: INFO) |
## License
MIT
|