Spaces:

serenichron
/

opencode-zerogpu

Sleeping

File size: 4,154 Bytes

adcb9bd
 
 
 
 
 
16b4dcd
adcb9bd

---
title: OpenCode ZeroGPU Provider
emoji: 🚀
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.16.1
app_file: app.py
pinned: false
license: mit
hardware: zero-a10g
---

# OpenCode ZeroGPU Provider

OpenAI-compatible inference endpoint for [opencode](https://github.com/sst/opencode), powered by HuggingFace ZeroGPU (NVIDIA H200).

## Features

- **OpenAI-compatible API** - Drop-in replacement for OpenAI endpoints
- **Pass-through model selection** - Use any HuggingFace model ID
- **ZeroGPU H200 inference** - 25 min/day of H200 GPU compute (PRO plan)
- **Automatic fallback** - Falls back to HF Serverless when quota exhausted
- **SSE streaming** - Real-time token streaming support
- **Authentication** - Requires valid HuggingFace token

## API Endpoint

```
POST /v1/chat/completions
```

### Request Format

```json
{
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7,
  "max_tokens": 512,
  "stream": true
}
```

### Headers

```
Authorization: Bearer hf_YOUR_TOKEN
Content-Type: application/json
```

## Usage with opencode

Configure in `~/.config/opencode/opencode.json`:

```json
{
  "providers": {
    "zerogpu": {
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "https://serenichron-opencode-zerogpu.hf.space/v1",
        "headers": {
          "Authorization": "Bearer hf_YOUR_TOKEN"
        }
      },
      "models": {
        "llama-8b": {
          "name": "meta-llama/Llama-3.1-8B-Instruct"
        },
        "mistral-7b": {
          "name": "mistralai/Mistral-7B-Instruct-v0.3"
        },
        "qwen-7b": {
          "name": "Qwen/Qwen2.5-7B-Instruct"
        },
        "qwen-14b": {
          "name": "Qwen/Qwen2.5-14B-Instruct"
        }
      }
    }
  }
}
```

Then use `/models` in opencode to select a zerogpu model.

## Supported Models

Any HuggingFace model that fits in ~70GB VRAM. Examples:

| Model | Size | Quantization |
|-------|------|--------------|
| `meta-llama/Llama-3.1-8B-Instruct` | 8B | None |
| `mistralai/Mistral-7B-Instruct-v0.3` | 7B | None |
| `Qwen/Qwen2.5-7B-Instruct` | 7B | None |
| `Qwen/Qwen2.5-14B-Instruct` | 14B | None |
| `Qwen/Qwen2.5-32B-Instruct` | 32B | None |
| `meta-llama/Llama-3.1-70B-Instruct` | 70B | INT4 (auto) |

Models larger than 34B are automatically quantized to INT4.

## VRAM Guidelines

| Model Size | FP16 VRAM | INT8 VRAM | INT4 VRAM |
|------------|-----------|-----------|-----------|
| 7B | ~14GB | ~7GB | ~3.5GB |
| 13B | ~26GB | ~13GB | ~6.5GB |
| 34B | ~68GB | ~34GB | ~17GB |
| 70B | ~140GB | ~70GB | ~35GB |

*70B models require INT4 quantization. Add ~20% overhead for KV cache.*

## Quota Information

- **PRO plan**: 25 minutes/day of H200 GPU compute
- **Priority**: PRO users get highest queue priority
- **Fallback**: When quota exhausted, falls back to HF Serverless Inference API

## API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/v1/chat/completions` | POST | Chat completion (OpenAI-compatible) |
| `/v1/models` | GET | List loaded models |
| `/health` | GET | Health check and quota status |

## Local Development

```bash
# Clone the repo
git clone https://huggingface.co/spaces/serenichron/opencode-zerogpu

# Install dependencies
pip install -r requirements.txt

# Run locally (ZeroGPU decorator no-ops)
python app.py
```

## Testing

```bash
# Run tests
pytest tests/ -v

# Test the API locally
curl -X POST http://localhost:7860/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $HF_TOKEN" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'
```

## Environment Variables

| Variable | Required | Description |
|----------|----------|-------------|
| `HF_TOKEN` | No* | Token for gated models (* Space uses its own token) |
| `FALLBACK_ENABLED` | No | Enable HF Serverless fallback (default: true) |
| `LOG_LEVEL` | No | Logging verbosity (default: INFO) |

## License

MIT