File size: 4,154 Bytes
adcb9bd
 
 
 
 
 
16b4dcd
adcb9bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: OpenCode ZeroGPU Provider
emoji: 🚀
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.16.1
app_file: app.py
pinned: false
license: mit
hardware: zero-a10g
---

# OpenCode ZeroGPU Provider

OpenAI-compatible inference endpoint for [opencode](https://github.com/sst/opencode), powered by HuggingFace ZeroGPU (NVIDIA H200).

## Features

- **OpenAI-compatible API** - Drop-in replacement for OpenAI endpoints
- **Pass-through model selection** - Use any HuggingFace model ID
- **ZeroGPU H200 inference** - 25 min/day of H200 GPU compute (PRO plan)
- **Automatic fallback** - Falls back to HF Serverless when quota exhausted
- **SSE streaming** - Real-time token streaming support
- **Authentication** - Requires valid HuggingFace token

## API Endpoint

```
POST /v1/chat/completions
```

### Request Format

```json
{
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7,
  "max_tokens": 512,
  "stream": true
}
```

### Headers

```
Authorization: Bearer hf_YOUR_TOKEN
Content-Type: application/json
```

## Usage with opencode

Configure in `~/.config/opencode/opencode.json`:

```json
{
  "providers": {
    "zerogpu": {
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "https://serenichron-opencode-zerogpu.hf.space/v1",
        "headers": {
          "Authorization": "Bearer hf_YOUR_TOKEN"
        }
      },
      "models": {
        "llama-8b": {
          "name": "meta-llama/Llama-3.1-8B-Instruct"
        },
        "mistral-7b": {
          "name": "mistralai/Mistral-7B-Instruct-v0.3"
        },
        "qwen-7b": {
          "name": "Qwen/Qwen2.5-7B-Instruct"
        },
        "qwen-14b": {
          "name": "Qwen/Qwen2.5-14B-Instruct"
        }
      }
    }
  }
}
```

Then use `/models` in opencode to select a zerogpu model.

## Supported Models

Any HuggingFace model that fits in ~70GB VRAM. Examples:

| Model | Size | Quantization |
|-------|------|--------------|
| `meta-llama/Llama-3.1-8B-Instruct` | 8B | None |
| `mistralai/Mistral-7B-Instruct-v0.3` | 7B | None |
| `Qwen/Qwen2.5-7B-Instruct` | 7B | None |
| `Qwen/Qwen2.5-14B-Instruct` | 14B | None |
| `Qwen/Qwen2.5-32B-Instruct` | 32B | None |
| `meta-llama/Llama-3.1-70B-Instruct` | 70B | INT4 (auto) |

Models larger than 34B are automatically quantized to INT4.

## VRAM Guidelines

| Model Size | FP16 VRAM | INT8 VRAM | INT4 VRAM |
|------------|-----------|-----------|-----------|
| 7B | ~14GB | ~7GB | ~3.5GB |
| 13B | ~26GB | ~13GB | ~6.5GB |
| 34B | ~68GB | ~34GB | ~17GB |
| 70B | ~140GB | ~70GB | ~35GB |

*70B models require INT4 quantization. Add ~20% overhead for KV cache.*

## Quota Information

- **PRO plan**: 25 minutes/day of H200 GPU compute
- **Priority**: PRO users get highest queue priority
- **Fallback**: When quota exhausted, falls back to HF Serverless Inference API

## API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/v1/chat/completions` | POST | Chat completion (OpenAI-compatible) |
| `/v1/models` | GET | List loaded models |
| `/health` | GET | Health check and quota status |

## Local Development

```bash
# Clone the repo
git clone https://huggingface.co/spaces/serenichron/opencode-zerogpu

# Install dependencies
pip install -r requirements.txt

# Run locally (ZeroGPU decorator no-ops)
python app.py
```

## Testing

```bash
# Run tests
pytest tests/ -v

# Test the API locally
curl -X POST http://localhost:7860/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $HF_TOKEN" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'
```

## Environment Variables

| Variable | Required | Description |
|----------|----------|-------------|
| `HF_TOKEN` | No* | Token for gated models (* Space uses its own token) |
| `FALLBACK_ENABLED` | No | Enable HF Serverless fallback (default: true) |
| `LOG_LEVEL` | No | Logging verbosity (default: INFO) |

## License

MIT