Spaces:

serenichron
/

opencode-zerogpu

Sleeping

serenichron commited on 16 days ago

Commit

adcb9bd

0 Parent(s):

Initial implementation of ZeroGPU OpenCode Provider

- OpenAI-compatible /v1/chat/completions endpoint
- Pass-through model selection (any HF model ID)
- ZeroGPU H200 inference with automatic fallback to HF Serverless
- HF Token authentication required
- SSE streaming support
- Automatic INT4 quantization for 70B+ models

Files changed (13) hide show

.env.template +22 -0
.gitignore +60 -0
CLAUDE.md +186 -0
README.md +171 -0
app.py +524 -0
config.py +159 -0
models.py +335 -0
openai_compat.py +269 -0
requirements.txt +32 -0
tests/__init__.py +1 -0
tests/conftest.py +116 -0
tests/test_models.py +150 -0
tests/test_openai_compat.py +263 -0

.env.template ADDED Viewed

	@@ -0,0 +1,22 @@

+# HuggingFace ZeroGPU Space - Environment Variables
+# Copy to .env and fill in values
+# HuggingFace Token (for gated models access)
+# When deployed to HF Space, the Space's own token is used automatically
+# This is mainly for local development with gated models
+HF_TOKEN=
+# Fallback Configuration
+# Enable HF Serverless Inference API fallback when ZeroGPU quota exhausted
+FALLBACK_ENABLED=true
+# Logging
+LOG_LEVEL=INFO
+# Model Loading
+# Default quantization for large models (none, int8, int4)
+DEFAULT_QUANTIZATION=none
+# Maximum model size to load without quantization (in billions of parameters)
+# Models larger than this will automatically use INT4 quantization
+AUTO_QUANTIZE_THRESHOLD_B=34

.gitignore ADDED Viewed

	@@ -0,0 +1,60 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual environments
+.env
+.venv
+env/
+venv/
+ENV/
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+*~
+# Testing
+.pytest_cache/
+.coverage
+htmlcov/
+.tox/
+.nox/
+# Gradio
+flagged/
+# HuggingFace
+*.bin
+*.safetensors
+*.pt
+*.pth
+.cache/
+# Logs
+*.log
+logs/
+# OS
+.DS_Store
+Thumbs.db

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,186 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project Overview
+HuggingFace ZeroGPU Space serving as an OpenAI-compatible inference provider for opencode. Deployed at `serenichron/opencode-zerogpu`.
+**Key Features:**
+- OpenAI-compatible `/v1/chat/completions` endpoint
+- Pass-through model selection (any HF model ID)
+- ZeroGPU H200 inference with HF Serverless fallback
+- HF Token authentication required
+- SSE streaming support
+## Architecture
+```
+┌─────────────┐     ┌──────────────────────────────────────────────┐
+│  opencode   │────▶│  serenichron/opencode-zerogpu (HF Space)    │
+│  (client)   │     │                                              │
+└─────────────┘     │  ┌────────────────────────────────────────┐  │
+                    │  │ app.py (Gradio + FastAPI mount)        │  │
+                    │  │  └─ /v1/chat/completions               │  │
+                    │  │      └─ auth_middleware (HF token)     │  │
+                    │  │      └─ inference_router               │  │
+                    │  │           ├─ ZeroGPU (@spaces.GPU)     │  │
+                    │  │           └─ HF Serverless (fallback)  │  │
+                    │  └────────────────────────────────────────┘  │
+                    │                                              │
+                    │  ┌──────────────┐  ┌───────────────────────┐ │
+                    │  │ models.py    │  │ openai_compat.py      │ │
+                    │  │ - load/unload│  │ - request/response    │ │
+                    │  │ - quantize   │  │ - streaming format    │ │
+                    │  └──────────────┘  └───────────────────────┘ │
+                    └──────────────────────────────────────────────┘
+```
+## Development Commands
+### Local Development (CPU/Mock Mode)
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Run locally (ZeroGPU decorator no-ops)
+python app.py
+# Run with specific port
+gradio app.py --server-port 7860
+```
+### Testing
+```bash
+# Run all tests
+pytest tests/ -v
+# Run specific test file
+pytest tests/test_openai_compat.py -v
+# Run with coverage
+pytest tests/ --cov=. --cov-report=term-missing
+```
+### API Testing
+```bash
+# Test chat completions endpoint
+curl -X POST http://localhost:7860/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer $HF_TOKEN" \
+  -d '{
+    "model": "mistralai/Mistral-7B-Instruct-v0.3",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "stream": true
+  }'
+```
+### Deployment
+```bash
+# Push to HuggingFace Space (after git remote setup)
+git push hf main
+# Or use HF CLI
+huggingface-cli upload serenichron/opencode-zerogpu . --repo-type space
+```
+## Key Files
+| File | Purpose |
+|------|---------|
+| `app.py` | Main Gradio app with FastAPI mount for OpenAI endpoints |
+| `models.py` | Model loading, unloading, quantization, caching |
+| `openai_compat.py` | OpenAI request/response format conversion |
+| `config.py` | Environment variables, settings, quota tracking |
+| `README.md` | HF Space config (YAML frontmatter) + documentation |
+## ZeroGPU Patterns
+### GPU Decorator Usage
+```python
+import spaces
+# Standard inference (60s default)
+@spaces.GPU
+def generate(prompt, model_id):
+    ...
+# Extended duration for large models
+@spaces.GPU(duration=120)
+def generate_large(prompt, model_id):
+    ...
+# Dynamic duration based on input
+def calc_duration(prompt, max_tokens):
+    return min(120, max_tokens // 10)
+@spaces.GPU(duration=calc_duration)
+def generate_dynamic(prompt, max_tokens):
+    ...
+```
+### Model Loading Pattern
+```python
+import gc
+import torch
+current_model = None
+current_model_id = None
+@spaces.GPU
+def load_and_generate(model_id, prompt):
+    global current_model, current_model_id
+    if model_id != current_model_id:
+        # Cleanup previous model
+        if current_model:
+            del current_model
+            gc.collect()
+            torch.cuda.empty_cache()
+        # Load new model
+        current_model = AutoModelForCausalLM.from_pretrained(
+            model_id,
+            torch_dtype=torch.bfloat16,
+            device_map="auto"
+        )
+        current_model_id = model_id
+    return generate(current_model, prompt)
+```
+## Important Constraints
+1. **ZeroGPU Compatibility**
+   - `torch.compile` NOT supported - use PyTorch AoT instead
+   - Gradio SDK only (no Streamlit)
+   - GPU allocated only during `@spaces.GPU` decorated functions
+2. **Memory Management**
+   - H200 provides ~70GB VRAM
+   - 70B models require INT4 quantization
+   - Always cleanup with `gc.collect()` and `torch.cuda.empty_cache()`
+3. **Quota Awareness**
+   - PRO plan: 25 min/day H200 compute
+   - Track usage, fall back to HF Serverless when exhausted
+   - Shorter `duration` = higher queue priority
+4. **Authentication**
+   - All API requests require `Authorization: Bearer hf_...` header
+   - Validate tokens via HuggingFace Hub API
+## Environment Variables
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `HF_TOKEN` | No* | Token for accessing gated models (* Space has its own token) |
+| `FALLBACK_ENABLED` | No | Enable HF Serverless fallback (default: true) |
+| `LOG_LEVEL` | No | Logging verbosity (default: INFO) |
+## Testing Strategy
+1. **Unit Tests**: Model loading, OpenAI format conversion
+2. **Integration Tests**: Full API request/response cycle
+3. **Local Testing**: CPU-only mode (decorator no-ops)
+4. **Live Testing**: Deploy to Space, test via opencode

README.md ADDED Viewed

	@@ -0,0 +1,171 @@

+---
+title: OpenCode ZeroGPU Provider
+emoji: 🚀
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: 4.44.0
+app_file: app.py
+pinned: false
+license: mit
+hardware: zero-a10g
+---
+# OpenCode ZeroGPU Provider
+OpenAI-compatible inference endpoint for [opencode](https://github.com/sst/opencode), powered by HuggingFace ZeroGPU (NVIDIA H200).
+## Features
+- **OpenAI-compatible API** - Drop-in replacement for OpenAI endpoints
+- **Pass-through model selection** - Use any HuggingFace model ID
+- **ZeroGPU H200 inference** - 25 min/day of H200 GPU compute (PRO plan)
+- **Automatic fallback** - Falls back to HF Serverless when quota exhausted
+- **SSE streaming** - Real-time token streaming support
+- **Authentication** - Requires valid HuggingFace token
+## API Endpoint
+```
+POST /v1/chat/completions
+```
+### Request Format
+```json
+{
+  "model": "meta-llama/Llama-3.1-8B-Instruct",
+  "messages": [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "Hello!"}
+  ],
+  "temperature": 0.7,
+  "max_tokens": 512,
+  "stream": true
+}
+```
+### Headers
+```
+Authorization: Bearer hf_YOUR_TOKEN
+Content-Type: application/json
+```
+## Usage with opencode
+Configure in `~/.config/opencode/opencode.json`:
+```json
+{
+  "providers": {
+    "zerogpu": {
+      "npm": "@ai-sdk/openai-compatible",
+      "options": {
+        "baseURL": "https://serenichron-opencode-zerogpu.hf.space/v1",
+        "headers": {
+          "Authorization": "Bearer hf_YOUR_TOKEN"
+        }
+      },
+      "models": {
+        "llama-8b": {
+          "name": "meta-llama/Llama-3.1-8B-Instruct"
+        },
+        "mistral-7b": {
+          "name": "mistralai/Mistral-7B-Instruct-v0.3"
+        },
+        "qwen-7b": {
+          "name": "Qwen/Qwen2.5-7B-Instruct"
+        },
+        "qwen-14b": {
+          "name": "Qwen/Qwen2.5-14B-Instruct"
+        }
+      }
+    }
+  }
+}
+```
+Then use `/models` in opencode to select a zerogpu model.
+## Supported Models
+Any HuggingFace model that fits in ~70GB VRAM. Examples:
+| Model | Size | Quantization |
+|-------|------|--------------|
+| `meta-llama/Llama-3.1-8B-Instruct` | 8B | None |
+| `mistralai/Mistral-7B-Instruct-v0.3` | 7B | None |
+| `Qwen/Qwen2.5-7B-Instruct` | 7B | None |
+| `Qwen/Qwen2.5-14B-Instruct` | 14B | None |
+| `Qwen/Qwen2.5-32B-Instruct` | 32B | None |
+| `meta-llama/Llama-3.1-70B-Instruct` | 70B | INT4 (auto) |
+Models larger than 34B are automatically quantized to INT4.
+## VRAM Guidelines
+| Model Size | FP16 VRAM | INT8 VRAM | INT4 VRAM |
+|------------|-----------|-----------|-----------|
+| 7B | ~14GB | ~7GB | ~3.5GB |
+| 13B | ~26GB | ~13GB | ~6.5GB |
+| 34B | ~68GB | ~34GB | ~17GB |
+| 70B | ~140GB | ~70GB | ~35GB |
+*70B models require INT4 quantization. Add ~20% overhead for KV cache.*
+## Quota Information
+- **PRO plan**: 25 minutes/day of H200 GPU compute
+- **Priority**: PRO users get highest queue priority
+- **Fallback**: When quota exhausted, falls back to HF Serverless Inference API
+## API Endpoints
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/v1/chat/completions` | POST | Chat completion (OpenAI-compatible) |
+| `/v1/models` | GET | List loaded models |
+| `/health` | GET | Health check and quota status |
+## Local Development
+```bash
+# Clone the repo
+git clone https://huggingface.co/spaces/serenichron/opencode-zerogpu
+# Install dependencies
+pip install -r requirements.txt
+# Run locally (ZeroGPU decorator no-ops)
+python app.py
+```
+## Testing
+```bash
+# Run tests
+pytest tests/ -v
+# Test the API locally
+curl -X POST http://localhost:7860/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer $HF_TOKEN" \
+  -d '{
+    "model": "mistralai/Mistral-7B-Instruct-v0.3",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "stream": false
+  }'
+```
+## Environment Variables
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `HF_TOKEN` | No* | Token for gated models (* Space uses its own token) |
+| `FALLBACK_ENABLED` | No | Enable HF Serverless fallback (default: true) |
+| `LOG_LEVEL` | No | Logging verbosity (default: INFO) |
+## License
+MIT

app.py ADDED Viewed

	@@ -0,0 +1,524 @@

+"""
+HuggingFace ZeroGPU Space - OpenAI-compatible inference provider for opencode.
+This Gradio app provides:
+- OpenAI-compatible /v1/chat/completions endpoint
+- Pass-through model selection (any HF model ID)
+- ZeroGPU H200 inference with HF Serverless fallback
+- HF Token authentication
+- SSE streaming support
+"""
+import logging
+import time
+from contextlib import asynccontextmanager
+from typing import Optional
+import gradio as gr
+import httpx
+from fastapi import FastAPI, Header, HTTPException, Request
+from fastapi.responses import StreamingResponse, JSONResponse
+from huggingface_hub import HfApi
+# Import spaces conditionally (no-op when not on ZeroGPU)
+try:
+    import spaces
+    ZEROGPU_AVAILABLE = True
+except ImportError:
+    ZEROGPU_AVAILABLE = False
+    # Create a no-op decorator for local development
+    class spaces:
+        @staticmethod
+        def GPU(fn=None, duration=60):
+            if fn is None:
+                return lambda f: f
+            return fn
+from config import get_config, get_quota_tracker
+from models import (
+    apply_chat_template,
+    generate_text,
+    generate_text_stream,
+    get_current_model,
+)
+from openai_compat import (
+    ChatCompletionRequest,
+    InferenceParams,
+    create_chat_response,
+    create_error_response,
+    estimate_tokens,
+    stream_response_generator,
+)
+logger = logging.getLogger(__name__)
+config = get_config()
+quota_tracker = get_quota_tracker()
+# HuggingFace API for token validation
+hf_api = HfApi()
+# --- Authentication ---
+def validate_hf_token(token: str) -> bool:
+    """Validate a HuggingFace token by checking with the API."""
+    if not token or not token.startswith("hf_"):
+        return False
+    try:
+        hf_api.whoami(token=token)
+        return True
+    except Exception:
+        return False
+def extract_token(authorization: Optional[str]) -> Optional[str]:
+    """Extract the token from the Authorization header."""
+    if not authorization:
+        return None
+    if authorization.startswith("Bearer "):
+        return authorization[7:]
+    return authorization
+# --- ZeroGPU Inference ---
+@spaces.GPU(duration=120)
+def zerogpu_generate(
+    model_id: str,
+    prompt: str,
+    max_new_tokens: int,
+    temperature: float,
+    top_p: float,
+    stop_sequences: Optional[list[str]],
+) -> str:
+    """Generate text using ZeroGPU (H200 GPU)."""
+    start_time = time.time()
+    result = generate_text(
+        model_id=model_id,
+        prompt=prompt,
+        max_new_tokens=max_new_tokens,
+        temperature=temperature,
+        top_p=top_p,
+        stop_sequences=stop_sequences,
+    )
+    # Track quota usage
+    duration = time.time() - start_time
+    quota_tracker.add_usage(duration)
+    return result
+@spaces.GPU(duration=120)
+def zerogpu_generate_stream(
+    model_id: str,
+    prompt: str,
+    max_new_tokens: int,
+    temperature: float,
+    top_p: float,
+    stop_sequences: Optional[list[str]],
+):
+    """Generate text with streaming using ZeroGPU (H200 GPU)."""
+    start_time = time.time()
+    for token in generate_text_stream(
+        model_id=model_id,
+        prompt=prompt,
+        max_new_tokens=max_new_tokens,
+        temperature=temperature,
+        top_p=top_p,
+        stop_sequences=stop_sequences,
+    ):
+        yield token
+    # Track quota usage
+    duration = time.time() - start_time
+    quota_tracker.add_usage(duration)
+# --- HF Serverless Fallback ---
+async def serverless_generate(
+    model_id: str,
+    prompt: str,
+    max_new_tokens: int,
+    temperature: float,
+    top_p: float,
+    token: str,
+) -> str:
+    """Generate text using HuggingFace Serverless Inference API."""
+    url = f"https://api-inference.huggingface.co/models/{model_id}"
+    payload = {
+        "inputs": prompt,
+        "parameters": {
+            "max_new_tokens": max_new_tokens,
+            "temperature": temperature,
+            "top_p": top_p,
+            "return_full_text": False,
+        },
+    }
+    async with httpx.AsyncClient() as client:
+        response = await client.post(
+            url,
+            json=payload,
+            headers={"Authorization": f"Bearer {token}"},
+            timeout=120.0,
+        )
+        if response.status_code != 200:
+            raise HTTPException(
+                status_code=response.status_code,
+                detail=f"HF Serverless error: {response.text}",
+            )
+        result = response.json()
+        # Handle different response formats
+        if isinstance(result, list) and len(result) > 0:
+            if "generated_text" in result[0]:
+                return result[0]["generated_text"]
+        raise HTTPException(
+            status_code=500,
+            detail=f"Unexpected response format from HF Serverless: {result}",
+        )
+# --- FastAPI App ---
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """Application lifespan events."""
+    logger.info("Starting ZeroGPU OpenCode Provider")
+    logger.info(f"ZeroGPU available: {ZEROGPU_AVAILABLE}")
+    logger.info(f"Fallback enabled: {config.fallback_enabled}")
+    yield
+    logger.info("Shutting down ZeroGPU OpenCode Provider")
+api = FastAPI(
+    title="ZeroGPU OpenCode Provider",
+    description="OpenAI-compatible API for HuggingFace models on ZeroGPU",
+    version="1.0.0",
+    lifespan=lifespan,
+)
+@api.post("/v1/chat/completions")
+async def chat_completions(
+    request: ChatCompletionRequest,
+    authorization: Optional[str] = Header(None),
+):
+    """
+    OpenAI-compatible chat completions endpoint.
+    Supports both streaming and non-streaming responses.
+    """
+    # Validate authentication
+    token = extract_token(authorization)
+    if not token or not validate_hf_token(token):
+        return JSONResponse(
+            status_code=401,
+            content=create_error_response(
+                message="Invalid or missing HuggingFace token",
+                error_type="authentication_error",
+                code="invalid_api_key",
+            ).model_dump(),
+        )
+    # Extract inference parameters
+    params = InferenceParams.from_request(request)
+    # Apply chat template
+    try:
+        prompt = apply_chat_template(params.model_id, params.messages)
+    except Exception as e:
+        logger.error(f"Failed to apply chat template: {e}")
+        return JSONResponse(
+            status_code=400,
+            content=create_error_response(
+                message=f"Failed to load model or apply chat template: {str(e)}",
+                error_type="invalid_request_error",
+                param="model",
+            ).model_dump(),
+        )
+    prompt_tokens = estimate_tokens(prompt)
+    # Determine inference method
+    use_zerogpu = ZEROGPU_AVAILABLE and not quota_tracker.quota_exhausted
+    if not use_zerogpu and not config.fallback_enabled:
+        return JSONResponse(
+            status_code=503,
+            content=create_error_response(
+                message="ZeroGPU quota exhausted and fallback is disabled",
+                error_type="server_error",
+                code="quota_exhausted",
+            ).model_dump(),
+        )
+    try:
+        if params.stream:
+            # Streaming response
+            if use_zerogpu:
+                token_gen = zerogpu_generate_stream(
+                    model_id=params.model_id,
+                    prompt=prompt,
+                    max_new_tokens=params.max_new_tokens,
+                    temperature=params.temperature,
+                    top_p=params.top_p,
+                    stop_sequences=params.stop_sequences,
+                )
+            else:
+                # Fallback doesn't support streaming, so generate full response
+                # and simulate streaming
+                logger.info("Using HF Serverless fallback (no streaming)")
+                full_response = await serverless_generate(
+                    model_id=params.model_id,
+                    prompt=prompt,
+                    max_new_tokens=params.max_new_tokens,
+                    temperature=params.temperature,
+                    top_p=params.top_p,
+                    token=token,
+                )
+                def simulate_stream():
+                    # Yield the full response as a single chunk
+                    yield full_response
+                token_gen = simulate_stream()
+            return StreamingResponse(
+                stream_response_generator(params.model_id, token_gen),
+                media_type="text/event-stream",
+                headers={
+                    "Cache-Control": "no-cache",
+                    "Connection": "keep-alive",
+                    "X-Accel-Buffering": "no",
+                },
+            )
+        else:
+            # Non-streaming response
+            if use_zerogpu:
+                response_text = zerogpu_generate(
+                    model_id=params.model_id,
+                    prompt=prompt,
+                    max_new_tokens=params.max_new_tokens,
+                    temperature=params.temperature,
+                    top_p=params.top_p,
+                    stop_sequences=params.stop_sequences,
+                )
+            else:
+                logger.info("Using HF Serverless fallback")
+                response_text = await serverless_generate(
+                    model_id=params.model_id,
+                    prompt=prompt,
+                    max_new_tokens=params.max_new_tokens,
+                    temperature=params.temperature,
+                    top_p=params.top_p,
+                    token=token,
+                )
+            completion_tokens = estimate_tokens(response_text)
+            return create_chat_response(
+                model=params.model_id,
+                content=response_text,
+                prompt_tokens=prompt_tokens,
+                completion_tokens=completion_tokens,
+            )
+    except Exception as e:
+        logger.exception(f"Inference error: {e}")
+        return JSONResponse(
+            status_code=500,
+            content=create_error_response(
+                message=f"Inference failed: {str(e)}",
+                error_type="server_error",
+            ).model_dump(),
+        )
+@api.get("/v1/models")
+async def list_models(authorization: Optional[str] = Header(None)):
+    """List available models (returns info about current model if loaded)."""
+    token = extract_token(authorization)
+    if not token or not validate_hf_token(token):
+        return JSONResponse(
+            status_code=401,
+            content=create_error_response(
+                message="Invalid or missing HuggingFace token",
+                error_type="authentication_error",
+                code="invalid_api_key",
+            ).model_dump(),
+        )
+    current = get_current_model()
+    models = []
+    if current:
+        models.append(
+            {
+                "id": current.model_id,
+                "object": "model",
+                "created": int(time.time()),
+                "owned_by": "huggingface",
+            }
+        )
+    return {"object": "list", "data": models}
+@api.get("/health")
+async def health_check():
+    """Health check endpoint."""
+    return {
+        "status": "healthy",
+        "zerogpu_available": ZEROGPU_AVAILABLE,
+        "quota_remaining_minutes": quota_tracker.remaining_minutes(),
+        "fallback_enabled": config.fallback_enabled,
+    }
+# --- Gradio Interface ---
+def gradio_chat(
+    message: str,
+    history: list[list[str]],
+    model_id: str,
+    temperature: float,
+    max_tokens: int,
+):
+    """Gradio chat interface handler."""
+    # Build messages from history
+    messages = []
+    for user_msg, assistant_msg in history:
+        messages.append({"role": "user", "content": user_msg})
+        if assistant_msg:
+            messages.append({"role": "assistant", "content": assistant_msg})
+    messages.append({"role": "user", "content": message})
+    # Apply chat template
+    prompt = apply_chat_template(model_id, messages)
+    # Generate response (streaming)
+    response = ""
+    for token in zerogpu_generate_stream(
+        model_id=model_id,
+        prompt=prompt,
+        max_new_tokens=max_tokens,
+        temperature=temperature,
+        top_p=0.95,
+        stop_sequences=None,
+    ):
+        response += token
+        yield response
+# Gradio Blocks interface
+with gr.Blocks(title="ZeroGPU OpenCode Provider") as demo:
+    gr.Markdown(
+        """
+        # ZeroGPU OpenCode Provider
+        OpenAI-compatible inference endpoint for [opencode](https://github.com/sst/opencode).
+        **API Endpoint:** `/v1/chat/completions`
+        ## Usage with opencode
+        Configure in `~/.config/opencode/opencode.json`:
+        ```json
+        {
+          "providers": {
+            "zerogpu": {
+              "npm": "@ai-sdk/openai-compatible",
+              "options": {
+                "baseURL": "https://serenichron-opencode-zerogpu.hf.space/v1",
+                "headers": {
+                  "Authorization": "Bearer hf_YOUR_TOKEN"
+                }
+              },
+              "models": {
+                "llama-8b": {
+                  "name": "meta-llama/Llama-3.1-8B-Instruct"
+                }
+              }
+            }
+          }
+        }
+        ```
+        ---
+        """
+    )
+    with gr.Row():
+        with gr.Column(scale=1):
+            model_dropdown = gr.Dropdown(
+                label="Model",
+                choices=[
+                    "meta-llama/Llama-3.1-8B-Instruct",
+                    "mistralai/Mistral-7B-Instruct-v0.3",
+                    "Qwen/Qwen2.5-7B-Instruct",
+                    "Qwen/Qwen2.5-14B-Instruct",
+                ],
+                value="meta-llama/Llama-3.1-8B-Instruct",
+                allow_custom_value=True,
+            )
+            temperature_slider = gr.Slider(
+                label="Temperature",
+                minimum=0.0,
+                maximum=2.0,
+                value=0.7,
+                step=0.1,
+            )
+            max_tokens_slider = gr.Slider(
+                label="Max Tokens",
+                minimum=64,
+                maximum=4096,
+                value=512,
+                step=64,
+            )
+            gr.Markdown(
+                f"""
+                ### Status
+                - **ZeroGPU:** {'Available' if ZEROGPU_AVAILABLE else 'Not Available'}
+                - **Fallback:** {'Enabled' if config.fallback_enabled else 'Disabled'}
+                """
+            )
+        with gr.Column(scale=3):
+            chatbot = gr.ChatInterface(
+                fn=gradio_chat,
+                additional_inputs=[model_dropdown, temperature_slider, max_tokens_slider],
+                title="",
+                examples=[
+                    ["Hello! How are you?"],
+                    ["Explain quantum computing in simple terms."],
+                    ["Write a Python function to calculate fibonacci numbers."],
+                ],
+            )
+# Mount FastAPI to Gradio
+demo = gr.mount_gradio_app(demo, api, path="/")
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0", server_port=7860)

config.py ADDED Viewed

	@@ -0,0 +1,159 @@

+"""Configuration and environment handling for ZeroGPU Space."""
+import os
+import logging
+from dataclasses import dataclass, field
+from typing import Optional
+from dotenv import load_dotenv
+load_dotenv()
+logger = logging.getLogger(__name__)
+@dataclass
+class Config:
+    """Application configuration loaded from environment."""
+    # HuggingFace token for gated models
+    hf_token: Optional[str] = field(default_factory=lambda: os.getenv("HF_TOKEN"))
+    # Fallback to HF Serverless when ZeroGPU quota exhausted
+    fallback_enabled: bool = field(
+        default_factory=lambda: os.getenv("FALLBACK_ENABLED", "true").lower() == "true"
+    )
+    # Logging level
+    log_level: str = field(default_factory=lambda: os.getenv("LOG_LEVEL", "INFO"))
+    # Quantization settings
+    default_quantization: str = field(
+        default_factory=lambda: os.getenv("DEFAULT_QUANTIZATION", "none")
+    )
+    auto_quantize_threshold_b: int = field(
+        default_factory=lambda: int(os.getenv("AUTO_QUANTIZE_THRESHOLD_B", "34"))
+    )
+    def __post_init__(self):
+        """Configure logging after initialization."""
+        logging.basicConfig(
+            level=getattr(logging, self.log_level.upper(), logging.INFO),
+            format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+        )
+@dataclass
+class QuotaTracker:
+    """Track ZeroGPU quota usage for the current session."""
+    # Total seconds used in current day
+    seconds_used: float = 0.0
+    # Daily quota in seconds (PRO plan: 25 min = 1500 sec)
+    daily_quota_seconds: float = 1500.0
+    # Whether quota is exhausted
+    quota_exhausted: bool = False
+    def add_usage(self, seconds: float) -> None:
+        """Record GPU usage time."""
+        self.seconds_used += seconds
+        if self.seconds_used >= self.daily_quota_seconds:
+            self.quota_exhausted = True
+            logger.warning(
+                f"ZeroGPU quota exhausted: {self.seconds_used:.1f}s / {self.daily_quota_seconds:.1f}s"
+            )
+    def remaining_seconds(self) -> float:
+        """Get remaining quota in seconds."""
+        return max(0, self.daily_quota_seconds - self.seconds_used)
+    def remaining_minutes(self) -> float:
+        """Get remaining quota in minutes."""
+        return self.remaining_seconds() / 60.0
+    def reset(self) -> None:
+        """Reset quota (called at day boundary)."""
+        self.seconds_used = 0.0
+        self.quota_exhausted = False
+        logger.info("ZeroGPU quota reset")
+# Global configuration instance
+config = Config()
+# Global quota tracker
+quota_tracker = QuotaTracker()
+def get_config() -> Config:
+    """Get the global configuration instance."""
+    return config
+def get_quota_tracker() -> QuotaTracker:
+    """Get the global quota tracker instance."""
+    return quota_tracker
+# Model size estimates (parameters in billions)
+MODEL_SIZE_ESTIMATES = {
+    # Llama family
+    "meta-llama/Llama-3.1-8B-Instruct": 8,
+    "meta-llama/Llama-3.1-70B-Instruct": 70,
+    "meta-llama/Llama-3.2-1B-Instruct": 1,
+    "meta-llama/Llama-3.2-3B-Instruct": 3,
+    # Mistral family
+    "mistralai/Mistral-7B-Instruct-v0.3": 7,
+    "mistralai/Mixtral-8x7B-Instruct-v0.1": 47,  # MoE effective
+    # Qwen family
+    "Qwen/Qwen2.5-7B-Instruct": 7,
+    "Qwen/Qwen2.5-14B-Instruct": 14,
+    "Qwen/Qwen2.5-32B-Instruct": 32,
+    "Qwen/Qwen2.5-72B-Instruct": 72,
+}
+def estimate_model_size(model_id: str) -> Optional[int]:
+    """
+    Estimate model size in billions of parameters from model ID.
+    Returns None if size cannot be determined.
+    """
+    # Check known models first
+    if model_id in MODEL_SIZE_ESTIMATES:
+        return MODEL_SIZE_ESTIMATES[model_id]
+    # Try to extract size from model name (e.g., "7B", "70B", "14B")
+    import re
+    match = re.search(r"(\d+)B", model_id, re.IGNORECASE)
+    if match:
+        return int(match.group(1))
+    return None
+def should_quantize(model_id: str) -> str:
+    """
+    Determine if a model should be quantized and which method to use.
+    Returns: "none", "int8", or "int4"
+    """
+    if config.default_quantization != "none":
+        return config.default_quantization
+    size = estimate_model_size(model_id)
+    if size is None:
+        # Unknown size, don't auto-quantize
+        return "none"
+    if size > 65:
+        # 70B+ models need INT4 to fit in 70GB VRAM
+        return "int4"
+    elif size > config.auto_quantize_threshold_b:
+        # Large models get INT8
+        return "int8"
+    return "none"

models.py ADDED Viewed

	@@ -0,0 +1,335 @@

+"""Model loading, caching, and memory management for ZeroGPU inference."""
+import gc
+import logging
+from dataclasses import dataclass, field
+from typing import Optional, Generator, Any
+import torch
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    BitsAndBytesConfig,
+    TextIteratorStreamer,
+)
+from threading import Thread
+from config import get_config, should_quantize
+logger = logging.getLogger(__name__)
+@dataclass
+class LoadedModel:
+    """Container for a loaded model and its tokenizer."""
+    model_id: str
+    model: Any
+    tokenizer: Any
+    quantization: str = "none"
+# Global model cache (single model at a time due to memory constraints)
+_current_model: Optional[LoadedModel] = None
+def get_quantization_config(quantization: str) -> Optional[BitsAndBytesConfig]:
+    """Get BitsAndBytes configuration for the specified quantization level."""
+    if quantization == "int8":
+        return BitsAndBytesConfig(load_in_8bit=True)
+    elif quantization == "int4":
+        return BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_compute_dtype=torch.bfloat16,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type="nf4",
+        )
+    return None
+def clear_gpu_memory() -> None:
+    """Clear GPU memory by running garbage collection and emptying CUDA cache."""
+    gc.collect()
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize()
+    logger.debug("GPU memory cleared")
+def unload_model() -> None:
+    """Unload the currently loaded model and free memory."""
+    global _current_model
+    if _current_model is not None:
+        logger.info(f"Unloading model: {_current_model.model_id}")
+        del _current_model.model
+        del _current_model.tokenizer
+        _current_model = None
+        clear_gpu_memory()
+def load_model(
+    model_id: str,
+    quantization: Optional[str] = None,
+    force_reload: bool = False,
+) -> LoadedModel:
+    """
+    Load a model from HuggingFace Hub.
+    Args:
+        model_id: HuggingFace model ID (e.g., "meta-llama/Llama-3.1-8B-Instruct")
+        quantization: Force specific quantization ("none", "int8", "int4")
+                     If None, auto-determine based on model size
+        force_reload: If True, reload even if already loaded
+    Returns:
+        LoadedModel with model and tokenizer ready for inference
+    """
+    global _current_model
+    # Check if already loaded
+    if not force_reload and _current_model is not None:
+        if _current_model.model_id == model_id:
+            logger.debug(f"Model already loaded: {model_id}")
+            return _current_model
+    # Determine quantization
+    if quantization is None:
+        quantization = should_quantize(model_id)
+    logger.info(f"Loading model: {model_id} (quantization: {quantization})")
+    # Unload current model first
+    unload_model()
+    config = get_config()
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_id,
+        token=config.hf_token,
+        trust_remote_code=True,
+    )
+    # Ensure tokenizer has pad token
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # Load model with appropriate configuration
+    quant_config = get_quantization_config(quantization)
+    model_kwargs = {
+        "token": config.hf_token,
+        "trust_remote_code": True,
+        "device_map": "auto",
+    }
+    if quant_config is not None:
+        model_kwargs["quantization_config"] = quant_config
+    else:
+        model_kwargs["torch_dtype"] = torch.bfloat16
+    model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
+    _current_model = LoadedModel(
+        model_id=model_id,
+        model=model,
+        tokenizer=tokenizer,
+        quantization=quantization,
+    )
+    logger.info(f"Model loaded successfully: {model_id}")
+    return _current_model
+def get_current_model() -> Optional[LoadedModel]:
+    """Get the currently loaded model, if any."""
+    return _current_model
+def generate_text(
+    model_id: str,
+    prompt: str,
+    max_new_tokens: int = 512,
+    temperature: float = 0.7,
+    top_p: float = 0.95,
+    top_k: int = 50,
+    repetition_penalty: float = 1.1,
+    stop_sequences: Optional[list[str]] = None,
+) -> str:
+    """
+    Generate text using the specified model.
+    Args:
+        model_id: HuggingFace model ID
+        prompt: Input prompt (already formatted with chat template)
+        max_new_tokens: Maximum tokens to generate
+        temperature: Sampling temperature
+        top_p: Nucleus sampling probability
+        top_k: Top-k sampling parameter
+        repetition_penalty: Penalty for repeating tokens
+        stop_sequences: Additional stop sequences
+    Returns:
+        Generated text (without the input prompt)
+    """
+    loaded = load_model(model_id)
+    inputs = loaded.tokenizer(
+        prompt,
+        return_tensors="pt",
+        truncation=True,
+        max_length=loaded.tokenizer.model_max_length - max_new_tokens,
+    )
+    if torch.cuda.is_available():
+        inputs = {k: v.cuda() for k, v in inputs.items()}
+    # Build generation config
+    gen_kwargs = {
+        "max_new_tokens": max_new_tokens,
+        "temperature": temperature,
+        "top_p": top_p,
+        "top_k": top_k,
+        "repetition_penalty": repetition_penalty,
+        "do_sample": temperature > 0,
+        "pad_token_id": loaded.tokenizer.pad_token_id,
+        "eos_token_id": loaded.tokenizer.eos_token_id,
+    }
+    with torch.no_grad():
+        outputs = loaded.model.generate(**inputs, **gen_kwargs)
+    # Decode only the new tokens
+    input_length = inputs["input_ids"].shape[1]
+    generated_tokens = outputs[0][input_length:]
+    response = loaded.tokenizer.decode(generated_tokens, skip_special_tokens=True)
+    # Handle stop sequences
+    if stop_sequences:
+        for stop_seq in stop_sequences:
+            if stop_seq in response:
+                response = response.split(stop_seq)[0]
+    return response
+def generate_text_stream(
+    model_id: str,
+    prompt: str,
+    max_new_tokens: int = 512,
+    temperature: float = 0.7,
+    top_p: float = 0.95,
+    top_k: int = 50,
+    repetition_penalty: float = 1.1,
+    stop_sequences: Optional[list[str]] = None,
+) -> Generator[str, None, None]:
+    """
+    Generate text using streaming output.
+    Yields tokens as they are generated.
+    """
+    loaded = load_model(model_id)
+    inputs = loaded.tokenizer(
+        prompt,
+        return_tensors="pt",
+        truncation=True,
+        max_length=loaded.tokenizer.model_max_length - max_new_tokens,
+    )
+    if torch.cuda.is_available():
+        inputs = {k: v.cuda() for k, v in inputs.items()}
+    # Create streamer
+    streamer = TextIteratorStreamer(
+        loaded.tokenizer,
+        skip_prompt=True,
+        skip_special_tokens=True,
+    )
+    # Build generation config
+    gen_kwargs = {
+        **inputs,
+        "max_new_tokens": max_new_tokens,
+        "temperature": temperature,
+        "top_p": top_p,
+        "top_k": top_k,
+        "repetition_penalty": repetition_penalty,
+        "do_sample": temperature > 0,
+        "pad_token_id": loaded.tokenizer.pad_token_id,
+        "eos_token_id": loaded.tokenizer.eos_token_id,
+        "streamer": streamer,
+    }
+    # Run generation in separate thread
+    thread = Thread(target=loaded.model.generate, kwargs=gen_kwargs)
+    thread.start()
+    # Stream tokens
+    accumulated = ""
+    for token in streamer:
+        accumulated += token
+        # Check for stop sequences
+        should_stop = False
+        if stop_sequences:
+            for stop_seq in stop_sequences:
+                if stop_seq in accumulated:
+                    # Yield everything before the stop sequence
+                    before_stop = accumulated.split(stop_seq)[0]
+                    if before_stop:
+                        yield before_stop[len(accumulated) - len(token):]
+                    should_stop = True
+                    break
+        if should_stop:
+            break
+        yield token
+    thread.join()
+def apply_chat_template(
+    model_id: str,
+    messages: list[dict[str, str]],
+    add_generation_prompt: bool = True,
+) -> str:
+    """
+    Apply the model's chat template to format messages.
+    Args:
+        model_id: HuggingFace model ID
+        messages: List of message dicts with "role" and "content"
+        add_generation_prompt: Whether to add the generation prompt
+    Returns:
+        Formatted prompt string
+    """
+    loaded = load_model(model_id)
+    # Check if tokenizer has chat template
+    if hasattr(loaded.tokenizer, "apply_chat_template"):
+        return loaded.tokenizer.apply_chat_template(
+            messages,
+            tokenize=False,
+            add_generation_prompt=add_generation_prompt,
+        )
+    # Fallback: simple formatting
+    prompt_parts = []
+    for msg in messages:
+        role = msg["role"]
+        content = msg["content"]
+        if role == "system":
+            prompt_parts.append(f"System: {content}\n")
+        elif role == "user":
+            prompt_parts.append(f"User: {content}\n")
+        elif role == "assistant":
+            prompt_parts.append(f"Assistant: {content}\n")
+    if add_generation_prompt:
+        prompt_parts.append("Assistant:")
+    return "".join(prompt_parts)

openai_compat.py ADDED Viewed

	@@ -0,0 +1,269 @@

+"""OpenAI-compatible API request/response format handling."""
+import time
+import uuid
+import json
+import logging
+from dataclasses import dataclass, field
+from typing import Optional, Generator, Literal
+from pydantic import BaseModel, Field
+logger = logging.getLogger(__name__)
+# --- Request Models ---
+class ChatMessage(BaseModel):
+    """A single message in the conversation."""
+    role: Literal["system", "user", "assistant"]
+    content: str
+class ChatCompletionRequest(BaseModel):
+    """OpenAI-compatible chat completion request."""
+    model: str = Field(..., description="HuggingFace model ID")
+    messages: list[ChatMessage]
+    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
+    top_p: float = Field(default=0.95, ge=0.0, le=1.0)
+    max_tokens: Optional[int] = Field(default=512, ge=1, le=8192)
+    stream: bool = False
+    stop: Optional[list[str]] = None
+    presence_penalty: float = Field(default=0.0, ge=-2.0, le=2.0)
+    frequency_penalty: float = Field(default=0.0, ge=-2.0, le=2.0)
+    n: int = Field(default=1, ge=1, le=1)  # Only support n=1 for now
+    user: Optional[str] = None
+# --- Response Models ---
+class ChatCompletionChoice(BaseModel):
+    """A single completion choice."""
+    index: int
+    message: ChatMessage
+    finish_reason: Literal["stop", "length", "content_filter"] = "stop"
+class ChatCompletionUsage(BaseModel):
+    """Token usage statistics."""
+    prompt_tokens: int
+    completion_tokens: int
+    total_tokens: int
+class ChatCompletionResponse(BaseModel):
+    """OpenAI-compatible chat completion response."""
+    id: str
+    object: str = "chat.completion"
+    created: int
+    model: str
+    choices: list[ChatCompletionChoice]
+    usage: ChatCompletionUsage
+# --- Streaming Response Models ---
+class DeltaMessage(BaseModel):
+    """Delta content for streaming responses."""
+    role: Optional[str] = None
+    content: Optional[str] = None
+class StreamChoice(BaseModel):
+    """A single streaming choice."""
+    index: int
+    delta: DeltaMessage
+    finish_reason: Optional[Literal["stop", "length", "content_filter"]] = None
+class ChatCompletionChunk(BaseModel):
+    """OpenAI-compatible streaming chunk."""
+    id: str
+    object: str = "chat.completion.chunk"
+    created: int
+    model: str
+    choices: list[StreamChoice]
+# --- Helper Functions ---
+def generate_completion_id() -> str:
+    """Generate a unique completion ID."""
+    return f"chatcmpl-{uuid.uuid4().hex[:24]}"
+def create_chat_response(
+    model: str,
+    content: str,
+    prompt_tokens: int = 0,
+    completion_tokens: int = 0,
+    finish_reason: str = "stop",
+) -> ChatCompletionResponse:
+    """Create a complete chat completion response."""
+    return ChatCompletionResponse(
+        id=generate_completion_id(),
+        created=int(time.time()),
+        model=model,
+        choices=[
+            ChatCompletionChoice(
+                index=0,
+                message=ChatMessage(role="assistant", content=content),
+                finish_reason=finish_reason,
+            )
+        ],
+        usage=ChatCompletionUsage(
+            prompt_tokens=prompt_tokens,
+            completion_tokens=completion_tokens,
+            total_tokens=prompt_tokens + completion_tokens,
+        ),
+    )
+def create_stream_chunk(
+    completion_id: str,
+    model: str,
+    content: Optional[str] = None,
+    role: Optional[str] = None,
+    finish_reason: Optional[str] = None,
+) -> ChatCompletionChunk:
+    """Create a single streaming chunk."""
+    return ChatCompletionChunk(
+        id=completion_id,
+        created=int(time.time()),
+        model=model,
+        choices=[
+            StreamChoice(
+                index=0,
+                delta=DeltaMessage(role=role, content=content),
+                finish_reason=finish_reason,
+            )
+        ],
+    )
+def stream_response_generator(
+    model: str,
+    token_generator: Generator[str, None, None],
+) -> Generator[str, None, None]:
+    """
+    Convert a token generator to SSE-formatted streaming response.
+    Yields SSE-formatted strings ready for HTTP streaming.
+    """
+    completion_id = generate_completion_id()
+    # First chunk: role
+    first_chunk = create_stream_chunk(
+        completion_id=completion_id,
+        model=model,
+        role="assistant",
+    )
+    yield f"data: {first_chunk.model_dump_json()}\n\n"
+    # Content chunks
+    for token in token_generator:
+        chunk = create_stream_chunk(
+            completion_id=completion_id,
+            model=model,
+            content=token,
+        )
+        yield f"data: {chunk.model_dump_json()}\n\n"
+    # Final chunk: finish reason
+    final_chunk = create_stream_chunk(
+        completion_id=completion_id,
+        model=model,
+        finish_reason="stop",
+    )
+    yield f"data: {final_chunk.model_dump_json()}\n\n"
+    # End marker
+    yield "data: [DONE]\n\n"
+def messages_to_dicts(messages: list[ChatMessage]) -> list[dict[str, str]]:
+    """Convert Pydantic ChatMessage objects to simple dicts."""
+    return [{"role": msg.role, "content": msg.content} for msg in messages]
+def estimate_tokens(text: str) -> int:
+    """
+    Rough token count estimation.
+    This is a simple approximation - actual token count depends on the tokenizer.
+    Rule of thumb: ~4 characters per token for English text.
+    """
+    return max(1, len(text) // 4)
+@dataclass
+class InferenceParams:
+    """Extracted inference parameters from request."""
+    model_id: str
+    messages: list[dict[str, str]]
+    max_new_tokens: int
+    temperature: float
+    top_p: float
+    stop_sequences: Optional[list[str]]
+    stream: bool
+    @classmethod
+    def from_request(cls, request: ChatCompletionRequest) -> "InferenceParams":
+        """Extract inference parameters from an OpenAI-compatible request."""
+        return cls(
+            model_id=request.model,
+            messages=messages_to_dicts(request.messages),
+            max_new_tokens=request.max_tokens or 512,
+            temperature=request.temperature,
+            top_p=request.top_p,
+            stop_sequences=request.stop,
+            stream=request.stream,
+        )
+# --- Error Responses ---
+class ErrorDetail(BaseModel):
+    """Error detail for API error responses."""
+    message: str
+    type: str
+    param: Optional[str] = None
+    code: Optional[str] = None
+class ErrorResponse(BaseModel):
+    """OpenAI-compatible error response."""
+    error: ErrorDetail
+def create_error_response(
+    message: str,
+    error_type: str = "invalid_request_error",
+    param: Optional[str] = None,
+    code: Optional[str] = None,
+) -> ErrorResponse:
+    """Create an error response."""
+    return ErrorResponse(
+        error=ErrorDetail(
+            message=message,
+            type=error_type,
+            param=param,
+            code=code,
+        )
+    )

requirements.txt ADDED Viewed

	@@ -0,0 +1,32 @@

+# HuggingFace ZeroGPU Space - OpenCode Provider
+# For ZeroGPU H200 inference with OpenAI-compatible API
+# Core Framework
+gradio>=4.44.0
+spaces>=0.30.0
+# ML/Inference
+torch>=2.0.0
+transformers>=4.45.0
+accelerate>=0.34.0
+bitsandbytes>=0.44.0
+safetensors>=0.4.0
+# HuggingFace Integration
+huggingface-hub>=0.25.0
+# API
+fastapi>=0.115.0
+uvicorn>=0.30.0
+pydantic>=2.0.0
+httpx>=0.27.0
+sse-starlette>=2.1.0
+# Utilities
+python-dotenv>=1.0.0
+typing-extensions>=4.12.0
+# Testing (dev)
+pytest>=8.0.0
+pytest-asyncio>=0.24.0
+pytest-cov>=5.0.0

tests/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Tests package for ZeroGPU OpenCode Provider

tests/conftest.py ADDED Viewed

	@@ -0,0 +1,116 @@

+"""Test fixtures for ZeroGPU OpenCode Provider tests."""
+import pytest
+from unittest.mock import MagicMock, patch
+@pytest.fixture
+def mock_tokenizer():
+    """Create a mock tokenizer for testing."""
+    tokenizer = MagicMock()
+    tokenizer.pad_token = None
+    tokenizer.eos_token = "</s>"
+    tokenizer.pad_token_id = 0
+    tokenizer.eos_token_id = 2
+    tokenizer.model_max_length = 4096
+    def mock_apply_chat_template(messages, tokenize=False, add_generation_prompt=True):
+        parts = []
+        for msg in messages:
+            role = msg.get("role", msg.role if hasattr(msg, "role") else "user")
+            content = msg.get("content", msg.content if hasattr(msg, "content") else "")
+            if role == "system":
+                parts.append(f"<|system|>{content}</s>")
+            elif role == "user":
+                parts.append(f"<|user|>{content}</s>")
+            elif role == "assistant":
+                parts.append(f"<|assistant|>{content}</s>")
+        if add_generation_prompt:
+            parts.append("<|assistant|>")
+        return "".join(parts)
+    tokenizer.apply_chat_template = mock_apply_chat_template
+    def mock_call(text, return_tensors=None, truncation=True, max_length=None):
+        import torch
+        # Simple mock: return input_ids based on text length
+        token_count = max(1, len(text) // 4)
+        return {
+            "input_ids": torch.ones((1, token_count), dtype=torch.long),
+            "attention_mask": torch.ones((1, token_count), dtype=torch.long),
+        }
+    tokenizer.__call__ = mock_call
+    tokenizer.return_value = mock_call("test")
+    def mock_decode(tokens, skip_special_tokens=True):
+        return "This is a test response."
+    tokenizer.decode = mock_decode
+    return tokenizer
+@pytest.fixture
+def mock_model():
+    """Create a mock model for testing."""
+    import torch
+    model = MagicMock()
+    def mock_generate(**kwargs):
+        input_ids = kwargs.get("input_ids", torch.ones((1, 10), dtype=torch.long))
+        input_length = input_ids.shape[1]
+        # Generate some tokens
+        generated = torch.ones((1, input_length + 20), dtype=torch.long)
+        return generated
+    model.generate = mock_generate
+    model.device = "cpu"
+    return model
+@pytest.fixture
+def sample_messages():
+    """Sample chat messages for testing."""
+    return [
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "Hello!"},
+    ]
+@pytest.fixture
+def sample_request_data():
+    """Sample request data for OpenAI-compatible endpoint."""
+    return {
+        "model": "meta-llama/Llama-3.1-8B-Instruct",
+        "messages": [
+            {"role": "system", "content": "You are a helpful assistant."},
+            {"role": "user", "content": "Hello!"},
+        ],
+        "temperature": 0.7,
+        "max_tokens": 512,
+        "stream": False,
+    }
+@pytest.fixture
+def sample_streaming_request_data():
+    """Sample streaming request data."""
+    return {
+        "model": "meta-llama/Llama-3.1-8B-Instruct",
+        "messages": [
+            {"role": "user", "content": "Tell me a joke."},
+        ],
+        "temperature": 0.7,
+        "max_tokens": 256,
+        "stream": True,
+    }
+@pytest.fixture(autouse=True)
+def mock_torch_cuda():
+    """Mock CUDA availability for tests."""
+    with patch("torch.cuda.is_available", return_value=False):
+        yield

tests/test_models.py ADDED Viewed

	@@ -0,0 +1,150 @@

+"""Tests for model loading and inference."""
+import pytest
+from unittest.mock import patch, MagicMock
+import sys
+import os
+# Add parent directory to path for imports
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from config import estimate_model_size, should_quantize
+class TestModelSizeEstimation:
+    """Test model size estimation logic."""
+    def test_known_model_size(self):
+        """Test size estimation for known models."""
+        assert estimate_model_size("meta-llama/Llama-3.1-8B-Instruct") == 8
+        assert estimate_model_size("meta-llama/Llama-3.1-70B-Instruct") == 70
+        assert estimate_model_size("mistralai/Mistral-7B-Instruct-v0.3") == 7
+    def test_extract_size_from_name(self):
+        """Test size extraction from model name pattern."""
+        assert estimate_model_size("some-org/CustomModel-13B") == 13
+        assert estimate_model_size("another/model-2B-test") == 2
+        assert estimate_model_size("org/Model-32B-Instruct") == 32
+    def test_unknown_model_size(self):
+        """Test handling of models with unknown size."""
+        assert estimate_model_size("unknown/model-without-size") is None
+        assert estimate_model_size("org/mystery-model") is None
+class TestQuantizationDecision:
+    """Test automatic quantization decisions."""
+    def test_small_model_no_quantization(self):
+        """Small models should not be quantized."""
+        assert should_quantize("meta-llama/Llama-3.1-8B-Instruct") == "none"
+        assert should_quantize("mistralai/Mistral-7B-Instruct-v0.3") == "none"
+    def test_large_model_int4_quantization(self):
+        """70B+ models should use INT4."""
+        assert should_quantize("meta-llama/Llama-3.1-70B-Instruct") == "int4"
+        assert should_quantize("Qwen/Qwen2.5-72B-Instruct") == "int4"
+    def test_unknown_model_no_quantization(self):
+        """Unknown models should not be auto-quantized."""
+        assert should_quantize("unknown/mystery-model") == "none"
+class TestModelLoading:
+    """Test model loading functionality."""
+    @patch("models.AutoModelForCausalLM")
+    @patch("models.AutoTokenizer")
+    def test_load_model_creates_loaded_model(
+        self, mock_tokenizer_class, mock_model_class, mock_tokenizer, mock_model
+    ):
+        """Test that load_model returns a LoadedModel instance."""
+        mock_tokenizer_class.from_pretrained.return_value = mock_tokenizer
+        mock_model_class.from_pretrained.return_value = mock_model
+        from models import load_model, unload_model
+        # Ensure clean state
+        unload_model()
+        loaded = load_model("test-model/test-7B")
+        assert loaded.model_id == "test-model/test-7B"
+        assert loaded.model is not None
+        assert loaded.tokenizer is not None
+    @patch("models.AutoModelForCausalLM")
+    @patch("models.AutoTokenizer")
+    def test_load_model_caches_result(
+        self, mock_tokenizer_class, mock_model_class, mock_tokenizer, mock_model
+    ):
+        """Test that loading the same model twice uses cache."""
+        mock_tokenizer_class.from_pretrained.return_value = mock_tokenizer
+        mock_model_class.from_pretrained.return_value = mock_model
+        from models import load_model, unload_model
+        # Ensure clean state
+        unload_model()
+        # First load
+        load_model("test-model/test-7B")
+        first_call_count = mock_model_class.from_pretrained.call_count
+        # Second load (should use cache)
+        load_model("test-model/test-7B")
+        second_call_count = mock_model_class.from_pretrained.call_count
+        # Should not have called from_pretrained again
+        assert first_call_count == second_call_count
+class TestChatTemplate:
+    """Test chat template application."""
+    @patch("models.load_model")
+    def test_apply_chat_template_with_tokenizer_method(self, mock_load_model, mock_tokenizer):
+        """Test chat template when tokenizer has apply_chat_template."""
+        from models import apply_chat_template, LoadedModel
+        mock_load_model.return_value = LoadedModel(
+            model_id="test-model",
+            model=MagicMock(),
+            tokenizer=mock_tokenizer,
+        )
+        messages = [
+            {"role": "user", "content": "Hello!"},
+        ]
+        result = apply_chat_template("test-model", messages)
+        assert "<|user|>" in result
+        assert "Hello!" in result
+        assert "<|assistant|>" in result  # Generation prompt
+    @patch("models.load_model")
+    def test_apply_chat_template_fallback(self, mock_load_model):
+        """Test fallback formatting when tokenizer lacks apply_chat_template."""
+        from models import apply_chat_template, LoadedModel
+        # Tokenizer without apply_chat_template
+        simple_tokenizer = MagicMock()
+        del simple_tokenizer.apply_chat_template
+        mock_load_model.return_value = LoadedModel(
+            model_id="test-model",
+            model=MagicMock(),
+            tokenizer=simple_tokenizer,
+        )
+        messages = [
+            {"role": "system", "content": "You are helpful."},
+            {"role": "user", "content": "Hi!"},
+        ]
+        result = apply_chat_template("test-model", messages)
+        assert "System:" in result
+        assert "User:" in result
+        assert "Assistant:" in result

tests/test_openai_compat.py ADDED Viewed

	@@ -0,0 +1,263 @@

+"""Tests for OpenAI-compatible API format handling."""
+import json
+import pytest
+import sys
+import os
+# Add parent directory to path for imports
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from openai_compat import (
+    ChatCompletionRequest,
+    ChatMessage,
+    InferenceParams,
+    create_chat_response,
+    create_error_response,
+    create_stream_chunk,
+    estimate_tokens,
+    generate_completion_id,
+    messages_to_dicts,
+    stream_response_generator,
+)
+class TestChatCompletionRequest:
+    """Test request parsing."""
+    def test_parse_basic_request(self, sample_request_data):
+        """Test parsing a basic chat completion request."""
+        request = ChatCompletionRequest(**sample_request_data)
+        assert request.model == "meta-llama/Llama-3.1-8B-Instruct"
+        assert len(request.messages) == 2
+        assert request.messages[0].role == "system"
+        assert request.messages[1].role == "user"
+        assert request.temperature == 0.7
+        assert request.max_tokens == 512
+        assert request.stream is False
+    def test_parse_streaming_request(self, sample_streaming_request_data):
+        """Test parsing a streaming request."""
+        request = ChatCompletionRequest(**sample_streaming_request_data)
+        assert request.stream is True
+        assert request.max_tokens == 256
+    def test_default_values(self):
+        """Test that defaults are applied correctly."""
+        minimal_request = {
+            "model": "test-model",
+            "messages": [{"role": "user", "content": "Hi"}],
+        }
+        request = ChatCompletionRequest(**minimal_request)
+        assert request.temperature == 0.7
+        assert request.top_p == 0.95
+        assert request.max_tokens == 512
+        assert request.stream is False
+        assert request.stop is None
+    def test_validation_temperature_bounds(self):
+        """Test temperature validation."""
+        with pytest.raises(ValueError):
+            ChatCompletionRequest(
+                model="test",
+                messages=[{"role": "user", "content": "Hi"}],
+                temperature=-0.5,
+            )
+        with pytest.raises(ValueError):
+            ChatCompletionRequest(
+                model="test",
+                messages=[{"role": "user", "content": "Hi"}],
+                temperature=2.5,
+            )
+class TestChatCompletionResponse:
+    """Test response generation."""
+    def test_create_basic_response(self):
+        """Test creating a basic chat response."""
+        response = create_chat_response(
+            model="test-model",
+            content="Hello! How can I help you?",
+            prompt_tokens=10,
+            completion_tokens=8,
+        )
+        assert response.model == "test-model"
+        assert response.object == "chat.completion"
+        assert len(response.choices) == 1
+        assert response.choices[0].message.role == "assistant"
+        assert response.choices[0].message.content == "Hello! How can I help you?"
+        assert response.choices[0].finish_reason == "stop"
+        assert response.usage.prompt_tokens == 10
+        assert response.usage.completion_tokens == 8
+        assert response.usage.total_tokens == 18
+    def test_response_has_unique_id(self):
+        """Test that each response has a unique ID."""
+        response1 = create_chat_response(model="test", content="Hi")
+        response2 = create_chat_response(model="test", content="Hi")
+        assert response1.id != response2.id
+        assert response1.id.startswith("chatcmpl-")
+    def test_response_serialization(self):
+        """Test that response can be serialized to JSON."""
+        response = create_chat_response(
+            model="test-model",
+            content="Test",
+        )
+        json_str = response.model_dump_json()
+        parsed = json.loads(json_str)
+        assert parsed["model"] == "test-model"
+        assert parsed["choices"][0]["message"]["content"] == "Test"
+class TestStreamingResponse:
+    """Test streaming response format."""
+    def test_create_stream_chunk_with_content(self):
+        """Test creating a streaming chunk with content."""
+        chunk = create_stream_chunk(
+            completion_id="test-id",
+            model="test-model",
+            content="Hello",
+        )
+        assert chunk.id == "test-id"
+        assert chunk.object == "chat.completion.chunk"
+        assert chunk.choices[0].delta.content == "Hello"
+        assert chunk.choices[0].finish_reason is None
+    def test_create_stream_chunk_with_role(self):
+        """Test creating a streaming chunk with role (first chunk)."""
+        chunk = create_stream_chunk(
+            completion_id="test-id",
+            model="test-model",
+            role="assistant",
+        )
+        assert chunk.choices[0].delta.role == "assistant"
+        assert chunk.choices[0].delta.content is None
+    def test_create_stream_chunk_with_finish_reason(self):
+        """Test creating a final streaming chunk."""
+        chunk = create_stream_chunk(
+            completion_id="test-id",
+            model="test-model",
+            finish_reason="stop",
+        )
+        assert chunk.choices[0].finish_reason == "stop"
+    def test_stream_response_generator(self):
+        """Test the full streaming response generator."""
+        def token_gen():
+            yield "Hello"
+            yield " World"
+            yield "!"
+        chunks = list(stream_response_generator("test-model", token_gen()))
+        # Should have: role chunk, 3 content chunks, finish chunk, [DONE]
+        assert len(chunks) == 6
+        # First chunk has role
+        first_data = json.loads(chunks[0].replace("data: ", "").strip())
+        assert first_data["choices"][0]["delta"]["role"] == "assistant"
+        # Content chunks
+        second_data = json.loads(chunks[1].replace("data: ", "").strip())
+        assert second_data["choices"][0]["delta"]["content"] == "Hello"
+        # Last data chunk has finish reason
+        last_data = json.loads(chunks[4].replace("data: ", "").strip())
+        assert last_data["choices"][0]["finish_reason"] == "stop"
+        # Very last is [DONE]
+        assert chunks[5] == "data: [DONE]\n\n"
+class TestInferenceParams:
+    """Test parameter extraction."""
+    def test_extract_params_from_request(self, sample_request_data):
+        """Test extracting inference parameters from request."""
+        request = ChatCompletionRequest(**sample_request_data)
+        params = InferenceParams.from_request(request)
+        assert params.model_id == "meta-llama/Llama-3.1-8B-Instruct"
+        assert len(params.messages) == 2
+        assert params.max_new_tokens == 512
+        assert params.temperature == 0.7
+        assert params.stream is False
+    def test_messages_to_dicts(self):
+        """Test converting ChatMessage objects to dicts."""
+        messages = [
+            ChatMessage(role="user", content="Hello"),
+            ChatMessage(role="assistant", content="Hi there!"),
+        ]
+        dicts = messages_to_dicts(messages)
+        assert dicts == [
+            {"role": "user", "content": "Hello"},
+            {"role": "assistant", "content": "Hi there!"},
+        ]
+class TestErrorResponse:
+    """Test error response format."""
+    def test_create_error_response(self):
+        """Test creating an error response."""
+        error = create_error_response(
+            message="Model not found",
+            error_type="invalid_request_error",
+            param="model",
+        )
+        assert error.error.message == "Model not found"
+        assert error.error.type == "invalid_request_error"
+        assert error.error.param == "model"
+    def test_error_response_serialization(self):
+        """Test error response JSON serialization."""
+        error = create_error_response(
+            message="Test error",
+            error_type="server_error",
+            code="internal_error",
+        )
+        parsed = json.loads(error.model_dump_json())
+        assert parsed["error"]["message"] == "Test error"
+        assert parsed["error"]["type"] == "server_error"
+        assert parsed["error"]["code"] == "internal_error"
+class TestUtilityFunctions:
+    """Test utility functions."""
+    def test_generate_completion_id_format(self):
+        """Test completion ID format."""
+        id1 = generate_completion_id()
+        id2 = generate_completion_id()
+        assert id1.startswith("chatcmpl-")
+        assert len(id1) == len("chatcmpl-") + 24
+        assert id1 != id2  # Should be unique
+    def test_estimate_tokens(self):
+        """Test rough token estimation."""
+        # ~4 chars per token
+        assert estimate_tokens("Hello World!") == 3  # 12 chars / 4 = 3
+        assert estimate_tokens("A") == 1  # Min 1
+        assert estimate_tokens("This is a longer piece of text.") == 8  # 32 / 4 = 8