split-brain-copilot / AGENTS.md
blessingmwiti's picture
Build split-brain copilot scaffold
053ee0d

A newer version of the Gradio SDK is available: 6.18.0

Upgrade

Split-Brain Speculative Co-Pilot

Build Small Hackathon — Complete Build Instructions

Concept: A 1B model runs entirely in the user's browser via WebGPU + transformers.js, streaming code instantly. A 14B model on Modal verifies the draft in the background. When the verifier catches a bug, the UI rolls back the local generation and replaces it with the corrected cloud block — live, visually.

Models: Qwen2.5-Coder-1.5B (browser, WebGPU) + Qwen2.5-Coder-14B-Instruct (Modal, GGUF via llama.cpp) — combined 15.5B, well under the 32B cap.

Bonus badges targeted: Off the Grid · Llama Champion · Off-Brand · Field Notes


0. Prerequisites

  • Python 3.11+
  • Node.js 18+ (for local frontend testing)
  • Modal account with modal CLI installed and authenticated (modal token new)
  • Hugging Face account, joined the build-small-hackathon org, HF token with write access
  • huggingface-cli installed and logged in (huggingface-cli login)
  • Chrome 113+ (WebGPU required — Firefox and Safari are out, document this clearly)
  • Git

1. Repository Structure

Set up the project layout before writing any code.

split-brain-copilot/
├── app.py                  # Gradio app entry point (HF Space root)
├── modal_backend/
│   ├── __init__.py
│   ├── verifier.py         # Modal app: 14B inference endpoint
│   └── sandbox.py          # Modal app: code execution sandbox
├── static/
│   ├── engine.js           # transformers.js WebGPU inference engine
│   ├── ui.js               # Stream rendering, rollback animation, diff logic
│   └── style.css           # Custom UI (required for Off-Brand badge)
├── requirements.txt
└── README.md               # HF Space card + demo video embed

Initialize git and create a HF Space repo (check whether we have huggingface cli installed and token set or logged in):

git init
huggingface-cli repo create split-brain-copilot --type space --space-sdk gradio
git remote add origin https://huggingface.co/spaces/YOUR_HF_USERNAME/split-brain-copilot

2. Modal Backend — 14B Verifier Endpoint

2.1 Download the GGUF model to a Modal Volume

The 14B model is too large to bake into the image. Use a Modal Volume for persistent storage.

# modal_backend/verifier.py
import modal

app = modal.App("split-brain-verifier")

# Persistent volume — survives cold starts
model_volume = modal.Volume.from_name("qwen-14b-volume", create_if_missing=True)

MODEL_DIR = "/models"
MODEL_FILENAME = "qwen2.5-coder-14b-instruct-q4_k_m.gguf"
# Source: bartowski/Qwen2.5-Coder-14B-Instruct-GGUF on HuggingFace
MODEL_REPO = "bartowski/Qwen2.5-Coder-14B-Instruct-GGUF"

Create a one-time download function:

@app.function(
    volumes={MODEL_DIR: model_volume},
    timeout=3600,
    secrets=[modal.Secret.from_name("huggingface-secret")],
)
def download_model():
    from huggingface_hub import hf_hub_download
    import os
    hf_hub_download(
        repo_id=MODEL_REPO,
        filename=MODEL_FILENAME,
        local_dir=MODEL_DIR,
    )
    model_volume.commit()
    print(f"Downloaded to {MODEL_DIR}/{MODEL_FILENAME}")

Run this once: modal run modal_backend/verifier.py::download_model

2.2 Build the llama.cpp image

llama_image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("build-essential", "cmake", "git", "libgomp1")
    .run_commands(
        "git clone https://github.com/ggerganov/llama.cpp /llama.cpp",
        "cd /llama.cpp && cmake -B build -DLLAMA_CURL=OFF && cmake --build build --config Release -j$(nproc)",
        "cd /llama.cpp && pip install -e .",
    )
    .pip_install("llama-cpp-python==0.3.4", "fastapi", "uvicorn")
)

2.3 Verifier inference function

The verifier receives the speculated code draft and the original prompt, and returns a verdict: PASS, FIX, or REWRITE with corrected code.

@app.cls(
    image=llama_image,
    gpu=modal.gpu.A10G(),
    volumes={MODEL_DIR: model_volume},
    container_idle_timeout=300,
    allow_concurrent_inputs=10,
)
class Verifier:
    @modal.enter()
    def load_model(self):
        from llama_cpp import Llama
        self.llm = Llama(
            model_path=f"{MODEL_DIR}/{MODEL_FILENAME}",
            n_gpu_layers=-1,      # all layers on GPU
            n_ctx=8192,
            n_batch=512,
            verbose=False,
        )

    @modal.method()
    def verify(self, prompt: str, draft_code: str, language: str = "python") -> dict:
        system = f"""You are a code verifier. A smaller model drafted the following {language} code.
Your job:
1. Check for bugs, logic errors, type errors, off-by-one errors, and security issues.
2. If the code is correct, respond with exactly: {{"verdict": "PASS"}}
3. If fixable, respond with: {{"verdict": "FIX", "corrected_code": "<fixed code here>", "reason": "<one line>"}}
4. If fundamentally wrong, respond with: {{"verdict": "REWRITE", "corrected_code": "<rewritten code>", "reason": "<one line>"}}
Respond ONLY with valid JSON. No markdown, no explanation outside the JSON."""

        user = f"Original prompt:\n{prompt}\n\nDrafted code:\n```{language}\n{draft_code}\n```"

        response = self.llm.create_chat_completion(
            messages=[
                {"role": "system", "content": system},
                {"role": "user", "content": user},
            ],
            max_tokens=2048,
            temperature=0.1,
        )
        import json
        raw = response["choices"][0]["message"]["content"].strip()
        try:
            return json.loads(raw)
        except json.JSONDecodeError:
            # Fallback: treat as PASS if we can't parse
            return {"verdict": "PASS"}

2.4 Modal Sandbox — code execution (optional but impressive)

Sandboxed execution confirms the corrected code actually runs. This is what earns you extra credibility in the demo.

# modal_backend/sandbox.py
import modal

app = modal.App("split-brain-sandbox")

@app.function(timeout=30)
def execute_python(code: str) -> dict:
    """Run untrusted code in a Modal sandbox and return stdout/stderr."""
    sandbox = modal.Sandbox.create(
        "python3", "-c", code,
        image=modal.Image.debian_slim().pip_install("numpy"),
        timeout=10,
        cpu=0.5,
    )
    sandbox.wait()
    return {
        "stdout": sandbox.stdout.read(),
        "stderr": sandbox.stderr.read(),
        "returncode": sandbox.returncode,
    }

2.5 Deploy the Modal backend

modal deploy modal_backend/verifier.py
modal deploy modal_backend/sandbox.py

Note the endpoint URLs printed after deploy. You'll need them in app.py.

Store your Modal token and HF token as Modal secrets:

modal secret create huggingface-secret HF_TOKEN=hf_xxx

3. Browser Engine — transformers.js + WebGPU

3.1 Model choice for the browser

Use Qwen2.5-Coder-1.5B-Instruct in ONNX/WebGPU format. Xenova and onnx-community maintain these on HF Hub. Target: onnx-community/Qwen2.5-Coder-1.5B-Instruct with dtype: "q4" for fast WebGPU loading (~800MB, fits comfortably in browser VRAM on a modern GPU).

3.2 engine.js — WebGPU inference

// static/engine.js
import { pipeline, TextStreamer } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.5.0/dist/transformers.min.js";

const MODEL_ID = "onnx-community/Qwen2.5-Coder-1.5B-Instruct";
let generator = null;
let isLoaded = false;

export async function loadModel(onProgress) {
    if (isLoaded) return;
    generator = await pipeline("text-generation", MODEL_ID, {
        dtype: "q4",
        device: "webgpu",
        progress_callback: onProgress,
    });
    isLoaded = true;
}

export async function generateCode(prompt, language, onToken, onComplete) {
    if (!generator) throw new Error("Model not loaded");

    const messages = [
        {
            role: "system",
            content: `You are an expert ${language} programmer. Write clean, correct, production-ready code. Output ONLY the code block, no explanation.`
        },
        { role: "user", content: prompt }
    ];

    const streamer = new TextStreamer(generator.tokenizer, {
        skip_prompt: true,
        callback_function: (token) => {
            onToken(token);
        },
    });

    const result = await generator(messages, {
        max_new_tokens: 1024,
        temperature: 0.2,
        do_sample: true,
        streamer,
    });

    const fullCode = result[0].generated_text.at(-1).content;
    onComplete(fullCode);
    return fullCode;
}

export function isWebGPUSupported() {
    return !!navigator.gpu;
}

3.3 ui.js — stream rendering + rollback animation

// static/ui.js

let currentTokens = [];
let streamBuffer = "";

export function initEditor(containerId) {
    // Attach to the Gradio custom HTML component
    const container = document.getElementById(containerId);
    container.innerHTML = `
        <div id="stream-display" class="code-stream"></div>
        <div id="status-bar" class="status-bar">
            <span id="status-text">Ready</span>
            <span id="token-count">0 tok/s</span>
            <span id="verifier-status"></span>
        </div>
    `;
}

export function appendToken(token) {
    streamBuffer += token;
    currentTokens.push(token);
    const display = document.getElementById("stream-display");
    if (display) display.textContent = streamBuffer;
}

export function setStatus(text, type = "neutral") {
    const el = document.getElementById("status-text");
    if (el) {
        el.textContent = text;
        el.className = `status-${type}`;
    }
}

export function setVerifierStatus(verdict) {
    const el = document.getElementById("verifier-status");
    if (!el) return;
    const icons = { PASS: "✅ Verified", FIX: "🔧 Fixed", REWRITE: "🔄 Rewritten", CHECKING: "🔍 Verifying..." };
    el.textContent = icons[verdict] || "";
}

export async function rollbackAndReplace(correctedCode, reason) {
    const display = document.getElementById("stream-display");
    if (!display) return;

    // Flash red to signal rollback
    display.classList.add("rollback-flash");
    setVerifierStatus("FIX");
    setStatus(`Verifier corrected: ${reason}`, "warning");

    await sleep(400);
    display.classList.remove("rollback-flash");

    // Type in corrected code character by character
    display.textContent = "";
    streamBuffer = correctedCode;
    currentTokens = [];

    for (let i = 0; i < correctedCode.length; i++) {
        display.textContent += correctedCode[i];
        if (i % 5 === 0) await sleep(8); // smooth typewriter
    }

    setVerifierStatus("PASS");
    setStatus("Ready", "neutral");
}

export function getCurrentCode() {
    return streamBuffer;
}

export function reset() {
    streamBuffer = "";
    currentTokens = [];
    const display = document.getElementById("stream-display");
    if (display) display.textContent = "";
}

function sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
}

3.4 style.css — custom UI (Off-Brand badge)

/* static/style.css */
:root {
    --bg: #0d1117;
    --surface: #161b22;
    --border: #30363d;
    --accent: #58a6ff;
    --accent-warn: #f0883e;
    --text: #e6edf3;
    --text-muted: #8b949e;
    --green: #3fb950;
    --red: #f85149;
}

body { background: var(--bg); color: var(--text); font-family: 'JetBrains Mono', monospace; }

.code-stream {
    background: var(--surface);
    border: 1px solid var(--border);
    border-radius: 8px;
    padding: 16px;
    min-height: 300px;
    font-family: 'JetBrains Mono', monospace;
    font-size: 13px;
    line-height: 1.6;
    white-space: pre-wrap;
    overflow-y: auto;
    transition: border-color 0.2s;
}

.rollback-flash {
    border-color: var(--red) !important;
    background: rgba(248, 81, 73, 0.08) !important;
    animation: flash 0.4s ease;
}

@keyframes flash {
    0%   { background: rgba(248, 81, 73, 0.25); }
    100% { background: rgba(248, 81, 73, 0.08); }
}

.status-bar {
    display: flex;
    justify-content: space-between;
    padding: 8px 12px;
    background: var(--surface);
    border: 1px solid var(--border);
    border-top: none;
    border-radius: 0 0 8px 8px;
    font-size: 12px;
    color: var(--text-muted);
}

.status-warning { color: var(--accent-warn); }
.status-success { color: var(--green); }
.status-neutral { color: var(--text-muted); }

/* Gradio overrides */
.gradio-container { background: var(--bg) !important; }
footer { display: none !important; }

/* WebGPU loading bar */
.loading-bar {
    height: 3px;
    background: var(--border);
    border-radius: 2px;
    overflow: hidden;
    margin: 8px 0;
}
.loading-bar-fill {
    height: 100%;
    background: var(--accent);
    transition: width 0.3s ease;
}

4. Gradio App — app.py

This is the HF Space entry point. Gradio acts as the shell; the real UI lives in the custom HTML component injected via gr.HTML.

# app.py
import gradio as gr
import httpx
import json
import os
import asyncio
from pathlib import Path

MODAL_VERIFIER_URL = os.environ.get("MODAL_VERIFIER_URL")  # set as HF Space secret
MODAL_SANDBOX_URL = os.environ.get("MODAL_SANDBOX_URL")    # set as HF Space secret

LANGUAGES = ["Python", "JavaScript", "TypeScript", "Rust", "Go", "C++"]

def load_static(filename):
    return Path(f"static/{filename}").read_text()

custom_html = f"""
<!DOCTYPE html>
<html>
<head>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link href="https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet">
<style>{load_static('style.css')}</style>
</head>
<body>
<div id="split-brain-root">
    <div class="webgpu-notice" id="webgpu-warning" style="display:none; color:#f85149; padding:8px; border:1px solid #f85149; border-radius:6px; margin-bottom:12px;">
        ⚠️ WebGPU not detected. Please use Chrome 113+ on desktop for local inference.
    </div>
    <div id="load-section">
        <button id="load-btn" onclick="initEngine()">⚡ Load 1.5B Model (WebGPU)</button>
        <div class="loading-bar"><div class="loading-bar-fill" id="load-progress" style="width:0%"></div></div>
        <span id="load-status" style="font-size:12px; color:#8b949e;"></span>
    </div>
    <div id="stream-display" class="code-stream" style="margin-top:12px;">Waiting for model load...</div>
    <div class="status-bar">
        <span id="status-text">Idle</span>
        <span id="token-count"></span>
        <span id="verifier-status"></span>
    </div>
</div>
<script type="module">
{load_static('engine.js')}
{load_static('ui.js')}

// Check WebGPU on load
if (!isWebGPUSupported()) {{
    document.getElementById('webgpu-warning').style.display = 'block';
    document.getElementById('load-btn').disabled = true;
}}

window.initEngine = async function() {{
    document.getElementById('load-btn').disabled = true;
    document.getElementById('load-status').textContent = 'Loading model weights...';
    await loadModel((progress) => {{
        if (progress.progress) {{
            document.getElementById('load-progress').style.width = progress.progress + '%';
            document.getElementById('load-status').textContent = `${{progress.file || 'Loading'}} — ${{Math.round(progress.progress)}}%`;
        }}
    }});
    document.getElementById('load-status').textContent = '✅ Model ready — WebGPU active';
    document.getElementById('load-section').style.opacity = '0.5';
}};

// Gradio will call this via the hidden trigger
window.runLocalGeneration = async function(prompt, language) {{
    reset();
    setStatus('Generating locally (WebGPU)...', 'neutral');

    let tokenCount = 0;
    const startTime = Date.now();

    const fullCode = await generateCode(prompt, language,
        (token) => {{
            appendToken(token);
            tokenCount++;
            const elapsed = (Date.now() - startTime) / 1000;
            const tps = Math.round(tokenCount / elapsed);
            document.getElementById('token-count').textContent = `${{tps}} tok/s`;
        }},
        (code) => {{
            setStatus('Local generation complete. Verifying...', 'neutral');
        }}
    );

    // Send to Gradio backend for verification
    // Use the hidden Gradio state to trigger the verify function
    document.getElementById('draft-output-hidden').value = fullCode;
    document.getElementById('trigger-verify-btn').click();
}};

window.applyVerification = function(verdictJson) {{
    const verdict = JSON.parse(verdictJson);
    if (verdict.verdict === 'PASS') {{
        setVerifierStatus('PASS');
        setStatus('✅ Verified clean', 'success');
    }} else {{
        rollbackAndReplace(verdict.corrected_code, verdict.reason);
    }}
}};
</script>
</body>
</html>
"""

async def verify_with_modal(prompt: str, draft_code: str, language: str) -> str:
    """Call Modal verifier endpoint and return JSON string."""
    if not MODAL_VERIFIER_URL:
        return json.dumps({"verdict": "PASS"})
    async with httpx.AsyncClient(timeout=60.0) as client:
        resp = await client.post(
            MODAL_VERIFIER_URL,
            json={"prompt": prompt, "draft_code": draft_code, "language": language},
        )
        resp.raise_for_status()
        return resp.text

async def execute_in_sandbox(code: str) -> dict:
    """Call Modal sandbox and return execution result."""
    if not MODAL_SANDBOX_URL:
        return {"stdout": "", "stderr": "Sandbox not configured", "returncode": -1}
    async with httpx.AsyncClient(timeout=30.0) as client:
        resp = await client.post(MODAL_SANDBOX_URL, json={"code": code})
        return resp.json()

with gr.Blocks(
    title="Split-Brain Co-Pilot",
    css="footer {display:none}",
    theme=gr.themes.Base(
        primary_hue="blue",
        neutral_hue="slate",
    ),
) as demo:
    gr.HTML("<h1 style='text-align:center; color:#58a6ff;'>⚡ Split-Brain Co-Pilot</h1>")
    gr.HTML("<p style='text-align:center; color:#8b949e;'>1.5B model runs in your browser (WebGPU). 14B model on Modal verifies and corrects.</p>")

    with gr.Row():
        with gr.Column(scale=2):
            prompt_input = gr.Textbox(
                label="What do you want to build?",
                placeholder="e.g. A function that parses a CSV and returns the top 5 rows by a given column",
                lines=3,
            )
            language_select = gr.Dropdown(
                choices=LANGUAGES,
                value="Python",
                label="Language",
            )
            generate_btn = gr.Button("⚡ Generate (WebGPU → Verify)", variant="primary")

        with gr.Column(scale=3):
            # Custom HTML component for streaming display
            gr.HTML(custom_html)

            # Hidden elements for JS ↔ Gradio bridge
            draft_hidden = gr.Textbox(visible=False, elem_id="draft-output-hidden")
            verify_trigger = gr.Button("verify", visible=False, elem_id="trigger-verify-btn")
            verdict_output = gr.Textbox(visible=False, label="verdict")

    with gr.Row():
        sandbox_output = gr.Code(label="Sandbox Execution Output", language="python", visible=False)

    # Gradio event: user clicks Generate → JS takes over for local inference
    generate_btn.click(
        fn=None,
        inputs=[prompt_input, language_select],
        outputs=[],
        js="(prompt, lang) => { window.runLocalGeneration(prompt, lang); return []; }",
    )

    # Gradio event: JS triggers verify after local generation completes
    async def run_verification(prompt, draft_code, language):
        verdict_json = await verify_with_modal(prompt, draft_code, language)
        return verdict_json

    verify_trigger.click(
        fn=run_verification,
        inputs=[prompt_input, draft_hidden, language_select],
        outputs=[verdict_output],
    )

    # Apply verdict back to JS
    verdict_output.change(
        fn=None,
        inputs=[verdict_output],
        outputs=[],
        js="(verdict) => { window.applyVerification(verdict); return []; }",
    )

if __name__ == "__main__":
    demo.launch()

5. Modal Web Endpoint Wrapper

The Modal functions need to be exposed as HTTP endpoints that app.py can call via httpx. Add this to verifier.py:

from fastapi import FastAPI
from pydantic import BaseModel

web_app = FastAPI()

class VerifyRequest(BaseModel):
    prompt: str
    draft_code: str
    language: str = "python"

@app.function(
    image=llama_image,
    gpu=modal.gpu.A10G(),
    volumes={MODEL_DIR: model_volume},
    container_idle_timeout=300,
)
@modal.asgi_app()
def verifier_endpoint():
    verifier = Verifier()

    @web_app.post("/verify")
    async def verify(req: VerifyRequest):
        result = verifier.verify.remote(req.prompt, req.draft_code, req.language)
        return result

    return web_app

After deploying, Modal gives you a URL like https://your-username--split-brain-verifier-verifier-endpoint.modal.run. Set this as the HF Space secret MODAL_VERIFIER_URL.


6. HF Space Configuration

6.1 README.md (Space card)

---
title: Split-Brain Co-Pilot
emoji: 
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.30.0
app_file: app.py
pinned: true
license: apache-2.0
tags:
  - code-generation
  - webgpu
  - speculative-decoding
  - llama.cpp
  - local-first
---

6.2 requirements.txt

gradio==5.30.0
httpx==0.27.0
modal==0.73.0
huggingface-hub==0.23.0

6.3 HF Space secrets

Set these in the Space Settings → Repository secrets:

Secret name Value
MODAL_VERIFIER_URL Your Modal verifier endpoint URL
MODAL_SANDBOX_URL Your Modal sandbox endpoint URL
MODAL_TOKEN_ID From modal token show
MODAL_TOKEN_SECRET From modal token show

7. Cold Start Mitigation

Modal A10G containers take 10–40 seconds to cold start. Handle this gracefully:

In verify_with_modal, add a keep-warm ping. Add this to verifier.py:

@app.function(schedule=modal.Cron("*/5 * * * *"))
def keep_warm():
    """Ping the verifier every 5 minutes to avoid cold starts during the demo window."""
    Verifier().verify.remote("test", "print('hello')", "python")

Deploy this separately: modal deploy modal_backend/verifier.py

In the UI, show "Verifier warming up..." in the status bar while the first request is in flight and display a spinner. Do not let the UI appear broken during cold start.


8. Demo Video Script

The demo video is a submission requirement. Plan it around these beats:

  • Open Chrome, show the app. Explain the split-brain concept in one sentence.
  • Click "Load 1.5B Model" — show the WebGPU loading progress bar.
  • Type a non-trivial prompt: "Write a Python function that finds all prime numbers up to n using a segmented sieve, handling edge cases."
  • Hit Generate — show tokens streaming at 80–120 tok/s with the token counter live.
  • Show the "Verifying..." status kick in immediately after local generation completes.
  • If the verifier returns FIX or REWRITE: show the red flash rollback animation and the corrected code typing in.
  • Show the sandbox execution output (stdout) confirming the corrected code runs.
  • End on the split status bar: "Local: WebGPU · Cloud: Modal A10G · Verdict: ✅ Verified"

Keep the video under 3 minutes. Record with OBS or Loom. No cuts during the generation — the live stream is the point.


9. Bonus Badge Checklist

Badge How you earn it Status
Off the Grid 1.5B runs 100% in browser, no cloud API for inference ✅ Automatic
Llama Champion 14B served via llama.cpp on Modal ✅ Automatic
Off-Brand Custom dark theme, rollback animation, token counter, status bar ✅ Build it
Field Notes Write a blog post on HF or Dev.to explaining the speculative split-brain architecture ✅ Write it post-build

10. Submission Checklist

Before June 15 deadline:

  • Modal verifier deployed and endpoint URL confirmed working
  • HF Space live and publicly accessible under build-small-hackathon org
  • WebGPU model loads in Chrome without errors
  • Token streaming visible in UI
  • Rollback animation triggers on at least one FIX/REWRITE verdict
  • Sandbox execution output shown in demo
  • Demo video recorded and uploaded (YouTube unlisted or HF)
  • Social media post published (Twitter/X or LinkedIn) with Space link and demo video
  • README.md Space card complete with description, tags, and video embed
  • Field Notes blog post published and linked in README

11. Known Gotchas

WebGPU VRAM: The 1.5B Q4 ONNX model needs ~1GB VRAM. On machines with integrated graphics sharing system RAM, this works but may be slow. Document the Chrome + dedicated GPU requirement.

CORS: Modal's ASGI endpoints allow cross-origin by default, but if you hit CORS errors in the browser JS, add fastapi.middleware.cors.CORSMiddleware to the web_app with allow_origins=["*"].

transformers.js version: Pin to 3.5.x. Breaking changes in 3.x are frequent. The CDN import in engine.js uses the pinned version — don't use @latest.

Gradio JS bridge: The gr.Button(visible=False) trigger pattern is the cleanest way to fire a Python function from browser JS in Gradio 5.x without websocket hacks. Do not use gr.Request for this — it won't work from inside a custom HTML block.

Modal Volume first deploy: The volume download must complete before the verifier function can load the model. Run download_model manually once and confirm with modal volume ls qwen-14b-volume /models before deploying the endpoint.

HF Space cold start: HF Spaces themselves also cold start. If the Space hasn't been visited recently, Gradio takes 20–30 seconds to boot. Add a loading spinner at the Gradio level using gr.HTML with a brief "Space initializing..." message that auto-hides once the page is interactive.