split-brain-copilot / AGENTS.md
blessingmwiti's picture
Build split-brain copilot scaffold
053ee0d
# Split-Brain Speculative Co-Pilot
### Build Small Hackathon — Complete Build Instructions
> **Concept:** A 1B model runs entirely in the user's browser via WebGPU + transformers.js, streaming code instantly. A 14B model on Modal verifies the draft in the background. When the verifier catches a bug, the UI rolls back the local generation and replaces it with the corrected cloud block — live, visually.
>
> **Models:** `Qwen2.5-Coder-1.5B` (browser, WebGPU) + `Qwen2.5-Coder-14B-Instruct` (Modal, GGUF via llama.cpp) — combined 15.5B, well under the 32B cap.
>
> **Bonus badges targeted:** Off the Grid · Llama Champion · Off-Brand · Field Notes
---
## 0. Prerequisites
- Python 3.11+
- Node.js 18+ (for local frontend testing)
- Modal account with `modal` CLI installed and authenticated (`modal token new`)
- Hugging Face account, joined the `build-small-hackathon` org, HF token with write access
- `huggingface-cli` installed and logged in (`huggingface-cli login`)
- Chrome 113+ (WebGPU required — Firefox and Safari are out, document this clearly)
- Git
---
## 1. Repository Structure
Set up the project layout before writing any code.
```
split-brain-copilot/
├── app.py # Gradio app entry point (HF Space root)
├── modal_backend/
│ ├── __init__.py
│ ├── verifier.py # Modal app: 14B inference endpoint
│ └── sandbox.py # Modal app: code execution sandbox
├── static/
│ ├── engine.js # transformers.js WebGPU inference engine
│ ├── ui.js # Stream rendering, rollback animation, diff logic
│ └── style.css # Custom UI (required for Off-Brand badge)
├── requirements.txt
└── README.md # HF Space card + demo video embed
```
Initialize git and create a HF Space repo (check whether we have huggingface cli installed and token set or logged in):
```bash
git init
huggingface-cli repo create split-brain-copilot --type space --space-sdk gradio
git remote add origin https://huggingface.co/spaces/YOUR_HF_USERNAME/split-brain-copilot
```
---
## 2. Modal Backend — 14B Verifier Endpoint
### 2.1 Download the GGUF model to a Modal Volume
The 14B model is too large to bake into the image. Use a Modal Volume for persistent storage.
```python
# modal_backend/verifier.py
import modal
app = modal.App("split-brain-verifier")
# Persistent volume — survives cold starts
model_volume = modal.Volume.from_name("qwen-14b-volume", create_if_missing=True)
MODEL_DIR = "/models"
MODEL_FILENAME = "qwen2.5-coder-14b-instruct-q4_k_m.gguf"
# Source: bartowski/Qwen2.5-Coder-14B-Instruct-GGUF on HuggingFace
MODEL_REPO = "bartowski/Qwen2.5-Coder-14B-Instruct-GGUF"
```
Create a one-time download function:
```python
@app.function(
volumes={MODEL_DIR: model_volume},
timeout=3600,
secrets=[modal.Secret.from_name("huggingface-secret")],
)
def download_model():
from huggingface_hub import hf_hub_download
import os
hf_hub_download(
repo_id=MODEL_REPO,
filename=MODEL_FILENAME,
local_dir=MODEL_DIR,
)
model_volume.commit()
print(f"Downloaded to {MODEL_DIR}/{MODEL_FILENAME}")
```
Run this once: `modal run modal_backend/verifier.py::download_model`
### 2.2 Build the llama.cpp image
```python
llama_image = (
modal.Image.debian_slim(python_version="3.11")
.apt_install("build-essential", "cmake", "git", "libgomp1")
.run_commands(
"git clone https://github.com/ggerganov/llama.cpp /llama.cpp",
"cd /llama.cpp && cmake -B build -DLLAMA_CURL=OFF && cmake --build build --config Release -j$(nproc)",
"cd /llama.cpp && pip install -e .",
)
.pip_install("llama-cpp-python==0.3.4", "fastapi", "uvicorn")
)
```
### 2.3 Verifier inference function
The verifier receives the speculated code draft and the original prompt, and returns a verdict: `PASS`, `FIX`, or `REWRITE` with corrected code.
```python
@app.cls(
image=llama_image,
gpu=modal.gpu.A10G(),
volumes={MODEL_DIR: model_volume},
container_idle_timeout=300,
allow_concurrent_inputs=10,
)
class Verifier:
@modal.enter()
def load_model(self):
from llama_cpp import Llama
self.llm = Llama(
model_path=f"{MODEL_DIR}/{MODEL_FILENAME}",
n_gpu_layers=-1, # all layers on GPU
n_ctx=8192,
n_batch=512,
verbose=False,
)
@modal.method()
def verify(self, prompt: str, draft_code: str, language: str = "python") -> dict:
system = f"""You are a code verifier. A smaller model drafted the following {language} code.
Your job:
1. Check for bugs, logic errors, type errors, off-by-one errors, and security issues.
2. If the code is correct, respond with exactly: {{"verdict": "PASS"}}
3. If fixable, respond with: {{"verdict": "FIX", "corrected_code": "<fixed code here>", "reason": "<one line>"}}
4. If fundamentally wrong, respond with: {{"verdict": "REWRITE", "corrected_code": "<rewritten code>", "reason": "<one line>"}}
Respond ONLY with valid JSON. No markdown, no explanation outside the JSON."""
user = f"Original prompt:\n{prompt}\n\nDrafted code:\n```{language}\n{draft_code}\n```"
response = self.llm.create_chat_completion(
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user},
],
max_tokens=2048,
temperature=0.1,
)
import json
raw = response["choices"][0]["message"]["content"].strip()
try:
return json.loads(raw)
except json.JSONDecodeError:
# Fallback: treat as PASS if we can't parse
return {"verdict": "PASS"}
```
### 2.4 Modal Sandbox — code execution (optional but impressive)
Sandboxed execution confirms the corrected code actually runs. This is what earns you extra credibility in the demo.
```python
# modal_backend/sandbox.py
import modal
app = modal.App("split-brain-sandbox")
@app.function(timeout=30)
def execute_python(code: str) -> dict:
"""Run untrusted code in a Modal sandbox and return stdout/stderr."""
sandbox = modal.Sandbox.create(
"python3", "-c", code,
image=modal.Image.debian_slim().pip_install("numpy"),
timeout=10,
cpu=0.5,
)
sandbox.wait()
return {
"stdout": sandbox.stdout.read(),
"stderr": sandbox.stderr.read(),
"returncode": sandbox.returncode,
}
```
### 2.5 Deploy the Modal backend
```bash
modal deploy modal_backend/verifier.py
modal deploy modal_backend/sandbox.py
```
Note the endpoint URLs printed after deploy. You'll need them in `app.py`.
Store your Modal token and HF token as Modal secrets:
```bash
modal secret create huggingface-secret HF_TOKEN=hf_xxx
```
---
## 3. Browser Engine — transformers.js + WebGPU
### 3.1 Model choice for the browser
Use `Qwen2.5-Coder-1.5B-Instruct` in ONNX/WebGPU format. Xenova and onnx-community maintain these on HF Hub. Target:
`onnx-community/Qwen2.5-Coder-1.5B-Instruct` with `dtype: "q4"` for fast WebGPU loading (~800MB, fits comfortably in browser VRAM on a modern GPU).
### 3.2 engine.js — WebGPU inference
```javascript
// static/engine.js
import { pipeline, TextStreamer } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.5.0/dist/transformers.min.js";
const MODEL_ID = "onnx-community/Qwen2.5-Coder-1.5B-Instruct";
let generator = null;
let isLoaded = false;
export async function loadModel(onProgress) {
if (isLoaded) return;
generator = await pipeline("text-generation", MODEL_ID, {
dtype: "q4",
device: "webgpu",
progress_callback: onProgress,
});
isLoaded = true;
}
export async function generateCode(prompt, language, onToken, onComplete) {
if (!generator) throw new Error("Model not loaded");
const messages = [
{
role: "system",
content: `You are an expert ${language} programmer. Write clean, correct, production-ready code. Output ONLY the code block, no explanation.`
},
{ role: "user", content: prompt }
];
const streamer = new TextStreamer(generator.tokenizer, {
skip_prompt: true,
callback_function: (token) => {
onToken(token);
},
});
const result = await generator(messages, {
max_new_tokens: 1024,
temperature: 0.2,
do_sample: true,
streamer,
});
const fullCode = result[0].generated_text.at(-1).content;
onComplete(fullCode);
return fullCode;
}
export function isWebGPUSupported() {
return !!navigator.gpu;
}
```
### 3.3 ui.js — stream rendering + rollback animation
```javascript
// static/ui.js
let currentTokens = [];
let streamBuffer = "";
export function initEditor(containerId) {
// Attach to the Gradio custom HTML component
const container = document.getElementById(containerId);
container.innerHTML = `
<div id="stream-display" class="code-stream"></div>
<div id="status-bar" class="status-bar">
<span id="status-text">Ready</span>
<span id="token-count">0 tok/s</span>
<span id="verifier-status"></span>
</div>
`;
}
export function appendToken(token) {
streamBuffer += token;
currentTokens.push(token);
const display = document.getElementById("stream-display");
if (display) display.textContent = streamBuffer;
}
export function setStatus(text, type = "neutral") {
const el = document.getElementById("status-text");
if (el) {
el.textContent = text;
el.className = `status-${type}`;
}
}
export function setVerifierStatus(verdict) {
const el = document.getElementById("verifier-status");
if (!el) return;
const icons = { PASS: "✅ Verified", FIX: "🔧 Fixed", REWRITE: "🔄 Rewritten", CHECKING: "🔍 Verifying..." };
el.textContent = icons[verdict] || "";
}
export async function rollbackAndReplace(correctedCode, reason) {
const display = document.getElementById("stream-display");
if (!display) return;
// Flash red to signal rollback
display.classList.add("rollback-flash");
setVerifierStatus("FIX");
setStatus(`Verifier corrected: ${reason}`, "warning");
await sleep(400);
display.classList.remove("rollback-flash");
// Type in corrected code character by character
display.textContent = "";
streamBuffer = correctedCode;
currentTokens = [];
for (let i = 0; i < correctedCode.length; i++) {
display.textContent += correctedCode[i];
if (i % 5 === 0) await sleep(8); // smooth typewriter
}
setVerifierStatus("PASS");
setStatus("Ready", "neutral");
}
export function getCurrentCode() {
return streamBuffer;
}
export function reset() {
streamBuffer = "";
currentTokens = [];
const display = document.getElementById("stream-display");
if (display) display.textContent = "";
}
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
```
### 3.4 style.css — custom UI (Off-Brand badge)
```css
/* static/style.css */
:root {
--bg: #0d1117;
--surface: #161b22;
--border: #30363d;
--accent: #58a6ff;
--accent-warn: #f0883e;
--text: #e6edf3;
--text-muted: #8b949e;
--green: #3fb950;
--red: #f85149;
}
body { background: var(--bg); color: var(--text); font-family: 'JetBrains Mono', monospace; }
.code-stream {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 8px;
padding: 16px;
min-height: 300px;
font-family: 'JetBrains Mono', monospace;
font-size: 13px;
line-height: 1.6;
white-space: pre-wrap;
overflow-y: auto;
transition: border-color 0.2s;
}
.rollback-flash {
border-color: var(--red) !important;
background: rgba(248, 81, 73, 0.08) !important;
animation: flash 0.4s ease;
}
@keyframes flash {
0% { background: rgba(248, 81, 73, 0.25); }
100% { background: rgba(248, 81, 73, 0.08); }
}
.status-bar {
display: flex;
justify-content: space-between;
padding: 8px 12px;
background: var(--surface);
border: 1px solid var(--border);
border-top: none;
border-radius: 0 0 8px 8px;
font-size: 12px;
color: var(--text-muted);
}
.status-warning { color: var(--accent-warn); }
.status-success { color: var(--green); }
.status-neutral { color: var(--text-muted); }
/* Gradio overrides */
.gradio-container { background: var(--bg) !important; }
footer { display: none !important; }
/* WebGPU loading bar */
.loading-bar {
height: 3px;
background: var(--border);
border-radius: 2px;
overflow: hidden;
margin: 8px 0;
}
.loading-bar-fill {
height: 100%;
background: var(--accent);
transition: width 0.3s ease;
}
```
---
## 4. Gradio App — app.py
This is the HF Space entry point. Gradio acts as the shell; the real UI lives in the custom HTML component injected via `gr.HTML`.
```python
# app.py
import gradio as gr
import httpx
import json
import os
import asyncio
from pathlib import Path
MODAL_VERIFIER_URL = os.environ.get("MODAL_VERIFIER_URL") # set as HF Space secret
MODAL_SANDBOX_URL = os.environ.get("MODAL_SANDBOX_URL") # set as HF Space secret
LANGUAGES = ["Python", "JavaScript", "TypeScript", "Rust", "Go", "C++"]
def load_static(filename):
return Path(f"static/{filename}").read_text()
custom_html = f"""
<!DOCTYPE html>
<html>
<head>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link href="https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet">
<style>{load_static('style.css')}</style>
</head>
<body>
<div id="split-brain-root">
<div class="webgpu-notice" id="webgpu-warning" style="display:none; color:#f85149; padding:8px; border:1px solid #f85149; border-radius:6px; margin-bottom:12px;">
⚠️ WebGPU not detected. Please use Chrome 113+ on desktop for local inference.
</div>
<div id="load-section">
<button id="load-btn" onclick="initEngine()">⚡ Load 1.5B Model (WebGPU)</button>
<div class="loading-bar"><div class="loading-bar-fill" id="load-progress" style="width:0%"></div></div>
<span id="load-status" style="font-size:12px; color:#8b949e;"></span>
</div>
<div id="stream-display" class="code-stream" style="margin-top:12px;">Waiting for model load...</div>
<div class="status-bar">
<span id="status-text">Idle</span>
<span id="token-count"></span>
<span id="verifier-status"></span>
</div>
</div>
<script type="module">
{load_static('engine.js')}
{load_static('ui.js')}
// Check WebGPU on load
if (!isWebGPUSupported()) {{
document.getElementById('webgpu-warning').style.display = 'block';
document.getElementById('load-btn').disabled = true;
}}
window.initEngine = async function() {{
document.getElementById('load-btn').disabled = true;
document.getElementById('load-status').textContent = 'Loading model weights...';
await loadModel((progress) => {{
if (progress.progress) {{
document.getElementById('load-progress').style.width = progress.progress + '%';
document.getElementById('load-status').textContent = `${{progress.file || 'Loading'}} — ${{Math.round(progress.progress)}}%`;
}}
}});
document.getElementById('load-status').textContent = '✅ Model ready — WebGPU active';
document.getElementById('load-section').style.opacity = '0.5';
}};
// Gradio will call this via the hidden trigger
window.runLocalGeneration = async function(prompt, language) {{
reset();
setStatus('Generating locally (WebGPU)...', 'neutral');
let tokenCount = 0;
const startTime = Date.now();
const fullCode = await generateCode(prompt, language,
(token) => {{
appendToken(token);
tokenCount++;
const elapsed = (Date.now() - startTime) / 1000;
const tps = Math.round(tokenCount / elapsed);
document.getElementById('token-count').textContent = `${{tps}} tok/s`;
}},
(code) => {{
setStatus('Local generation complete. Verifying...', 'neutral');
}}
);
// Send to Gradio backend for verification
// Use the hidden Gradio state to trigger the verify function
document.getElementById('draft-output-hidden').value = fullCode;
document.getElementById('trigger-verify-btn').click();
}};
window.applyVerification = function(verdictJson) {{
const verdict = JSON.parse(verdictJson);
if (verdict.verdict === 'PASS') {{
setVerifierStatus('PASS');
setStatus('✅ Verified clean', 'success');
}} else {{
rollbackAndReplace(verdict.corrected_code, verdict.reason);
}}
}};
</script>
</body>
</html>
"""
async def verify_with_modal(prompt: str, draft_code: str, language: str) -> str:
"""Call Modal verifier endpoint and return JSON string."""
if not MODAL_VERIFIER_URL:
return json.dumps({"verdict": "PASS"})
async with httpx.AsyncClient(timeout=60.0) as client:
resp = await client.post(
MODAL_VERIFIER_URL,
json={"prompt": prompt, "draft_code": draft_code, "language": language},
)
resp.raise_for_status()
return resp.text
async def execute_in_sandbox(code: str) -> dict:
"""Call Modal sandbox and return execution result."""
if not MODAL_SANDBOX_URL:
return {"stdout": "", "stderr": "Sandbox not configured", "returncode": -1}
async with httpx.AsyncClient(timeout=30.0) as client:
resp = await client.post(MODAL_SANDBOX_URL, json={"code": code})
return resp.json()
with gr.Blocks(
title="Split-Brain Co-Pilot",
css="footer {display:none}",
theme=gr.themes.Base(
primary_hue="blue",
neutral_hue="slate",
),
) as demo:
gr.HTML("<h1 style='text-align:center; color:#58a6ff;'>⚡ Split-Brain Co-Pilot</h1>")
gr.HTML("<p style='text-align:center; color:#8b949e;'>1.5B model runs in your browser (WebGPU). 14B model on Modal verifies and corrects.</p>")
with gr.Row():
with gr.Column(scale=2):
prompt_input = gr.Textbox(
label="What do you want to build?",
placeholder="e.g. A function that parses a CSV and returns the top 5 rows by a given column",
lines=3,
)
language_select = gr.Dropdown(
choices=LANGUAGES,
value="Python",
label="Language",
)
generate_btn = gr.Button("⚡ Generate (WebGPU → Verify)", variant="primary")
with gr.Column(scale=3):
# Custom HTML component for streaming display
gr.HTML(custom_html)
# Hidden elements for JS ↔ Gradio bridge
draft_hidden = gr.Textbox(visible=False, elem_id="draft-output-hidden")
verify_trigger = gr.Button("verify", visible=False, elem_id="trigger-verify-btn")
verdict_output = gr.Textbox(visible=False, label="verdict")
with gr.Row():
sandbox_output = gr.Code(label="Sandbox Execution Output", language="python", visible=False)
# Gradio event: user clicks Generate → JS takes over for local inference
generate_btn.click(
fn=None,
inputs=[prompt_input, language_select],
outputs=[],
js="(prompt, lang) => { window.runLocalGeneration(prompt, lang); return []; }",
)
# Gradio event: JS triggers verify after local generation completes
async def run_verification(prompt, draft_code, language):
verdict_json = await verify_with_modal(prompt, draft_code, language)
return verdict_json
verify_trigger.click(
fn=run_verification,
inputs=[prompt_input, draft_hidden, language_select],
outputs=[verdict_output],
)
# Apply verdict back to JS
verdict_output.change(
fn=None,
inputs=[verdict_output],
outputs=[],
js="(verdict) => { window.applyVerification(verdict); return []; }",
)
if __name__ == "__main__":
demo.launch()
```
---
## 5. Modal Web Endpoint Wrapper
The Modal functions need to be exposed as HTTP endpoints that `app.py` can call via httpx. Add this to `verifier.py`:
```python
from fastapi import FastAPI
from pydantic import BaseModel
web_app = FastAPI()
class VerifyRequest(BaseModel):
prompt: str
draft_code: str
language: str = "python"
@app.function(
image=llama_image,
gpu=modal.gpu.A10G(),
volumes={MODEL_DIR: model_volume},
container_idle_timeout=300,
)
@modal.asgi_app()
def verifier_endpoint():
verifier = Verifier()
@web_app.post("/verify")
async def verify(req: VerifyRequest):
result = verifier.verify.remote(req.prompt, req.draft_code, req.language)
return result
return web_app
```
After deploying, Modal gives you a URL like `https://your-username--split-brain-verifier-verifier-endpoint.modal.run`. Set this as the HF Space secret `MODAL_VERIFIER_URL`.
---
## 6. HF Space Configuration
### 6.1 README.md (Space card)
```yaml
---
title: Split-Brain Co-Pilot
emoji: ⚡
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.30.0
app_file: app.py
pinned: true
license: apache-2.0
tags:
- code-generation
- webgpu
- speculative-decoding
- llama.cpp
- local-first
---
```
### 6.2 requirements.txt
```
gradio==5.30.0
httpx==0.27.0
modal==0.73.0
huggingface-hub==0.23.0
```
### 6.3 HF Space secrets
Set these in the Space Settings → Repository secrets:
| Secret name | Value |
|---|---|
| `MODAL_VERIFIER_URL` | Your Modal verifier endpoint URL |
| `MODAL_SANDBOX_URL` | Your Modal sandbox endpoint URL |
| `MODAL_TOKEN_ID` | From `modal token show` |
| `MODAL_TOKEN_SECRET` | From `modal token show` |
---
## 7. Cold Start Mitigation
Modal A10G containers take 10–40 seconds to cold start. Handle this gracefully:
In `verify_with_modal`, add a keep-warm ping. Add this to `verifier.py`:
```python
@app.function(schedule=modal.Cron("*/5 * * * *"))
def keep_warm():
"""Ping the verifier every 5 minutes to avoid cold starts during the demo window."""
Verifier().verify.remote("test", "print('hello')", "python")
```
Deploy this separately: `modal deploy modal_backend/verifier.py`
In the UI, show "Verifier warming up..." in the status bar while the first request is in flight and display a spinner. Do not let the UI appear broken during cold start.
---
## 8. Demo Video Script
The demo video is a submission requirement. Plan it around these beats:
- Open Chrome, show the app. Explain the split-brain concept in one sentence.
- Click "Load 1.5B Model" — show the WebGPU loading progress bar.
- Type a non-trivial prompt: "Write a Python function that finds all prime numbers up to n using a segmented sieve, handling edge cases."
- Hit Generate — show tokens streaming at 80–120 tok/s with the token counter live.
- Show the "Verifying..." status kick in immediately after local generation completes.
- If the verifier returns FIX or REWRITE: show the red flash rollback animation and the corrected code typing in.
- Show the sandbox execution output (stdout) confirming the corrected code runs.
- End on the split status bar: "Local: WebGPU · Cloud: Modal A10G · Verdict: ✅ Verified"
Keep the video under 3 minutes. Record with OBS or Loom. No cuts during the generation — the live stream is the point.
---
## 9. Bonus Badge Checklist
| Badge | How you earn it | Status |
|---|---|---|
| **Off the Grid** | 1.5B runs 100% in browser, no cloud API for inference | ✅ Automatic |
| **Llama Champion** | 14B served via llama.cpp on Modal | ✅ Automatic |
| **Off-Brand** | Custom dark theme, rollback animation, token counter, status bar | ✅ Build it |
| **Field Notes** | Write a blog post on HF or Dev.to explaining the speculative split-brain architecture | ✅ Write it post-build |
---
## 10. Submission Checklist
Before June 15 deadline:
- [ ] Modal verifier deployed and endpoint URL confirmed working
- [ ] HF Space live and publicly accessible under `build-small-hackathon` org
- [ ] WebGPU model loads in Chrome without errors
- [ ] Token streaming visible in UI
- [ ] Rollback animation triggers on at least one FIX/REWRITE verdict
- [ ] Sandbox execution output shown in demo
- [ ] Demo video recorded and uploaded (YouTube unlisted or HF)
- [ ] Social media post published (Twitter/X or LinkedIn) with Space link and demo video
- [ ] README.md Space card complete with description, tags, and video embed
- [ ] Field Notes blog post published and linked in README
---
## 11. Known Gotchas
**WebGPU VRAM:** The 1.5B Q4 ONNX model needs ~1GB VRAM. On machines with integrated graphics sharing system RAM, this works but may be slow. Document the Chrome + dedicated GPU requirement.
**CORS:** Modal's ASGI endpoints allow cross-origin by default, but if you hit CORS errors in the browser JS, add `fastapi.middleware.cors.CORSMiddleware` to the web_app with `allow_origins=["*"]`.
**transformers.js version:** Pin to `3.5.x`. Breaking changes in 3.x are frequent. The CDN import in `engine.js` uses the pinned version — don't use `@latest`.
**Gradio JS bridge:** The `gr.Button(visible=False)` trigger pattern is the cleanest way to fire a Python function from browser JS in Gradio 5.x without websocket hacks. Do not use `gr.Request` for this — it won't work from inside a custom HTML block.
**Modal Volume first deploy:** The volume download must complete before the verifier function can load the model. Run `download_model` manually once and confirm with `modal volume ls qwen-14b-volume /models` before deploying the endpoint.
**HF Space cold start:** HF Spaces themselves also cold start. If the Space hasn't been visited recently, Gradio takes 20–30 seconds to boot. Add a loading spinner at the Gradio level using `gr.HTML` with a brief "Space initializing..." message that auto-hides once the page is interactive.