# Split-Brain Speculative Co-Pilot ### Build Small Hackathon — Complete Build Instructions > **Concept:** A 1B model runs entirely in the user's browser via WebGPU + transformers.js, streaming code instantly. A 14B model on Modal verifies the draft in the background. When the verifier catches a bug, the UI rolls back the local generation and replaces it with the corrected cloud block — live, visually. > > **Models:** `Qwen2.5-Coder-1.5B` (browser, WebGPU) + `Qwen2.5-Coder-14B-Instruct` (Modal, GGUF via llama.cpp) — combined 15.5B, well under the 32B cap. > > **Bonus badges targeted:** Off the Grid · Llama Champion · Off-Brand · Field Notes --- ## 0. Prerequisites - Python 3.11+ - Node.js 18+ (for local frontend testing) - Modal account with `modal` CLI installed and authenticated (`modal token new`) - Hugging Face account, joined the `build-small-hackathon` org, HF token with write access - `huggingface-cli` installed and logged in (`huggingface-cli login`) - Chrome 113+ (WebGPU required — Firefox and Safari are out, document this clearly) - Git --- ## 1. Repository Structure Set up the project layout before writing any code. ``` split-brain-copilot/ ├── app.py # Gradio app entry point (HF Space root) ├── modal_backend/ │ ├── __init__.py │ ├── verifier.py # Modal app: 14B inference endpoint │ └── sandbox.py # Modal app: code execution sandbox ├── static/ │ ├── engine.js # transformers.js WebGPU inference engine │ ├── ui.js # Stream rendering, rollback animation, diff logic │ └── style.css # Custom UI (required for Off-Brand badge) ├── requirements.txt └── README.md # HF Space card + demo video embed ``` Initialize git and create a HF Space repo (check whether we have huggingface cli installed and token set or logged in): ```bash git init huggingface-cli repo create split-brain-copilot --type space --space-sdk gradio git remote add origin https://huggingface.co/spaces/YOUR_HF_USERNAME/split-brain-copilot ``` --- ## 2. Modal Backend — 14B Verifier Endpoint ### 2.1 Download the GGUF model to a Modal Volume The 14B model is too large to bake into the image. Use a Modal Volume for persistent storage. ```python # modal_backend/verifier.py import modal app = modal.App("split-brain-verifier") # Persistent volume — survives cold starts model_volume = modal.Volume.from_name("qwen-14b-volume", create_if_missing=True) MODEL_DIR = "/models" MODEL_FILENAME = "qwen2.5-coder-14b-instruct-q4_k_m.gguf" # Source: bartowski/Qwen2.5-Coder-14B-Instruct-GGUF on HuggingFace MODEL_REPO = "bartowski/Qwen2.5-Coder-14B-Instruct-GGUF" ``` Create a one-time download function: ```python @app.function( volumes={MODEL_DIR: model_volume}, timeout=3600, secrets=[modal.Secret.from_name("huggingface-secret")], ) def download_model(): from huggingface_hub import hf_hub_download import os hf_hub_download( repo_id=MODEL_REPO, filename=MODEL_FILENAME, local_dir=MODEL_DIR, ) model_volume.commit() print(f"Downloaded to {MODEL_DIR}/{MODEL_FILENAME}") ``` Run this once: `modal run modal_backend/verifier.py::download_model` ### 2.2 Build the llama.cpp image ```python llama_image = ( modal.Image.debian_slim(python_version="3.11") .apt_install("build-essential", "cmake", "git", "libgomp1") .run_commands( "git clone https://github.com/ggerganov/llama.cpp /llama.cpp", "cd /llama.cpp && cmake -B build -DLLAMA_CURL=OFF && cmake --build build --config Release -j$(nproc)", "cd /llama.cpp && pip install -e .", ) .pip_install("llama-cpp-python==0.3.4", "fastapi", "uvicorn") ) ``` ### 2.3 Verifier inference function The verifier receives the speculated code draft and the original prompt, and returns a verdict: `PASS`, `FIX`, or `REWRITE` with corrected code. ```python @app.cls( image=llama_image, gpu=modal.gpu.A10G(), volumes={MODEL_DIR: model_volume}, container_idle_timeout=300, allow_concurrent_inputs=10, ) class Verifier: @modal.enter() def load_model(self): from llama_cpp import Llama self.llm = Llama( model_path=f"{MODEL_DIR}/{MODEL_FILENAME}", n_gpu_layers=-1, # all layers on GPU n_ctx=8192, n_batch=512, verbose=False, ) @modal.method() def verify(self, prompt: str, draft_code: str, language: str = "python") -> dict: system = f"""You are a code verifier. A smaller model drafted the following {language} code. Your job: 1. Check for bugs, logic errors, type errors, off-by-one errors, and security issues. 2. If the code is correct, respond with exactly: {{"verdict": "PASS"}} 3. If fixable, respond with: {{"verdict": "FIX", "corrected_code": "", "reason": ""}} 4. If fundamentally wrong, respond with: {{"verdict": "REWRITE", "corrected_code": "", "reason": ""}} Respond ONLY with valid JSON. No markdown, no explanation outside the JSON.""" user = f"Original prompt:\n{prompt}\n\nDrafted code:\n```{language}\n{draft_code}\n```" response = self.llm.create_chat_completion( messages=[ {"role": "system", "content": system}, {"role": "user", "content": user}, ], max_tokens=2048, temperature=0.1, ) import json raw = response["choices"][0]["message"]["content"].strip() try: return json.loads(raw) except json.JSONDecodeError: # Fallback: treat as PASS if we can't parse return {"verdict": "PASS"} ``` ### 2.4 Modal Sandbox — code execution (optional but impressive) Sandboxed execution confirms the corrected code actually runs. This is what earns you extra credibility in the demo. ```python # modal_backend/sandbox.py import modal app = modal.App("split-brain-sandbox") @app.function(timeout=30) def execute_python(code: str) -> dict: """Run untrusted code in a Modal sandbox and return stdout/stderr.""" sandbox = modal.Sandbox.create( "python3", "-c", code, image=modal.Image.debian_slim().pip_install("numpy"), timeout=10, cpu=0.5, ) sandbox.wait() return { "stdout": sandbox.stdout.read(), "stderr": sandbox.stderr.read(), "returncode": sandbox.returncode, } ``` ### 2.5 Deploy the Modal backend ```bash modal deploy modal_backend/verifier.py modal deploy modal_backend/sandbox.py ``` Note the endpoint URLs printed after deploy. You'll need them in `app.py`. Store your Modal token and HF token as Modal secrets: ```bash modal secret create huggingface-secret HF_TOKEN=hf_xxx ``` --- ## 3. Browser Engine — transformers.js + WebGPU ### 3.1 Model choice for the browser Use `Qwen2.5-Coder-1.5B-Instruct` in ONNX/WebGPU format. Xenova and onnx-community maintain these on HF Hub. Target: `onnx-community/Qwen2.5-Coder-1.5B-Instruct` with `dtype: "q4"` for fast WebGPU loading (~800MB, fits comfortably in browser VRAM on a modern GPU). ### 3.2 engine.js — WebGPU inference ```javascript // static/engine.js import { pipeline, TextStreamer } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.5.0/dist/transformers.min.js"; const MODEL_ID = "onnx-community/Qwen2.5-Coder-1.5B-Instruct"; let generator = null; let isLoaded = false; export async function loadModel(onProgress) { if (isLoaded) return; generator = await pipeline("text-generation", MODEL_ID, { dtype: "q4", device: "webgpu", progress_callback: onProgress, }); isLoaded = true; } export async function generateCode(prompt, language, onToken, onComplete) { if (!generator) throw new Error("Model not loaded"); const messages = [ { role: "system", content: `You are an expert ${language} programmer. Write clean, correct, production-ready code. Output ONLY the code block, no explanation.` }, { role: "user", content: prompt } ]; const streamer = new TextStreamer(generator.tokenizer, { skip_prompt: true, callback_function: (token) => { onToken(token); }, }); const result = await generator(messages, { max_new_tokens: 1024, temperature: 0.2, do_sample: true, streamer, }); const fullCode = result[0].generated_text.at(-1).content; onComplete(fullCode); return fullCode; } export function isWebGPUSupported() { return !!navigator.gpu; } ``` ### 3.3 ui.js — stream rendering + rollback animation ```javascript // static/ui.js let currentTokens = []; let streamBuffer = ""; export function initEditor(containerId) { // Attach to the Gradio custom HTML component const container = document.getElementById(containerId); container.innerHTML = `
Ready 0 tok/s
`; } export function appendToken(token) { streamBuffer += token; currentTokens.push(token); const display = document.getElementById("stream-display"); if (display) display.textContent = streamBuffer; } export function setStatus(text, type = "neutral") { const el = document.getElementById("status-text"); if (el) { el.textContent = text; el.className = `status-${type}`; } } export function setVerifierStatus(verdict) { const el = document.getElementById("verifier-status"); if (!el) return; const icons = { PASS: "✅ Verified", FIX: "🔧 Fixed", REWRITE: "🔄 Rewritten", CHECKING: "🔍 Verifying..." }; el.textContent = icons[verdict] || ""; } export async function rollbackAndReplace(correctedCode, reason) { const display = document.getElementById("stream-display"); if (!display) return; // Flash red to signal rollback display.classList.add("rollback-flash"); setVerifierStatus("FIX"); setStatus(`Verifier corrected: ${reason}`, "warning"); await sleep(400); display.classList.remove("rollback-flash"); // Type in corrected code character by character display.textContent = ""; streamBuffer = correctedCode; currentTokens = []; for (let i = 0; i < correctedCode.length; i++) { display.textContent += correctedCode[i]; if (i % 5 === 0) await sleep(8); // smooth typewriter } setVerifierStatus("PASS"); setStatus("Ready", "neutral"); } export function getCurrentCode() { return streamBuffer; } export function reset() { streamBuffer = ""; currentTokens = []; const display = document.getElementById("stream-display"); if (display) display.textContent = ""; } function sleep(ms) { return new Promise(resolve => setTimeout(resolve, ms)); } ``` ### 3.4 style.css — custom UI (Off-Brand badge) ```css /* static/style.css */ :root { --bg: #0d1117; --surface: #161b22; --border: #30363d; --accent: #58a6ff; --accent-warn: #f0883e; --text: #e6edf3; --text-muted: #8b949e; --green: #3fb950; --red: #f85149; } body { background: var(--bg); color: var(--text); font-family: 'JetBrains Mono', monospace; } .code-stream { background: var(--surface); border: 1px solid var(--border); border-radius: 8px; padding: 16px; min-height: 300px; font-family: 'JetBrains Mono', monospace; font-size: 13px; line-height: 1.6; white-space: pre-wrap; overflow-y: auto; transition: border-color 0.2s; } .rollback-flash { border-color: var(--red) !important; background: rgba(248, 81, 73, 0.08) !important; animation: flash 0.4s ease; } @keyframes flash { 0% { background: rgba(248, 81, 73, 0.25); } 100% { background: rgba(248, 81, 73, 0.08); } } .status-bar { display: flex; justify-content: space-between; padding: 8px 12px; background: var(--surface); border: 1px solid var(--border); border-top: none; border-radius: 0 0 8px 8px; font-size: 12px; color: var(--text-muted); } .status-warning { color: var(--accent-warn); } .status-success { color: var(--green); } .status-neutral { color: var(--text-muted); } /* Gradio overrides */ .gradio-container { background: var(--bg) !important; } footer { display: none !important; } /* WebGPU loading bar */ .loading-bar { height: 3px; background: var(--border); border-radius: 2px; overflow: hidden; margin: 8px 0; } .loading-bar-fill { height: 100%; background: var(--accent); transition: width 0.3s ease; } ``` --- ## 4. Gradio App — app.py This is the HF Space entry point. Gradio acts as the shell; the real UI lives in the custom HTML component injected via `gr.HTML`. ```python # app.py import gradio as gr import httpx import json import os import asyncio from pathlib import Path MODAL_VERIFIER_URL = os.environ.get("MODAL_VERIFIER_URL") # set as HF Space secret MODAL_SANDBOX_URL = os.environ.get("MODAL_SANDBOX_URL") # set as HF Space secret LANGUAGES = ["Python", "JavaScript", "TypeScript", "Rust", "Go", "C++"] def load_static(filename): return Path(f"static/{filename}").read_text() custom_html = f"""
Waiting for model load...
Idle
""" async def verify_with_modal(prompt: str, draft_code: str, language: str) -> str: """Call Modal verifier endpoint and return JSON string.""" if not MODAL_VERIFIER_URL: return json.dumps({"verdict": "PASS"}) async with httpx.AsyncClient(timeout=60.0) as client: resp = await client.post( MODAL_VERIFIER_URL, json={"prompt": prompt, "draft_code": draft_code, "language": language}, ) resp.raise_for_status() return resp.text async def execute_in_sandbox(code: str) -> dict: """Call Modal sandbox and return execution result.""" if not MODAL_SANDBOX_URL: return {"stdout": "", "stderr": "Sandbox not configured", "returncode": -1} async with httpx.AsyncClient(timeout=30.0) as client: resp = await client.post(MODAL_SANDBOX_URL, json={"code": code}) return resp.json() with gr.Blocks( title="Split-Brain Co-Pilot", css="footer {display:none}", theme=gr.themes.Base( primary_hue="blue", neutral_hue="slate", ), ) as demo: gr.HTML("

⚡ Split-Brain Co-Pilot

") gr.HTML("

1.5B model runs in your browser (WebGPU). 14B model on Modal verifies and corrects.

") with gr.Row(): with gr.Column(scale=2): prompt_input = gr.Textbox( label="What do you want to build?", placeholder="e.g. A function that parses a CSV and returns the top 5 rows by a given column", lines=3, ) language_select = gr.Dropdown( choices=LANGUAGES, value="Python", label="Language", ) generate_btn = gr.Button("⚡ Generate (WebGPU → Verify)", variant="primary") with gr.Column(scale=3): # Custom HTML component for streaming display gr.HTML(custom_html) # Hidden elements for JS ↔ Gradio bridge draft_hidden = gr.Textbox(visible=False, elem_id="draft-output-hidden") verify_trigger = gr.Button("verify", visible=False, elem_id="trigger-verify-btn") verdict_output = gr.Textbox(visible=False, label="verdict") with gr.Row(): sandbox_output = gr.Code(label="Sandbox Execution Output", language="python", visible=False) # Gradio event: user clicks Generate → JS takes over for local inference generate_btn.click( fn=None, inputs=[prompt_input, language_select], outputs=[], js="(prompt, lang) => { window.runLocalGeneration(prompt, lang); return []; }", ) # Gradio event: JS triggers verify after local generation completes async def run_verification(prompt, draft_code, language): verdict_json = await verify_with_modal(prompt, draft_code, language) return verdict_json verify_trigger.click( fn=run_verification, inputs=[prompt_input, draft_hidden, language_select], outputs=[verdict_output], ) # Apply verdict back to JS verdict_output.change( fn=None, inputs=[verdict_output], outputs=[], js="(verdict) => { window.applyVerification(verdict); return []; }", ) if __name__ == "__main__": demo.launch() ``` --- ## 5. Modal Web Endpoint Wrapper The Modal functions need to be exposed as HTTP endpoints that `app.py` can call via httpx. Add this to `verifier.py`: ```python from fastapi import FastAPI from pydantic import BaseModel web_app = FastAPI() class VerifyRequest(BaseModel): prompt: str draft_code: str language: str = "python" @app.function( image=llama_image, gpu=modal.gpu.A10G(), volumes={MODEL_DIR: model_volume}, container_idle_timeout=300, ) @modal.asgi_app() def verifier_endpoint(): verifier = Verifier() @web_app.post("/verify") async def verify(req: VerifyRequest): result = verifier.verify.remote(req.prompt, req.draft_code, req.language) return result return web_app ``` After deploying, Modal gives you a URL like `https://your-username--split-brain-verifier-verifier-endpoint.modal.run`. Set this as the HF Space secret `MODAL_VERIFIER_URL`. --- ## 6. HF Space Configuration ### 6.1 README.md (Space card) ```yaml --- title: Split-Brain Co-Pilot emoji: ⚡ colorFrom: blue colorTo: indigo sdk: gradio sdk_version: 5.30.0 app_file: app.py pinned: true license: apache-2.0 tags: - code-generation - webgpu - speculative-decoding - llama.cpp - local-first --- ``` ### 6.2 requirements.txt ``` gradio==5.30.0 httpx==0.27.0 modal==0.73.0 huggingface-hub==0.23.0 ``` ### 6.3 HF Space secrets Set these in the Space Settings → Repository secrets: | Secret name | Value | |---|---| | `MODAL_VERIFIER_URL` | Your Modal verifier endpoint URL | | `MODAL_SANDBOX_URL` | Your Modal sandbox endpoint URL | | `MODAL_TOKEN_ID` | From `modal token show` | | `MODAL_TOKEN_SECRET` | From `modal token show` | --- ## 7. Cold Start Mitigation Modal A10G containers take 10–40 seconds to cold start. Handle this gracefully: In `verify_with_modal`, add a keep-warm ping. Add this to `verifier.py`: ```python @app.function(schedule=modal.Cron("*/5 * * * *")) def keep_warm(): """Ping the verifier every 5 minutes to avoid cold starts during the demo window.""" Verifier().verify.remote("test", "print('hello')", "python") ``` Deploy this separately: `modal deploy modal_backend/verifier.py` In the UI, show "Verifier warming up..." in the status bar while the first request is in flight and display a spinner. Do not let the UI appear broken during cold start. --- ## 8. Demo Video Script The demo video is a submission requirement. Plan it around these beats: - Open Chrome, show the app. Explain the split-brain concept in one sentence. - Click "Load 1.5B Model" — show the WebGPU loading progress bar. - Type a non-trivial prompt: "Write a Python function that finds all prime numbers up to n using a segmented sieve, handling edge cases." - Hit Generate — show tokens streaming at 80–120 tok/s with the token counter live. - Show the "Verifying..." status kick in immediately after local generation completes. - If the verifier returns FIX or REWRITE: show the red flash rollback animation and the corrected code typing in. - Show the sandbox execution output (stdout) confirming the corrected code runs. - End on the split status bar: "Local: WebGPU · Cloud: Modal A10G · Verdict: ✅ Verified" Keep the video under 3 minutes. Record with OBS or Loom. No cuts during the generation — the live stream is the point. --- ## 9. Bonus Badge Checklist | Badge | How you earn it | Status | |---|---|---| | **Off the Grid** | 1.5B runs 100% in browser, no cloud API for inference | ✅ Automatic | | **Llama Champion** | 14B served via llama.cpp on Modal | ✅ Automatic | | **Off-Brand** | Custom dark theme, rollback animation, token counter, status bar | ✅ Build it | | **Field Notes** | Write a blog post on HF or Dev.to explaining the speculative split-brain architecture | ✅ Write it post-build | --- ## 10. Submission Checklist Before June 15 deadline: - [ ] Modal verifier deployed and endpoint URL confirmed working - [ ] HF Space live and publicly accessible under `build-small-hackathon` org - [ ] WebGPU model loads in Chrome without errors - [ ] Token streaming visible in UI - [ ] Rollback animation triggers on at least one FIX/REWRITE verdict - [ ] Sandbox execution output shown in demo - [ ] Demo video recorded and uploaded (YouTube unlisted or HF) - [ ] Social media post published (Twitter/X or LinkedIn) with Space link and demo video - [ ] README.md Space card complete with description, tags, and video embed - [ ] Field Notes blog post published and linked in README --- ## 11. Known Gotchas **WebGPU VRAM:** The 1.5B Q4 ONNX model needs ~1GB VRAM. On machines with integrated graphics sharing system RAM, this works but may be slow. Document the Chrome + dedicated GPU requirement. **CORS:** Modal's ASGI endpoints allow cross-origin by default, but if you hit CORS errors in the browser JS, add `fastapi.middleware.cors.CORSMiddleware` to the web_app with `allow_origins=["*"]`. **transformers.js version:** Pin to `3.5.x`. Breaking changes in 3.x are frequent. The CDN import in `engine.js` uses the pinned version — don't use `@latest`. **Gradio JS bridge:** The `gr.Button(visible=False)` trigger pattern is the cleanest way to fire a Python function from browser JS in Gradio 5.x without websocket hacks. Do not use `gr.Request` for this — it won't work from inside a custom HTML block. **Modal Volume first deploy:** The volume download must complete before the verifier function can load the model. Run `download_model` manually once and confirm with `modal volume ls qwen-14b-volume /models` before deploying the endpoint. **HF Space cold start:** HF Spaces themselves also cold start. If the Space hasn't been visited recently, Gradio takes 20–30 seconds to boot. Add a loading spinner at the Gradio level using `gr.HTML` with a brief "Space initializing..." message that auto-hides once the page is interactive.