Split-Brain Co-Pilot
Generate and verify code from a natural‑language prompt
The app has two brains:
The total model budget is 15.5B parameters, comfortably under the 32B hackathon cap. The design goal was not to imitate a giant coding assistant. It was to ask a smaller question: what if a tiny local model could move fast, while a second small model watched its back?
Most code assistants feel like a single mind. You prompt them, they think somewhere else, and then they return an answer.
I wanted this one to feel split:
That visible disagreement is the point. The app is not only generating code; it is showing a small local model being reviewed by a larger, still-small verifier.
Small models are often good enough at the first draft, especially for compact coding tasks. They are also much more fun when they run close to the user. Loading a 1.5B model in Chrome gives the app a local, physical feeling: the tokens are not arriving from a remote API; they are being produced on the machine in front of you.
But first drafts are fragile. A small model can produce code that looks plausible but has missing edge cases, markdown wrappers, or functions that are defined but never called. That made the verifier useful as a second pass rather than a replacement for the local model.
The split became:
The app is a Gradio Space with a custom HTML/CSS/JS interface.
flowchart LR
Prompt["Prompt"] --> Local["1.5B browser model<br/>WebGPU + transformers.js"]
Local --> Draft["Streaming draft"]
Draft --> Verifier["14B verifier<br/>Modal A10G + llama.cpp"]
Verifier -->|PASS| Final["Verified code"]
Verifier -->|FIX / REWRITE| Rollback["Rollback animation<br/>corrected block"]
Rollback --> Final
Final --> Sandbox["Modal Python sandbox"]
The browser model is onnx-community/Qwen2.5-Coder-1.5B-Instruct. It runs through transformers.js with WebGPU. The verifier uses Qwen2.5-Coder-14B-Instruct in GGUF format from Bartowski's Hugging Face repo, loaded into a Modal Volume and served with llama.cpp on an A10G.
The verifier returns a small JSON contract:
{ "verdict": "PASS" }
or:
{
"verdict": "FIX",
"corrected_code": "...",
"reason": "..."
}
That JSON contract matters. It lets the UI treat verification as an event, not a chat message. A PASS marks the draft as clean. A FIX or REWRITE triggers the rollback animation and replaces the local block.
The most interesting bugs were not the dramatic ones. They were the small integration edges.
The browser model sometimes produced:
```python
def example():
...
```
That looked fine in the UI, but the Python sandbox received the backticks and failed with a syntax error. The fix was to clean generated code before sending it to the verifier or sandbox. That sounds obvious after the fact, but it is exactly the kind of boundary bug that appears when generated text becomes executable input.
One test prompt asked for a function that loops from 1 to 10. The model generated:
def loop_to_ten():
for i in range(1, 11):
print(i)
The sandbox returned returncode: 0, with empty stdout and stderr. That was not a failure. Python had successfully defined the function and exited. Nothing called it.
The app now adds a conservative harness for Python sandbox runs: if the code contains only definitions and the first function has no required arguments, the sandbox calls it under:
if __name__ == "__main__":
loop_to_ten()
It does not guess arguments for functions that require inputs. That would make the demo less honest.
The verifier initially appeared to hang at "Verifier warming up..." longer than expected. The issue was the ASGI endpoint calling a Modal method in a blocking way from an async route. Switching to the async remote call path made the endpoint behave cleanly and stopped the UI from feeling stuck during verifier startup.
The browser model did not load with the first quantization choice on every machine. The loader now tries a small sequence of dtypes and surfaces the actual active dtype in the UI. For the working demo path, q4f16 loaded successfully.
Small models become much more interesting when the interface admits their uncertainty.
Instead of hiding the verifier behind a final answer, the app makes the verification step visible. The UI says: here is the local draft, here is the cloud check, here is the verdict, and here is the corrected code if the two disagree.
That changed how the project felt. The verifier was no longer just a backend. It became part of the performance.
I also learned that "runs locally" is not a single binary property. This app is local-first, not fully offline:
That tradeoff felt honest for this build. The local model gives speed and presence. The cloud verifier gives a stronger second opinion. The app is not trying to be pure; it is trying to show what each small model is good at.
The next version would add a few things:
PASS, FIX, and REWRITE.onnx-community/Qwen2.5-Coder-1.5B-Instruct and bartowski/Qwen2.5-Coder-14B-Instruct-GGUFBuilt by Blessing Mwiti in collaboration with OpenAI Codex.
Generate and verify code from a natural‑language prompt