| --- |
| title: OpenCortex |
| emoji: 🧠 |
| colorFrom: red |
| colorTo: blue |
| sdk: gradio |
| sdk_version: 6.18.0 |
| app_file: app.py |
| pinned: false |
| license: mit |
| tags: |
| - track:backyard-ai |
| - sponsor:openbmb |
| - sponsor:nvidia |
| - sponsor:modal |
| - achievement:offgrid |
| - achievement:offbrand |
| - achievement:llama |
| - backyard-ai |
| - off-brand |
| - tiny-titan |
| - best-demo |
| - codex |
| - llama-cpp |
| - local-ai |
| - ai-infra |
| - observability |
| short_description: Feel a local LLM think, work, and forget. |
| --- |
| |
|
|
| # OpenCortex |
|
|
| **Making LLM inference visible.** |
|
|
| OpenCortex is a real-time observatory for local LLM inference. It pairs a chat |
| interface with a living view of the runtime behind the answer: working memory, |
| context pressure, token flow, and engine health. |
|
|
| Most chat interfaces show only two things: |
|
|
| ```text |
| user input -> assistant output |
| ``` |
|
|
| OpenCortex shows the machine in the middle. |
|
|
| ```text |
| prompt prefill -> KV/cache pressure -> decode rhythm -> context boundary -> answer behavior |
| ``` |
|
|
| It is built for learners, local AI users, and AI infrastructure engineers who |
| want to understand why an LLM slows down, loops, forgets, or runs out of |
| context. |
|
|
|  |
|
|
| ## Why this exists |
|
|
| Small local models make AI personal again. But local inference is still a black |
| box for most users. When a response becomes slow or repetitive, the user rarely |
| knows whether the model is thinking, overloaded, stuck in a loop, or simply |
| running out of context. |
|
|
| OpenCortex turns runtime signals into a product experience: |
|
|
| | Runtime signal | Human-readable concept | Visual feedback | |
| | --- | --- | --- | |
| | KV/cache usage proxy | Working Memory | block pressure and fragmentation | |
| | Context tokens | Context Window | filling chamber and active boundary | |
| | Decode throughput | Token Stream | flowing, broken, or stalled token river | |
| | TTFT and queue evidence | Engine State | pulse, charging, recovery, hazard state | |
| | Repeated generated text | Thought Loop | red loop warning and irregular core pulse | |
|
|
| The goal is not to build another metrics dashboard. The goal is to let people |
| feel the hidden mechanics of inference while they chat. |
|
|
| ## Demo links |
|
|
| - **Hugging Face Space:** https://huggingface.co/spaces/build-small-hackathon/open-cortex |
| - **Demo video:** https://youtu.be/edxZdttCf-s |
| - **Social post:** https://x.com/ZhaoJ90682/status/2066576258042155518 |
|
|
| The Build Small Hackathon requires a deployed Gradio Space, a demo video, a |
| social post, and README tags for tracks and badges. This README is structured |
| for that submission flow. |
|
|
| The hosted Space runs in **simulated runtime mode** so judges can open the demo |
| without a private llama.cpp server. The local path below connects to live |
| llama.cpp metrics. |
|
|
| ## What you can try |
|
|
| OpenCortex has one live mode and four built-in runtime experiments. |
|
|
| ### 1. Live local chat |
|
|
| Ask the local model a question and watch the observatory react while the answer |
| streams. The UI listens to llama.cpp OpenAI-compatible streaming events, |
| timings, `/metrics`, and `/slots`. |
|
|
| ### 2. Long context stress |
|
|
| Shows how prompt growth increases prefill work before generation begins. |
|
|
| ### 3. Memory pressure |
|
|
| Shows working memory blocks fragmenting and reallocating as the runtime becomes |
| strained. |
|
|
| ### 4. Slow decode |
|
|
| Shows token flow breaking into a stop-burst-stop rhythm when generation slows. |
|
|
| ### 5. Context collapse |
|
|
| Shows the moment earlier turns leave the active context window. The chat history |
| stays visible, but OpenCortex marks the active context boundary so the user can |
| see what the model can no longer reliably use. |
|
|
|  |
|
|
| ### 6. Thought loop detection |
|
|
| If generation begins repeating the same pattern, OpenCortex marks it as a |
| thought loop. The Token Stream and Cortex Core switch into a red hazard state. |
|
|
|  |
|
|
| ## How it works |
|
|
| OpenCortex is intentionally lightweight: |
|
|
| ```text |
| Browser UI |
| | |
| | POST /api/chat |
| v |
| FastAPI / Space app |
| | |
| | OpenAI-compatible streaming request |
| v |
| llama.cpp llama-server |
| | |
| | /v1/chat/completions |
| | /metrics |
| | /slots |
| v |
| Runtime events -> semantic state -> visual organs |
| ``` |
|
|
| The backend converts low-level runtime evidence into normalized events: |
|
|
| - `request_started` |
| - `first_token` |
| - `token` |
| - `context_collapse` |
| - `request_completed` |
| - `error` |
|
|
| The frontend consumes those events and updates the chat and observatory in the |
| same stream, so visual state and language output stay aligned. |
|
|
| ## Runtime evidence |
|
|
| The MVP uses real llama.cpp evidence where available: |
|
|
| | Evidence | Source | Used for | |
| | --- | --- | --- | |
| | Time to first token | measured in Python client | Engine State | |
| | Prompt tokens | llama.cpp usage/timings | Context Window | |
| | Completion tokens | llama.cpp usage/timings | Token Stream | |
| | Prompt throughput | llama.cpp timings | Prefill evidence | |
| | Decode throughput | llama.cpp timings and metrics | Token Stream | |
| | Active slot context | llama.cpp `/slots` | Context and memory proxy | |
| | Processing/deferred requests | llama.cpp `/metrics` | Engine State | |
| | Repetition pattern | generated text heuristic | Thought loop hazard | |
|
|
| Working Memory is labeled as a **context-derived proxy** in this MVP. llama.cpp |
| does not expose a simple per-request KV cache percentage through the HTTP API we |
| use, so OpenCortex derives a truthful pressure estimate from active slot context |
| tokens and context size. |
|
|
| ## Model |
|
|
| The current local demo uses: |
|
|
| ```text |
| Qwen/Qwen2.5-1.5B-Instruct-GGUF |
| quantization: Q4_K_M |
| context: 2048 tokens in the local demo |
| runtime: llama.cpp llama-server |
| ``` |
|
|
| This keeps the submission inside the Build Small model limit and qualifies the |
| project for tiny-model storytelling. The interface is backend-neutral enough to |
| evolve toward Ollama, vLLM, or Modal-hosted llama.cpp later. |
|
|
| On Hugging Face Spaces, OpenCortex defaults to `OPEN_CORTEX_BACKEND=simulated` |
| when `SPACE_ID` is present. Set `OPEN_CORTEX_BACKEND=llama_cpp` only when the |
| Space can reach a running llama.cpp backend. |
|
|
| ## Run locally |
|
|
| ### 1. Install project dependencies |
|
|
| ```bash |
| uv sync |
| ``` |
|
|
| ### 2. Start llama.cpp |
|
|
| From your llama.cpp checkout: |
|
|
| ```bash |
| llama-server \ |
| -hf Qwen/Qwen2.5-1.5B-Instruct-GGUF:Q4_K_M \ |
| -c 2048 -t 8 -tb 8 \ |
| --metrics \ |
| --host 127.0.0.1 --port 8080 |
| ``` |
|
|
| Check that the server is alive: |
|
|
| ```bash |
| curl --noproxy '*' http://127.0.0.1:8080/health |
| ``` |
|
|
| Expected: |
|
|
| ```json |
| {"status":"ok"} |
| ``` |
|
|
| ### 3. Start OpenCortex |
|
|
| ```bash |
| uv run open-cortex |
| ``` |
|
|
| Then open: |
|
|
| ```text |
| http://127.0.0.1:7860 |
| ``` |
|
|
| ## Suggested demo script |
|
|
| Use this for the video: |
|
|
| 1. Open the app and send a normal question. |
| 2. Point out prefill, first-token latency, context usage, and token flow. |
| 3. Run **Slow decode** to show token flow breaking. |
| 4. Ask for a long story and show **Thought Loop Detected** when repetition |
| appears. |
| 5. Fill the context window and show **Context Window Full**. |
| 6. Send one more message and show the earliest message leaving the active |
| context boundary. |
| 7. Close with the core claim: OpenCortex turns runtime internals into something |
| users can see and reason about. |
|
|
| ## Hackathon fit |
|
|
| OpenCortex targets the **Backyard AI** track: it is a practical tool for |
| learning and debugging local AI behavior on hardware you own. |
|
|
| It also targets these badges: |
|
|
| - **Off Brand**: custom runtime cockpit UI, not default Gradio components. |
| - **Tiny Titan**: built around a 1.5B local model. |
| - **Best Demo**: the experience is designed around a clear visual story. |
| - **Best Use of Codex**: the project was developed with Codex assistance and |
| should be submitted with Codex-attributed commits in the connected repository. |
|
|
| Modal credits were used/planned for development exploration, but the current MVP |
| does not require Modal at runtime. |
|
|
| ## Truthfulness and limitations |
|
|
| OpenCortex is a visualization layer over real runtime evidence, but it does not |
| claim to expose every internal tensor or true biological cognition. |
|
|
| Current limitations: |
|
|
| - KV cache usage is approximated from active context pressure. |
| - Thought loop detection is heuristic pattern detection over generated text. |
| - Context collapse is handled at the application layer by trimming oldest turns. |
| - The current demo is optimized for a single local runtime, not multi-user |
| production serving. |
| - vLLM integration is a planned evolution, not part of v0.1. |
|
|
| These constraints are visible by design. The product shows both semantic states |
| and metric evidence so users can tell which parts are measured and which parts |
| are interpreted. |
|
|
| ## Project structure |
|
|
| ```text |
| app.py Space/local entry point |
| open_cortex/ui/app.py HTTP app, event streaming, HTML shell |
| open_cortex/ui/assets/ Product UI CSS and browser controller |
| open_cortex/runtime/client.py llama.cpp streaming client |
| open_cortex/runtime/metrics.py llama.cpp metrics and slots parser |
| open_cortex/runtime/events.py Runtime event model |
| tests/ Runtime and UI regression tests |
| ``` |
|
|
| ## Development checks |
|
|
| ```bash |
| uv run pytest -q |
| mise exec -- node --check open_cortex/ui/assets/open_cortex.js |
| ``` |
|
|
| ## Submission checklist |
|
|
| - [ ] Deploy Space inside `build-small-hackathon/`. |
| - [ ] Confirm the Space launches as a Gradio-compatible submission. |
| - [ ] Add final Space URL to this README. |
| - [ ] Add demo video URL to this README. |
| - [ ] Add social post URL to this README. |
| - [ ] Run the Build Small README validator. |
| - [ ] Confirm the model is under 32B parameters. |
|
|
| ## License |
|
|
| MIT |
|
|