File size: 9,565 Bytes
cb55577 9fb4fd1 cb55577 4995649 cb55577 6775849 cb55577 8161966 cb55577 99e40e7 cb55577 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 | ---
title: OpenCortex
emoji: 🧠
colorFrom: red
colorTo: blue
sdk: gradio
sdk_version: 6.18.0
app_file: app.py
pinned: false
license: mit
tags:
- track:backyard-ai
- sponsor:openbmb
- sponsor:nvidia
- sponsor:modal
- achievement:offgrid
- achievement:offbrand
- achievement:llama
- backyard-ai
- off-brand
- tiny-titan
- best-demo
- codex
- llama-cpp
- local-ai
- ai-infra
- observability
short_description: Feel a local LLM think, work, and forget.
---
# OpenCortex
**Making LLM inference visible.**
OpenCortex is a real-time observatory for local LLM inference. It pairs a chat
interface with a living view of the runtime behind the answer: working memory,
context pressure, token flow, and engine health.
Most chat interfaces show only two things:
```text
user input -> assistant output
```
OpenCortex shows the machine in the middle.
```text
prompt prefill -> KV/cache pressure -> decode rhythm -> context boundary -> answer behavior
```
It is built for learners, local AI users, and AI infrastructure engineers who
want to understand why an LLM slows down, loops, forgets, or runs out of
context.

## Why this exists
Small local models make AI personal again. But local inference is still a black
box for most users. When a response becomes slow or repetitive, the user rarely
knows whether the model is thinking, overloaded, stuck in a loop, or simply
running out of context.
OpenCortex turns runtime signals into a product experience:
| Runtime signal | Human-readable concept | Visual feedback |
| --- | --- | --- |
| KV/cache usage proxy | Working Memory | block pressure and fragmentation |
| Context tokens | Context Window | filling chamber and active boundary |
| Decode throughput | Token Stream | flowing, broken, or stalled token river |
| TTFT and queue evidence | Engine State | pulse, charging, recovery, hazard state |
| Repeated generated text | Thought Loop | red loop warning and irregular core pulse |
The goal is not to build another metrics dashboard. The goal is to let people
feel the hidden mechanics of inference while they chat.
## Demo links
- **Hugging Face Space:** https://huggingface.co/spaces/build-small-hackathon/open-cortex
- **Demo video:** https://youtu.be/edxZdttCf-s
- **Social post:** https://x.com/ZhaoJ90682/status/2066576258042155518
The Build Small Hackathon requires a deployed Gradio Space, a demo video, a
social post, and README tags for tracks and badges. This README is structured
for that submission flow.
The hosted Space runs in **simulated runtime mode** so judges can open the demo
without a private llama.cpp server. The local path below connects to live
llama.cpp metrics.
## What you can try
OpenCortex has one live mode and four built-in runtime experiments.
### 1. Live local chat
Ask the local model a question and watch the observatory react while the answer
streams. The UI listens to llama.cpp OpenAI-compatible streaming events,
timings, `/metrics`, and `/slots`.
### 2. Long context stress
Shows how prompt growth increases prefill work before generation begins.
### 3. Memory pressure
Shows working memory blocks fragmenting and reallocating as the runtime becomes
strained.
### 4. Slow decode
Shows token flow breaking into a stop-burst-stop rhythm when generation slows.
### 5. Context collapse
Shows the moment earlier turns leave the active context window. The chat history
stays visible, but OpenCortex marks the active context boundary so the user can
see what the model can no longer reliably use.

### 6. Thought loop detection
If generation begins repeating the same pattern, OpenCortex marks it as a
thought loop. The Token Stream and Cortex Core switch into a red hazard state.

## How it works
OpenCortex is intentionally lightweight:
```text
Browser UI
|
| POST /api/chat
v
FastAPI / Space app
|
| OpenAI-compatible streaming request
v
llama.cpp llama-server
|
| /v1/chat/completions
| /metrics
| /slots
v
Runtime events -> semantic state -> visual organs
```
The backend converts low-level runtime evidence into normalized events:
- `request_started`
- `first_token`
- `token`
- `context_collapse`
- `request_completed`
- `error`
The frontend consumes those events and updates the chat and observatory in the
same stream, so visual state and language output stay aligned.
## Runtime evidence
The MVP uses real llama.cpp evidence where available:
| Evidence | Source | Used for |
| --- | --- | --- |
| Time to first token | measured in Python client | Engine State |
| Prompt tokens | llama.cpp usage/timings | Context Window |
| Completion tokens | llama.cpp usage/timings | Token Stream |
| Prompt throughput | llama.cpp timings | Prefill evidence |
| Decode throughput | llama.cpp timings and metrics | Token Stream |
| Active slot context | llama.cpp `/slots` | Context and memory proxy |
| Processing/deferred requests | llama.cpp `/metrics` | Engine State |
| Repetition pattern | generated text heuristic | Thought loop hazard |
Working Memory is labeled as a **context-derived proxy** in this MVP. llama.cpp
does not expose a simple per-request KV cache percentage through the HTTP API we
use, so OpenCortex derives a truthful pressure estimate from active slot context
tokens and context size.
## Model
The current local demo uses:
```text
Qwen/Qwen2.5-1.5B-Instruct-GGUF
quantization: Q4_K_M
context: 2048 tokens in the local demo
runtime: llama.cpp llama-server
```
This keeps the submission inside the Build Small model limit and qualifies the
project for tiny-model storytelling. The interface is backend-neutral enough to
evolve toward Ollama, vLLM, or Modal-hosted llama.cpp later.
On Hugging Face Spaces, OpenCortex defaults to `OPEN_CORTEX_BACKEND=simulated`
when `SPACE_ID` is present. Set `OPEN_CORTEX_BACKEND=llama_cpp` only when the
Space can reach a running llama.cpp backend.
## Run locally
### 1. Install project dependencies
```bash
uv sync
```
### 2. Start llama.cpp
From your llama.cpp checkout:
```bash
llama-server \
-hf Qwen/Qwen2.5-1.5B-Instruct-GGUF:Q4_K_M \
-c 2048 -t 8 -tb 8 \
--metrics \
--host 127.0.0.1 --port 8080
```
Check that the server is alive:
```bash
curl --noproxy '*' http://127.0.0.1:8080/health
```
Expected:
```json
{"status":"ok"}
```
### 3. Start OpenCortex
```bash
uv run open-cortex
```
Then open:
```text
http://127.0.0.1:7860
```
## Suggested demo script
Use this for the video:
1. Open the app and send a normal question.
2. Point out prefill, first-token latency, context usage, and token flow.
3. Run **Slow decode** to show token flow breaking.
4. Ask for a long story and show **Thought Loop Detected** when repetition
appears.
5. Fill the context window and show **Context Window Full**.
6. Send one more message and show the earliest message leaving the active
context boundary.
7. Close with the core claim: OpenCortex turns runtime internals into something
users can see and reason about.
## Hackathon fit
OpenCortex targets the **Backyard AI** track: it is a practical tool for
learning and debugging local AI behavior on hardware you own.
It also targets these badges:
- **Off Brand**: custom runtime cockpit UI, not default Gradio components.
- **Tiny Titan**: built around a 1.5B local model.
- **Best Demo**: the experience is designed around a clear visual story.
- **Best Use of Codex**: the project was developed with Codex assistance and
should be submitted with Codex-attributed commits in the connected repository.
Modal credits were used/planned for development exploration, but the current MVP
does not require Modal at runtime.
## Truthfulness and limitations
OpenCortex is a visualization layer over real runtime evidence, but it does not
claim to expose every internal tensor or true biological cognition.
Current limitations:
- KV cache usage is approximated from active context pressure.
- Thought loop detection is heuristic pattern detection over generated text.
- Context collapse is handled at the application layer by trimming oldest turns.
- The current demo is optimized for a single local runtime, not multi-user
production serving.
- vLLM integration is a planned evolution, not part of v0.1.
These constraints are visible by design. The product shows both semantic states
and metric evidence so users can tell which parts are measured and which parts
are interpreted.
## Project structure
```text
app.py Space/local entry point
open_cortex/ui/app.py HTTP app, event streaming, HTML shell
open_cortex/ui/assets/ Product UI CSS and browser controller
open_cortex/runtime/client.py llama.cpp streaming client
open_cortex/runtime/metrics.py llama.cpp metrics and slots parser
open_cortex/runtime/events.py Runtime event model
tests/ Runtime and UI regression tests
```
## Development checks
```bash
uv run pytest -q
mise exec -- node --check open_cortex/ui/assets/open_cortex.js
```
## Submission checklist
- [ ] Deploy Space inside `build-small-hackathon/`.
- [ ] Confirm the Space launches as a Gradio-compatible submission.
- [ ] Add final Space URL to this README.
- [ ] Add demo video URL to this README.
- [ ] Add social post URL to this README.
- [ ] Run the Build Small README validator.
- [ ] Confirm the model is under 32B parameters.
## License
MIT
|