| # Architecture β workflows and the LLMs behind them |
|
|
| An AI-solution-architect view of the agentic system: every workflow through the |
| platform, and exactly which model (if any) each one calls. The architectural |
| signature: the extraction core is **one grammar-constrained LLM call**, the |
| **MiniCPM planner** adds a visible multi-step loop over the platform's own |
| public MCP tool contract, everything verifiable β conflict math, dedup, time |
| proposals, eval gates β stays deterministic, and there are **zero cloud-AI API |
| calls anywhere**, training included. |
|
|
| ## System workflow |
|
|
| ```mermaid |
| flowchart TB |
| subgraph ENTRY["1 Β· Entry points β four front-ends, one contract"] |
| direction LR |
| UIIN["π₯οΈ Gradio UI<br/>Schedule flow + Agent tab<br/>(paste thread, screenshots, .ics)"] |
| SHORT["π± iOS Shortcut /<br/>Android Tasker"] |
| MAC["π Mac collector<br/>polls iMessage chat.db<br/>(collector/collector.py)"] |
| MCPC["π€ MCP clients<br/>Claude Desktop, Cursor"] |
| end |
| |
| subgraph API["2 Β· API & orchestration β app.py (FastAPI + Gradio, one port)"] |
| AGENTEP["POST /agent<br/>bearer-token, stateless"] |
| INGEST["POST /ingest β feed store<br/>AUTONOMOUS=1 triggers on<br/>your outgoing message (is_from_me)"] |
| ROLL["threads.rolling_thread<br/>per-chat window (20 msgs / 12 h)"] |
| MCPT["MCP tools β server/mcp_tools.py<br/>extract_events Β· make_ics Β· check_conflicts"] |
| end |
| |
| subgraph ORCH["2a Β· Agentic orchestration β server/orchestrator.py"] |
| SMOL["smolagents ToolCallingAgent<br/>planned by MiniCPM, β€6 steps<br/>playbook: extract β check β render<br/>final ActionPlan re-derived deterministically"] |
| SCRIPT["ScriptedPlanner β no LLM<br/>identical tool sequence + step events<br/>(stub mode, CI, planner failure)"] |
| end |
| |
| subgraph CORE["3 Β· Agent core β server/pipeline.py β server/agent.py"] |
| PROMPT["Prompt assembly:<br/>SYSTEM + memory recall block<br/>+ existing calendar + thread + images"] |
| GEN["Grammar-constrained generation<br/>β ActionPlan JSON (always parses)"] |
| PROMPT --> GEN |
| end |
| |
| subgraph LLMT["4 Β· LLM tier β ALL inference is local llama.cpp, zero cloud AI APIs"] |
| GEMMA["β gemma-cal E4B β fine-tuned Gemma 4<br/>ParetoOptimal/gemma-4-cal-gguf<br/>gemma-cal-e4b-Q4_K_M.gguf (~5 GB)<br/>+ mmproj-F16.gguf vision projector"] |
| MODES["served either:<br/>Β· in-process llama-cpp-python (ZeroGPU lease)<br/>Β· remote llama-server via INFERENCE_BASE_URL<br/>(Space sidecar / Mac launchd / phone)"] |
| MINICPM["π§ MiniCPM planner β OpenBMB (sponsor)<br/>openbmb/MiniCPM4.1-8B-GGUF Q4 (~5 GB)<br/>β€4B option: openbmb/MiniCPM5-1B-GGUF (config switch)<br/>2nd llama-server :8081 β enabled via<br/>PLANNER_HF_REPO / PLANNER_FILE"] |
| HERMES["(optional) Hermes-3-Llama-3.1-8B Q4_K_M<br/>HERMES_TOOLS=1 β tool-calling loop:<br/>calls remember() to write memory mid-run"] |
| STUB["(no LLM) regex stub extractor<br/>USE_STUB_EXTRACTOR=1 β CI & free tier"] |
| GEMMA --- MODES |
| end |
| |
| subgraph DET["5 Β· Deterministic post-processing β no LLM"] |
| CONF["freebusy.annotate_conflicts<br/>overlap / adjacent / tight<br/>+ propose_times free slots"] |
| DEDUP["dedup.filter_new<br/>idempotency for autonomous runs"] |
| MEMW["memory.observe_plan<br/>learns recurring contacts"] |
| end |
| |
| subgraph OUT["6 Β· Outputs"] |
| CARDS["Event cards + reply draft<br/>+ clarification question"] |
| ICS["π₯ .ics download<br/>(off-grid default)"] |
| GCAL["π Google Calendar push<br/>(per-user OAuth web flow, opt-in)"] |
| TRACE["Redacted trace export<br/>β public HF dataset"] |
| end |
| |
| UIIN -->|"run_orchestrator (step trace streams into the UI)"| SMOL |
| SHORT --> AGENTEP |
| MAC -->|"store-only"| INGEST |
| MAC -->|"AGENT_MODE=1"| AGENTEP |
| MCPC --> MCPT |
| AGENTEP --> CORE |
| INGEST --> ROLL --> CORE |
| SMOL ==>|"planning loop, β€6 steps"| MINICPM |
| SMOL -->|"tool calls β the Space's OWN MCP<br/>endpoint (localhost SSE)"| MCPT |
| SMOL -.->|"planner down / stub mode"| SCRIPT |
| SCRIPT -->|"same tool sequence,<br/>deterministic"| MCPT |
| MCPT -->|"extract_events β 1 LLM call"| CORE |
| MCPT -.->|"make_ics / check_conflicts β 0 LLM calls"| DET |
| |
| GEN ==>|"default"| GEMMA |
| GEN -.->|"opt-in autonomous brain"| HERMES |
| GEN -.->|"tests / free demo"| STUB |
| HERMES -->|"remember()"| MEMW |
| |
| LLMT --> DET --> OUT |
| ``` |
|
|
| ## Offline loop β eval-gated fine-tuning (produces the serving LLM) |
|
|
| ```mermaid |
| flowchart LR |
| SEEDS["Seed data β NO LLM<br/>139 hand-authored template examples<br/>(gen_new_seeds.py / make_dataset.py)"] |
| SMC["SMCalFlow import β NO LLM<br/>deterministic LISP-program parse, ~2000 rows"] |
| TRAIN["QLoRA fine-tune β Unsloth on Modal A100-80GB<br/>base: google/gemma-4-31B-it or gemma-4-E4B-it<br/>r=16, lr 5e-5, 2 epochs, responses-only loss"] |
| GGUF["convert_hf_to_gguf + llama-quantize<br/>β staging Q4_K_M GGUF"] |
| EVAL["Eval β NO LLM judge, deterministic metrics<br/>60-example held-out set:<br/>schema validity Β· event F1 Β· start-exact recall"] |
| GATE{"Gate<br/>validity β₯ 0.95<br/>F1 β₯ 0.81<br/>recall β₯ 0.773"} |
| PROD["Promote β ParetoOptimal/gemma-4-cal-gguf<br/>(the model the Space serves)"] |
| TRASH["Discard staging β<br/>production untouched"] |
| |
| SEEDS --> TRAIN |
| SMC --> TRAIN |
| TRAIN --> GGUF --> EVAL --> GATE |
| GATE -->|pass| PROD |
| GATE -->|fail| TRASH |
| ``` |
|
|
| See [eval-roadmap.md](./eval-roadmap.md) and the |
| [eval-gated fine-tuning post-mortem](./blog-eval-gated-finetuning.md) for the |
| gate's history and rationale; [hermes.md](./hermes.md) for the optional |
| tool-calling backend; [build-small-submission.md](./build-small-submission.md) |
| for how the MiniCPM planner maps to the `sponsor:openbmb` track. |
|
|
| ## Which LLM each workflow calls |
|
|
| | # | Workflow | Trigger | LLM call(s) | Where it runs | |
| |---|----------|---------|-------------|----------------| |
| | 1 | Agentic orchestration (Schedule flow + Agent tab) | User pastes thread / uploads screenshots, clicks Find the events / Run the agents | **1Γ MiniCPM planning loop** (`MiniCPM4.1-8B`, or `MiniCPM5-1B` β€4B variant; β€6 steps) driving the Space's own MCP tools, **+ 1Γ gemma-cal E4B** per `extract_events` tool call (vision via mmproj); `check_conflicts`/`make_ics` are zero-LLM. Planner unconfigured or down β ScriptedPlanner runs the identical sequence, **gemma-cal only** | Two local llama-servers β gemma-cal on :8080, MiniCPM on :8081 | |
| | 2 | API extraction (`POST /agent`) | iOS Shortcut, Android Tasker, or Mac collector in `AGENT_MODE=1` | **1Γ gemma-cal E4B** (same pipeline, same prompt) | Same | |
| | 3 | Autonomous ingest | Mac collector β `/ingest`; your outgoing message triggers a run over the chat's rolling thread | **1Γ gemma-cal E4B per affected chat**, then deterministic dedup + calendar delivery | Same | |
| | 4 | Memory-writing agent (optional) | `HERMES_TOOLS=1` on the remote path | **Hermes-3-Llama-3.1-8B** in a tool loop (β€3 rounds): may call `remember()` then returns the ActionPlan | Remote llama-server (e.g. Mac launchd) | |
| | 5 | MCP tools for external agents | MCP client calls the Space | `extract_events` β **1Γ gemma-cal E4B**; `make_ics` and `check_conflicts` β **zero LLM calls** | Same as #1 | |
| | 6 | CI / free-tier demo | `USE_STUB_EXTRACTOR=1` | **No LLM** β regex heuristic | CPU anywhere | |
| | 7 | Training & eval (offline) | `training/gated_retrain.py` | **No LLM at the inference-API level**: data gen is template-based, eval is metric-based (no judge). The LLM here is the *training target*: QLoRA on `google/gemma-4-31B-it` / `gemma-4-E4B-it` | Modal A100/H100 | |
|
|