# Architecture — workflows and the LLMs behind them An AI-solution-architect view of the agentic system: every workflow through the platform, and exactly which model (if any) each one calls. The architectural signature: the extraction core is **one grammar-constrained LLM call**, the **MiniCPM planner** adds a visible multi-step loop over the platform's own public MCP tool contract, everything verifiable — conflict math, dedup, time proposals, eval gates — stays deterministic, and there are **zero cloud-AI API calls anywhere**, training included. ## System workflow ```mermaid flowchart TB subgraph ENTRY["1 · Entry points — four front-ends, one contract"] direction LR UIIN["🖥️ Gradio UI
Schedule flow + Agent tab
(paste thread, screenshots, .ics)"] SHORT["📱 iOS Shortcut /
Android Tasker"] MAC["🍎 Mac collector
polls iMessage chat.db
(collector/collector.py)"] MCPC["🤖 MCP clients
Claude Desktop, Cursor"] end subgraph API["2 · API & orchestration — app.py (FastAPI + Gradio, one port)"] AGENTEP["POST /agent
bearer-token, stateless"] INGEST["POST /ingest → feed store
AUTONOMOUS=1 triggers on
your outgoing message (is_from_me)"] ROLL["threads.rolling_thread
per-chat window (20 msgs / 12 h)"] MCPT["MCP tools — server/mcp_tools.py
extract_events · make_ics · check_conflicts"] end subgraph ORCH["2a · Agentic orchestration — server/orchestrator.py"] SMOL["smolagents ToolCallingAgent
planned by MiniCPM, ≤6 steps
playbook: extract → check → render
final ActionPlan re-derived deterministically"] SCRIPT["ScriptedPlanner — no LLM
identical tool sequence + step events
(stub mode, CI, planner failure)"] end subgraph CORE["3 · Agent core — server/pipeline.py → server/agent.py"] PROMPT["Prompt assembly:
SYSTEM + memory recall block
+ existing calendar + thread + images"] GEN["Grammar-constrained generation
→ ActionPlan JSON (always parses)"] PROMPT --> GEN end subgraph LLMT["4 · LLM tier — ALL inference is local llama.cpp, zero cloud AI APIs"] GEMMA["⭐ gemma-cal E4B — fine-tuned Gemma 4
ParetoOptimal/gemma-4-cal-gguf
gemma-cal-e4b-Q4_K_M.gguf (~5 GB)
+ mmproj-F16.gguf vision projector"] MODES["served either:
· in-process llama-cpp-python (ZeroGPU lease)
· remote llama-server via INFERENCE_BASE_URL
(Space sidecar / Mac launchd / phone)"] MINICPM["🧭 MiniCPM planner — OpenBMB (sponsor)
openbmb/MiniCPM4.1-8B-GGUF Q4 (~5 GB)
≤4B option: openbmb/MiniCPM5-1B-GGUF (config switch)
2nd llama-server :8081 — enabled via
PLANNER_HF_REPO / PLANNER_FILE"] HERMES["(optional) Hermes-3-Llama-3.1-8B Q4_K_M
HERMES_TOOLS=1 — tool-calling loop:
calls remember() to write memory mid-run"] STUB["(no LLM) regex stub extractor
USE_STUB_EXTRACTOR=1 — CI & free tier"] GEMMA --- MODES end subgraph DET["5 · Deterministic post-processing — no LLM"] CONF["freebusy.annotate_conflicts
overlap / adjacent / tight
+ propose_times free slots"] DEDUP["dedup.filter_new
idempotency for autonomous runs"] MEMW["memory.observe_plan
learns recurring contacts"] end subgraph OUT["6 · Outputs"] CARDS["Event cards + reply draft
+ clarification question"] ICS["📥 .ics download
(off-grid default)"] GCAL["📆 Google Calendar push
(per-user OAuth web flow, opt-in)"] TRACE["Redacted trace export
→ public HF dataset"] end UIIN -->|"run_orchestrator (step trace streams into the UI)"| SMOL SHORT --> AGENTEP MAC -->|"store-only"| INGEST MAC -->|"AGENT_MODE=1"| AGENTEP MCPC --> MCPT AGENTEP --> CORE INGEST --> ROLL --> CORE SMOL ==>|"planning loop, ≤6 steps"| MINICPM SMOL -->|"tool calls — the Space's OWN MCP
endpoint (localhost SSE)"| MCPT SMOL -.->|"planner down / stub mode"| SCRIPT SCRIPT -->|"same tool sequence,
deterministic"| MCPT MCPT -->|"extract_events → 1 LLM call"| CORE MCPT -.->|"make_ics / check_conflicts → 0 LLM calls"| DET GEN ==>|"default"| GEMMA GEN -.->|"opt-in autonomous brain"| HERMES GEN -.->|"tests / free demo"| STUB HERMES -->|"remember()"| MEMW LLMT --> DET --> OUT ``` ## Offline loop — eval-gated fine-tuning (produces the serving LLM) ```mermaid flowchart LR SEEDS["Seed data — NO LLM
139 hand-authored template examples
(gen_new_seeds.py / make_dataset.py)"] SMC["SMCalFlow import — NO LLM
deterministic LISP-program parse, ~2000 rows"] TRAIN["QLoRA fine-tune — Unsloth on Modal A100-80GB
base: google/gemma-4-31B-it or gemma-4-E4B-it
r=16, lr 5e-5, 2 epochs, responses-only loss"] GGUF["convert_hf_to_gguf + llama-quantize
→ staging Q4_K_M GGUF"] EVAL["Eval — NO LLM judge, deterministic metrics
60-example held-out set:
schema validity · event F1 · start-exact recall"] GATE{"Gate
validity ≥ 0.95
F1 ≥ 0.81
recall ≥ 0.773"} PROD["Promote → ParetoOptimal/gemma-4-cal-gguf
(the model the Space serves)"] TRASH["Discard staging —
production untouched"] SEEDS --> TRAIN SMC --> TRAIN TRAIN --> GGUF --> EVAL --> GATE GATE -->|pass| PROD GATE -->|fail| TRASH ``` See [eval-roadmap.md](./eval-roadmap.md) and the [eval-gated fine-tuning post-mortem](./blog-eval-gated-finetuning.md) for the gate's history and rationale; [hermes.md](./hermes.md) for the optional tool-calling backend; [build-small-submission.md](./build-small-submission.md) for how the MiniCPM planner maps to the `sponsor:openbmb` track. ## Which LLM each workflow calls | # | Workflow | Trigger | LLM call(s) | Where it runs | |---|----------|---------|-------------|----------------| | 1 | Agentic orchestration (Schedule flow + Agent tab) | User pastes thread / uploads screenshots, clicks Find the events / Run the agents | **1× MiniCPM planning loop** (`MiniCPM4.1-8B`, or `MiniCPM5-1B` ≤4B variant; ≤6 steps) driving the Space's own MCP tools, **+ 1× gemma-cal E4B** per `extract_events` tool call (vision via mmproj); `check_conflicts`/`make_ics` are zero-LLM. Planner unconfigured or down → ScriptedPlanner runs the identical sequence, **gemma-cal only** | Two local llama-servers — gemma-cal on :8080, MiniCPM on :8081 | | 2 | API extraction (`POST /agent`) | iOS Shortcut, Android Tasker, or Mac collector in `AGENT_MODE=1` | **1× gemma-cal E4B** (same pipeline, same prompt) | Same | | 3 | Autonomous ingest | Mac collector → `/ingest`; your outgoing message triggers a run over the chat's rolling thread | **1× gemma-cal E4B per affected chat**, then deterministic dedup + calendar delivery | Same | | 4 | Memory-writing agent (optional) | `HERMES_TOOLS=1` on the remote path | **Hermes-3-Llama-3.1-8B** in a tool loop (≤3 rounds): may call `remember()` then returns the ActionPlan | Remote llama-server (e.g. Mac launchd) | | 5 | MCP tools for external agents | MCP client calls the Space | `extract_events` → **1× gemma-cal E4B**; `make_ics` and `check_conflicts` → **zero LLM calls** | Same as #1 | | 6 | CI / free-tier demo | `USE_STUB_EXTRACTOR=1` | **No LLM** — regex heuristic | CPU anywhere | | 7 | Training & eval (offline) | `training/gated_retrain.py` | **No LLM at the inference-API level**: data gen is template-based, eval is metric-based (no judge). The LLM here is the *training target*: QLoRA on `google/gemma-4-31B-it` / `gemma-4-E4B-it` | Modal A100/H100 |