| # Scheduler Architecture |
|
|
| This page describes how PinchTab's optional scheduler works internally. |
|
|
| ## Scope |
|
|
| The scheduler is an opt-in dispatch layer for queued browser tasks. It sits in front of the existing tab action execution path and adds: |
|
|
| - queue admission limits |
| - per-agent fairness |
| - bounded concurrency |
| - cooperative cancellation |
| - result retention with TTL |
|
|
| It does not replace the direct action endpoints. |
|
|
| ## Runtime Placement |
|
|
| In dashboard mode, the scheduler is created only when `scheduler.enabled` is true. It is registered directly on the main mux and exposes: |
|
|
| - `POST /tasks` |
| - `GET /tasks` |
| - `GET /tasks/{id}` |
| - `POST /tasks/{id}/cancel` |
| - `GET /scheduler/stats` |
| - `POST /tasks/batch` |
|
|
| ## High-Level Flow |
|
|
| ```text |
| client |
| -> POST /tasks |
| -> scheduler admission |
| -> in-memory task queue |
| -> worker dispatch |
| -> resolve tab -> instance port |
| -> POST /tabs/{tabId}/action on the owning instance |
| -> result store |
| ``` |
|
|
| ## Core Components |
|
|
| ### Scheduler |
|
|
| The scheduler owns: |
|
|
| - config |
| - task queue |
| - result store |
| - task lookup for live tasks |
| - cancellation map |
| - worker lifecycle |
|
|
| ### TaskQueue |
|
|
| The queue is: |
|
|
| - in-memory |
| - global-limit aware |
| - per-agent-limit aware |
| - priority ordered within an agent |
| - fairness aware across agents |
|
|
| Within an agent queue: |
|
|
| - lower `priority` value wins |
| - equal priority falls back to FIFO by `CreatedAt` |
|
|
| Across agents: |
|
|
| - the agent with the fewest in-flight tasks is chosen first |
|
|
| ### ResultStore |
|
|
| The result store keeps snapshots of tasks and evicts terminal tasks after the configured TTL. |
|
|
| ### ManagerResolver |
|
|
| The resolver maps a `tabId` to the owning instance port through `instance.Manager.FindInstanceByTabID`. |
|
|
| This is how the scheduler knows where to forward execution. |
|
|
| ## Dispatch Lifecycle |
|
|
| A task moves through these internal states: |
|
|
| ```text |
| queued -> assigned -> running -> done |
| -> failed |
| queued -> cancelled |
| queued -> rejected |
| ``` |
|
|
| Implementation details: |
|
|
| - `assigned` is set just before worker execution starts |
| - `running` is set immediately before the proxied action request |
| - `done`, `failed`, and `cancelled` are terminal |
| - rejected tasks are stored as terminal results even though they never enter active execution |
|
|
| ## Admission And Limits |
|
|
| Admission checks include: |
|
|
| - global queue size |
| - per-agent queue size |
|
|
| Execution checks include: |
|
|
| - max global in-flight count |
| - max per-agent in-flight count |
|
|
| These are enforced in the queue and worker dispatch path rather than by external infrastructure. |
|
|
| ## Cancellation Model |
|
|
| Each running task gets its own `context.WithDeadline(...)`. |
|
|
| The scheduler stores the corresponding cancel function so that: |
|
|
| - `POST /tasks/{id}/cancel` can stop a running task |
| - shutdown can cancel in-flight work |
| - deadlines propagate naturally to the proxied request |
|
|
| ## Deadline Handling |
|
|
| Two deadline paths exist: |
|
|
| - queued task expiry: a background reaper scans queued tasks every second and marks expired ones as failed |
| - running task deadline: the per-task context deadline is enforced by the HTTP request to the executor |
|
|
| Queued expiry currently records: |
|
|
| ```text |
| deadline exceeded while queued |
| ``` |
|
|
| ## Execution Contract |
|
|
| The scheduler does not invent a separate execution protocol. It converts each task into the normal action request body: |
|
|
| ```json |
| { |
| "kind": "<action>", |
| "ref": "<ref>", |
| "...params": "..." |
| } |
| ``` |
|
|
| and forwards it to: |
|
|
| ```text |
| POST /tabs/{tabId}/action |
| ``` |
|
|
| That keeps the immediate path and scheduled path aligned. |
|
|
| ## Error Model |
|
|
| Common failure sources: |
|
|
| - admission rejection because the queue is full |
| - tab-to-instance resolution failure |
| - executor HTTP failure |
| - browser-side action failure |
| - deadline expiry |
|
|
| Scheduler task snapshots keep the final `error` string for these cases. |
|
|
| ## Result Retention |
|
|
| Terminal task snapshots are stored in memory and evicted after the configured TTL. This makes `GET /tasks/{id}` and `GET /tasks` useful for short-lived inspection, not for long-term audit storage. |
|
|
| ## Design Tradeoffs |
|
|
| The current scheduler favors: |
|
|
| - simple in-memory operation |
| - low dependency count |
| - reuse of the existing action executor |
|
|
| That means it does not currently provide: |
|
|
| - persistent queue storage |
| - durable recovery after process restart |
| - a separate task execution DSL |
|
|
| --- |
|
|
| ## Phase 2 -- Observability Layer |
|
|
| ### Metrics |
|
|
| The scheduler maintains global and per-agent counters via the `Metrics` struct. Global counters use `atomic.Uint64` for lock-free increments. Per-agent counters use a sharded mutex pattern (`agentMetricEntry`) to avoid global lock contention. |
|
|
| Tracked counters: |
|
|
| - `TasksSubmitted`, `TasksCompleted`, `TasksFailed` |
| - `TasksCancelled`, `TasksRejected`, `TasksExpired` |
| - `DispatchTotal` and cumulative `DispatchLatency` (for average dispatch latency calculation) |
|
|
| `Metrics.Snapshot()` returns a `MetricsSnapshot` -- a point-in-time serializable copy that includes per-agent breakdowns. |
|
|
| ### Stats Endpoint |
|
|
| `GET /scheduler/stats` exposes three sections: |
|
|
| - **queue** -- current queued/inflight counts from `QueueStats()` |
| - **metrics** -- snapshot from `Metrics.Snapshot()` |
| - **config** -- current scheduler configuration values |
|
|
| ### Lifecycle Logging |
|
|
| Key scheduler events are logged via `slog`: |
|
|
| - task submission, dispatch, completion, failure, cancellation |
| - queue admission rejections |
| - config reload events |
| - webhook delivery outcomes |
|
|
| ### Webhook Delivery |
|
|
| When a task has a `callbackUrl` and reaches a terminal state, the scheduler sends a POST with the task snapshot as JSON. This runs asynchronously in a goroutine from `finishTask()`. |
|
|
| Security constraints: |
|
|
| - only `http` and `https` schemes are accepted (SSRF mitigation) |
| - a dedicated `http.Client` with a 10-second timeout prevents hanging connections |
| - delivery failures are logged but do not affect task state |
|
|
| Custom headers are sent: `X-PinchTab-Event: task.completed` and `X-PinchTab-Task-ID`. |
|
|
| --- |
|
|
| ## Phase 3 -- Hardening Layer |
|
|
| ### Batch Submission |
|
|
| `POST /tasks/batch` accepts an array of task definitions (up to 50) sharing a single `agentId` and optional `callbackUrl`. Each task is submitted individually through `Submit()`, so queue admission limits apply per-task. |
|
|
| The batch endpoint supports partial failure: if some tasks are rejected (queue full), the accepted tasks are still submitted and the response includes per-task status. |
|
|
| ### Config Hot-Reload |
|
|
| `ReloadConfig(cfg Config)` updates tuning knobs at runtime: |
|
|
| - queue limits via `queue.SetLimits(maxQueue, maxPerAgent)` |
| - inflight limits via `cfgMu`-protected config fields |
| - result TTL via `results.SetTTL(ttl)` |
|
|
| Zero values in the reload config are ignored, preserving the existing setting. |
|
|
| `ConfigWatcher` is a background goroutine that periodically calls a user-supplied `loadFn() (Config, error)` and applies changes via `ReloadConfig`. It can be started and stopped cleanly. |
|
|
| ### SetLimits and SetTTL |
|
|
| The queue and result store expose safe runtime mutators: |
|
|
| - `TaskQueue.SetLimits(maxQueue, maxPerAgent)` -- atomically updates admission thresholds |
| - `ResultStore.SetTTL(ttl)` -- updates the eviction window for terminal task snapshots |
|
|