# Scheduler Architecture This page describes how PinchTab's optional scheduler works internally. ## Scope The scheduler is an opt-in dispatch layer for queued browser tasks. It sits in front of the existing tab action execution path and adds: - queue admission limits - per-agent fairness - bounded concurrency - cooperative cancellation - result retention with TTL It does not replace the direct action endpoints. ## Runtime Placement In dashboard mode, the scheduler is created only when `scheduler.enabled` is true. It is registered directly on the main mux and exposes: - `POST /tasks` - `GET /tasks` - `GET /tasks/{id}` - `POST /tasks/{id}/cancel` - `GET /scheduler/stats` - `POST /tasks/batch` ## High-Level Flow ```text client -> POST /tasks -> scheduler admission -> in-memory task queue -> worker dispatch -> resolve tab -> instance port -> POST /tabs/{tabId}/action on the owning instance -> result store ``` ## Core Components ### Scheduler The scheduler owns: - config - task queue - result store - task lookup for live tasks - cancellation map - worker lifecycle ### TaskQueue The queue is: - in-memory - global-limit aware - per-agent-limit aware - priority ordered within an agent - fairness aware across agents Within an agent queue: - lower `priority` value wins - equal priority falls back to FIFO by `CreatedAt` Across agents: - the agent with the fewest in-flight tasks is chosen first ### ResultStore The result store keeps snapshots of tasks and evicts terminal tasks after the configured TTL. ### ManagerResolver The resolver maps a `tabId` to the owning instance port through `instance.Manager.FindInstanceByTabID`. This is how the scheduler knows where to forward execution. ## Dispatch Lifecycle A task moves through these internal states: ```text queued -> assigned -> running -> done -> failed queued -> cancelled queued -> rejected ``` Implementation details: - `assigned` is set just before worker execution starts - `running` is set immediately before the proxied action request - `done`, `failed`, and `cancelled` are terminal - rejected tasks are stored as terminal results even though they never enter active execution ## Admission And Limits Admission checks include: - global queue size - per-agent queue size Execution checks include: - max global in-flight count - max per-agent in-flight count These are enforced in the queue and worker dispatch path rather than by external infrastructure. ## Cancellation Model Each running task gets its own `context.WithDeadline(...)`. The scheduler stores the corresponding cancel function so that: - `POST /tasks/{id}/cancel` can stop a running task - shutdown can cancel in-flight work - deadlines propagate naturally to the proxied request ## Deadline Handling Two deadline paths exist: - queued task expiry: a background reaper scans queued tasks every second and marks expired ones as failed - running task deadline: the per-task context deadline is enforced by the HTTP request to the executor Queued expiry currently records: ```text deadline exceeded while queued ``` ## Execution Contract The scheduler does not invent a separate execution protocol. It converts each task into the normal action request body: ```json { "kind": "", "ref": "", "...params": "..." } ``` and forwards it to: ```text POST /tabs/{tabId}/action ``` That keeps the immediate path and scheduled path aligned. ## Error Model Common failure sources: - admission rejection because the queue is full - tab-to-instance resolution failure - executor HTTP failure - browser-side action failure - deadline expiry Scheduler task snapshots keep the final `error` string for these cases. ## Result Retention Terminal task snapshots are stored in memory and evicted after the configured TTL. This makes `GET /tasks/{id}` and `GET /tasks` useful for short-lived inspection, not for long-term audit storage. ## Design Tradeoffs The current scheduler favors: - simple in-memory operation - low dependency count - reuse of the existing action executor That means it does not currently provide: - persistent queue storage - durable recovery after process restart - a separate task execution DSL --- ## Phase 2 -- Observability Layer ### Metrics The scheduler maintains global and per-agent counters via the `Metrics` struct. Global counters use `atomic.Uint64` for lock-free increments. Per-agent counters use a sharded mutex pattern (`agentMetricEntry`) to avoid global lock contention. Tracked counters: - `TasksSubmitted`, `TasksCompleted`, `TasksFailed` - `TasksCancelled`, `TasksRejected`, `TasksExpired` - `DispatchTotal` and cumulative `DispatchLatency` (for average dispatch latency calculation) `Metrics.Snapshot()` returns a `MetricsSnapshot` -- a point-in-time serializable copy that includes per-agent breakdowns. ### Stats Endpoint `GET /scheduler/stats` exposes three sections: - **queue** -- current queued/inflight counts from `QueueStats()` - **metrics** -- snapshot from `Metrics.Snapshot()` - **config** -- current scheduler configuration values ### Lifecycle Logging Key scheduler events are logged via `slog`: - task submission, dispatch, completion, failure, cancellation - queue admission rejections - config reload events - webhook delivery outcomes ### Webhook Delivery When a task has a `callbackUrl` and reaches a terminal state, the scheduler sends a POST with the task snapshot as JSON. This runs asynchronously in a goroutine from `finishTask()`. Security constraints: - only `http` and `https` schemes are accepted (SSRF mitigation) - a dedicated `http.Client` with a 10-second timeout prevents hanging connections - delivery failures are logged but do not affect task state Custom headers are sent: `X-PinchTab-Event: task.completed` and `X-PinchTab-Task-ID`. --- ## Phase 3 -- Hardening Layer ### Batch Submission `POST /tasks/batch` accepts an array of task definitions (up to 50) sharing a single `agentId` and optional `callbackUrl`. Each task is submitted individually through `Submit()`, so queue admission limits apply per-task. The batch endpoint supports partial failure: if some tasks are rejected (queue full), the accepted tasks are still submitted and the response includes per-task status. ### Config Hot-Reload `ReloadConfig(cfg Config)` updates tuning knobs at runtime: - queue limits via `queue.SetLimits(maxQueue, maxPerAgent)` - inflight limits via `cfgMu`-protected config fields - result TTL via `results.SetTTL(ttl)` Zero values in the reload config are ignored, preserving the existing setting. `ConfigWatcher` is a background goroutine that periodically calls a user-supplied `loadFn() (Config, error)` and applies changes via `ReloadConfig`. It can be started and stopped cleanly. ### SetLimits and SetTTL The queue and result store expose safe runtime mutators: - `TaskQueue.SetLimits(maxQueue, maxPerAgent)` -- atomically updates admission thresholds - `ResultStore.SetTTL(ttl)` -- updates the eviction window for terminal task snapshots