WitNote / docs /architecture /scheduler.md
AUXteam's picture
Upload folder using huggingface_hub
6a7089a verified
# Scheduler Architecture
This page describes how PinchTab's optional scheduler works internally.
## Scope
The scheduler is an opt-in dispatch layer for queued browser tasks. It sits in front of the existing tab action execution path and adds:
- queue admission limits
- per-agent fairness
- bounded concurrency
- cooperative cancellation
- result retention with TTL
It does not replace the direct action endpoints.
## Runtime Placement
In dashboard mode, the scheduler is created only when `scheduler.enabled` is true. It is registered directly on the main mux and exposes:
- `POST /tasks`
- `GET /tasks`
- `GET /tasks/{id}`
- `POST /tasks/{id}/cancel`
- `GET /scheduler/stats`
- `POST /tasks/batch`
## High-Level Flow
```text
client
-> POST /tasks
-> scheduler admission
-> in-memory task queue
-> worker dispatch
-> resolve tab -> instance port
-> POST /tabs/{tabId}/action on the owning instance
-> result store
```
## Core Components
### Scheduler
The scheduler owns:
- config
- task queue
- result store
- task lookup for live tasks
- cancellation map
- worker lifecycle
### TaskQueue
The queue is:
- in-memory
- global-limit aware
- per-agent-limit aware
- priority ordered within an agent
- fairness aware across agents
Within an agent queue:
- lower `priority` value wins
- equal priority falls back to FIFO by `CreatedAt`
Across agents:
- the agent with the fewest in-flight tasks is chosen first
### ResultStore
The result store keeps snapshots of tasks and evicts terminal tasks after the configured TTL.
### ManagerResolver
The resolver maps a `tabId` to the owning instance port through `instance.Manager.FindInstanceByTabID`.
This is how the scheduler knows where to forward execution.
## Dispatch Lifecycle
A task moves through these internal states:
```text
queued -> assigned -> running -> done
-> failed
queued -> cancelled
queued -> rejected
```
Implementation details:
- `assigned` is set just before worker execution starts
- `running` is set immediately before the proxied action request
- `done`, `failed`, and `cancelled` are terminal
- rejected tasks are stored as terminal results even though they never enter active execution
## Admission And Limits
Admission checks include:
- global queue size
- per-agent queue size
Execution checks include:
- max global in-flight count
- max per-agent in-flight count
These are enforced in the queue and worker dispatch path rather than by external infrastructure.
## Cancellation Model
Each running task gets its own `context.WithDeadline(...)`.
The scheduler stores the corresponding cancel function so that:
- `POST /tasks/{id}/cancel` can stop a running task
- shutdown can cancel in-flight work
- deadlines propagate naturally to the proxied request
## Deadline Handling
Two deadline paths exist:
- queued task expiry: a background reaper scans queued tasks every second and marks expired ones as failed
- running task deadline: the per-task context deadline is enforced by the HTTP request to the executor
Queued expiry currently records:
```text
deadline exceeded while queued
```
## Execution Contract
The scheduler does not invent a separate execution protocol. It converts each task into the normal action request body:
```json
{
"kind": "<action>",
"ref": "<ref>",
"...params": "..."
}
```
and forwards it to:
```text
POST /tabs/{tabId}/action
```
That keeps the immediate path and scheduled path aligned.
## Error Model
Common failure sources:
- admission rejection because the queue is full
- tab-to-instance resolution failure
- executor HTTP failure
- browser-side action failure
- deadline expiry
Scheduler task snapshots keep the final `error` string for these cases.
## Result Retention
Terminal task snapshots are stored in memory and evicted after the configured TTL. This makes `GET /tasks/{id}` and `GET /tasks` useful for short-lived inspection, not for long-term audit storage.
## Design Tradeoffs
The current scheduler favors:
- simple in-memory operation
- low dependency count
- reuse of the existing action executor
That means it does not currently provide:
- persistent queue storage
- durable recovery after process restart
- a separate task execution DSL
---
## Phase 2 -- Observability Layer
### Metrics
The scheduler maintains global and per-agent counters via the `Metrics` struct. Global counters use `atomic.Uint64` for lock-free increments. Per-agent counters use a sharded mutex pattern (`agentMetricEntry`) to avoid global lock contention.
Tracked counters:
- `TasksSubmitted`, `TasksCompleted`, `TasksFailed`
- `TasksCancelled`, `TasksRejected`, `TasksExpired`
- `DispatchTotal` and cumulative `DispatchLatency` (for average dispatch latency calculation)
`Metrics.Snapshot()` returns a `MetricsSnapshot` -- a point-in-time serializable copy that includes per-agent breakdowns.
### Stats Endpoint
`GET /scheduler/stats` exposes three sections:
- **queue** -- current queued/inflight counts from `QueueStats()`
- **metrics** -- snapshot from `Metrics.Snapshot()`
- **config** -- current scheduler configuration values
### Lifecycle Logging
Key scheduler events are logged via `slog`:
- task submission, dispatch, completion, failure, cancellation
- queue admission rejections
- config reload events
- webhook delivery outcomes
### Webhook Delivery
When a task has a `callbackUrl` and reaches a terminal state, the scheduler sends a POST with the task snapshot as JSON. This runs asynchronously in a goroutine from `finishTask()`.
Security constraints:
- only `http` and `https` schemes are accepted (SSRF mitigation)
- a dedicated `http.Client` with a 10-second timeout prevents hanging connections
- delivery failures are logged but do not affect task state
Custom headers are sent: `X-PinchTab-Event: task.completed` and `X-PinchTab-Task-ID`.
---
## Phase 3 -- Hardening Layer
### Batch Submission
`POST /tasks/batch` accepts an array of task definitions (up to 50) sharing a single `agentId` and optional `callbackUrl`. Each task is submitted individually through `Submit()`, so queue admission limits apply per-task.
The batch endpoint supports partial failure: if some tasks are rejected (queue full), the accepted tasks are still submitted and the response includes per-task status.
### Config Hot-Reload
`ReloadConfig(cfg Config)` updates tuning knobs at runtime:
- queue limits via `queue.SetLimits(maxQueue, maxPerAgent)`
- inflight limits via `cfgMu`-protected config fields
- result TTL via `results.SetTTL(ttl)`
Zero values in the reload config are ignored, preserving the existing setting.
`ConfigWatcher` is a background goroutine that periodically calls a user-supplied `loadFn() (Config, error)` and applies changes via `ReloadConfig`. It can be started and stopped cleanly.
### SetLimits and SetTTL
The queue and result store expose safe runtime mutators:
- `TaskQueue.SetLimits(maxQueue, maxPerAgent)` -- atomically updates admission thresholds
- `ResultStore.SetTTL(ttl)` -- updates the eviction window for terminal task snapshots