Spaces:

AUXteam
/

WitNote

Sleeping

App Files Files Community

WitNote / docs /architecture /scheduler.md

AUXteam

Upload folder using huggingface_hub

6a7089a verified 9 days ago

preview code

raw

history blame contribute delete

7.09 kB

	# Scheduler Architecture

	This page describes how PinchTab's optional scheduler works internally.

	## Scope

	The scheduler is an opt-in dispatch layer for queued browser tasks. It sits in front of the existing tab action execution path and adds:

	- queue admission limits
	- per-agent fairness
	- bounded concurrency
	- cooperative cancellation
	- result retention with TTL

	It does not replace the direct action endpoints.

	## Runtime Placement

	In dashboard mode, the scheduler is created only when `scheduler.enabled` is true. It is registered directly on the main mux and exposes:

	- `POST /tasks`
	- `GET /tasks`
	- `GET /tasks/{id}`
	- `POST /tasks/{id}/cancel`
	- `GET /scheduler/stats`
	- `POST /tasks/batch`

	## High-Level Flow

	```text
	client
	-> POST /tasks
	-> scheduler admission
	-> in-memory task queue
	-> worker dispatch
	-> resolve tab -> instance port
	-> POST /tabs/{tabId}/action on the owning instance
	-> result store
	```

	## Core Components

	### Scheduler

	The scheduler owns:

	- config
	- task queue
	- result store
	- task lookup for live tasks
	- cancellation map
	- worker lifecycle

	### TaskQueue

	The queue is:

	- in-memory
	- global-limit aware
	- per-agent-limit aware
	- priority ordered within an agent
	- fairness aware across agents

	Within an agent queue:

	- lower `priority` value wins
	- equal priority falls back to FIFO by `CreatedAt`

	Across agents:

	- the agent with the fewest in-flight tasks is chosen first

	### ResultStore

	The result store keeps snapshots of tasks and evicts terminal tasks after the configured TTL.

	### ManagerResolver

	The resolver maps a `tabId` to the owning instance port through `instance.Manager.FindInstanceByTabID`.

	This is how the scheduler knows where to forward execution.

	## Dispatch Lifecycle

	A task moves through these internal states:

	```text
	queued -> assigned -> running -> done
	-> failed
	queued -> cancelled
	queued -> rejected
	```

	Implementation details:

	- `assigned` is set just before worker execution starts
	- `running` is set immediately before the proxied action request
	- `done`, `failed`, and `cancelled` are terminal
	- rejected tasks are stored as terminal results even though they never enter active execution

	## Admission And Limits

	Admission checks include:

	- global queue size
	- per-agent queue size

	Execution checks include:

	- max global in-flight count
	- max per-agent in-flight count

	These are enforced in the queue and worker dispatch path rather than by external infrastructure.

	## Cancellation Model

	Each running task gets its own `context.WithDeadline(...)`.

	The scheduler stores the corresponding cancel function so that:

	- `POST /tasks/{id}/cancel` can stop a running task
	- shutdown can cancel in-flight work
	- deadlines propagate naturally to the proxied request

	## Deadline Handling

	Two deadline paths exist:

	- queued task expiry: a background reaper scans queued tasks every second and marks expired ones as failed
	- running task deadline: the per-task context deadline is enforced by the HTTP request to the executor

	Queued expiry currently records:

	```text
	deadline exceeded while queued
	```

	## Execution Contract

	The scheduler does not invent a separate execution protocol. It converts each task into the normal action request body:

	```json
	{
	"kind": "<action>",
	"ref": "<ref>",
	"...params": "..."
	}
	```

	and forwards it to:

	```text
	POST /tabs/{tabId}/action
	```

	That keeps the immediate path and scheduled path aligned.

	## Error Model

	Common failure sources:

	- admission rejection because the queue is full
	- tab-to-instance resolution failure
	- executor HTTP failure
	- browser-side action failure
	- deadline expiry

	Scheduler task snapshots keep the final `error` string for these cases.

	## Result Retention

	Terminal task snapshots are stored in memory and evicted after the configured TTL. This makes `GET /tasks/{id}` and `GET /tasks` useful for short-lived inspection, not for long-term audit storage.

	## Design Tradeoffs

	The current scheduler favors:

	- simple in-memory operation
	- low dependency count
	- reuse of the existing action executor

	That means it does not currently provide:

	- persistent queue storage
	- durable recovery after process restart
	- a separate task execution DSL

	---

	## Phase 2 -- Observability Layer

	### Metrics

	The scheduler maintains global and per-agent counters via the `Metrics` struct. Global counters use `atomic.Uint64` for lock-free increments. Per-agent counters use a sharded mutex pattern (`agentMetricEntry`) to avoid global lock contention.

	Tracked counters:

	- `TasksSubmitted`, `TasksCompleted`, `TasksFailed`
	- `TasksCancelled`, `TasksRejected`, `TasksExpired`
	- `DispatchTotal` and cumulative `DispatchLatency` (for average dispatch latency calculation)

	`Metrics.Snapshot()` returns a `MetricsSnapshot` -- a point-in-time serializable copy that includes per-agent breakdowns.

	### Stats Endpoint

	`GET /scheduler/stats` exposes three sections:

	- queue -- current queued/inflight counts from `QueueStats()`
	- metrics -- snapshot from `Metrics.Snapshot()`
	- config -- current scheduler configuration values

	### Lifecycle Logging

	Key scheduler events are logged via `slog`:

	- task submission, dispatch, completion, failure, cancellation
	- queue admission rejections
	- config reload events
	- webhook delivery outcomes

	### Webhook Delivery

	When a task has a `callbackUrl` and reaches a terminal state, the scheduler sends a POST with the task snapshot as JSON. This runs asynchronously in a goroutine from `finishTask()`.

	Security constraints:

	- only `http` and `https` schemes are accepted (SSRF mitigation)
	- a dedicated `http.Client` with a 10-second timeout prevents hanging connections
	- delivery failures are logged but do not affect task state

	Custom headers are sent: `X-PinchTab-Event: task.completed` and `X-PinchTab-Task-ID`.

	---

	## Phase 3 -- Hardening Layer

	### Batch Submission

	`POST /tasks/batch` accepts an array of task definitions (up to 50) sharing a single `agentId` and optional `callbackUrl`. Each task is submitted individually through `Submit()`, so queue admission limits apply per-task.

	The batch endpoint supports partial failure: if some tasks are rejected (queue full), the accepted tasks are still submitted and the response includes per-task status.

	### Config Hot-Reload

	`ReloadConfig(cfg Config)` updates tuning knobs at runtime:

	- queue limits via `queue.SetLimits(maxQueue, maxPerAgent)`
	- inflight limits via `cfgMu`-protected config fields
	- result TTL via `results.SetTTL(ttl)`

	Zero values in the reload config are ignored, preserving the existing setting.

	`ConfigWatcher` is a background goroutine that periodically calls a user-supplied `loadFn() (Config, error)` and applies changes via `ReloadConfig`. It can be started and stopped cleanly.

	### SetLimits and SetTTL

	The queue and result store expose safe runtime mutators:

	- `TaskQueue.SetLimits(maxQueue, maxPerAgent)` -- atomically updates admission thresholds
	- `ResultStore.SetTTL(ttl)` -- updates the eviction window for terminal task snapshots