Spaces:

build-small-hackathon
/

multi-agent-lab

Sleeping

App Files Files Community

multi-agent-lab / docs /architecture /persistence.md

agharsallah

feat: phase 2 architecture — manifest system, structured output, salience memory, SQLite ledger, observer

0d7db8e 26 days ago

preview code

Raw

History Blame Contribute Delete

4.61 kB

	# Persistence — SQLite Ledger, Snapshot, and Replay

	## Why persistence matters for long-running agents

	An interactive demo that runs for 5 minutes can live entirely in memory.
	A village that simulates a day over 8 wall-clock hours cannot:
	- The process will be restarted
	- The machine will sleep
	- The user will want to resume yesterday's story

	The fix: make the checkpoint the ledger. The ledger is already the source of truth.
	Persisting it is the only persistence you need.

	---

	## SQLiteLedger

	Drop-in replacement for the in-memory `Ledger`. Same API, same idempotency,
	plus durable storage:

	```python
	# In-memory (default, good for tests and short demos)
	ledger = Ledger()

	# Persistent (for long-running scenarios)
	ledger = SQLiteLedger("runs/my-village.db")

	# Pass to conductor
	conductor = Conductor(scenario, ledger=ledger)
	```

	### Schema

	```sql
	CREATE TABLE events (
	offset INTEGER PRIMARY KEY AUTOINCREMENT, -- ordering
	id TEXT UNIQUE NOT NULL, -- idempotency key
	run_id TEXT NOT NULL,
	turn INTEGER NOT NULL,
	kind TEXT NOT NULL,
	actor TEXT NOT NULL,
	payload TEXT NOT NULL, -- JSON
	created_at TEXT NOT NULL, -- ISO-8601
	schema_version INTEGER NOT NULL DEFAULT 1
	);
	CREATE INDEX idx_run_id ON events(run_id);
	CREATE INDEX idx_kind ON events(kind);
	CREATE INDEX idx_actor ON events(actor);
	```

	Key decisions:
	- `offset` is the ordering guarantee — not `created_at` (clock skew) or `turn` (duplicate turns on retry)
	- `UNIQUE(id)` is the idempotency guarantee — retried workers can't double-write
	- `payload` stored as JSON text — readable without schema migration for most changes
	- WAL mode + NORMAL synchronous — safe on crash, fast on write

	### Idempotency

	Every event has a UUID `id`. The ledger's `append()` is idempotent:
	- In-memory: deduplication via `_seen_ids` set
	- SQLite: `INSERT OR IGNORE` (actually `try/except IntegrityError` on `UNIQUE`)

	A retried conductor step emits the same events with the same IDs → safe to replay.
	This is why `uuid4()` in the event constructor is the right default (not a sequence number).

	---

	## Snapshot and replay

	### Taking a snapshot

	```python
	ledger = SQLiteLedger("village.db")
	# ... run for N turns ...
	ledger.snapshot_to("village-snapshot-turn-100.db")
	```

	SQLite's `.backup()` API is atomic and zero-copy. The snapshot is a valid,
	fully-readable database file — no special restore tooling needed.

	### Restoring from a snapshot

	```python
	ledger = SQLiteLedger.from_file("village-snapshot-turn-100.db")
	conductor = Conductor(scenario, ledger=ledger)
	# conductor.turn must be set from the max turn in the ledger
	conductor.turn = max(e.turn for e in ledger.events)
	# now step() continues from where the snapshot left off
	```

	### Crash recovery

	After a crash, the ledger file contains all events that were committed.
	Events that were computed but not yet written are lost — but idempotency means
	the next retry will recompute them with the same IDs and they will be deduplicated.

	The recovery pattern:
	1. Open the existing ledger file (not reset).
	2. Rebuild projections from the full event log.
	3. Set `conductor.turn` to the max turn in the log.
	4. Resume `step()` from there.

	---

	## Tail replay for long-running monitoring

	```python
	# Poll for new events since the last seen offset:
	offset = 0
	while True:
	new_events = ledger.tail(from_offset=offset)
	for e in new_events:
	observer.consume(e)
	offset = ledger.latest_offset()
	time.sleep(1)
	```

	This is the pattern for an external monitoring process that reads the ledger
	without being in-process with the conductor. Useful for:
	- A separate UI process
	- A dashboard that watches a running scenario
	- Post-hoc analysis of a completed run

	---

	## Phase 3 milestone: Postgres upgrade

	The `Ledger` interface is the abstraction. Swapping SQLite for Postgres is:
	1. Implement `PostgresLedger(Ledger)` using `psycopg2` + `LISTEN/NOTIFY`
	2. Add `pgvector` column for embedding-based episodic retrieval
	3. Use Postgres `SERIAL` for ordering, `UNIQUE(id)` for idempotency — same logic
	4. Factory reads `DATABASE_URL` env var, returns the right backend

	Zero scenario or agent changes.

	---

	## Snapshot schedule (recommended)

	For a scenario running 100 turns/hour:
	- Take a snapshot every 100 turns
	- Keep the last 3 snapshots
	- On crash, restore from the latest snapshot and replay the tail

	The `scripts/` directory will grow a `snapshot_and_prune.py` in Phase 3.