agharsallah
feat: phase 2 architecture β manifest system, structured output, salience memory, SQLite ledger, observer
0d7db8e | # Persistence β SQLite Ledger, Snapshot, and Replay | |
| ## Why persistence matters for long-running agents | |
| An interactive demo that runs for 5 minutes can live entirely in memory. | |
| A village that simulates a day over 8 wall-clock hours cannot: | |
| - The process will be restarted | |
| - The machine will sleep | |
| - The user will want to resume yesterday's story | |
| The fix: make the checkpoint the ledger. The ledger is already the source of truth. | |
| Persisting it is the only persistence you need. | |
| --- | |
| ## SQLiteLedger | |
| Drop-in replacement for the in-memory `Ledger`. Same API, same idempotency, | |
| plus durable storage: | |
| ```python | |
| # In-memory (default, good for tests and short demos) | |
| ledger = Ledger() | |
| # Persistent (for long-running scenarios) | |
| ledger = SQLiteLedger("runs/my-village.db") | |
| # Pass to conductor | |
| conductor = Conductor(scenario, ledger=ledger) | |
| ``` | |
| ### Schema | |
| ```sql | |
| CREATE TABLE events ( | |
| offset INTEGER PRIMARY KEY AUTOINCREMENT, -- ordering | |
| id TEXT UNIQUE NOT NULL, -- idempotency key | |
| run_id TEXT NOT NULL, | |
| turn INTEGER NOT NULL, | |
| kind TEXT NOT NULL, | |
| actor TEXT NOT NULL, | |
| payload TEXT NOT NULL, -- JSON | |
| created_at TEXT NOT NULL, -- ISO-8601 | |
| schema_version INTEGER NOT NULL DEFAULT 1 | |
| ); | |
| CREATE INDEX idx_run_id ON events(run_id); | |
| CREATE INDEX idx_kind ON events(kind); | |
| CREATE INDEX idx_actor ON events(actor); | |
| ``` | |
| **Key decisions**: | |
| - `offset` is the ordering guarantee β not `created_at` (clock skew) or `turn` (duplicate turns on retry) | |
| - `UNIQUE(id)` is the idempotency guarantee β retried workers can't double-write | |
| - `payload` stored as JSON text β readable without schema migration for most changes | |
| - WAL mode + NORMAL synchronous β safe on crash, fast on write | |
| ### Idempotency | |
| Every event has a UUID `id`. The ledger's `append()` is idempotent: | |
| - In-memory: deduplication via `_seen_ids` set | |
| - SQLite: `INSERT OR IGNORE` (actually `try/except IntegrityError` on `UNIQUE`) | |
| A retried conductor step emits the same events with the same IDs β safe to replay. | |
| This is why `uuid4()` in the event constructor is the right default (not a sequence number). | |
| --- | |
| ## Snapshot and replay | |
| ### Taking a snapshot | |
| ```python | |
| ledger = SQLiteLedger("village.db") | |
| # ... run for N turns ... | |
| ledger.snapshot_to("village-snapshot-turn-100.db") | |
| ``` | |
| SQLite's `.backup()` API is atomic and zero-copy. The snapshot is a valid, | |
| fully-readable database file β no special restore tooling needed. | |
| ### Restoring from a snapshot | |
| ```python | |
| ledger = SQLiteLedger.from_file("village-snapshot-turn-100.db") | |
| conductor = Conductor(scenario, ledger=ledger) | |
| # conductor.turn must be set from the max turn in the ledger | |
| conductor.turn = max(e.turn for e in ledger.events) | |
| # now step() continues from where the snapshot left off | |
| ``` | |
| ### Crash recovery | |
| After a crash, the ledger file contains all events that were committed. | |
| Events that were computed but not yet written are lost β but idempotency means | |
| the next retry will recompute them with the same IDs and they will be deduplicated. | |
| The recovery pattern: | |
| 1. Open the existing ledger file (not reset). | |
| 2. Rebuild projections from the full event log. | |
| 3. Set `conductor.turn` to the max turn in the log. | |
| 4. Resume `step()` from there. | |
| --- | |
| ## Tail replay for long-running monitoring | |
| ```python | |
| # Poll for new events since the last seen offset: | |
| offset = 0 | |
| while True: | |
| new_events = ledger.tail(from_offset=offset) | |
| for e in new_events: | |
| observer.consume(e) | |
| offset = ledger.latest_offset() | |
| time.sleep(1) | |
| ``` | |
| This is the pattern for an external monitoring process that reads the ledger | |
| without being in-process with the conductor. Useful for: | |
| - A separate UI process | |
| - A dashboard that watches a running scenario | |
| - Post-hoc analysis of a completed run | |
| --- | |
| ## Phase 3 milestone: Postgres upgrade | |
| The `Ledger` interface is the abstraction. Swapping SQLite for Postgres is: | |
| 1. Implement `PostgresLedger(Ledger)` using `psycopg2` + `LISTEN/NOTIFY` | |
| 2. Add `pgvector` column for embedding-based episodic retrieval | |
| 3. Use Postgres `SERIAL` for ordering, `UNIQUE(id)` for idempotency β same logic | |
| 4. Factory reads `DATABASE_URL` env var, returns the right backend | |
| Zero scenario or agent changes. | |
| --- | |
| ## Snapshot schedule (recommended) | |
| For a scenario running 100 turns/hour: | |
| - Take a snapshot every 100 turns | |
| - Keep the last 3 snapshots | |
| - On crash, restore from the latest snapshot and replay the tail | |
| The `scripts/` directory will grow a `snapshot_and_prune.py` in Phase 3. | |