multi-agent-lab / docs /architecture /persistence.md
agharsallah
feat: phase 2 architecture β€” manifest system, structured output, salience memory, SQLite ledger, observer
0d7db8e
|
Raw
History Blame Contribute Delete
4.61 kB
# Persistence β€” SQLite Ledger, Snapshot, and Replay
## Why persistence matters for long-running agents
An interactive demo that runs for 5 minutes can live entirely in memory.
A village that simulates a day over 8 wall-clock hours cannot:
- The process will be restarted
- The machine will sleep
- The user will want to resume yesterday's story
The fix: make the checkpoint the ledger. The ledger is already the source of truth.
Persisting it is the only persistence you need.
---
## SQLiteLedger
Drop-in replacement for the in-memory `Ledger`. Same API, same idempotency,
plus durable storage:
```python
# In-memory (default, good for tests and short demos)
ledger = Ledger()
# Persistent (for long-running scenarios)
ledger = SQLiteLedger("runs/my-village.db")
# Pass to conductor
conductor = Conductor(scenario, ledger=ledger)
```
### Schema
```sql
CREATE TABLE events (
offset INTEGER PRIMARY KEY AUTOINCREMENT, -- ordering
id TEXT UNIQUE NOT NULL, -- idempotency key
run_id TEXT NOT NULL,
turn INTEGER NOT NULL,
kind TEXT NOT NULL,
actor TEXT NOT NULL,
payload TEXT NOT NULL, -- JSON
created_at TEXT NOT NULL, -- ISO-8601
schema_version INTEGER NOT NULL DEFAULT 1
);
CREATE INDEX idx_run_id ON events(run_id);
CREATE INDEX idx_kind ON events(kind);
CREATE INDEX idx_actor ON events(actor);
```
**Key decisions**:
- `offset` is the ordering guarantee β€” not `created_at` (clock skew) or `turn` (duplicate turns on retry)
- `UNIQUE(id)` is the idempotency guarantee β€” retried workers can't double-write
- `payload` stored as JSON text β€” readable without schema migration for most changes
- WAL mode + NORMAL synchronous β€” safe on crash, fast on write
### Idempotency
Every event has a UUID `id`. The ledger's `append()` is idempotent:
- In-memory: deduplication via `_seen_ids` set
- SQLite: `INSERT OR IGNORE` (actually `try/except IntegrityError` on `UNIQUE`)
A retried conductor step emits the same events with the same IDs β†’ safe to replay.
This is why `uuid4()` in the event constructor is the right default (not a sequence number).
---
## Snapshot and replay
### Taking a snapshot
```python
ledger = SQLiteLedger("village.db")
# ... run for N turns ...
ledger.snapshot_to("village-snapshot-turn-100.db")
```
SQLite's `.backup()` API is atomic and zero-copy. The snapshot is a valid,
fully-readable database file β€” no special restore tooling needed.
### Restoring from a snapshot
```python
ledger = SQLiteLedger.from_file("village-snapshot-turn-100.db")
conductor = Conductor(scenario, ledger=ledger)
# conductor.turn must be set from the max turn in the ledger
conductor.turn = max(e.turn for e in ledger.events)
# now step() continues from where the snapshot left off
```
### Crash recovery
After a crash, the ledger file contains all events that were committed.
Events that were computed but not yet written are lost β€” but idempotency means
the next retry will recompute them with the same IDs and they will be deduplicated.
The recovery pattern:
1. Open the existing ledger file (not reset).
2. Rebuild projections from the full event log.
3. Set `conductor.turn` to the max turn in the log.
4. Resume `step()` from there.
---
## Tail replay for long-running monitoring
```python
# Poll for new events since the last seen offset:
offset = 0
while True:
new_events = ledger.tail(from_offset=offset)
for e in new_events:
observer.consume(e)
offset = ledger.latest_offset()
time.sleep(1)
```
This is the pattern for an external monitoring process that reads the ledger
without being in-process with the conductor. Useful for:
- A separate UI process
- A dashboard that watches a running scenario
- Post-hoc analysis of a completed run
---
## Phase 3 milestone: Postgres upgrade
The `Ledger` interface is the abstraction. Swapping SQLite for Postgres is:
1. Implement `PostgresLedger(Ledger)` using `psycopg2` + `LISTEN/NOTIFY`
2. Add `pgvector` column for embedding-based episodic retrieval
3. Use Postgres `SERIAL` for ordering, `UNIQUE(id)` for idempotency β€” same logic
4. Factory reads `DATABASE_URL` env var, returns the right backend
Zero scenario or agent changes.
---
## Snapshot schedule (recommended)
For a scenario running 100 turns/hour:
- Take a snapshot every 100 turns
- Keep the last 3 snapshots
- On crash, restore from the latest snapshot and replay the tail
The `scripts/` directory will grow a `snapshot_and_prune.py` in Phase 3.