# Persistence — SQLite Ledger, Snapshot, and Replay

## Why persistence matters for long-running agents

An interactive demo that runs for 5 minutes can live entirely in memory.
A village that simulates a day over 8 wall-clock hours cannot:
- The process will be restarted
- The machine will sleep
- The user will want to resume yesterday's story

The fix: make the checkpoint the ledger.  The ledger is already the source of truth.
Persisting it is the only persistence you need.

---

## SQLiteLedger

Drop-in replacement for the in-memory `Ledger`.  Same API, same idempotency,
plus durable storage:

```python
# In-memory (default, good for tests and short demos)
ledger = Ledger()

# Persistent (for long-running scenarios)
ledger = SQLiteLedger("runs/my-village.db")

# Pass to conductor
conductor = Conductor(scenario, ledger=ledger)
```

### Schema

```sql
CREATE TABLE events (
    offset         INTEGER PRIMARY KEY AUTOINCREMENT,  -- ordering
    id             TEXT UNIQUE NOT NULL,               -- idempotency key
    run_id         TEXT NOT NULL,
    turn           INTEGER NOT NULL,
    kind           TEXT NOT NULL,
    actor          TEXT NOT NULL,
    payload        TEXT NOT NULL,                      -- JSON
    created_at     TEXT NOT NULL,                      -- ISO-8601
    schema_version INTEGER NOT NULL DEFAULT 1
);
CREATE INDEX idx_run_id ON events(run_id);
CREATE INDEX idx_kind   ON events(kind);
CREATE INDEX idx_actor  ON events(actor);
```

**Key decisions**:
- `offset` is the ordering guarantee — not `created_at` (clock skew) or `turn` (duplicate turns on retry)
- `UNIQUE(id)` is the idempotency guarantee — retried workers can't double-write
- `payload` stored as JSON text — readable without schema migration for most changes
- WAL mode + NORMAL synchronous — safe on crash, fast on write

### Idempotency

Every event has a UUID `id`.  The ledger's `append()` is idempotent:
- In-memory: deduplication via `_seen_ids` set
- SQLite: `INSERT OR IGNORE` (actually `try/except IntegrityError` on `UNIQUE`)

A retried conductor step emits the same events with the same IDs → safe to replay.
This is why `uuid4()` in the event constructor is the right default (not a sequence number).

---

## Snapshot and replay

### Taking a snapshot

```python
ledger = SQLiteLedger("village.db")
# ... run for N turns ...
ledger.snapshot_to("village-snapshot-turn-100.db")
```

SQLite's `.backup()` API is atomic and zero-copy.  The snapshot is a valid,
fully-readable database file — no special restore tooling needed.

### Restoring from a snapshot

```python
ledger = SQLiteLedger.from_file("village-snapshot-turn-100.db")
conductor = Conductor(scenario, ledger=ledger)
# conductor.turn must be set from the max turn in the ledger
conductor.turn = max(e.turn for e in ledger.events)
# now step() continues from where the snapshot left off
```

### Crash recovery

After a crash, the ledger file contains all events that were committed.
Events that were computed but not yet written are lost — but idempotency means
the next retry will recompute them with the same IDs and they will be deduplicated.

The recovery pattern:
1. Open the existing ledger file (not reset).
2. Rebuild projections from the full event log.
3. Set `conductor.turn` to the max turn in the log.
4. Resume `step()` from there.

---

## Tail replay for long-running monitoring

```python
# Poll for new events since the last seen offset:
offset = 0
while True:
    new_events = ledger.tail(from_offset=offset)
    for e in new_events:
        observer.consume(e)
    offset = ledger.latest_offset()
    time.sleep(1)
```

This is the pattern for an external monitoring process that reads the ledger
without being in-process with the conductor.  Useful for:
- A separate UI process
- A dashboard that watches a running scenario
- Post-hoc analysis of a completed run

---

## Phase 3 milestone: Postgres upgrade

The `Ledger` interface is the abstraction.  Swapping SQLite for Postgres is:
1. Implement `PostgresLedger(Ledger)` using `psycopg2` + `LISTEN/NOTIFY`
2. Add `pgvector` column for embedding-based episodic retrieval
3. Use Postgres `SERIAL` for ordering, `UNIQUE(id)` for idempotency — same logic
4. Factory reads `DATABASE_URL` env var, returns the right backend

Zero scenario or agent changes.

---

## Snapshot schedule (recommended)

For a scenario running 100 turns/hour:
- Take a snapshot every 100 turns
- Keep the last 3 snapshots
- On crash, restore from the latest snapshot and replay the tail

The `scripts/` directory will grow a `snapshot_and_prune.py` in Phase 3.