A newer version of the Gradio SDK is available: 6.19.0
Persistence β SQLite Ledger, Snapshot, and Replay
Why persistence matters for long-running agents
An interactive demo that runs for 5 minutes can live entirely in memory. A village that simulates a day over 8 wall-clock hours cannot:
- The process will be restarted
- The machine will sleep
- The user will want to resume yesterday's story
The fix: make the checkpoint the ledger. The ledger is already the source of truth. Persisting it is the only persistence you need.
SQLiteLedger
Drop-in replacement for the in-memory Ledger. Same API, same idempotency,
plus durable storage:
# In-memory (default, good for tests and short demos)
ledger = Ledger()
# Persistent (for long-running scenarios)
ledger = SQLiteLedger("runs/my-village.db")
# Pass to conductor
conductor = Conductor(scenario, ledger=ledger)
Schema
CREATE TABLE events (
offset INTEGER PRIMARY KEY AUTOINCREMENT, -- ordering
id TEXT UNIQUE NOT NULL, -- idempotency key
run_id TEXT NOT NULL,
turn INTEGER NOT NULL,
kind TEXT NOT NULL,
actor TEXT NOT NULL,
payload TEXT NOT NULL, -- JSON
created_at TEXT NOT NULL, -- ISO-8601
schema_version INTEGER NOT NULL DEFAULT 1
);
CREATE INDEX idx_run_id ON events(run_id);
CREATE INDEX idx_kind ON events(kind);
CREATE INDEX idx_actor ON events(actor);
Key decisions:
offsetis the ordering guarantee β notcreated_at(clock skew) orturn(duplicate turns on retry)UNIQUE(id)is the idempotency guarantee β retried workers can't double-writepayloadstored as JSON text β readable without schema migration for most changes- WAL mode + NORMAL synchronous β safe on crash, fast on write
Idempotency
Every event has a UUID id. The ledger's append() is idempotent:
- In-memory: deduplication via
_seen_idsset - SQLite:
INSERT OR IGNORE(actuallytry/except IntegrityErroronUNIQUE)
A retried conductor step emits the same events with the same IDs β safe to replay.
This is why uuid4() in the event constructor is the right default (not a sequence number).
Snapshot and replay
Taking a snapshot
ledger = SQLiteLedger("village.db")
# ... run for N turns ...
ledger.snapshot_to("village-snapshot-turn-100.db")
SQLite's .backup() API is atomic and zero-copy. The snapshot is a valid,
fully-readable database file β no special restore tooling needed.
Restoring from a snapshot
ledger = SQLiteLedger.from_file("village-snapshot-turn-100.db")
conductor = Conductor(scenario, ledger=ledger)
# conductor.turn must be set from the max turn in the ledger
conductor.turn = max(e.turn for e in ledger.events)
# now step() continues from where the snapshot left off
Crash recovery
After a crash, the ledger file contains all events that were committed. Events that were computed but not yet written are lost β but idempotency means the next retry will recompute them with the same IDs and they will be deduplicated.
The recovery pattern:
- Open the existing ledger file (not reset).
- Rebuild projections from the full event log.
- Set
conductor.turnto the max turn in the log. - Resume
step()from there.
Tail replay for long-running monitoring
# Poll for new events since the last seen offset:
offset = 0
while True:
new_events = ledger.tail(from_offset=offset)
for e in new_events:
observer.consume(e)
offset = ledger.latest_offset()
time.sleep(1)
This is the pattern for an external monitoring process that reads the ledger without being in-process with the conductor. Useful for:
- A separate UI process
- A dashboard that watches a running scenario
- Post-hoc analysis of a completed run
Phase 3 milestone: Postgres upgrade
The Ledger interface is the abstraction. Swapping SQLite for Postgres is:
- Implement
PostgresLedger(Ledger)usingpsycopg2+LISTEN/NOTIFY - Add
pgvectorcolumn for embedding-based episodic retrieval - Use Postgres
SERIALfor ordering,UNIQUE(id)for idempotency β same logic - Factory reads
DATABASE_URLenv var, returns the right backend
Zero scenario or agent changes.
Snapshot schedule (recommended)
For a scenario running 100 turns/hour:
- Take a snapshot every 100 turns
- Keep the last 3 snapshots
- On crash, restore from the latest snapshot and replay the tail
The scripts/ directory will grow a snapshot_and_prune.py in Phase 3.