ConverTA / docs /persistence.md
MikelWL's picture
Docs: overhaul roadmap and config persistence
903beb7
# Persistence Design (HF `/data` now, Railway later)
This document describes a storage design that enables:
- **Run history** for all three modes (AI↔AI, Human↔AI, Upload Text): chronological list → click → **replay read-only**.
- **Persona CRUD** (create/update/delete) with **versioning**, while preserving historical run replay.
- **Portability** across deployment targets:
- Hugging Face Spaces: persistent filesystem under `/data`
- Future: Railway (likely Postgres), without rewriting app/business logic
This is intentionally **implementation-agnostic** at the interface level and **implementation-specific** only at the backend adapter level (SQLite first).
---
## Goals
1. **Durable run replay**
- After a run ends, the transcript + analysis outputs are persisted and can be reloaded after restart/redeploy.
- Reloaded runs are **read-only**: no WebSocket streaming, no “resume”, no edits.
2. **Stable historical fidelity**
- Reloading a run shows the same transcript and analysis output that was originally produced.
- Runs do **not drift** if personas/system prompts are edited later.
3. **One shared history**
- This app is currently used by a team evaluating the tool, so history is **global/shared**, not per-user.
4. **Provider portability**
- HF → Railway should be a storage-backend swap (SQLite-on-volume → Postgres), not a rewrite.
---
## Non-goals (for the first iteration)
- Live persistence during streaming (we can add later). For now: **persist at end-of-run only**.
- “Resume” or “continue” a prior run.
- Fine-grained multi-user auth and per-user histories.
- Full-text search, tagging, sharing links, etc.
---
## What “persistent storage” means in this project
The current runtime holds critical data in memory during a session:
- Transcript messages: `ConversationAI/backend/api/conversation_service.py` maintains `self.transcripts[conversation_id]`.
- Analysis results: after completion, `_run_resource_agent()` broadcasts `resource_agent_result` back to the UI.
Persistence means:
1. When a session ends as a **sealed run** (conversation finished and analysis succeeded), we write a **Run record** to durable storage:
- transcript (messages)
- analysis outputs (resource agent JSON, evidence catalog, schema versions)
- configuration snapshot (LLM backend/model/params, selected personas, and the effective shared settings used)
- persona snapshots for historical fidelity (see below)
2. We expose APIs to:
- list prior runs (chronological)
- fetch a specific run by `run_id`
3. The UI can then “rehydrate” a panel from those persisted artifacts and render it read-only.
---
## Key design principles
### 1) Runs are immutable once sealed
We treat a “Run” as a record of what happened.
- A run can transition from “active” → “sealed”
- Once sealed, it is **read-only** for all consumers
- The UI is allowed to render it, export it, and inspect it
This makes the system easy to reason about and prevents accidental data drift.
### 2) Personas are mutable, but runs never drift
Personas and system prompts are editable over time (CRUD).
However, old runs must still open exactly as they were.
To guarantee this:
- We store a **persona snapshot** inside each run (the persona content used at runtime).
- Optionally, we also store a reference to the persona ID/version that the snapshot came from.
### 3) Store blobs as JSON, but keep query fields as columns
For history lists and basic filtering, we want queryable columns (mode, status, timestamps).
For richer data (config, analysis output, persona content), JSON is fine and reduces schema churn.
### 4) Storage adapter boundary (portability)
All application code should talk to a small interface (conceptually):
- `RunStore`
- `save_sealed_run(run_record)`
- `list_runs(mode?, limit?, offset?)`
- `get_run(run_id)`
- `PersonaStore`
- `list_personas(kind?, include_deleted?)`
- `get_persona(persona_id, version?)`
- `create_persona(payload)`
- `update_persona(persona_id, payload)` (creates a new version)
- `delete_persona(persona_id)` (soft delete)
HF and Railway differ only in the **implementation** of these interfaces.
---
## Storage backend choice: SQLite-first
SQLite is selected as the first implementation because it provides:
- Atomic writes and durability
- Indexes for fast run listing
- Natural support for persona versioning
- A smooth conceptual migration path to Postgres later
### Database location
- Hugging Face Spaces: `DB_PATH=/data/converta/converta.db`
- Local dev fallback: `DB_PATH=ConversationAI/.localdata/converta.db` (or similar)
The storage module should create parent directories if missing.
Important: `DB_PATH` (env) is the canonical source of truth for storage location. Any legacy config values
(e.g. `config/default_config.yaml` SQLite path) should be treated as non-authoritative for persistence.
---
## Proposed SQLite schema (v1)
### `runs` — top-level history entries
- `run_id TEXT PRIMARY KEY` (UUID)
- `mode TEXT NOT NULL` (`ai_to_ai|human_to_ai|text_analysis`)
- `status TEXT NOT NULL` (`completed` in v1; reserve `aborted|error` for later)
- `created_at TEXT NOT NULL` (ISO timestamp)
- `ended_at TEXT NOT NULL` (ISO timestamp)
- `title TEXT` (optional)
- `input_summary TEXT` (optional: filename/source label for text analysis)
- `config_json TEXT NOT NULL` (JSON blob)
- `sealed_at TEXT NOT NULL` (ISO timestamp; equals `ended_at` in v1)
Indexes:
- `INDEX runs_mode_created_at ON runs(mode, created_at DESC)`
- `INDEX runs_created_at ON runs(created_at DESC)`
### `run_messages` — transcripts
- `run_id TEXT NOT NULL` (FK → `runs.run_id`)
- `message_index INTEGER NOT NULL`
- `role TEXT NOT NULL`
- `persona_label TEXT` (optional)
- `content TEXT NOT NULL`
- `timestamp TEXT` (ISO timestamp)
Primary key:
- `(run_id, message_index)`
### `run_analyses` — analysis outputs per run
- `run_id TEXT NOT NULL` (FK)
- `analysis_key TEXT NOT NULL` (e.g. `resource_agent_v2`)
- `schema_version TEXT`
- `prompt_version TEXT`
- `result_json TEXT NOT NULL` (full JSON blob, including `evidence_catalog`)
Primary key:
- `(run_id, analysis_key)`
### `personas` — stable identity and lifecycle
- `persona_id TEXT PRIMARY KEY` (UUID)
- `kind TEXT NOT NULL` (`surveyor|patient`)
- `name TEXT NOT NULL`
- `is_deleted INTEGER NOT NULL DEFAULT 0`
- `created_at TEXT NOT NULL`
- `updated_at TEXT NOT NULL`
### `persona_versions` — append-only versions
- `persona_id TEXT NOT NULL` (FK)
- `version_id TEXT NOT NULL` (UUID)
- `created_at TEXT NOT NULL`
- `content_json TEXT NOT NULL` (persona definition + system prompt)
Primary key:
- `(persona_id, version_id)`
### `run_persona_snapshots` — prevent drift
- `run_id TEXT NOT NULL` (FK)
- `role TEXT NOT NULL` (`surveyor|patient`)
- `persona_id TEXT` (nullable)
- `persona_version_id TEXT` (nullable)
- `snapshot_json TEXT NOT NULL`
Primary key:
- `(run_id, role)`
---
## What gets stored in `config_json` (recommended)
`config_json` should allow exact replay and debugging:
- LLM settings:
- `llm_backend`, `host`, `model`
- `timeout`, `max_retries`, `retry_delay`
- any generation params used (temperature, max_tokens, top_p, etc.)
- Mode-specific:
- AI↔AI: surveyor/patient persona IDs (and persona version ids / snapshots)
- Human↔AI: same + human mode flags
- Text analysis: `source_name`, optional file metadata (original filename, sha256)
- App versions:
- `analysis_prompt_version` and `schema_version` (duplicated in `run_analyses` is fine)
- optional git commit SHA (if available at runtime)
---
## Integration points (where persistence is hooked in)
This section maps “what to save” to the current code.
### 1) AI↔AI and Human↔AI runs
Source of truth today:
- Transcript: `ConversationAI/backend/api/conversation_service.py` (`self.transcripts[conversation_id]`)
- Analysis: `_run_resource_agent(conversation_id)` broadcasts `resource_agent_result`
End-of-run save flow (conceptual):
1. Run completes (or human chat ends)
2. Resource agent analysis completes successfully
3. Build a `RunRecord`:
- `run_id` (use the `conversation_id` or generate a new UUID; recommended: new `run_id` distinct from WS id)
- `mode`, `status`, timestamps
- `messages[]` from `self.transcripts`
- `analyses["resource_agent_v2"]` from the parsed JSON
- `persona snapshots` for surveyor/patient content actually used
4. `RunStore.save_sealed_run(run_record)` writes the run to SQLite in a transaction.
5. Memory cleanup proceeds as today.
Notes:
- Because we’re “end-only”, if the process dies mid-run, that run is lost. This is accepted for v1.
- If a run is stopped/aborted (e.g. user presses Stop in AI↔AI), it is not a sealed run and is not persisted in v1.
### 2) Text analysis (“Upload Text”)
Source of truth today:
- Transcript is derived by parsing uploaded/pasted text into message-like units.
- Resource agent analysis is run and returned.
End-of-analysis save flow:
- Store as `mode=text_analysis` with `messages[]` using role `transcript` (or the derived roles if present).
- Store analysis output the same way as live conversations.
- Because Upload Text is analysis-driven, “sealed” effectively means “analysis succeeded”; if analysis fails, do not persist.
---
## API design for history and persona CRUD (v1)
### Runs
- `GET /api/runs?mode=ai_to_ai&limit=50&offset=0`
- Returns run summaries: `run_id`, `mode`, `status`, `created_at`, `ended_at`, `title`, `input_summary`
- `GET /api/runs/{run_id}`
- Returns the full run record: transcript + analysis JSON + config snapshot
### Personas
- `GET /api/personas`
- `POST /api/personas`
- `PUT /api/personas/{persona_id}` (creates a new version)
- `DELETE /api/personas/{persona_id}` (soft delete)
- (optional) `GET /api/personas/{persona_id}/versions`
Defaults:
- Default personas are seeded by the backend on startup.
- The DB is the runtime source of truth for both defaults and user-created personas.
---
## UI behavior for read-only history
The UI should treat historical sessions as “render only”:
- No WebSocket connection
- Start/stop buttons disabled
- Transcript and analysis panels populated from `GET /api/runs/{run_id}`
- Export actions should be backed by server-canonical run data (avoid treating client-hydrated payloads as the source of truth)
Recommended UI structure:
- Add a “History” view per mode (or a unified history with filters)
- Clicking a run loads it into that panel, sets a `readOnly=true` flag, and renders accordingly
---
## Migration plan: SQLite → Postgres (Railway)
### What stays the same
- The `RunStore` / `PersonaStore` interface
- The external API endpoints and payload shapes
- The UI behavior (list + get + render)
- The logical schema (tables/entities)
### What changes
- Replace `SQLiteRunStore/SQLitePersonaStore` with `PostgresRunStore/PostgresPersonaStore`
- DB connection config:
- `DATABASE_URL` (Railway Postgres)
- migrations managed via Alembic (recommended for Postgres)
### Suggested migration path
1. Introduce the store interface now and keep it backend-agnostic.
2. Implement SQLite store now (HF + local).
3. When ready for Railway:
- Implement Postgres store behind the same interface
- Add migrations
- Add a one-time “export/import” command:
- export all SQLite runs/personas to JSON
- import into Postgres
---
## Verification checklist (for v1)
1. Run an AI↔AI session to completion and confirm a row appears in `runs` with `status=completed`.
2. Reload history via API and confirm:
- transcript message count and ordering match the original UI
- analysis boxes match the original output
3. Stop an AI↔AI session mid-run and confirm it does **not** appear in history (not sealed → not persisted).
4. Export a completed run via API and confirm the file downloads from `/api/runs/{run_id}/export/*` (server-canonical).
5. Restart the container/app and confirm history remains available.