# Implement Version 2 Risk-Reduced Migration Plan With Cloud Coding Agent ## Objective Implement the risk-reduced Version 2 migration plan in `version2_plan.md` using a cloud coding agent. ## Source of Truth - Plan file: `version2_plan.md` - Repository: `kmkarakaya/openLLMbenchmark` ## What to Implement 1. Phase 0: Baseline freeze + observability - Create baseline fixtures for `results.json` and `results.md` behavior on fixed test inputs. - Add shared correlation/log fields (`trace_id`, `run_id`, `session_id`, `dataset_key`, `question_id`, `model`, `event`, `elapsed_ms`). - Lock API schema/version compatibility (`v1`). 2. Phase 1: Thin API (shadow mode first, writes later) - Extract orchestration logic from UI to service layer without behavior changes. - Implement API contracts defined in plan (`/health`, `/models`, `/datasets`, `/questions`, `/results`, `/runs`, `/runs/{id}/events`, `/runs/{id}/stop`, `/results/manual`). - Keep read endpoints enabled first. - Enable run endpoints in shadow mode. - Enable write endpoints only after parity gates pass. 3. Phase 2: Next.js MVP UI + gradual rollout - Build MVP against API: model/dataset selection, run/stop, streaming responses, manual decisions, metrics/matrix. - Canary internal rollout, then expansion after stability criteria. 4. Phase 2b: Full parity + Streamlit deprecation gate - Add dataset upload/delete, JSON/Excel export, metadata stats/charts, response render parity. - Decommission Streamlit only after all parity and stability checks pass. ## Required Endpoint Scope (Full Parity Target) Implement these endpoints as explicit deliverables (not implied work): - `GET /health` - `GET /models` - `GET /datasets` - `GET /datasets/template` (download benchmark JSON template) - `POST /datasets/upload` (multipart file upload, validation, save) - `DELETE /datasets/{dataset_key}` (uploaded datasets only) - `GET /questions?dataset_key=...` - `GET /results?dataset_key=...` - `GET /results/export?dataset_key=...&format=json|xlsx` - `PATCH /results/manual` - `POST /runs` - `GET /runs/{run_id}/events` (SSE) - `GET /runs/{run_id}/status` (reconnect/snapshot state) - `POST /runs/{run_id}/stop` Notes: - Endpoint behavior must preserve existing storage/scoring semantics. - Export endpoint must match current JSON/Excel output compatibility. - `GET /runs/{run_id}/status` is required for robust UI refresh/reconnect behavior. ## Mandatory Pre-Implementation Locks (Do Not Skip) 1. Single-writer policy - While `FEATURE_API_WRITES=false`: Streamlit is sole writer. - When `FEATURE_API_WRITES=true`: API is sole writer; Streamlit writes disabled/routed. - Use inter-process file locks for persistence paths. 2. SSE rollback thresholds (15-minute rolling window) - Disable `FEATURE_API_RUNS` if any: - SSE disconnect/error rate > 1% - Run completion success rate < 99% - P95 chunk gap > 2s 3. Delivery prerequisites - Add backend/frontend dependencies, CI gates, and environment contract exactly as defined in the plan. ## Feature Flags - `FEATURE_API_READS` - `FEATURE_API_RUNS` - `FEATURE_API_WRITES` - `FEATURE_NEW_UI` ## Acceptance Criteria - Golden-master parity tests pass between existing Streamlit behavior and service/API output. - API contract/integration/SSE tests pass. - Next.js MVP E2E flow passes. - 0 critical and 0 high defects during one-week internal canary. - At least 3 benchmark cycles with no data compatibility regression. ## Constraints - Preserve existing storage/scoring semantics. - No breaking changes to result artifacts unless explicitly approved. - Keep migration backward-compatible until cutover gates are met. ## Deliverables - PR(s) implementing phases with clear checkpoints. - Test evidence for each gate. - Rollback runbook tied to feature flags. --- Please execute phase-by-phase and stop at each gate with evidence before enabling the next flag.