Spaces:
Sleeping
Sleeping
Implement Version 2 Risk-Reduced Migration Plan With Cloud Coding Agent
Objective
Implement the risk-reduced Version 2 migration plan in version2_plan.md using a cloud coding agent.
Source of Truth
- Plan file:
version2_plan.md - Repository:
kmkarakaya/openLLMbenchmark
What to Implement
- Phase 0: Baseline freeze + observability
- Create baseline fixtures for
results.jsonandresults.mdbehavior on fixed test inputs. - Add shared correlation/log fields (
trace_id,run_id,session_id,dataset_key,question_id,model,event,elapsed_ms). - Lock API schema/version compatibility (
v1).
- Phase 1: Thin API (shadow mode first, writes later)
- Extract orchestration logic from UI to service layer without behavior changes.
- Implement API contracts defined in plan (
/health,/models,/datasets,/questions,/results,/runs,/runs/{id}/events,/runs/{id}/stop,/results/manual). - Keep read endpoints enabled first.
- Enable run endpoints in shadow mode.
- Enable write endpoints only after parity gates pass.
- Phase 2: Next.js MVP UI + gradual rollout
- Build MVP against API: model/dataset selection, run/stop, streaming responses, manual decisions, metrics/matrix.
- Canary internal rollout, then expansion after stability criteria.
- Phase 2b: Full parity + Streamlit deprecation gate
- Add dataset upload/delete, JSON/Excel export, metadata stats/charts, response render parity.
- Decommission Streamlit only after all parity and stability checks pass.
Required Endpoint Scope (Full Parity Target)
Implement these endpoints as explicit deliverables (not implied work):
GET /healthGET /modelsGET /datasetsGET /datasets/template(download benchmark JSON template)POST /datasets/upload(multipart file upload, validation, save)DELETE /datasets/{dataset_key}(uploaded datasets only)GET /questions?dataset_key=...GET /results?dataset_key=...GET /results/export?dataset_key=...&format=json|xlsxPATCH /results/manualPOST /runsGET /runs/{run_id}/events(SSE)GET /runs/{run_id}/status(reconnect/snapshot state)POST /runs/{run_id}/stop
Notes:
- Endpoint behavior must preserve existing storage/scoring semantics.
- Export endpoint must match current JSON/Excel output compatibility.
GET /runs/{run_id}/statusis required for robust UI refresh/reconnect behavior.
Mandatory Pre-Implementation Locks (Do Not Skip)
- Single-writer policy
- While
FEATURE_API_WRITES=false: Streamlit is sole writer. - When
FEATURE_API_WRITES=true: API is sole writer; Streamlit writes disabled/routed. - Use inter-process file locks for persistence paths.
- SSE rollback thresholds (15-minute rolling window)
- Disable
FEATURE_API_RUNSif any: - SSE disconnect/error rate > 1%
- Run completion success rate < 99%
- P95 chunk gap > 2s
- Delivery prerequisites
- Add backend/frontend dependencies, CI gates, and environment contract exactly as defined in the plan.
Feature Flags
FEATURE_API_READSFEATURE_API_RUNSFEATURE_API_WRITESFEATURE_NEW_UI
Acceptance Criteria
- Golden-master parity tests pass between existing Streamlit behavior and service/API output.
- API contract/integration/SSE tests pass.
- Next.js MVP E2E flow passes.
- 0 critical and 0 high defects during one-week internal canary.
- At least 3 benchmark cycles with no data compatibility regression.
Constraints
- Preserve existing storage/scoring semantics.
- No breaking changes to result artifacts unless explicitly approved.
- Keep migration backward-compatible until cutover gates are met.
Deliverables
- PR(s) implementing phases with clear checkpoints.
- Test evidence for each gate.
- Rollback runbook tied to feature flags.
Please execute phase-by-phase and stop at each gate with evidence before enabling the next flag.