openLLMbenchmark / issue_cloud_agent_request.md
hf-space-deployer
HF Space deploy from main - 0b1e82967585f1407bf51086f2e5a962f178218a
371efe0

Implement Version 2 Risk-Reduced Migration Plan With Cloud Coding Agent

Objective

Implement the risk-reduced Version 2 migration plan in version2_plan.md using a cloud coding agent.

Source of Truth

  • Plan file: version2_plan.md
  • Repository: kmkarakaya/openLLMbenchmark

What to Implement

  1. Phase 0: Baseline freeze + observability
  • Create baseline fixtures for results.json and results.md behavior on fixed test inputs.
  • Add shared correlation/log fields (trace_id, run_id, session_id, dataset_key, question_id, model, event, elapsed_ms).
  • Lock API schema/version compatibility (v1).
  1. Phase 1: Thin API (shadow mode first, writes later)
  • Extract orchestration logic from UI to service layer without behavior changes.
  • Implement API contracts defined in plan (/health, /models, /datasets, /questions, /results, /runs, /runs/{id}/events, /runs/{id}/stop, /results/manual).
  • Keep read endpoints enabled first.
  • Enable run endpoints in shadow mode.
  • Enable write endpoints only after parity gates pass.
  1. Phase 2: Next.js MVP UI + gradual rollout
  • Build MVP against API: model/dataset selection, run/stop, streaming responses, manual decisions, metrics/matrix.
  • Canary internal rollout, then expansion after stability criteria.
  1. Phase 2b: Full parity + Streamlit deprecation gate
  • Add dataset upload/delete, JSON/Excel export, metadata stats/charts, response render parity.
  • Decommission Streamlit only after all parity and stability checks pass.

Required Endpoint Scope (Full Parity Target)

Implement these endpoints as explicit deliverables (not implied work):

  • GET /health
  • GET /models
  • GET /datasets
  • GET /datasets/template (download benchmark JSON template)
  • POST /datasets/upload (multipart file upload, validation, save)
  • DELETE /datasets/{dataset_key} (uploaded datasets only)
  • GET /questions?dataset_key=...
  • GET /results?dataset_key=...
  • GET /results/export?dataset_key=...&format=json|xlsx
  • PATCH /results/manual
  • POST /runs
  • GET /runs/{run_id}/events (SSE)
  • GET /runs/{run_id}/status (reconnect/snapshot state)
  • POST /runs/{run_id}/stop

Notes:

  • Endpoint behavior must preserve existing storage/scoring semantics.
  • Export endpoint must match current JSON/Excel output compatibility.
  • GET /runs/{run_id}/status is required for robust UI refresh/reconnect behavior.

Mandatory Pre-Implementation Locks (Do Not Skip)

  1. Single-writer policy
  • While FEATURE_API_WRITES=false: Streamlit is sole writer.
  • When FEATURE_API_WRITES=true: API is sole writer; Streamlit writes disabled/routed.
  • Use inter-process file locks for persistence paths.
  1. SSE rollback thresholds (15-minute rolling window)
  • Disable FEATURE_API_RUNS if any:
  • SSE disconnect/error rate > 1%
  • Run completion success rate < 99%
  • P95 chunk gap > 2s
  1. Delivery prerequisites
  • Add backend/frontend dependencies, CI gates, and environment contract exactly as defined in the plan.

Feature Flags

  • FEATURE_API_READS
  • FEATURE_API_RUNS
  • FEATURE_API_WRITES
  • FEATURE_NEW_UI

Acceptance Criteria

  • Golden-master parity tests pass between existing Streamlit behavior and service/API output.
  • API contract/integration/SSE tests pass.
  • Next.js MVP E2E flow passes.
  • 0 critical and 0 high defects during one-week internal canary.
  • At least 3 benchmark cycles with no data compatibility regression.

Constraints

  • Preserve existing storage/scoring semantics.
  • No breaking changes to result artifacts unless explicitly approved.
  • Keep migration backward-compatible until cutover gates are met.

Deliverables

  • PR(s) implementing phases with clear checkpoints.
  • Test evidence for each gate.
  • Rollback runbook tied to feature flags.

Please execute phase-by-phase and stop at each gate with evidence before enabling the next flag.