Spaces:

kmkarakaya
/

openLLMbenchmark

Sleeping

openLLMbenchmark / issue_cloud_agent_request.md

hf-space-deployer

HF Space deploy from main - 0b1e82967585f1407bf51086f2e5a962f178218a

371efe0 about 1 month ago

3.88 kB

Implement Version 2 Risk-Reduced Migration Plan With Cloud Coding Agent

Implement the risk-reduced Version 2 migration plan in version2_plan.md using a cloud coding agent.

Create baseline fixtures for results.json and results.md behavior on fixed test inputs.
Add shared correlation/log fields (trace_id, run_id, session_id, dataset_key, question_id, model, event, elapsed_ms).
Lock API schema/version compatibility (v1).

Extract orchestration logic from UI to service layer without behavior changes.
Implement API contracts defined in plan (/health, /models, /datasets, /questions, /results, /runs, /runs/{id}/events, /runs/{id}/stop, /results/manual).
Keep read endpoints enabled first.
Enable run endpoints in shadow mode.
Enable write endpoints only after parity gates pass.

Build MVP against API: model/dataset selection, run/stop, streaming responses, manual decisions, metrics/matrix.
Canary internal rollout, then expansion after stability criteria.

Add dataset upload/delete, JSON/Excel export, metadata stats/charts, response render parity.
Decommission Streamlit only after all parity and stability checks pass.

Implement these endpoints as explicit deliverables (not implied work):

Notes:

Endpoint behavior must preserve existing storage/scoring semantics.
Export endpoint must match current JSON/Excel output compatibility.
GET /runs/{run_id}/status is required for robust UI refresh/reconnect behavior.

While FEATURE_API_WRITES=false: Streamlit is sole writer.
When FEATURE_API_WRITES=true: API is sole writer; Streamlit writes disabled/routed.
Use inter-process file locks for persistence paths.

Add backend/frontend dependencies, CI gates, and environment contract exactly as defined in the plan.

Golden-master parity tests pass between existing Streamlit behavior and service/API output.
API contract/integration/SSE tests pass.
Next.js MVP E2E flow passes.
0 critical and 0 high defects during one-week internal canary.
At least 3 benchmark cycles with no data compatibility regression.

Please execute phase-by-phase and stop at each gate with evidence before enabling the next flag.