Spaces:
Running
Migration Analysis: Which Approach Is Best for Your ML Agent System?
After a thorough review of your ML-ETLAgent.py (1966 lines), your chatview.tsx (862 lines), and verification against the official Agno background execution docs, here is my honest assessment.
The Three Plans At a Glance
| # | Plan | Core Idea |
|---|---|---|
| 1 | CRUD + Code Edit/Rerun | Add endpoints to read/edit/rerun saved scripts from MinIO |
| 2 | Studio Hydration | Build /ml-studio/{session_id} that hydrates all state from DB on page load |
| 3 | Background Execution Migration | Replace threading/SSE with agent.arun(background=True) + PostgresDb + polling |
β οΈ Critical Finding: They Are NOT Three Alternatives
These three plans are not competing options β they are layers of the same solution. Plan 3 is the foundation, Plan 2 is the frontend architecture on top of it, and Plan 1 provides supplementary features. The correct answer is: do Plan 3 first, then Plan 2, then Plan 1 as needed.
Here's why:
Plan 3: Background Execution Migration β THE FOUNDATION β
Why This Must Come First
Your current architecture has a fundamental fragility: the ~120-line _run_agent_thread + asyncio.Queue + _active_runs dict system (lines 1602β1757). This is the root cause of every problem you've described:
| Problem | Root Cause |
|---|---|
| Agent stops when UI closes | _active_runs is in-memory; thread dies if process recycles |
| Can't reconnect to running jobs | Queue is per-process; no cross-instance sharing |
| analysis.md | No persistent run status |
| Can't share state across servers | SQLite + in-memory dict are single-node only |
What Agno Actually Provides (Verified)
From the official Agno example, the pattern is:
# Requires PostgresDb (NOT SqliteDb)
db = PostgresDb(
db_url="postgresql+psycopg://ai:ai@localhost:5532/ai",
session_table="background_exec_sessions",
)
agent = Agent(model=..., db=db)
# Returns immediately with RunStatus.pending
run_output = await agent.arun("...", background=True)
# Poll from DB (works from any process/server)
result = await agent.aget_run_output(
run_id=run_output.run_id,
session_id=run_output.session_id,
)
# result.status β pending | completed | error
# result.content β the agent's final answer
# Cancel
cancelled = await agent.acancel_run(run_id=run_output.run_id)
background=Truerequires PostgresDb. It will NOT work with your currentSqliteDb. This is a hard requirement from Agno's architecture β the DB acts as the message queue between the background task and the polling endpoint.
What You Delete (Massive Simplification)
| Lines | What | Action |
|---|---|---|
| 1β2 | import asyncio, import threading |
DELETE |
| 1602β1757 | _active_runs, _put_threadsafe, _run_agent_thread (~155 lines) |
DELETE entirely |
| 1799β1844 | stream_generator(), SSE headers, keep-alive, queue reconnect logic | DELETE |
That's ~160 lines of fragile custom infrastructure replaced by one line: background=True.
Risk Assessment
| Risk | Severity | Mitigation |
|---|---|---|
| PostgreSQL dependency | Medium | You likely already have it (Supabase uses PG). Docker one-liner for local dev |
| No streaming/SSE | High | Polling replaces real-time token streaming. Your frontend shows tool calls, narrations, and content chunks β all of this disappears with pure polling |
psycopg binary install |
Low | Standard pip install |
The streaming trade-off is the biggest decision here. Your chatview.tsx renders ThinkingSection (tool calls in real-time),
StreamingResponseDisplay(token-by-token content), andThoughtEventtimelines. With polling, you lose all of this β the user sees "Processing..." and then the full result appears at once. For ML pipelines that run 2β10 minutes, this may actually be fine (users don't watch a spinner for 5 minutes). But for quick chat messages (sub-10s), the experience degrades significantly.
Plan 2: Studio Hydration β THE FRONTEND LAYER
Why This Comes Second
Once you have Plan 3 (DB as source of truth), Plan 2 is trivially easy:
GET /ml-studio/{session_id}
β agent.get_session_messages(session_id) # from PostgresDb
β agent.get_session_state(session_id) # from PostgresDb
β return { messages, state, script }
Without Plan 3, this endpoint still works (your current SQLite stores messages), but it can't tell you if a run is still active, what the current status is, or allow reconnection from another machine.
Your Next.js Route
app/ml-studio/[sessionId]/page.tsx
On load:
GET /ml-studio/{sessionId}β hydrate chat history + state- If
is_running: trueβ start pollingGET /ml/runs/{sessionId}/{runId} - If user sends a new message β
POST /ml/ml-etl-agent/runβ getrun_idβ start polling
This is the standard pattern. No architectural risk.
Plan 1: Code Edit/Rerun β SUPPLEMENTARY FEATURE
Why This Comes Last
Adding script edit/rerun endpoints is a feature, not an infrastructure change. It only makes sense once:
- β Scripts are reliably persisted (your MinIO setup already handles this)
- β You can re-execute code independently (Plan 3 gives you clean execution boundaries)
- β The DB tracks which session owned which script (Plan 3's PostgresDb)
The dependency-checking/auto-install logic (_extract_and_check_imports) is genuinely useful but is a nice-to-have, not a blocker.
My Recommendation: Hybrid Approach
Don't do a full "rip and replace" of SSE with polling. Instead, use a hybrid:
Phase 1: Add background=True alongside existing SSE (1β2 days)
- Switch
SqliteDbβPostgresDb(agent constructor change, ~5 lines) - Add the poll endpoint
GET /ml/runs/{session_id}/{run_id} - Add the studio hydration endpoint
GET /ml-studio/{session_id} - Keep the existing SSE path (
stream=True) working as-is - Add a new non-streaming path that uses
background=True
This gives you both options: existing SSE for real-time chat UX, and background+poll for the "close browser and come back" use case.
Phase 2: Build the /ml-studio/[sessionId] page (1β2 days)
- Next.js dynamic route with DB hydration
- Poll-based status updates for active runs
- Code viewer/editor panel (reads saved scripts from MinIO)
Phase 3: Deprecate SSE (later, if desired)
Once you've validated the polling UX works well, you can optionally remove the threading/SSE code. But there's no urgency β the hybrid approach is the safest.
Summary Verdict
| Approach | Verdict |
|---|---|
| Plan 1 alone | β Wrong order β doesn't fix the infrastructure problems |
| Plan 2 alone | β Missing foundation β can't reliably track running jobs with SQLite |
| Plan 3 alone | β οΈ Works but lossy β kills real-time streaming UX that your chatview.tsx is built around |
| Hybrid (3 β 2 β 1) | β Best path β PostgresDb as foundation, keep SSE for real-time, add polling for resilience |
The single most impactful change is switching from SqliteDb to PostgresDb. That unlocks everything else with minimal risk to your existing SSE streaming pipeline.