sirus / backend /ml_module /working /analysis.md
ranilmukesh's picture
Deploy SiRUS SQL Agent backend
783a952

Migration Analysis: Which Approach Is Best for Your ML Agent System?

After a thorough review of your ML-ETLAgent.py (1966 lines), your chatview.tsx (862 lines), and verification against the official Agno background execution docs, here is my honest assessment.


The Three Plans At a Glance

# Plan Core Idea
1 CRUD + Code Edit/Rerun Add endpoints to read/edit/rerun saved scripts from MinIO
2 Studio Hydration Build /ml-studio/{session_id} that hydrates all state from DB on page load
3 Background Execution Migration Replace threading/SSE with agent.arun(background=True) + PostgresDb + polling

⚠️ Critical Finding: They Are NOT Three Alternatives

These three plans are not competing options β€” they are layers of the same solution. Plan 3 is the foundation, Plan 2 is the frontend architecture on top of it, and Plan 1 provides supplementary features. The correct answer is: do Plan 3 first, then Plan 2, then Plan 1 as needed.

Here's why:


Plan 3: Background Execution Migration β€” THE FOUNDATION βœ…

Why This Must Come First

Your current architecture has a fundamental fragility: the ~120-line _run_agent_thread + asyncio.Queue + _active_runs dict system (lines 1602–1757). This is the root cause of every problem you've described:

Problem Root Cause
Agent stops when UI closes _active_runs is in-memory; thread dies if process recycles
Can't reconnect to running jobs Queue is per-process; no cross-instance sharing
analysis.md No persistent run status
Can't share state across servers SQLite + in-memory dict are single-node only

What Agno Actually Provides (Verified)

From the official Agno example, the pattern is:

# Requires PostgresDb (NOT SqliteDb)
db = PostgresDb(
    db_url="postgresql+psycopg://ai:ai@localhost:5532/ai",
    session_table="background_exec_sessions",
)

agent = Agent(model=..., db=db)

# Returns immediately with RunStatus.pending
run_output = await agent.arun("...", background=True)

# Poll from DB (works from any process/server)
result = await agent.aget_run_output(
    run_id=run_output.run_id,
    session_id=run_output.session_id,
)
# result.status β†’ pending | completed | error
# result.content β†’ the agent's final answer

# Cancel
cancelled = await agent.acancel_run(run_id=run_output.run_id)

background=True requires PostgresDb. It will NOT work with your current SqliteDb. This is a hard requirement from Agno's architecture β€” the DB acts as the message queue between the background task and the polling endpoint.

What You Delete (Massive Simplification)

Lines What Action
1–2 import asyncio, import threading DELETE
1602–1757 _active_runs, _put_threadsafe, _run_agent_thread (~155 lines) DELETE entirely
1799–1844 stream_generator(), SSE headers, keep-alive, queue reconnect logic DELETE

That's ~160 lines of fragile custom infrastructure replaced by one line: background=True.

Risk Assessment

Risk Severity Mitigation
PostgreSQL dependency Medium You likely already have it (Supabase uses PG). Docker one-liner for local dev
No streaming/SSE High Polling replaces real-time token streaming. Your frontend shows tool calls, narrations, and content chunks β€” all of this disappears with pure polling
psycopg binary install Low Standard pip install

The streaming trade-off is the biggest decision here. Your chatview.tsx renders ThinkingSection (tool calls in real-time), StreamingResponseDisplay (token-by-token content), and ThoughtEvent timelines. With polling, you lose all of this β€” the user sees "Processing..." and then the full result appears at once. For ML pipelines that run 2–10 minutes, this may actually be fine (users don't watch a spinner for 5 minutes). But for quick chat messages (sub-10s), the experience degrades significantly.


Plan 2: Studio Hydration β€” THE FRONTEND LAYER

Why This Comes Second

Once you have Plan 3 (DB as source of truth), Plan 2 is trivially easy:

GET /ml-studio/{session_id}
  β†’ agent.get_session_messages(session_id)  # from PostgresDb
  β†’ agent.get_session_state(session_id)     # from PostgresDb  
  β†’ return { messages, state, script }

Without Plan 3, this endpoint still works (your current SQLite stores messages), but it can't tell you if a run is still active, what the current status is, or allow reconnection from another machine.

Your Next.js Route

app/ml-studio/[sessionId]/page.tsx

On load:

  1. GET /ml-studio/{sessionId} β†’ hydrate chat history + state
  2. If is_running: true β†’ start polling GET /ml/runs/{sessionId}/{runId}
  3. If user sends a new message β†’ POST /ml/ml-etl-agent/run β†’ get run_id β†’ start polling

This is the standard pattern. No architectural risk.


Plan 1: Code Edit/Rerun β€” SUPPLEMENTARY FEATURE

Why This Comes Last

Adding script edit/rerun endpoints is a feature, not an infrastructure change. It only makes sense once:

  • βœ… Scripts are reliably persisted (your MinIO setup already handles this)
  • βœ… You can re-execute code independently (Plan 3 gives you clean execution boundaries)
  • βœ… The DB tracks which session owned which script (Plan 3's PostgresDb)

The dependency-checking/auto-install logic (_extract_and_check_imports) is genuinely useful but is a nice-to-have, not a blocker.


My Recommendation: Hybrid Approach

Don't do a full "rip and replace" of SSE with polling. Instead, use a hybrid:

Phase 1: Add background=True alongside existing SSE (1–2 days)

  1. Switch SqliteDb β†’ PostgresDb (agent constructor change, ~5 lines)
  2. Add the poll endpoint GET /ml/runs/{session_id}/{run_id}
  3. Add the studio hydration endpoint GET /ml-studio/{session_id}
  4. Keep the existing SSE path (stream=True) working as-is
  5. Add a new non-streaming path that uses background=True

This gives you both options: existing SSE for real-time chat UX, and background+poll for the "close browser and come back" use case.

Phase 2: Build the /ml-studio/[sessionId] page (1–2 days)

  1. Next.js dynamic route with DB hydration
  2. Poll-based status updates for active runs
  3. Code viewer/editor panel (reads saved scripts from MinIO)

Phase 3: Deprecate SSE (later, if desired)

Once you've validated the polling UX works well, you can optionally remove the threading/SSE code. But there's no urgency β€” the hybrid approach is the safest.


Summary Verdict

Approach Verdict
Plan 1 alone ❌ Wrong order β€” doesn't fix the infrastructure problems
Plan 2 alone ❌ Missing foundation β€” can't reliably track running jobs with SQLite
Plan 3 alone ⚠️ Works but lossy β€” kills real-time streaming UX that your chatview.tsx is built around
Hybrid (3 β†’ 2 β†’ 1) βœ… Best path β€” PostgresDb as foundation, keep SSE for real-time, add polling for resilience

The single most impactful change is switching from SqliteDb to PostgresDb. That unlocks everything else with minimal risk to your existing SSE streaming pipeline.