Spaces:

ranilmukesh
/

sirus

Running

App Files Files Community

sirus / backend /ml_module /working /analysis.md

ranilmukesh

Deploy SiRUS SQL Agent backend

783a952 3 days ago

preview code

raw

history blame contribute delete

8.69 kB

Migration Analysis: Which Approach Is Best for Your ML Agent System?

After a thorough review of your ML-ETLAgent.py (1966 lines), your chatview.tsx (862 lines), and verification against the official Agno background execution docs, here is my honest assessment.

The Three Plans At a Glance

#	Plan	Core Idea
1	CRUD + Code Edit/Rerun	Add endpoints to read/edit/rerun saved scripts from MinIO
2	Studio Hydration	Build `/ml-studio/{session_id}` that hydrates all state from DB on page load
3	Background Execution Migration	Replace threading/SSE with `agent.arun(background=True)` + `PostgresDb` + polling

⚠️ Critical Finding: They Are NOT Three Alternatives

These three plans are not competing options — they are layers of the same solution. Plan 3 is the foundation, Plan 2 is the frontend architecture on top of it, and Plan 1 provides supplementary features. The correct answer is: do Plan 3 first, then Plan 2, then Plan 1 as needed.

Here's why:

Plan 3: Background Execution Migration — THE FOUNDATION ✅

Why This Must Come First

Your current architecture has a fundamental fragility: the ~120-line _run_agent_thread + asyncio.Queue + _active_runs dict system (lines 1602–1757). This is the root cause of every problem you've described:

Problem	Root Cause
Agent stops when UI closes	`_active_runs` is in-memory; thread dies if process recycles
Can't reconnect to running jobs	Queue is per-process; no cross-instance sharing
analysis.md	No persistent run status
Can't share state across servers	SQLite + in-memory dict are single-node only

What Agno Actually Provides (Verified)

From the official Agno example, the pattern is:

# Requires PostgresDb (NOT SqliteDb)
db = PostgresDb(
    db_url="postgresql+psycopg://ai:ai@localhost:5532/ai",
    session_table="background_exec_sessions",
)

agent = Agent(model=..., db=db)

# Returns immediately with RunStatus.pending
run_output = await agent.arun("...", background=True)

# Poll from DB (works from any process/server)
result = await agent.aget_run_output(
    run_id=run_output.run_id,
    session_id=run_output.session_id,
)
# result.status → pending | completed | error
# result.content → the agent's final answer

# Cancel
cancelled = await agent.acancel_run(run_id=run_output.run_id)

background=True requires PostgresDb. It will NOT work with your current SqliteDb. This is a hard requirement from Agno's architecture — the DB acts as the message queue between the background task and the polling endpoint.

What You Delete (Massive Simplification)

Lines	What	Action
1–2	`import asyncio`, `import threading`	DELETE
1602–1757	`_active_runs`, _put_threadsafe, _run_agent_thread (~155 lines)	DELETE entirely
1799–1844	stream_generator(), SSE headers, keep-alive, queue reconnect logic	DELETE

That's ~160 lines of fragile custom infrastructure replaced by one line: background=True.

Risk Assessment

Risk	Severity	Mitigation
PostgreSQL dependency	Medium	You likely already have it (Supabase uses PG). Docker one-liner for local dev
No streaming/SSE	High	Polling replaces real-time token streaming. Your frontend shows tool calls, narrations, and content chunks — all of this disappears with pure polling
`psycopg` binary install	Low	Standard pip install

The streaming trade-off is the biggest decision here. Your chatview.tsx renders ThinkingSection (tool calls in real-time), StreamingResponseDisplay (token-by-token content), and ThoughtEvent timelines. With polling, you lose all of this — the user sees "Processing..." and then the full result appears at once. For ML pipelines that run 2–10 minutes, this may actually be fine (users don't watch a spinner for 5 minutes). But for quick chat messages (sub-10s), the experience degrades significantly.

Plan 2: Studio Hydration — THE FRONTEND LAYER

Why This Comes Second

Once you have Plan 3 (DB as source of truth), Plan 2 is trivially easy:

GET /ml-studio/{session_id}
  → agent.get_session_messages(session_id)  # from PostgresDb
  → agent.get_session_state(session_id)     # from PostgresDb  
  → return { messages, state, script }

Without Plan 3, this endpoint still works (your current SQLite stores messages), but it can't tell you if a run is still active, what the current status is, or allow reconnection from another machine.

Your Next.js Route

app/ml-studio/[sessionId]/page.tsx

On load:

GET /ml-studio/{sessionId} → hydrate chat history + state
If is_running: true → start polling GET /ml/runs/{sessionId}/{runId}
If user sends a new message → POST /ml/ml-etl-agent/run → get run_id → start polling

This is the standard pattern. No architectural risk.

Plan 1: Code Edit/Rerun — SUPPLEMENTARY FEATURE

Why This Comes Last

Adding script edit/rerun endpoints is a feature, not an infrastructure change. It only makes sense once:

✅ Scripts are reliably persisted (your MinIO setup already handles this)
✅ You can re-execute code independently (Plan 3 gives you clean execution boundaries)
✅ The DB tracks which session owned which script (Plan 3's PostgresDb)

The dependency-checking/auto-install logic (_extract_and_check_imports) is genuinely useful but is a nice-to-have, not a blocker.

My Recommendation: Hybrid Approach

Don't do a full "rip and replace" of SSE with polling. Instead, use a hybrid:

Phase 1: Add `background=True` alongside existing SSE (1–2 days)

Switch SqliteDb → PostgresDb (agent constructor change, ~5 lines)
Add the poll endpoint GET /ml/runs/{session_id}/{run_id}
Add the studio hydration endpoint GET /ml-studio/{session_id}
Keep the existing SSE path (stream=True) working as-is
Add a new non-streaming path that uses background=True

This gives you both options: existing SSE for real-time chat UX, and background+poll for the "close browser and come back" use case.

Phase 2: Build the `/ml-studio/[sessionId]` page (1–2 days)

Next.js dynamic route with DB hydration
Poll-based status updates for active runs
Code viewer/editor panel (reads saved scripts from MinIO)

Phase 3: Deprecate SSE (later, if desired)

Once you've validated the polling UX works well, you can optionally remove the threading/SSE code. But there's no urgency — the hybrid approach is the safest.

Summary Verdict

Approach	Verdict
Plan 1 alone	❌ Wrong order — doesn't fix the infrastructure problems
Plan 2 alone	❌ Missing foundation — can't reliably track running jobs with SQLite
Plan 3 alone	⚠️ Works but lossy — kills real-time streaming UX that your chatview.tsx is built around
Hybrid (3 → 2 → 1)	✅ Best path — PostgresDb as foundation, keep SSE for real-time, add polling for resilience

The single most impactful change is switching from SqliteDb to PostgresDb. That unlocks everything else with minimal risk to your existing SSE streaming pipeline.

Migration Analysis: Which Approach Is Best for Your ML Agent System?

The Three Plans At a Glance

⚠️ Critical Finding: They Are NOT Three Alternatives

Plan 3: Background Execution Migration — THE FOUNDATION ✅

Why This Must Come First

What Agno Actually Provides (Verified)

What You Delete (Massive Simplification)

Risk Assessment

Plan 2: Studio Hydration — THE FRONTEND LAYER

Why This Comes Second

Your Next.js Route

Plan 1: Code Edit/Rerun — SUPPLEMENTARY FEATURE

Why This Comes Last

My Recommendation: Hybrid Approach

Phase 1: Add background=True alongside existing SSE (1–2 days)

Phase 2: Build the /ml-studio/[sessionId] page (1–2 days)

Phase 3: Deprecate SSE (later, if desired)

Summary Verdict

Phase 1: Add `background=True` alongside existing SSE (1–2 days)

Phase 2: Build the `/ml-studio/[sessionId]` page (1–2 days)