sirus / backend /ml_module /working /analysis.md
ranilmukesh's picture
Deploy SiRUS SQL Agent backend
783a952
# Migration Analysis: Which Approach Is Best for Your ML Agent System?
After a thorough review of your [ML-ETLAgent.py](file:///d:/PhobosQ%20-%20docs/sirus%20nextjs%20new%20ui/backend/ml_module/working/ML-ETLAgent.py) (1966 lines), your [chatview.tsx](file:///d:/PhobosQ%20-%20docs/sirus%20nextjs%20new%20ui/prod-frontend/frontend/components/ml-studio/sections/chatview.tsx) (862 lines), and verification against the [official Agno background execution docs](https://docs.agno.com/examples/agents/advanced/background-execution), here is my honest assessment.
---
## The Three Plans At a Glance
| # | Plan | Core Idea |
|---|------|-----------|
| 1 | **CRUD + Code Edit/Rerun** | Add endpoints to read/edit/rerun saved scripts from MinIO |
| 2 | **Studio Hydration** | Build `/ml-studio/{session_id}` that hydrates all state from DB on page load |
| 3 | **Background Execution Migration** | Replace threading/SSE with `agent.arun(background=True)` + `PostgresDb` + polling |
---
## ⚠️ Critical Finding: They Are NOT Three Alternatives
> [!IMPORTANT]
> These three plans are **not competing options** β€” they are **layers of the same solution**. Plan 3 is the foundation, Plan 2 is the frontend architecture on top of it, and Plan 1 provides supplementary features. The correct answer is: **do Plan 3 first, then Plan 2, then Plan 1 as needed**.
Here's why:
---
## Plan 3: Background Execution Migration β€” THE FOUNDATION βœ…
### Why This Must Come First
Your current architecture has a **fundamental fragility**: the ~120-line [_run_agent_thread](file:///d:/PhobosQ%20-%20docs/sirus%20nextjs%20new%20ui/backend/ml_module/working/ML-ETLAgent.py#1619-1757) + `asyncio.Queue` + `_active_runs` dict system (lines 1602–1757). This is the root cause of every problem you've described:
| Problem | Root Cause |
|---------|------------|
| Agent stops when UI closes | `_active_runs` is in-memory; thread dies if process recycles |
| Can't reconnect to running jobs | Queue is per-process; no cross-instance sharing |
analysis.md| No persistent run status | Run state lives only in the thread's local scope |
| Can't share state across servers | SQLite + in-memory dict are single-node only |
### What Agno Actually Provides (Verified)
From the [official Agno example](https://docs.agno.com/examples/agents/advanced/background-execution), the pattern is:
```python
# Requires PostgresDb (NOT SqliteDb)
db = PostgresDb(
db_url="postgresql+psycopg://ai:ai@localhost:5532/ai",
session_table="background_exec_sessions",
)
agent = Agent(model=..., db=db)
# Returns immediately with RunStatus.pending
run_output = await agent.arun("...", background=True)
# Poll from DB (works from any process/server)
result = await agent.aget_run_output(
run_id=run_output.run_id,
session_id=run_output.session_id,
)
# result.status β†’ pending | completed | error
# result.content β†’ the agent's final answer
# Cancel
cancelled = await agent.acancel_run(run_id=run_output.run_id)
```
> [!CAUTION]
> `background=True` **requires PostgresDb**. It will NOT work with your current `SqliteDb`. This is a hard requirement from Agno's architecture β€” the DB acts as the message queue between the background task and the polling endpoint.
### What You Delete (Massive Simplification)
| Lines | What | Action |
|-------|------|--------|
| 1–2 | `import asyncio`, `import threading` | **DELETE** |
| 1602–1757 | `_active_runs`, [_put_threadsafe](file:///d:/PhobosQ%20-%20docs/sirus%20nextjs%20new%20ui/backend/ml_module/working/ML-ETLAgent.py#1614-1617), [_run_agent_thread](file:///d:/PhobosQ%20-%20docs/sirus%20nextjs%20new%20ui/backend/ml_module/working/ML-ETLAgent.py#1619-1757) (~155 lines) | **DELETE entirely** |
| 1799–1844 | [stream_generator()](file:///d:/PhobosQ%20-%20docs/sirus%20nextjs%20new%20ui/backend/ml_module/working/ML-ETLAgent.py#1800-1835), SSE headers, keep-alive, queue reconnect logic | **DELETE** |
That's **~160 lines of fragile custom infrastructure** replaced by one line: `background=True`.
### Risk Assessment
| Risk | Severity | Mitigation |
|------|----------|------------|
| PostgreSQL dependency | Medium | You likely already have it (Supabase uses PG). Docker one-liner for local dev |
| **No streaming/SSE** | **High** | Polling replaces real-time token streaming. Your frontend shows tool calls, narrations, and content chunks β€” **all of this disappears with pure polling** |
| `psycopg` binary install | Low | Standard pip install |
> [!WARNING]
> **The streaming trade-off is the biggest decision here.** Your [chatview.tsx](file:///d:/PhobosQ%20-%20docs/sirus%20nextjs%20new%20ui/prod-frontend/frontend/components/ml-studio/sections/chatview.tsx) renders [ThinkingSection](file:///d:/PhobosQ%20-%20docs/sirus%20nextjs%20new%20ui/prod-frontend/frontend/components/ml-studio/sections/chatview.tsx#278-469) (tool calls in real-time), `StreamingResponseDisplay` (token-by-token content), and `ThoughtEvent` timelines. With polling, you lose all of this β€” the user sees "Processing..." and then the full result appears at once. For ML pipelines that run 2–10 minutes, this may actually be **fine** (users don't watch a spinner for 5 minutes). But for quick chat messages (sub-10s), the experience degrades significantly.
---
## Plan 2: Studio Hydration β€” THE FRONTEND LAYER
### Why This Comes Second
Once you have Plan 3 (DB as source of truth), Plan 2 is trivially easy:
```
GET /ml-studio/{session_id}
β†’ agent.get_session_messages(session_id) # from PostgresDb
β†’ agent.get_session_state(session_id) # from PostgresDb
β†’ return { messages, state, script }
```
Without Plan 3, this endpoint still works (your current SQLite stores messages), but it can't tell you if a run is still active, what the current status is, or allow reconnection from another machine.
### Your Next.js Route
```
app/ml-studio/[sessionId]/page.tsx
```
On load:
1. `GET /ml-studio/{sessionId}` β†’ hydrate chat history + state
2. If `is_running: true` β†’ start polling `GET /ml/runs/{sessionId}/{runId}`
3. If user sends a new message β†’ `POST /ml/ml-etl-agent/run` β†’ get `run_id` β†’ start polling
This is the standard pattern. No architectural risk.
---
## Plan 1: Code Edit/Rerun β€” SUPPLEMENTARY FEATURE
### Why This Comes Last
Adding script edit/rerun endpoints is a **feature**, not an infrastructure change. It only makes sense once:
- βœ… Scripts are reliably persisted (your MinIO setup already handles this)
- βœ… You can re-execute code independently (Plan 3 gives you clean execution boundaries)
- βœ… The DB tracks which session owned which script (Plan 3's PostgresDb)
The dependency-checking/auto-install logic (`_extract_and_check_imports`) is genuinely useful but is a nice-to-have, not a blocker.
---
## My Recommendation: Hybrid Approach
> [!TIP]
> **Don't do a full "rip and replace" of SSE with polling.** Instead, use a hybrid:
### Phase 1: Add `background=True` alongside existing SSE (1–2 days)
1. Switch `SqliteDb` β†’ `PostgresDb` (agent constructor change, ~5 lines)
2. Add the poll endpoint `GET /ml/runs/{session_id}/{run_id}`
3. Add the studio hydration endpoint `GET /ml-studio/{session_id}`
4. Keep the existing SSE path (`stream=True`) working as-is
5. Add a new non-streaming path that uses `background=True`
This gives you **both options**: existing SSE for real-time chat UX, and background+poll for the "close browser and come back" use case.
### Phase 2: Build the `/ml-studio/[sessionId]` page (1–2 days)
1. Next.js dynamic route with DB hydration
2. Poll-based status updates for active runs
3. Code viewer/editor panel (reads saved scripts from MinIO)
### Phase 3: Deprecate SSE (later, if desired)
Once you've validated the polling UX works well, you can optionally remove the threading/SSE code. But there's no urgency β€” the hybrid approach is the safest.
---
## Summary Verdict
| Approach | Verdict |
|----------|---------|
| Plan 1 alone | ❌ **Wrong order** β€” doesn't fix the infrastructure problems |
| Plan 2 alone | ❌ **Missing foundation** β€” can't reliably track running jobs with SQLite |
| Plan 3 alone | ⚠️ **Works but lossy** β€” kills real-time streaming UX that your [chatview.tsx](file:///d:/PhobosQ%20-%20docs/sirus%20nextjs%20new%20ui/prod-frontend/frontend/components/ml-studio/sections/chatview.tsx) is built around |
| **Hybrid (3 β†’ 2 β†’ 1)** | βœ… **Best path** β€” PostgresDb as foundation, keep SSE for real-time, add polling for resilience |
The single most impactful change is **switching from `SqliteDb` to `PostgresDb`**. That unlocks everything else with minimal risk to your existing SSE streaming pipeline.