VibecoderMcSwaggins commited on
Commit
b80c43b
Β·
1 Parent(s): e1232d2

docs: Add P2 cold start dead zones bug (#108)

Browse files

Documents three UX "dead zones" in Advanced Mode where users see
no visual feedback:

1. Dead Zone #1 (5-15s): Initialization - loading embeddings, ChromaDB
2. Dead Zone #2 (10-30s): First LLM call - manager planning
3. Dead Zone #3 (30-90s): Agent execution - SearchAgent queries

Root cause analysis, proposed solutions, and testing instructions included.

docs/bugs/ACTIVE_BUGS.md CHANGED
@@ -11,6 +11,23 @@ _No active P0 bugs._
11
 
12
  ---
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ## P1 - Important
15
 
16
  ### P1 - Memory Layer Not Integrated (Post-Hackathon)
 
11
 
12
  ---
13
 
14
+ ## P2 - UX Friction
15
+
16
+ ### P2 - Advanced Mode Cold Start Has No User Feedback
17
+ **File:** `docs/bugs/P2_ADVANCED_MODE_COLD_START_NO_FEEDBACK.md`
18
+ **Issue:** [#108](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/issues/108)
19
+ **Found:** 2025-12-01 (Gradio Testing)
20
+
21
+ **Problem:** Three "dead zones" with no visual feedback during Advanced Mode startup:
22
+ 1. **Dead Zone #1** (5-15s): Between STARTED β†’ THINKING (initialization)
23
+ 2. **Dead Zone #2** (10-30s): Between THINKING β†’ PROGRESS (first LLM call)
24
+ 3. **Dead Zone #3** (30-90s): After PROGRESS (SearchAgent executing)
25
+
26
+ **Impact:** Users think app is frozen, unclear if working.
27
+ **Solution:** Add granular progress events, potentially parallelize initialization, add Gradio progress bar.
28
+
29
+ ---
30
+
31
  ## P1 - Important
32
 
33
  ### P1 - Memory Layer Not Integrated (Post-Hackathon)
docs/bugs/P2_ADVANCED_MODE_COLD_START_NO_FEEDBACK.md ADDED
@@ -0,0 +1,255 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P2: Advanced Mode Cold Start Has No User Feedback
2
+
3
+ **Priority**: P2 (UX Friction)
4
+ **Component**: `src/orchestrators/advanced.py`
5
+ **Status**: Open
6
+ **Issue**: [#108](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/issues/108)
7
+ **Created**: 2025-12-01
8
+
9
+ ## Summary
10
+
11
+ When Advanced Mode starts, users experience three significant "dead zones" with no visual feedback:
12
+
13
+ 1. **Initialization delay** (5-15 seconds): Between "STARTED" and "THINKING" events
14
+ 2. **First LLM call delay** (10-30+ seconds): Between "THINKING" and first "PROGRESS" event
15
+ 3. **Agent execution delay** (30-90+ seconds): After "PROGRESS" while SearchAgent executes
16
+
17
+ Users see the UI freeze with no indication of what's happening, leading to confusion about whether the system is working.
18
+
19
+ ## Visual Timeline
20
+
21
+ ```
22
+ πŸš€ STARTED: Starting research (Advanced mode)...
23
+ β”‚
24
+ β”‚ ← DEAD ZONE #1: 5-15 seconds of nothing
25
+ β”‚ - Loading LlamaIndex/ChromaDB
26
+ β”‚ - Initializing embedding service
27
+ β”‚ - Building 4 agents + manager
28
+ β”‚
29
+ ⏳ THINKING: Multi-agent reasoning in progress...
30
+ β”‚
31
+ β”‚ ← DEAD ZONE #2: 10-30+ seconds of nothing
32
+ β”‚ - Manager agent's first OpenAI API call
33
+ β”‚ - Cold connection to OpenAI
34
+ β”‚
35
+ ⏱️ PROGRESS: Manager assigning research task...
36
+ β”‚
37
+ β”‚ ← DEAD ZONE #3: 30-90+ seconds of nothing
38
+ β”‚ - SearchAgent executing PubMed/ClinicalTrials/EuropePMC queries
39
+ β”‚ - Embedding and storing results in ChromaDB
40
+ β”‚ - No streaming events during search execution
41
+ β”‚
42
+ πŸ“Š SEARCH_COMPLETE / PROGRESS: Round 1/5...
43
+ ```
44
+
45
+ ## Root Cause Analysis
46
+
47
+ ### Dead Zone #1: Initialization (Lines 162-165)
48
+
49
+ ```python
50
+ yield AgentEvent(type="started", ...) # User sees this
51
+
52
+ # === BLOCKING OPERATIONS (no events yielded) ===
53
+ embedding_service = self._init_embedding_service() # ChromaDB, embeddings
54
+ init_magentic_state(query, embedding_service) # Shared state
55
+ workflow = self._build_workflow() # 4 agents + manager
56
+
57
+ yield AgentEvent(type="thinking", ...) # User finally sees this
58
+ ```
59
+
60
+ **What's happening:**
61
+ 1. `_init_embedding_service()` β†’ Loads LlamaIndex, connects to ChromaDB, initializes OpenAI embeddings
62
+ 2. `init_magentic_state()` β†’ Creates ResearchMemory, sets up context
63
+ 3. `_build_workflow()` β†’ Instantiates SearchAgent, JudgeAgent, HypothesisAgent, ReportAgent, Manager
64
+
65
+ ### Dead Zone #2: First LLM Call (Line 206)
66
+
67
+ ```python
68
+ yield AgentEvent(type="thinking", ...) # User sees this
69
+
70
+ async for event in workflow.run_stream(task): # BLOCKING until first event
71
+ # Manager makes first OpenAI call here
72
+ # No events until manager responds and starts delegating
73
+ ```
74
+
75
+ **What's happening:**
76
+ - Microsoft Agent Framework's manager agent receives the task
77
+ - Makes synchronous(ish) call to OpenAI for orchestration planning
78
+ - Only after response does it emit `MagenticOrchestratorMessageEvent`
79
+
80
+ ### Dead Zone #3: Agent Execution (After PROGRESS event)
81
+
82
+ After "Manager assigning research task...", the SearchAgent executes but emits no events until complete:
83
+
84
+ **What's happening:**
85
+ - SearchAgent receives task from manager
86
+ - Executes parallel queries to PubMed, ClinicalTrials.gov, Europe PMC
87
+ - Each result is embedded and stored in ChromaDB
88
+ - Only after ALL searches complete does it emit `MagenticAgentMessageEvent`
89
+
90
+ **Why no streaming:**
91
+ - The agent's internal tool calls (search APIs, embeddings) don't emit framework events
92
+ - Microsoft Agent Framework only emits events at agent message boundaries
93
+ - 3 databases Γ— multiple queries Γ— embedding each result = long silent period
94
+
95
+ **Potential fix:** Add progress callbacks to `SearchAgent` tools:
96
+ ```python
97
+ # In search_agent.py - hypothetical
98
+ async def search_pubmed(query: str, on_progress: Callable = None):
99
+ results = await pubmed_client.search(query)
100
+ if on_progress:
101
+ on_progress(f"Found {len(results)} PubMed results")
102
+ # ... embed and store
103
+ ```
104
+
105
+ ## Impact
106
+
107
+ 1. **User Confusion**: "Is it frozen? Should I refresh?"
108
+ 2. **Perceived Slowness**: Dead time feels longer than active progress
109
+ 3. **No Cancel Option**: Users can't abort during these zones
110
+ 4. **Support Burden**: Users report "it's not working" when it's actually initializing
111
+
112
+ ## Proposed Solutions
113
+
114
+ ### Option A: Granular Initialization Events (Quick Win)
115
+
116
+ Add progress events during initialization:
117
+
118
+ ```python
119
+ yield AgentEvent(type="started", ...)
120
+
121
+ yield AgentEvent(
122
+ type="progress",
123
+ message="Loading embedding service...",
124
+ iteration=0,
125
+ )
126
+ embedding_service = self._init_embedding_service()
127
+
128
+ yield AgentEvent(
129
+ type="progress",
130
+ message="Initializing research memory...",
131
+ iteration=0,
132
+ )
133
+ init_magentic_state(query, embedding_service)
134
+
135
+ yield AgentEvent(
136
+ type="progress",
137
+ message="Building agent team (Search, Judge, Hypothesis, Report)...",
138
+ iteration=0,
139
+ )
140
+ workflow = self._build_workflow()
141
+
142
+ yield AgentEvent(type="thinking", ...)
143
+ ```
144
+
145
+ **Pros**: Simple, immediate feedback
146
+ **Cons**: Still sequential, doesn't speed up actual time
147
+
148
+ ### Option B: Parallel Initialization (Performance + UX)
149
+
150
+ Use `asyncio.gather()` for independent operations:
151
+
152
+ ```python
153
+ yield AgentEvent(type="progress", message="Initializing agents...", iteration=0)
154
+
155
+ # These could potentially run in parallel
156
+ embedding_task = asyncio.create_task(self._init_embedding_service_async())
157
+ workflow_task = asyncio.create_task(self._build_workflow_async())
158
+
159
+ embedding_service, workflow = await asyncio.gather(embedding_task, workflow_task)
160
+ init_magentic_state(query, embedding_service)
161
+ ```
162
+
163
+ **Pros**: Faster initialization, better UX
164
+ **Cons**: Need to verify thread safety, more complex
165
+
166
+ ### Option C: Pre-warming / Singleton Services
167
+
168
+ Initialize expensive services once at app startup, not per-request:
169
+
170
+ ```python
171
+ # In app.py startup
172
+ global_embedding_service = init_embedding_service()
173
+ global_workflow_template = build_workflow_template()
174
+
175
+ # In orchestrator
176
+ workflow = global_workflow_template.clone() # Fast
177
+ ```
178
+
179
+ **Pros**: Near-instant start after first request
180
+ **Cons**: Memory overhead, cold start on first request still slow
181
+
182
+ ### Option D: Animated Progress Indicator (UI-Only)
183
+
184
+ Add a Gradio progress bar or spinner that animates during the dead zones:
185
+
186
+ ```python
187
+ # In app.py
188
+ with gr.Blocks() as demo:
189
+ progress = gr.Progress()
190
+
191
+ async def research(query):
192
+ progress(0.1, desc="Initializing...")
193
+ # ...
194
+ progress(0.2, desc="Building agents...")
195
+ ```
196
+
197
+ **Pros**: User sees activity even if nothing to report
198
+ **Cons**: Doesn't solve the actual blocking, Gradio-specific
199
+
200
+ ## Recommended Approach
201
+
202
+ **Phase 1 (Quick Win)**: Option A - Add granular events
203
+ **Phase 2 (Performance)**: Option C - Pre-warm services at startup
204
+ **Phase 3 (Polish)**: Option D - Gradio progress bar
205
+
206
+ ## Related Considerations
207
+
208
+ ### Parallel Agent Orchestration
209
+
210
+ The current Microsoft Agent Framework runs agents sequentially through the manager. True parallel execution would require:
211
+
212
+ 1. Breaking out of the framework's `run_stream()` pattern
213
+ 2. Implementing our own parallel task dispatch
214
+ 3. Managing agent coordination manually
215
+
216
+ This is a larger architectural change (P1 scope) and should be tracked separately if desired.
217
+
218
+ ## Files to Modify
219
+
220
+ 1. `src/orchestrators/advanced.py:155-210` - Add initialization events in `run()` method
221
+ 2. `src/utils/service_loader.py` - Pre-warming logic
222
+ 3. `src/app.py` - Gradio progress integration
223
+
224
+ ## Testing the Issue
225
+
226
+ ```python
227
+ import asyncio
228
+ import time
229
+ from src.orchestrators.advanced import AdvancedOrchestrator
230
+
231
+ async def test():
232
+ orch = AdvancedOrchestrator(max_rounds=3)
233
+ start = time.time()
234
+ async for event in orch.run("test query"):
235
+ elapsed = time.time() - start
236
+ print(f"[{elapsed:.1f}s] {event.type}: {event.message[:50]}...")
237
+ if event.type == "complete":
238
+ break
239
+
240
+ asyncio.run(test())
241
+ ```
242
+
243
+ Expected output showing the gaps:
244
+ ```
245
+ [0.0s] started: Starting research (Advanced mode)...
246
+ [8.2s] thinking: Multi-agent reasoning in progress... ← 8 second gap!
247
+ [22.5s] progress: Manager assigning research task... ← 14 second gap!
248
+ ```
249
+
250
+ ## References
251
+
252
+ - Advanced orchestrator: `src/orchestrators/advanced.py`
253
+ - Embedding service loader: `src/utils/service_loader.py`
254
+ - LlamaIndex RAG: `src/services/llamaindex_rag.py`
255
+ - Microsoft Agent Framework: `agent-framework-core`