VibecoderMcSwaggins commited on
Commit
949847c
·
unverified ·
2 Parent(s): c7f5590 f9576ce

Merge pull request #136 from The-Obstacle-Is-The-Way/fix/spec-20-pubmed-json

Browse files
docs/architecture/adr-001-middleware-refactor.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ADR-001: Middleware Architecture Refactor
2
+
3
+ **Status:** ACCEPTED
4
+ **Date:** 2025-12-06
5
+ **Decision Makers:** Development Team
6
+
7
+ ---
8
+
9
+ ## Context
10
+
11
+ The current `src/middleware/` folder is misleadingly named. It contains `SubIterationMiddleware`, which implements a workflow pattern (team→judge loop), not the interceptor middleware pattern used by Microsoft Agent Framework.
12
+
13
+ Additionally, we're missing proper middleware implementations for:
14
+ - Retry logic on transient errors (429, 500, 502, 503, 504)
15
+ - Token usage tracking for cost monitoring
16
+
17
+ ---
18
+
19
+ ## Decision
20
+
21
+ 1. **Rename `src/middleware/` to `src/workflows/`** to accurately reflect what it contains
22
+ 2. **Create new `src/middleware/` with proper MS-pattern middleware:**
23
+ - `RetryMiddleware(ChatMiddleware)` - exponential backoff retry
24
+ - `TokenTrackingMiddleware(ChatMiddleware)` - token usage logging
25
+
26
+ ---
27
+
28
+ ## Consequences
29
+
30
+ ### Positive
31
+ - Clearer codebase organization
32
+ - Proper use of MS Agent Framework patterns
33
+ - HuggingFace 429 crashes will be handled gracefully
34
+ - Token usage will be visible for cost monitoring
35
+
36
+ ### Negative
37
+ - Requires updating imports in `src/orchestrators/hierarchical.py`
38
+ - One-time migration effort
39
+
40
+ ### Neutral
41
+ - Aligns with Microsoft Agent Framework conventions
42
+
43
+ ---
44
+
45
+ ## Implementation
46
+
47
+ See [SPEC-21: Middleware Architecture Refactor](../specs/SPEC-21-MIDDLEWARE-ARCHITECTURE.md) for detailed implementation steps.
48
+
49
+ ---
50
+
51
+ ## References
52
+
53
+ - Microsoft Agent Framework `_middleware.py`
54
+ - [P2 Hardening Issues](../bugs/p2-hardening-issues.md) Issue 2 & 3
docs/architecture/ms-framework-usage-report.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Microsoft Agent Framework Usage Report
2
+
3
+ **Date:** 2025-12-06
4
+ **Framework Version:** agent-framework-core==1.0.0b251204
5
+
6
+ ---
7
+
8
+ ## What We Use From MS Framework (pip installed)
9
+
10
+ ### Core Classes
11
+ - `BaseChatClient` - Base class for chat clients
12
+ - `ChatMessage`, `ChatRole` - Message types
13
+ - `ChatOptions` - Request configuration
14
+ - `ChatAgent` - Agent base class
15
+
16
+ ### Decorators (Applied to HuggingFaceChatClient)
17
+ - `@use_function_invocation` - Enables tool/function calling
18
+ - `@use_observability` - Adds OTEL tracing hooks
19
+ - `@use_chat_middleware` - Enables middleware pipeline
20
+
21
+ ### Middleware Base Classes (Available but NOT yet used)
22
+ - `ChatMiddleware` - Intercepts chat client requests
23
+ - `AgentMiddleware` - Intercepts agent invocations
24
+ - `FunctionMiddleware` - Intercepts tool calls
25
+
26
+ ---
27
+
28
+ ## What We Hand-Roll (Custom Implementation)
29
+
30
+ ### Orchestration
31
+ - `AdvancedOrchestrator` - Main research workflow
32
+ - `HierarchicalOrchestrator` - Team-based orchestration
33
+ - `SubIterationMiddleware` - Team→judge loop (workflow, not middleware)
34
+
35
+ ### Clients
36
+ - `HuggingFaceChatClient` - Adapter for HuggingFace Inference API
37
+ - Client factory with auto-detection
38
+
39
+ ### Tools
40
+ - `PubMedTool`, `ClinicalTrialsTool`, `EuropePMCTool`
41
+ - `SearchHandler` - Scatter-gather orchestration
42
+
43
+ ### Services
44
+ - `EmbeddingService` - Local sentence-transformers
45
+ - `LlamaIndexRAG` - Premium OpenAI embeddings
46
+ - `ResearchMemory` - Research state management
47
+
48
+ ---
49
+
50
+ ## Gap Analysis
51
+
52
+ | Component | MS Framework Has | DeepBoner Has | Status |
53
+ |-----------|------------------|---------------|--------|
54
+ | Chat Middleware | `ChatMiddleware` base | Uses decorator only | SPEC-21 |
55
+ | Retry Logic | N/A (left to user) | None | SPEC-21 |
56
+ | Token Tracking | OTEL histograms | None | SPEC-21 |
57
+ | Thread State | `AgentThread` serialization | `ResearchMemory` (no serialization) | P3 |
58
+ | Observability | Full OTEL | structlog only | P3 |
59
+
60
+ ---
61
+
62
+ ## Recommendations
63
+
64
+ 1. **Implement `RetryMiddleware`** using MS `ChatMiddleware` base class
65
+ 2. **Implement `TokenTrackingMiddleware`** for cost visibility
66
+ 3. **Rename `src/middleware/`** to avoid confusion with MS patterns
67
+
68
+ See [SPEC-21](../specs/SPEC-21-MIDDLEWARE-ARCHITECTURE.md) for implementation details.
docs/bugs/p2-hardening-issues.md ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P2: Hardening Issues
2
+
3
+ **Date:** 2025-12-06
4
+ **Priority:** P2 (Should fix soon)
5
+
6
+ ---
7
+
8
+ ## Issue 1: PubMed JSON Parsing Crash
9
+
10
+ **Status:** SPEC-20 CREATED
11
+ **File:** `src/tools/pubmed.py:88`
12
+
13
+ The PubMed search tool crashes when the API returns non-JSON responses (maintenance pages, error pages). The JSON parsing happens outside the try/except block.
14
+
15
+ **Resolution:** See [SPEC-20: PubMed JSON Parsing Fix](../specs/SPEC-20-PUBMED-JSON-FIX.md)
16
+
17
+ ---
18
+
19
+ ## Issue 2: HuggingFace Client Missing Retry Logic
20
+
21
+ **Status:** SPEC-21 CREATED
22
+ **File:** `src/clients/huggingface.py`
23
+
24
+ The HuggingFaceChatClient has no retry logic for transient errors (429 rate limits, 500 server errors). When the API returns a 429, the entire research workflow crashes.
25
+
26
+ **Resolution:** See [SPEC-21: Middleware Architecture Refactor](../specs/SPEC-21-MIDDLEWARE-ARCHITECTURE.md)
27
+
28
+ ---
29
+
30
+ ## Issue 3: Misleading Middleware Folder Name
31
+
32
+ **Status:** SPEC-21 CREATED
33
+ **File:** `src/middleware/`
34
+
35
+ The `src/middleware/` folder contains `SubIterationMiddleware`, which is actually a workflow pattern (team→judge loop), not interceptor middleware. This is confusing and misleading.
36
+
37
+ **Resolution:** See [SPEC-21: Middleware Architecture Refactor](../specs/SPEC-21-MIDDLEWARE-ARCHITECTURE.md)
38
+
39
+ ---
40
+
41
+ ## Related Documentation
42
+
43
+ - [SPEC-20: PubMed JSON Parsing Fix](../specs/SPEC-20-PUBMED-JSON-FIX.md)
44
+ - [SPEC-21: Middleware Architecture Refactor](../specs/SPEC-21-MIDDLEWARE-ARCHITECTURE.md)
45
+ - [P3: MS Framework Gaps](./p3-ms-framework-gaps.md)
docs/bugs/p3-ms-framework-gaps.md ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P3: Microsoft Agent Framework Gaps Analysis
2
+
3
+ **Date:** 2025-12-06
4
+ **Priority:** P3 (Nice-to-Have)
5
+ **Source:** Comparison with Microsoft Agent Framework v1.0.0b251204 (commit 8c6b12e)
6
+
7
+ ## Executive Summary
8
+
9
+ Comparison of DeepBoner's implementation against Microsoft Agent Framework reveals several architectural patterns we're missing. These are not bugs but opportunities for hardening and production-readiness.
10
+
11
+ ---
12
+
13
+ ## Gap 1: OpenTelemetry Observability (HIGH VALUE)
14
+
15
+ **What MS Framework Has:**
16
+ ```python
17
+ # observability.py - 1706 lines of comprehensive OTEL integration
18
+ from opentelemetry.trace import get_tracer, Span
19
+ from opentelemetry.metrics import get_meter, Histogram
20
+
21
+ @use_observability # Decorator for ChatClient
22
+ @use_agent_observability # Decorator for Agent
23
+
24
+ # Token usage histograms with bucket boundaries
25
+ TOKEN_USAGE_BUCKET_BOUNDARIES = (1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, 262144, 1048576)
26
+
27
+ # Operation duration histograms
28
+ OPERATION_DURATION_BUCKET_BOUNDARIES = (0.01, 0.02, 0.04, 0.08, 0.16, ...)
29
+
30
+ # 80+ semantic span attributes (OtelAttr enum)
31
+ OtelAttr.GEN_AI_OPERATION_NAME
32
+ OtelAttr.GEN_AI_REQUEST_MODEL
33
+ OtelAttr.GEN_AI_USAGE_INPUT_TOKENS
34
+ OtelAttr.GEN_AI_USAGE_OUTPUT_TOKENS
35
+ ```
36
+
37
+ **What DeepBoner Has:**
38
+ - `structlog` for logging only
39
+ - No distributed tracing
40
+ - No metrics collection
41
+ - No token usage tracking
42
+
43
+ **Gap Impact:**
44
+ - Cannot trace requests across agents
45
+ - No token cost monitoring
46
+ - No performance profiling in production
47
+
48
+ **Recommended Fix:**
49
+ ```python
50
+ # Add optional OTEL support to orchestrator
51
+ # src/observability/__init__.py
52
+ from opentelemetry import trace
53
+ from opentelemetry.metrics import get_meter
54
+
55
+ def setup_observability():
56
+ """One-time setup for OpenTelemetry."""
57
+ ...
58
+
59
+ @contextmanager
60
+ def trace_agent_operation(name: str, attributes: dict):
61
+ """Context manager for tracing agent operations."""
62
+ ...
63
+ ```
64
+
65
+ ---
66
+
67
+ ## Gap 2: Middleware Pipelines (MEDIUM VALUE) - ADDRESSED IN ADR-001
68
+
69
+ > **NOTE:** This gap is being addressed in [ADR-001: Middleware Architecture Refactor](../architecture/adr-001-middleware-refactor.md)
70
+
71
+ **What MS Framework Has:**
72
+ ```python
73
+ # _middleware.py - Three types of middleware
74
+
75
+ class AgentMiddleware(ABC):
76
+ """Intercepts agent invocations."""
77
+ async def process(self, context: AgentRunContext, next): ...
78
+
79
+ class FunctionMiddleware(ABC):
80
+ """Intercepts tool/function calls."""
81
+ async def process(self, context: FunctionInvocationContext, next): ...
82
+
83
+ class ChatMiddleware(ABC):
84
+ """Intercepts chat client requests."""
85
+ async def process(self, context: ChatContext, next): ...
86
+
87
+ # Decorators for easy middleware creation
88
+ @agent_middleware
89
+ async def logging_middleware(context: AgentRunContext, next):
90
+ print(f"Before: {context.agent.name}")
91
+ await next(context)
92
+ print(f"After: {context.result}")
93
+
94
+ # Pipeline execution with terminate support
95
+ context.terminate = True # Stops pipeline early
96
+ ```
97
+
98
+ **What DeepBoner Has:**
99
+ - Uses MS decorators (`@use_chat_middleware`, `@use_observability`) ✓
100
+ - BUT: No custom `ChatMiddleware` class implementations ✗
101
+ - `src/middleware/` folder contains a workflow, not actual middleware ✗
102
+
103
+ **ADR-001 Solution:**
104
+ 1. Rename `src/middleware/` → `src/workflows/` (fix misleading name)
105
+ 2. Create proper `src/middleware/` with MS-pattern implementations:
106
+ - `RetryMiddleware(ChatMiddleware)` - fixes HuggingFace retry bug
107
+ - `TokenTrackingMiddleware(ChatMiddleware)` - enables cost monitoring
108
+ - `LoggingMiddleware(ChatMiddleware)` - structured request/response logs
109
+
110
+ ---
111
+
112
+ ## Gap 3: Thread/Conversation State Management (MEDIUM VALUE)
113
+
114
+ **What MS Framework Has:**
115
+ ```python
116
+ # _threads.py
117
+ class AgentThread:
118
+ """Maintains conversation state with serialization support."""
119
+
120
+ def __init__(self, service_thread_id=None, message_store=None):
121
+ ...
122
+
123
+ async def serialize(self) -> dict[str, Any]:
124
+ """Persist thread state."""
125
+ ...
126
+
127
+ @classmethod
128
+ async def deserialize(cls, state: dict) -> "AgentThread":
129
+ """Restore thread from persisted state."""
130
+ ...
131
+
132
+ class ChatMessageStoreProtocol(Protocol):
133
+ """Protocol for message storage backends."""
134
+ async def list_messages(self) -> list[ChatMessage]: ...
135
+ async def add_messages(self, messages: Sequence[ChatMessage]): ...
136
+ ```
137
+
138
+ **What DeepBoner Has:**
139
+ - `ResearchMemory` for research state only
140
+ - No conversation persistence
141
+ - No serialization/deserialization
142
+
143
+ **Gap Impact:**
144
+ - Cannot resume interrupted research sessions
145
+ - Cannot persist conversation history
146
+ - Cannot implement checkpointing
147
+
148
+ ---
149
+
150
+ ## Gap 4: Function/Tool Configuration (MEDIUM VALUE)
151
+
152
+ **What MS Framework Has:**
153
+ ```python
154
+ # _tools.py
155
+ class FunctionInvocationConfiguration:
156
+ """Configuration for function invocation in chat clients."""
157
+
158
+ enabled: bool = True
159
+ max_iterations: int = 40 # Maximum tool loop iterations
160
+ max_consecutive_errors_per_request: int = 3
161
+ terminate_on_unknown_calls: bool = False
162
+ include_detailed_errors: bool = False
163
+
164
+ class AIFunction:
165
+ """Wraps Python function for AI model calling."""
166
+ approval_mode: Literal["always_require", "never_require"]
167
+ max_invocations: int # Per-function invocation limit
168
+ max_invocation_exceptions: int # Per-function error limit
169
+ invocation_count: int # Tracks usage
170
+ ```
171
+
172
+ **What DeepBoner Has:**
173
+ - `max_iterations` in Settings
174
+ - Basic tool execution
175
+ - No per-tool configuration
176
+ - No approval mode
177
+
178
+ **Gap Impact:**
179
+ - Cannot limit individual tool usage
180
+ - No human-in-the-loop for dangerous tools
181
+ - No per-tool error budgets
182
+
183
+ ---
184
+
185
+ ## Gap 5: Context Provider Lifecycle (LOW VALUE)
186
+
187
+ **What MS Framework Has:**
188
+ ```python
189
+ # _memory.py
190
+ class ContextProvider(ABC):
191
+ """Abstract pattern for injecting context into agent invocations."""
192
+
193
+ async def invoking(self, agent, thread) -> str | None:
194
+ """Called before agent invocation. Returns context to inject."""
195
+ ...
196
+
197
+ async def invoked(self, agent, thread, result):
198
+ """Called after agent invocation."""
199
+ ...
200
+
201
+ async def thread_created(self, thread):
202
+ """Called when new thread is created."""
203
+ ...
204
+
205
+ class AggregateContextProvider(ContextProvider):
206
+ """Combines multiple context providers."""
207
+ ...
208
+ ```
209
+
210
+ **What DeepBoner Has:**
211
+ - `ResearchMemory` as simple state container
212
+ - No lifecycle hooks
213
+ - No provider aggregation
214
+
215
+ ---
216
+
217
+ ## Gap 6: Exception Granularity (LOW VALUE)
218
+
219
+ **What MS Framework Has:**
220
+ ```text
221
+ AgentFrameworkException (base)
222
+ ├── AgentException
223
+ │ ├── AgentExecutionException
224
+ │ ├── AgentInitializationError
225
+ │ └── AgentThreadException
226
+ ├── ChatClientException
227
+ │ └── ChatClientInitializationError
228
+ ├── ServiceException
229
+ │ ├── ServiceInitializationError
230
+ │ ├── ServiceResponseException
231
+ │ │ ├── ServiceContentFilterException
232
+ │ │ ├── ServiceInvalidExecutionSettingsError
233
+ │ │ └── ServiceInvalidResponseError
234
+ │ └── ServiceInvalidAuthError
235
+ ├── ToolException
236
+ │ └── ToolExecutionException
237
+ ├── MiddlewareException
238
+ └── ContentError
239
+ ```
240
+
241
+ **What DeepBoner Has:**
242
+ ```text
243
+ DeepBonerError (base)
244
+ ├── SearchError
245
+ │ └── RateLimitError
246
+ ├── JudgeError
247
+ ├── ConfigurationError
248
+ └── EmbeddingError
249
+ ```
250
+
251
+ **Gap Impact:**
252
+ - Less precise error handling
253
+ - Harder to distinguish error sources
254
+ - Less informative error messages for users
255
+
256
+ ---
257
+
258
+ ## Prioritized Implementation Roadmap
259
+
260
+ ### Phase 1: Quick Wins (1-2 days)
261
+ 1. Add token tracking to orchestrator (no OTEL yet, just counters)
262
+ 2. Add `max_consecutive_errors` to tool execution
263
+
264
+ ### Phase 2: Medium Effort (3-5 days)
265
+ 1. Add basic middleware pattern to orchestrator
266
+ 2. Implement thread serialization for `ResearchMemory`
267
+
268
+ ### Phase 3: Full Production (1-2 weeks)
269
+ 1. Full OpenTelemetry integration
270
+ 2. Complete middleware pipeline
271
+ 3. Context provider lifecycle hooks
272
+
273
+ ---
274
+
275
+ ## Related Issues
276
+
277
+ - **P2 Hardening Issues:** `docs/bugs/p2-hardening-issues.md`
278
+ - **MS Framework Reference:** `reference_repos/microsoft-agent-framework/`
279
+
280
+ ---
281
+
282
+ ## Notes
283
+
284
+ These gaps are P3 because:
285
+ 1. DeepBoner is functional without them
286
+ 2. They're architectural improvements, not bug fixes
287
+ 3. User-facing functionality is not impacted
288
+
289
+ However, for production deployment serving multiple users, Gaps 1 (Observability) and 3 (Thread State) become P1/P2.
docs/specs/README.md ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Tech Debt & Bug Fix Specs
2
+
3
+ **Status:** AWAITING SENIOR REVIEW
4
+ **Created:** 2025-12-06
5
+
6
+ ---
7
+
8
+ ## Overview
9
+
10
+ These specs consolidate all identified bugs, tech debt, and architectural issues into phased, implementable work packages. Each spec is designed to be a single PR with TDD, SOLID, DRY, Gang of Four principles.
11
+
12
+ **Implementation Order:** SPEC-20 → SPEC-21 → SPEC-22
13
+
14
+ ---
15
+
16
+ ## Spec Index
17
+
18
+ | Spec | Title | Priority | Effort | Status |
19
+ |------|-------|----------|--------|--------|
20
+ | [SPEC-20](./SPEC-20-PUBMED-JSON-FIX.md) | PubMed JSON Parsing Fix | P2 | 15 min | READY |
21
+ | [SPEC-21](./SPEC-21-MIDDLEWARE-ARCHITECTURE.md) | Middleware Architecture Refactor | P2 | 2 hours | READY |
22
+ | [SPEC-22](./SPEC-22-PROGRESS-BAR-REMOVAL.md) | Progress Bar Removal | P3 | 15 min | READY |
23
+
24
+ **Total Effort:** ~2.5 hours
25
+
26
+ ---
27
+
28
+ ## Why This Order?
29
+
30
+ ### SPEC-20 First (15 min)
31
+ - Quickest win
32
+ - Fixes a real crash bug
33
+ - Builds confidence before larger refactor
34
+ - Single file, single PR
35
+
36
+ ### SPEC-21 Second (2 hours)
37
+ - The big architectural fix
38
+ - Renames confusing folder
39
+ - Implements proper MS framework patterns
40
+ - Fixes HuggingFace retry bug THE RIGHT WAY
41
+ - Adds token tracking
42
+
43
+ ### SPEC-22 Last (15 min)
44
+ - Cosmetic only
45
+ - Can be deferred if needed
46
+ - Easy cleanup
47
+
48
+ ---
49
+
50
+ ## What These Specs Consolidate
51
+
52
+ These specs replace the scattered documentation in:
53
+
54
+ | Old Location | Now Covered By |
55
+ |--------------|----------------|
56
+ | `docs/bugs/p2-hardening-issues.md` Issue 1 | SPEC-20 |
57
+ | `docs/bugs/p2-hardening-issues.md` Issue 2 | SPEC-21 |
58
+ | `docs/architecture/adr-001-middleware-refactor.md` | SPEC-21 |
59
+ | `docs/bugs/p3-progress-bar-positioning.md` | SPEC-22 |
60
+
61
+ ---
62
+
63
+ ## What's NOT In These Specs (Deferred P3)
64
+
65
+ The following are documented but deferred for later:
66
+
67
+ 1. **OpenTelemetry observability** - Nice to have, not blocking
68
+ 2. **Thread state serialization** - Nice to have, not blocking
69
+ 3. **ResearchMemory locks** - Not a bug today (sequential execution)
70
+ 4. **Error path cleanup** - Minor resource leakage, GC handles it
71
+ 5. **Per-tool configuration** - Nice to have
72
+ 6. **Context provider lifecycle** - Nice to have
73
+
74
+ These remain documented in `docs/bugs/p3-ms-framework-gaps.md` for future work.
75
+
76
+ ---
77
+
78
+ ## Implementation Protocol
79
+
80
+ For each spec:
81
+
82
+ 1. **Read the spec** - Understand the problem and solution
83
+ 2. **TDD** - Write failing test first
84
+ 3. **Implement** - Minimal code to pass tests
85
+ 4. **Run `make check`** - Lint + typecheck + test
86
+ 5. **Commit** - Single commit per spec
87
+ 6. **PR** - One PR per spec with spec number in title
88
+
89
+ ---
90
+
91
+ ## Commit Message Format
92
+
93
+ ```
94
+ fix: PubMed JSON parsing (SPEC-20)
95
+
96
+ Moves JSON parsing inside try/except block to handle API
97
+ maintenance pages gracefully. Adds JSONDecodeError handling.
98
+
99
+ Fixes: production crash on PubMed maintenance pages
100
+ ```
101
+
102
+ ```
103
+ refactor: middleware architecture (SPEC-21)
104
+
105
+ - Renames src/middleware → src/workflows (accurate naming)
106
+ - Creates proper src/middleware with ChatMiddleware implementations
107
+ - Implements RetryMiddleware (fixes HuggingFace 429 crashes)
108
+ - Implements TokenTrackingMiddleware (enables cost monitoring)
109
+ ```
110
+
111
+ ```
112
+ fix: remove progress bar overlap (SPEC-22)
113
+
114
+ Removes gr.Progress() from research_agent function.
115
+ Gradio's Progress is incompatible with ChatInterface.
116
+ Emoji status messages in chat output are retained.
117
+ ```
118
+
119
+ ---
120
+
121
+ ## Senior Review Checklist
122
+
123
+ Before implementation, please verify:
124
+
125
+ - [ ] SPEC-20: Fix approach is correct (move into try/except)
126
+ - [ ] SPEC-21: MS middleware pattern is used correctly
127
+ - [ ] SPEC-21: RetryMiddleware implementation follows framework conventions
128
+ - [ ] SPEC-21: Folder rename won't break anything else
129
+ - [ ] SPEC-22: Removing gr.Progress() is the right fix (vs CSS hack)
130
+ - [ ] Order of implementation makes sense
131
+ - [ ] Nothing critical is missing from these specs
132
+
133
+ ---
134
+
135
+ ## After Implementation
136
+
137
+ Once all specs are implemented:
138
+
139
+ 1. Archive old docs:
140
+ - `docs/bugs/p2-hardening-issues.md` → Mark as RESOLVED
141
+ - `docs/architecture/adr-001-middleware-refactor.md` → Delete or archive
142
+ - `docs/bugs/p3-progress-bar-positioning.md` → Mark as RESOLVED
143
+
144
+ 2. Update `docs/bugs/active-bugs.md` to reflect completed fixes
145
+
146
+ 3. Consider v0.2.0 release with these fixes
docs/specs/SPEC-20-PUBMED-JSON-FIX.md ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SPEC-20: PubMed JSON Parsing Fix
2
+
3
+ **Status:** COMPLETED
4
+ **Priority:** P2 (Critical - causes crashes)
5
+ **Effort:** 15 minutes
6
+ **PR Scope:** Single file fix
7
+
8
+ ---
9
+
10
+ ## Problem Statement
11
+
12
+ The PubMed search tool crashes when the API returns non-JSON responses (maintenance pages, error pages). The JSON parsing happens **outside** the try/except block.
13
+
14
+ **File:** `src/tools/pubmed.py:88`
15
+ **Crash Type:** `json.JSONDecodeError`
16
+ **User Impact:** Entire research workflow crashes
17
+
18
+ ---
19
+
20
+ ## Current Code (BROKEN)
21
+
22
+ ```python
23
+ # src/tools/pubmed.py - Lines ~80-95
24
+ try:
25
+ search_resp = await client.get(
26
+ f"{NCBI_BASE_URL}/esearch.fcgi",
27
+ params=search_params,
28
+ )
29
+ search_resp.raise_for_status()
30
+ except httpx.HTTPStatusError as e:
31
+ logger.warning("PubMed search failed", status=e.response.status_code)
32
+ return []
33
+
34
+ # ↓↓↓ THIS IS OUTSIDE THE TRY BLOCK ↓↓↓
35
+ search_data = search_resp.json() # CRASHES HERE on maintenance pages
36
+ pmids = search_data.get("esearchresult", {}).get("idlist", [])
37
+ ```
38
+
39
+ ---
40
+
41
+ ## Required Fix
42
+
43
+ Move JSON parsing inside try/except and add `JSONDecodeError` handling:
44
+
45
+ ```python
46
+ # src/tools/pubmed.py - Fixed version
47
+ try:
48
+ search_resp = await client.get(
49
+ f"{NCBI_BASE_URL}/esearch.fcgi",
50
+ params=search_params,
51
+ )
52
+ search_resp.raise_for_status()
53
+ search_data = search_resp.json() # ← MOVED INSIDE TRY
54
+ except httpx.HTTPStatusError as e:
55
+ logger.warning("PubMed search failed", status=e.response.status_code)
56
+ return []
57
+ except json.JSONDecodeError as e:
58
+ logger.warning(
59
+ "PubMed returned invalid JSON (possible maintenance page)",
60
+ error=str(e),
61
+ response_preview=search_resp.text[:200] if search_resp else "N/A",
62
+ )
63
+ return []
64
+
65
+ pmids = search_data.get("esearchresult", {}).get("idlist", [])
66
+ ```
67
+
68
+ ---
69
+
70
+ ## Implementation Checklist
71
+
72
+ - [x] Add `import json` at top of file (if not present)
73
+ - [x] Move `search_resp.json()` inside try block (line ~88)
74
+ - [x] Add `except json.JSONDecodeError` handler
75
+ - [x] Log warning with response preview for debugging
76
+ - [x] Return empty list (graceful degradation)
77
+ - [x] Write unit test: mock response with HTML content
78
+ - [x] Run `make check` (lint + typecheck + test)
79
+
80
+ ---
81
+
82
+ ## Test Case
83
+
84
+ ```python
85
+ # tests/unit/tools/test_pubmed.py
86
+ import pytest
87
+ from unittest.mock import AsyncMock, MagicMock
88
+ from src.tools.pubmed import search_pubmed
89
+
90
+ @pytest.mark.asyncio
91
+ async def test_pubmed_handles_maintenance_page():
92
+ """PubMed should gracefully handle non-JSON responses."""
93
+ # Mock httpx client returning HTML maintenance page
94
+ mock_response = MagicMock()
95
+ mock_response.status_code = 200
96
+ mock_response.text = "<html><body>Service Temporarily Unavailable</body></html>"
97
+ mock_response.json.side_effect = json.JSONDecodeError("Expecting value", "", 0)
98
+ mock_response.raise_for_status = MagicMock()
99
+
100
+ mock_client = AsyncMock()
101
+ mock_client.get.return_value = mock_response
102
+
103
+ # Should return empty list, not crash
104
+ result = await search_pubmed("test query", client=mock_client)
105
+ assert result == []
106
+ ```
107
+
108
+ ---
109
+
110
+ ## Acceptance Criteria
111
+
112
+ 1. `search_pubmed()` returns `[]` when API returns HTML
113
+ 2. Warning logged with response preview
114
+ 3. No `JSONDecodeError` propagates to caller
115
+ 4. All existing tests pass
116
+ 5. `make check` passes
117
+
118
+ ---
119
+
120
+ ## Dependencies
121
+
122
+ None. This is a standalone fix.
123
+
124
+ ---
125
+
126
+ ## Notes
127
+
128
+ - This same pattern may exist in `clinicaltrials.py` and `europepmc.py` - check after this fix
129
+ - Do NOT over-engineer. Single fix, single PR.
docs/specs/SPEC-21-MIDDLEWARE-ARCHITECTURE.md ADDED
@@ -0,0 +1,449 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SPEC-21: Middleware Architecture Refactor
2
+
3
+ **Status:** READY FOR IMPLEMENTATION
4
+ **Priority:** P2 (Architectural hygiene + fixes HuggingFace retry bug)
5
+ **Effort:** 2 hours
6
+ **PR Scope:** Folder rename + new middleware implementations
7
+
8
+ ---
9
+
10
+ ## Problem Statement
11
+
12
+ 1. **Misleading folder name:** `src/middleware/` contains a workflow (`SubIterationMiddleware`), not interceptor middleware
13
+ 2. **Missing retry logic:** `HuggingFaceChatClient` has no retry on 429/transient errors (P2 bug)
14
+ 3. **No token tracking:** Cannot monitor API costs
15
+ 4. **Not using MS framework patterns:** We use decorators but not `ChatMiddleware` base classes
16
+
17
+ ---
18
+
19
+ ## Current State (WRONG)
20
+
21
+ ```text
22
+ src/
23
+ ├── middleware/ ← MISLEADING: contains workflow
24
+ │ ├── __init__.py
25
+ │ ├── sub_iteration.py ← This is a WORKFLOW, not middleware
26
+ │ └── .gitkeep
27
+ ```
28
+
29
+ **HuggingFace client has no retry:**
30
+ ```python
31
+ # src/clients/huggingface.py:263-265
32
+ except Exception as e:
33
+ logger.error("HuggingFace API error", error=str(e))
34
+ raise # ← No retry, crashes on 429
35
+ ```
36
+
37
+ ---
38
+
39
+ ## Target State (CORRECT)
40
+
41
+ ```text
42
+ src/
43
+ ├── workflows/ ← RENAMED: now accurate
44
+ │ ├── __init__.py
45
+ │ └── sub_iteration.py
46
+
47
+ ├── middleware/ ← NEW: actual MS-pattern middleware
48
+ │ ├── __init__.py
49
+ │ ├── retry.py ← RetryMiddleware(ChatMiddleware)
50
+ │ └── token_tracking.py ← TokenTrackingMiddleware(ChatMiddleware)
51
+ ```
52
+
53
+ ---
54
+
55
+ ## Implementation Steps
56
+
57
+ ### Step 1: Rename Folder (5 min)
58
+
59
+ ```bash
60
+ # Rename middleware → workflows
61
+ git mv src/middleware src/workflows
62
+ ```
63
+
64
+ ### Step 2: Update Import (5 min)
65
+
66
+ ```python
67
+ # src/orchestrators/hierarchical.py - Line ~15
68
+ # BEFORE:
69
+ from src.middleware.sub_iteration import SubIterationMiddleware, SubIterationTeam
70
+
71
+ # AFTER:
72
+ from src.workflows.sub_iteration import SubIterationMiddleware, SubIterationTeam
73
+ ```
74
+
75
+ ### Step 3: Create New Middleware Package (10 min)
76
+
77
+ ```python
78
+ # src/middleware/__init__.py
79
+ """Microsoft Agent Framework middleware implementations.
80
+
81
+ These are interceptor-pattern middleware that wrap chat client calls.
82
+ They are NOT workflows - see src/workflows/ for orchestration patterns.
83
+ """
84
+
85
+ from src.middleware.retry import RetryMiddleware
86
+ from src.middleware.token_tracking import TokenTrackingMiddleware
87
+
88
+ __all__ = ["RetryMiddleware", "TokenTrackingMiddleware"]
89
+ ```
90
+
91
+ ### Step 4: Implement RetryMiddleware (30 min)
92
+
93
+ ```python
94
+ # src/middleware/retry.py
95
+ """Retry middleware for chat clients with exponential backoff."""
96
+
97
+ import asyncio
98
+ from typing import Awaitable, Callable
99
+
100
+ import structlog
101
+ from agent_framework._middleware import ChatContext, ChatMiddleware
102
+
103
+ logger = structlog.get_logger()
104
+
105
+
106
+ class RetryMiddleware(ChatMiddleware):
107
+ """Retries failed chat requests with exponential backoff.
108
+
109
+ This middleware intercepts chat client calls and retries on transient
110
+ errors (rate limits, timeouts, server errors).
111
+
112
+ Attributes:
113
+ max_attempts: Maximum number of attempts (default: 3)
114
+ min_wait: Minimum wait between retries in seconds (default: 1.0)
115
+ max_wait: Maximum wait between retries in seconds (default: 10.0)
116
+ retryable_status_codes: HTTP status codes to retry (default: 429, 500, 502, 503, 504)
117
+ """
118
+
119
+ def __init__(
120
+ self,
121
+ max_attempts: int = 3,
122
+ min_wait: float = 1.0,
123
+ max_wait: float = 10.0,
124
+ retryable_status_codes: tuple[int, ...] = (429, 500, 502, 503, 504),
125
+ ) -> None:
126
+ self.max_attempts = max_attempts
127
+ self.min_wait = min_wait
128
+ self.max_wait = max_wait
129
+ self.retryable_status_codes = retryable_status_codes
130
+
131
+ def _is_retryable(self, error: Exception) -> bool:
132
+ """Check if error is retryable."""
133
+ # Check for httpx status errors
134
+ if hasattr(error, "response") and hasattr(error.response, "status_code"):
135
+ return error.response.status_code in self.retryable_status_codes
136
+
137
+ # Check for timeout errors
138
+ error_name = type(error).__name__.lower()
139
+ if "timeout" in error_name:
140
+ return True
141
+
142
+ # Check for connection errors
143
+ if "connection" in error_name:
144
+ return True
145
+
146
+ return False
147
+
148
+ def _calculate_wait(self, attempt: int) -> float:
149
+ """Calculate wait time with exponential backoff."""
150
+ wait = self.min_wait * (2 ** attempt)
151
+ return min(wait, self.max_wait)
152
+
153
+ async def process(
154
+ self, context: ChatContext, next: Callable[[ChatContext], Awaitable[None]]
155
+ ) -> None:
156
+ """Process the chat request with retry logic."""
157
+ last_error: Exception | None = None
158
+
159
+ for attempt in range(self.max_attempts):
160
+ try:
161
+ await next(context)
162
+ return # Success - exit retry loop
163
+
164
+ except Exception as e:
165
+ last_error = e
166
+
167
+ if not self._is_retryable(e):
168
+ logger.warning(
169
+ "Non-retryable error",
170
+ error=str(e),
171
+ error_type=type(e).__name__,
172
+ )
173
+ raise # Don't retry non-retryable errors
174
+
175
+ if attempt < self.max_attempts - 1:
176
+ wait_time = self._calculate_wait(attempt)
177
+ logger.info(
178
+ "Retrying after error",
179
+ attempt=attempt + 1,
180
+ max_attempts=self.max_attempts,
181
+ wait_seconds=wait_time,
182
+ error=str(e),
183
+ )
184
+ await asyncio.sleep(wait_time)
185
+
186
+ # All retries exhausted
187
+ logger.error(
188
+ "All retry attempts failed",
189
+ max_attempts=self.max_attempts,
190
+ last_error=str(last_error),
191
+ )
192
+ if last_error:
193
+ raise last_error
194
+ ```
195
+
196
+ ### Step 5: Implement TokenTrackingMiddleware (20 min)
197
+
198
+ ```python
199
+ # src/middleware/token_tracking.py
200
+ """Token tracking middleware for monitoring API usage."""
201
+
202
+ from contextvars import ContextVar
203
+ from typing import Awaitable, Callable
204
+
205
+ import structlog
206
+ from agent_framework._middleware import ChatContext, ChatMiddleware
207
+
208
+ logger = structlog.get_logger()
209
+
210
+ # ContextVar for per-request token tracking
211
+ _request_tokens: ContextVar[dict[str, int]] = ContextVar(
212
+ "request_tokens",
213
+ default={"input": 0, "output": 0},
214
+ )
215
+
216
+
217
+ class TokenTrackingMiddleware(ChatMiddleware):
218
+ """Tracks token usage across chat requests.
219
+
220
+ This middleware logs token usage after each chat completion
221
+ and maintains running totals for the session.
222
+
223
+ Usage metrics are logged via structlog for observability.
224
+ """
225
+
226
+ def __init__(self) -> None:
227
+ self.total_input_tokens = 0
228
+ self.total_output_tokens = 0
229
+ self.request_count = 0
230
+
231
+ async def process(
232
+ self, context: ChatContext, next: Callable[[ChatContext], Awaitable[None]]
233
+ ) -> None:
234
+ """Process request and track token usage."""
235
+ await next(context)
236
+
237
+ # Extract usage from response if available
238
+ if context.result is None:
239
+ return
240
+
241
+ usage = None
242
+
243
+ # Try to get usage from response
244
+ if hasattr(context.result, "usage"):
245
+ usage = context.result.usage
246
+ elif hasattr(context.result, "messages") and context.result.messages:
247
+ # Check first message for usage metadata
248
+ msg = context.result.messages[0]
249
+ if hasattr(msg, "metadata") and msg.metadata:
250
+ usage = msg.metadata.get("usage")
251
+
252
+ if usage:
253
+ input_tokens = usage.get("input_tokens", 0) or usage.get("prompt_tokens", 0)
254
+ output_tokens = usage.get("output_tokens", 0) or usage.get("completion_tokens", 0)
255
+
256
+ self.total_input_tokens += input_tokens
257
+ self.total_output_tokens += output_tokens
258
+ self.request_count += 1
259
+
260
+ logger.info(
261
+ "Token usage",
262
+ request_input=input_tokens,
263
+ request_output=output_tokens,
264
+ total_input=self.total_input_tokens,
265
+ total_output=self.total_output_tokens,
266
+ total_requests=self.request_count,
267
+ )
268
+
269
+
270
+ def get_token_stats() -> dict[str, int]:
271
+ """Get current request's token usage."""
272
+ return _request_tokens.get().copy()
273
+ ```
274
+
275
+ ### Step 6: Apply Middleware to HuggingFaceChatClient (15 min)
276
+
277
+ ```python
278
+ # src/clients/huggingface.py - Update __init__
279
+
280
+ from src.middleware.retry import RetryMiddleware
281
+ from src.middleware.token_tracking import TokenTrackingMiddleware
282
+
283
+ @use_function_invocation
284
+ @use_observability
285
+ @use_chat_middleware
286
+ class HuggingFaceChatClient(BaseChatClient):
287
+ def __init__(
288
+ self,
289
+ model_id: str | None = None,
290
+ api_key: str | None = None,
291
+ **kwargs: Any,
292
+ ) -> None:
293
+ # Create middleware instances
294
+ middleware = [
295
+ RetryMiddleware(max_attempts=3, min_wait=1.0, max_wait=10.0),
296
+ TokenTrackingMiddleware(),
297
+ ]
298
+
299
+ super().__init__(middleware=middleware, **kwargs)
300
+ # ... rest of __init__
301
+ ```
302
+
303
+ ### Step 7: Update Tests (20 min)
304
+
305
+ ```python
306
+ # tests/unit/middleware/test_retry.py
307
+ import pytest
308
+ from unittest.mock import AsyncMock, MagicMock
309
+ from src.middleware.retry import RetryMiddleware
310
+
311
+
312
+ @pytest.mark.asyncio
313
+ async def test_retry_middleware_succeeds_first_try():
314
+ """RetryMiddleware should pass through on success."""
315
+ middleware = RetryMiddleware(max_attempts=3)
316
+ context = MagicMock()
317
+ next_fn = AsyncMock()
318
+
319
+ await middleware.process(context, next_fn)
320
+
321
+ next_fn.assert_called_once_with(context)
322
+
323
+
324
+ @pytest.mark.asyncio
325
+ async def test_retry_middleware_retries_on_429():
326
+ """RetryMiddleware should retry on 429 rate limit."""
327
+ middleware = RetryMiddleware(max_attempts=3, min_wait=0.01)
328
+ context = MagicMock()
329
+
330
+ # First two calls fail with 429, third succeeds
331
+ call_count = 0
332
+
333
+ async def mock_next(ctx):
334
+ nonlocal call_count
335
+ call_count += 1
336
+ if call_count < 3:
337
+ error = Exception("Rate limited")
338
+ error.response = MagicMock(status_code=429)
339
+ raise error
340
+
341
+ await middleware.process(context, mock_next)
342
+ assert call_count == 3
343
+
344
+
345
+ @pytest.mark.asyncio
346
+ async def test_retry_middleware_raises_after_max_attempts():
347
+ """RetryMiddleware should raise after max attempts exhausted."""
348
+ middleware = RetryMiddleware(max_attempts=2, min_wait=0.01)
349
+ context = MagicMock()
350
+
351
+ async def always_fails(ctx):
352
+ error = Exception("Always fails")
353
+ error.response = MagicMock(status_code=500)
354
+ raise error
355
+
356
+ with pytest.raises(Exception, match="Always fails"):
357
+ await middleware.process(context, always_fails)
358
+ ```
359
+
360
+ ---
361
+
362
+ ## Implementation Checklist
363
+
364
+ ### Phase 1: Folder Rename
365
+ - [ ] `git mv src/middleware src/workflows`
366
+ - [ ] Update import in `src/orchestrators/hierarchical.py`
367
+ - [ ] Update `src/workflows/__init__.py` docstring
368
+ - [ ] Run `make check` - verify no import errors
369
+
370
+ ### Phase 2: Create Middleware Package
371
+ - [ ] Create `src/middleware/__init__.py`
372
+ - [ ] Create `src/middleware/retry.py` with `RetryMiddleware`
373
+ - [ ] Create `src/middleware/token_tracking.py` with `TokenTrackingMiddleware`
374
+ - [ ] Run `make check` - verify no syntax errors
375
+
376
+ ### Phase 3: Apply to Client
377
+ - [ ] Update `src/clients/huggingface.py` to use middleware
378
+ - [ ] Test manually: `uv run python -c "from src.clients.huggingface import HuggingFaceChatClient; print('OK')"`
379
+ - [ ] Run `make check`
380
+
381
+ ### Phase 4: Tests
382
+ - [ ] Create `tests/unit/middleware/__init__.py`
383
+ - [ ] Create `tests/unit/middleware/test_retry.py`
384
+ - [ ] Create `tests/unit/middleware/test_token_tracking.py`
385
+ - [ ] Run `make test` - all tests pass
386
+
387
+ ### Phase 5: Cleanup
388
+ - [ ] Remove `.gitkeep` from `src/workflows/` if present
389
+ - [ ] Run full `make check`
390
+ - [ ] Commit with message: "refactor: implement proper middleware architecture (SPEC-21)"
391
+
392
+ ---
393
+
394
+ ## Acceptance Criteria
395
+
396
+ 1. `src/middleware/` folder contains actual `ChatMiddleware` implementations
397
+ 2. `src/workflows/` folder contains `SubIterationMiddleware` (renamed from middleware)
398
+ 3. `HuggingFaceChatClient` uses `RetryMiddleware` - no more crashes on 429
399
+ 4. Token usage is logged via `TokenTrackingMiddleware`
400
+ 5. All existing tests pass
401
+ 6. `make check` passes
402
+ 7. No import errors anywhere in codebase
403
+
404
+ ---
405
+
406
+ ## Dependencies
407
+
408
+ - **SPEC-20** should be done first (simpler, builds confidence)
409
+ - Requires `agent-framework-core` package (already installed)
410
+
411
+ ---
412
+
413
+ ## Gotchas & Nuances
414
+
415
+ 1. **MS middleware signature:** The `process` method takes `(context, next)` where `next` is a callable
416
+ 2. **Middleware order matters:** Retry should be FIRST so it wraps everything
417
+ 3. **ContextVar for token tracking:** Use ContextVar for per-request isolation
418
+ 4. **Don't break HierarchicalOrchestrator:** It uses `SubIterationMiddleware` - update import path
419
+ 5. **BaseChatClient constructor:** Check if it accepts `middleware=` parameter - may need to register differently
420
+
421
+ ---
422
+
423
+ ## Testing the Fix
424
+
425
+ After implementation, verify 429 handling:
426
+
427
+ ```python
428
+ # Manual test
429
+ import asyncio
430
+ from src.clients.huggingface import HuggingFaceChatClient
431
+ from agent_framework import ChatMessage, ChatOptions
432
+
433
+ async def test():
434
+ client = HuggingFaceChatClient()
435
+ # Make rapid requests to trigger rate limit
436
+ for i in range(10):
437
+ try:
438
+ resp = await client.get_response(
439
+ messages=[ChatMessage(role="user", text="Hello")],
440
+ chat_options=ChatOptions(),
441
+ )
442
+ print(f"Request {i}: OK")
443
+ except Exception as e:
444
+ print(f"Request {i}: {e}")
445
+
446
+ asyncio.run(test())
447
+ ```
448
+
449
+ Should see retry logs instead of immediate crashes.
docs/specs/SPEC-22-PROGRESS-BAR-REMOVAL.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SPEC-22: Progress Bar Removal
2
+
3
+ **Status:** READY FOR IMPLEMENTATION
4
+ **Priority:** P3 (Cosmetic UX fix)
5
+ **Effort:** 15 minutes
6
+ **PR Scope:** Single file fix
7
+
8
+ ---
9
+
10
+ ## Problem Statement
11
+
12
+ The `gr.Progress()` bar conflicts with Gradio's `ChatInterface`, causing visual glitches:
13
+ - Progress bar "floats" in the middle of chat output
14
+ - Text overlaps with progress bar
15
+ - Looks unprofessional
16
+
17
+ **Root Cause:** `gr.Progress()` is designed for `gr.Interface`, not `ChatInterface`. It's a known Gradio limitation.
18
+
19
+ ---
20
+
21
+ ## Current Code (BROKEN)
22
+
23
+ ```python
24
+ # src/app.py - research_agent function
25
+ async def research_agent(
26
+ message: str,
27
+ history: list[dict[str, Any]],
28
+ domain: str = "sexual_health",
29
+ api_key: str = "",
30
+ api_key_state: str = "",
31
+ progress: gr.Progress = gr.Progress(), # ← PROBLEM: Causes visual glitches
32
+ ) -> AsyncGenerator[str, None]:
33
+ ...
34
+ if event.type == "started":
35
+ progress(0, desc="Starting research...") # ← These cause overlap
36
+ elif event.type == "progress":
37
+ progress(p, desc=event.message)
38
+ ```
39
+
40
+ ---
41
+
42
+ ## Required Fix
43
+
44
+ Remove `gr.Progress()` entirely. We already have emoji status messages in chat output.
45
+
46
+ ```python
47
+ # src/app.py - Fixed version
48
+ async def research_agent(
49
+ message: str,
50
+ history: list[dict[str, Any]],
51
+ domain: str = "sexual_health",
52
+ api_key: str = "",
53
+ api_key_state: str = "",
54
+ # REMOVED: progress: gr.Progress = gr.Progress(),
55
+ ) -> AsyncGenerator[str, None]:
56
+ ...
57
+ # REMOVED: All progress(...) calls
58
+
59
+ # KEEP: Emoji status messages are already being yielded
60
+ # These work great with ChatInterface:
61
+ # yield "⏱️ **PROGRESS**: Round 1/5 (~3m 0s remaining)"
62
+ ```
63
+
64
+ ---
65
+
66
+ ## Implementation Checklist
67
+
68
+ - [ ] Open `src/app.py`
69
+ - [ ] Remove `progress: gr.Progress = gr.Progress()` from `research_agent` signature
70
+ - [ ] Remove all `progress(...)` calls inside `research_agent`
71
+ - [ ] Verify emoji status yields are still present (they should be)
72
+ - [ ] Run `uv run python -c "from src.app import create_demo; print('OK')"`
73
+ - [ ] Run `make check`
74
+ - [ ] Test locally: `uv run python src/app.py` and verify no floating progress bar
75
+
76
+ ---
77
+
78
+ ## What We Keep
79
+
80
+ The emoji status messages in chat output:
81
+
82
+ ```
83
+ ⏱️ **PROGRESS**: Round 1/5 (~3m 0s remaining)
84
+ 🔬 **Step 2: SearchAgent** - Searching for evidence...
85
+ ✅ **COMPLETE**: Research finished in 45 seconds
86
+ ```
87
+
88
+ These are yielded directly to chat and work perfectly with `ChatInterface`.
89
+
90
+ ---
91
+
92
+ ## Acceptance Criteria
93
+
94
+ 1. No `gr.Progress()` in `research_agent` function signature
95
+ 2. No `progress(...)` calls in `research_agent` function body
96
+ 3. Emoji status messages still appear in chat output
97
+ 4. No floating/overlapping progress bar in UI
98
+ 5. `make check` passes
99
+
100
+ ---
101
+
102
+ ## Dependencies
103
+
104
+ None. This is a standalone cosmetic fix.
105
+
106
+ ---
107
+
108
+ ## Testing
109
+
110
+ ```bash
111
+ # Start local server
112
+ uv run python src/app.py
113
+
114
+ # In browser:
115
+ # 1. Submit a research query
116
+ # 2. Verify NO floating progress bar appears
117
+ # 3. Verify emoji status messages DO appear in chat
118
+ # 4. Verify chat messages don't have visual glitches
119
+ ```
120
+
121
+ ---
122
+
123
+ ## Notes
124
+
125
+ - This is the recommended fix from Gradio's own documentation
126
+ - `ChatInterface.show_progress="minimal"` (default) still shows a spinner, which is fine
127
+ - If we need a real progress bar later, we'd need to refactor to `gr.Blocks` wrapper
src/tools/pubmed.py CHANGED
@@ -1,5 +1,6 @@
1
  """PubMed search tool using NCBI E-utilities."""
2
 
 
3
  from typing import Any
4
 
5
  import httpx
@@ -80,12 +81,19 @@ class PubMedTool:
80
  params=search_params,
81
  )
82
  search_resp.raise_for_status()
 
83
  except httpx.HTTPStatusError as e:
84
  if e.response.status_code == self.HTTP_TOO_MANY_REQUESTS:
85
  raise RateLimitError("PubMed rate limit exceeded") from e
86
  raise SearchError(f"PubMed search failed: {e}") from e
 
 
 
 
 
 
 
87
 
88
- search_data = search_resp.json()
89
  pmids = search_data.get("esearchresult", {}).get("idlist", [])
90
 
91
  if not pmids:
 
1
  """PubMed search tool using NCBI E-utilities."""
2
 
3
+ import json
4
  from typing import Any
5
 
6
  import httpx
 
81
  params=search_params,
82
  )
83
  search_resp.raise_for_status()
84
+ search_data = search_resp.json()
85
  except httpx.HTTPStatusError as e:
86
  if e.response.status_code == self.HTTP_TOO_MANY_REQUESTS:
87
  raise RateLimitError("PubMed rate limit exceeded") from e
88
  raise SearchError(f"PubMed search failed: {e}") from e
89
+ except json.JSONDecodeError as e:
90
+ logger.warning(
91
+ "PubMed returned invalid JSON (possible maintenance page)",
92
+ error=str(e),
93
+ response_preview=search_resp.text[:200] if search_resp else "N/A",
94
+ )
95
+ return []
96
 
 
97
  pmids = search_data.get("esearchresult", {}).get("idlist", [])
98
 
99
  if not pmids:
tests/unit/tools/test_pubmed.py CHANGED
@@ -1,5 +1,6 @@
1
  """Unit tests for PubMed tool."""
2
 
 
3
  from unittest.mock import AsyncMock, MagicMock
4
 
5
  import pytest
@@ -150,3 +151,25 @@ class TestPubMedTool:
150
  assert "help" not in term.lower()
151
  # "low libido" should be expanded
152
  assert "HSDD" in term or "hypoactive" in term
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  """Unit tests for PubMed tool."""
2
 
3
+ import json
4
  from unittest.mock import AsyncMock, MagicMock
5
 
6
  import pytest
 
151
  assert "help" not in term.lower()
152
  # "low libido" should be expanded
153
  assert "HSDD" in term or "hypoactive" in term
154
+
155
+ @pytest.mark.asyncio
156
+ async def test_search_handles_maintenance_page(self, mocker):
157
+ """PubMedTool should gracefully handle non-JSON responses (maintenance pages)."""
158
+ # Mock response that returns HTML instead of JSON (maintenance page)
159
+ mock_response = MagicMock()
160
+ mock_response.status_code = 200
161
+ mock_response.text = "<html><body>Service Temporarily Unavailable</body></html>"
162
+ mock_response.json.side_effect = json.JSONDecodeError("Expecting value", "", 0)
163
+ mock_response.raise_for_status = MagicMock()
164
+
165
+ mock_client = AsyncMock()
166
+ mock_client.get = AsyncMock(return_value=mock_response)
167
+ mock_client.__aenter__ = AsyncMock(return_value=mock_client)
168
+ mock_client.__aexit__ = AsyncMock(return_value=None)
169
+
170
+ mocker.patch("httpx.AsyncClient", return_value=mock_client)
171
+
172
+ tool = PubMedTool()
173
+ # Should return empty list, not crash
174
+ results = await tool.search("test query")
175
+ assert results == []