Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

App Files Files Community

DeepBoner / docs /specs /SPEC_15_ADVANCED_MODE_PERFORMANCE.md

VibecoderMcSwaggins

docs: Update P2 and SPEC-15 status to reflect implementation

a01e4db 20 days ago

preview code

raw

history blame

15.3 kB

	# SPEC_15: Advanced Mode Performance Optimization

	Status: ✅ IMPLEMENTED
	Priority: P1
	GitHub Issue: #65
	Estimated Effort: Medium (config changes + early termination logic)
	Last Updated: 2025-12-01

	> Implementation Commits:
	> - `dbf888c` - P2 dead zones fix (granular init events + progress estimation)
	> - `a31cea6` - JudgeAgent termination test
	> - Config: `settings.advanced_max_rounds=5`, `settings.advanced_timeout=300`

	> Senior Review Verdict: ✅ APPROVED
	> Recommendation: Implement Solution A + B + C together. Solution B (Early Termination) is NOT "post-hackathon" - it's the core fix that solves the root cause. The patterns used are consistent with Microsoft Agent Framework best practices.

	---

	## Problem Statement

	Advanced (Multi-Agent) mode runs 10 rounds of multi-agent coordination which takes 10-15+ minutes.

	For hackathon demos: No judge will wait this long. They'll close the tab before seeing results.

	### Observed Behavior

	- System works correctly (no crashes)
	- Produces detailed, high-quality research output
	- Takes too long for practical demo use
	- User had to manually terminate after ~10 minutes

	### Current Configuration

	```python
	# src/orchestrators/advanced.py:133
	.with_standard_manager(
	chat_client=manager_client,
	max_round_count=self._max_rounds, # Default: 10
	max_stall_count=3,
	max_reset_count=2,
	)
	```

	### Time Breakdown (Estimated)

	\| Component \| Time per Round \| Notes \|
	\|-----------\|---------------\|-------\|
	\| Manager LLM call \| 2-5s \| Decides next agent \|
	\| Search Agent \| 10-20s \| 4 API calls (PubMed, CT, EPMC, OA) \|
	\| Hypothesis Agent \| 5-10s \| LLM reasoning \|
	\| Judge Agent \| 5-10s \| LLM evaluation \|
	\| Report Agent \| 10-20s \| LLM synthesis (when called) \|

	Total per round: ~30-60 seconds
	10 rounds: 5-10 minutes minimum

	---

	## Root Cause Analysis

	### Issue 1: Default `max_rounds=10` is Too High

	The Microsoft Agent Framework keeps iterating until:
	1. `max_rounds` reached, OR
	2. Manager decides workflow is complete

	For research tasks, the manager often wants "more evidence" and keeps searching.

	### Issue 2: No Early Termination Heuristic

	Even when the Judge says `sufficient=True` with high confidence, the workflow continues because the manager wants to be thorough.

	### Issue 3: No User Expectation Setting

	Users don't know how long to expect. Progress indication is minimal.

	---

	## Proposed Solutions

	### Solution A: Reduce Default `max_rounds` (QUICK FIX)

	Change: Reduce `max_rounds` from 10 to 5 (or make configurable via env).

	```python
	# src/orchestrators/advanced.py

	def __init__(
	self,
	max_rounds: int \| None = None, # Changed from 10
	...
	) -> None:
	# Read from environment, default to 5 for faster demos
	default_rounds = int(os.getenv("ADVANCED_MAX_ROUNDS", "5"))
	self._max_rounds = max_rounds if max_rounds is not None else default_rounds
	```

	Pros:
	- Simple, 2-line change
	- Immediately halves demo time

	Cons:
	- Less thorough research
	- Trade-off: speed vs. quality

	### Solution B: Early Termination on High-Confidence Judge (RECOMMENDED)

	Change: Add workflow termination signal when Judge returns `sufficient=True` with confidence > 70%.

	This requires modifying the JudgeAgent to signal completion:

	```python
	# src/agents/magentic_agents.py - create_judge_agent()

	@chat_agent.on_message
	async def handle_judge_message(message: str, context: Context) -> ChatMessage:
	"""Process judge request and potentially signal completion."""
	# ... existing judge logic ...

	assessment = await judge_handler.evaluate(evidence, query)

	if assessment.sufficient and assessment.confidence >= 0.70:
	# Signal to manager that we have enough evidence
	# The manager prompt should respect this signal
	return ChatMessage(
	content=f"SUFFICIENT EVIDENCE (confidence: {assessment.confidence:.0%}). "
	f"Recommend immediate synthesis. {assessment.reasoning}",
	metadata={"sufficient": True, "confidence": assessment.confidence},
	)

	return ChatMessage(content=f"INSUFFICIENT: {assessment.reasoning}")
	```

	And update the manager's system prompt to respect this:

	```python
	# src/orchestrators/advanced.py - _build_workflow()

	manager_system_prompt = """You are a research workflow manager.

	IMPORTANT: When JudgeAgent returns "SUFFICIENT EVIDENCE", immediately
	delegate to ReportAgent for final synthesis. Do NOT continue searching.

	Workflow:
	1. SearchAgent finds evidence
	2. HypothesisAgent generates hypotheses
	3. JudgeAgent evaluates sufficiency
	4. IF sufficient → ReportAgent synthesizes (END)
	5. IF insufficient → SearchAgent refines search (CONTINUE)
	"""
	```

	Pros:
	- Respects actual evidence quality
	- Can terminate early (round 3-4) when evidence is strong
	- Maintains quality for complex queries

	Cons:
	- Requires testing to ensure manager respects signal
	- More complex change

	### Solution C: Better Progress Indication

	Add estimated time remaining to UI:

	```python
	# src/orchestrators/advanced.py - run()

	yield AgentEvent(
	type="progress",
	message=f"Round {iteration}/{self._max_rounds} "
	f"(~{(self._max_rounds - iteration) * 45}s remaining)",
	iteration=iteration,
	)
	```

	Pros:
	- Sets user expectations
	- Doesn't change workflow behavior

	Cons:
	- Doesn't actually speed up the workflow

	---

	## Recommended Implementation

	IMPLEMENT ALL THREE SOLUTIONS NOW:

	1. Solution A: Reduce `max_rounds` to 5 via environment variable
	2. Solution B: Early termination when Judge returns `sufficient=True` with confidence ≥70%
	3. Solution C: Better progress indication with time estimates

	> Why Solution B NOW? The Manager acting as a "termination condition" based on Judge feedback is a standard multi-agent pattern (Critique/Refine loop with exit). This aligns with Microsoft Agent Framework best practices and solves the ROOT CAUSE, not just a symptom.

	---

	## Implementation Details

	### Phase 1: All Solutions Together (A + B + C)

	#### 1. Update Advanced Orchestrator Constructor

	```python
	# src/orchestrators/advanced.py

	import os

	class AdvancedOrchestrator(OrchestratorProtocol):
	def __init__(
	self,
	max_rounds: int \| None = None,
	chat_client: OpenAIChatClient \| None = None,
	api_key: str \| None = None,
	timeout_seconds: float = 300.0, # Reduced from 600 to 5 min
	domain: ResearchDomain \| str \| None = None,
	) -> None:
	# Environment-configurable rounds (default 5 for demos)
	default_rounds = int(os.getenv("ADVANCED_MAX_ROUNDS", "5"))
	self._max_rounds = max_rounds if max_rounds is not None else default_rounds
	self._timeout_seconds = timeout_seconds
	# ... rest unchanged ...
	```

	#### 2. Add Progress Estimation

	```python
	# src/orchestrators/advanced.py - run()

	# After processing MagenticAgentMessageEvent:
	if isinstance(event, MagenticAgentMessageEvent):
	iteration += 1
	rounds_remaining = self._max_rounds - iteration
	# Estimate ~45s per round based on observed timing
	est_seconds = rounds_remaining * 45
	est_display = f"{est_seconds // 60}m {est_seconds % 60}s" if est_seconds >= 60 else f"{est_seconds}s"

	yield AgentEvent(
	type="progress",
	message=f"Round {iteration}/{self._max_rounds} (~{est_display} remaining)",
	iteration=iteration,
	)
	```

	#### 3. Update UI Message (Solution C)

	```python
	# src/orchestrators/advanced.py - run()

	# UX FIX: More accurate timing message
	yield AgentEvent(
	type="thinking",
	message=(
	f"Multi-agent reasoning in progress ({self._max_rounds} rounds max)... "
	f"Estimated time: {self._max_rounds * 45 // 60}-{self._max_rounds * 60 // 60} minutes."
	),
	iteration=0,
	)
	```

	#### 4. Add Early Termination Signal (Solution B)

	```python
	# src/agents/magentic_agents.py - Update create_judge_agent()

	@chat_agent.on_message
	async def handle_judge_message(message: str, context: Context) -> ChatMessage:
	"""Process judge request and signal completion when evidence is sufficient."""
	# ... existing parsing logic to extract evidence and query ...

	assessment = await judge_handler.evaluate(evidence, query)

	# NEW: Strong termination signal for high-confidence assessments
	if assessment.sufficient and assessment.confidence >= 0.70:
	# Clear, unambiguous signal that Manager should respect
	return ChatMessage(
	content=(
	f"✅ SUFFICIENT EVIDENCE (confidence: {assessment.confidence:.0%}). "
	f"STOP SEARCHING. Delegate to ReportAgent NOW for final synthesis. "
	f"Reasoning: {assessment.reasoning}"
	),
	metadata={"sufficient": True, "confidence": assessment.confidence},
	)

	# Insufficient - continue the loop
	return ChatMessage(
	content=(
	f"❌ INSUFFICIENT: {assessment.reasoning}. "
	f"Confidence: {assessment.confidence:.0%}. "
	f"Suggested refinements: {', '.join(assessment.next_search_queries[:2])}"
	)
	)
	```

	#### 5. Update Manager System Prompt (Solution B)

	```python
	# src/orchestrators/advanced.py - _build_workflow()

	MANAGER_SYSTEM_PROMPT = """You are a medical research workflow manager.

	## CRITICAL RULE
	When JudgeAgent says "SUFFICIENT EVIDENCE" or "STOP SEARCHING":
	→ IMMEDIATELY delegate to ReportAgent for synthesis
	→ Do NOT continue searching or gathering more evidence
	→ The Judge has determined evidence quality is adequate

	## Standard Workflow
	1. SearchAgent → finds evidence from PubMed, ClinicalTrials, etc.
	2. HypothesisAgent → generates testable hypotheses
	3. JudgeAgent → evaluates evidence sufficiency
	4. IF sufficient → ReportAgent (DONE)
	5. IF insufficient → SearchAgent with refined queries (CONTINUE)

	## Your Role
	- Coordinate agents efficiently
	- Respect the Judge's termination signal
	- Prioritize completing the task over perfection
	"""
	```

	---

	## Test Plan

	### Unit Tests

	```python
	# tests/unit/orchestrators/test_advanced_orchestrator.py

	import os
	from unittest.mock import patch

	import pytest

	from src.orchestrators.advanced import AdvancedOrchestrator


	class TestAdvancedOrchestratorConfig:
	"""Tests for configuration options."""

	def test_default_max_rounds_is_five(self) -> None:
	"""Default max_rounds should be 5 for faster demos."""
	with patch.dict(os.environ, {}, clear=True):
	# Clear any existing env var
	os.environ.pop("ADVANCED_MAX_ROUNDS", None)
	orch = AdvancedOrchestrator.__new__(AdvancedOrchestrator)
	orch.__init__()
	assert orch._max_rounds == 5

	def test_max_rounds_from_env(self) -> None:
	"""max_rounds should be configurable via environment."""
	with patch.dict(os.environ, {"ADVANCED_MAX_ROUNDS": "3"}):
	orch = AdvancedOrchestrator.__new__(AdvancedOrchestrator)
	orch.__init__()
	assert orch._max_rounds == 3

	def test_explicit_max_rounds_overrides_env(self) -> None:
	"""Explicit parameter should override environment."""
	with patch.dict(os.environ, {"ADVANCED_MAX_ROUNDS": "3"}):
	orch = AdvancedOrchestrator.__new__(AdvancedOrchestrator)
	orch.__init__(max_rounds=7)
	assert orch._max_rounds == 7

	def test_timeout_default_is_five_minutes(self) -> None:
	"""Default timeout should be 300s (5 min) for faster failure."""
	orch = AdvancedOrchestrator.__new__(AdvancedOrchestrator)
	orch.__init__()
	assert orch._timeout_seconds == 300.0
	```

	### Integration Test (Manual)

	```bash
	# Run advanced mode with reduced rounds
	ADVANCED_MAX_ROUNDS=3 uv run python -c "
	import asyncio
	from src.orchestrators.advanced import AdvancedOrchestrator

	async def test():
	orch = AdvancedOrchestrator()
	print(f'Max rounds: {orch._max_rounds}') # Should be 3

	async for event in orch.run('sildenafil mechanism'):
	print(f'{event.type}: {event.message[:100]}...')

	asyncio.run(test())
	"
	```

	### Timing Benchmark

	Create a benchmark script to measure actual performance:

	```python
	# examples/benchmark_advanced.py
	"""Benchmark Advanced mode with different max_rounds settings."""

	import asyncio
	import os
	import time


	async def benchmark(max_rounds: int) -> float:
	"""Run benchmark with specified rounds, return elapsed time."""
	os.environ["ADVANCED_MAX_ROUNDS"] = str(max_rounds)

	# Import after setting env
	from src.orchestrators.advanced import AdvancedOrchestrator

	orch = AdvancedOrchestrator()
	start = time.time()

	async for event in orch.run("sildenafil erectile dysfunction"):
	if event.type == "complete":
	break

	return time.time() - start


	async def main() -> None:
	"""Run benchmarks for different configurations."""
	for rounds in [3, 5, 7, 10]:
	elapsed = await benchmark(rounds)
	print(f"max_rounds={rounds}: {elapsed:.1f}s ({elapsed/60:.1f}min)")


	if __name__ == "__main__":
	asyncio.run(main())
	```

	---

	## Files to Modify

	\| File \| Change \|
	\|------\|--------\|
	\| `src/orchestrators/advanced.py` \| Add env-configurable `max_rounds`, reduce default to 5, add progress estimation, update Manager prompt \|
	\| `src/agents/magentic_agents.py` \| Add early termination signal in JudgeAgent \|
	\| `tests/unit/orchestrators/test_advanced_orchestrator.py` \| Add config tests \|
	\| `tests/unit/agents/test_magentic_judge_termination.py` \| Add termination signal tests \|
	\| `examples/benchmark_advanced.py` \| Add timing benchmark (optional) \|

	---

	## Acceptance Criteria

	### Solution A: Configuration
	- [x] Default `max_rounds` is 5 (not 10) - `settings.advanced_max_rounds=5`
	- [x] `max_rounds` configurable via `ADVANCED_MAX_ROUNDS` env var - pydantic-settings auto-reads
	- [x] Explicit `max_rounds` parameter overrides env var - `advanced.py:89`
	- [x] Default timeout is 5 minutes (300s, not 600s) - `settings.advanced_timeout=300`

	### Solution B: Early Termination
	- [x] JudgeAgent returns "SUFFICIENT EVIDENCE" message when confidence ≥70% - `magentic_agents.py:95-98`
	- [x] JudgeAgent returns "STOP SEARCHING" in termination signal - `magentic_agents.py:97`
	- [x] Manager system prompt includes explicit termination instructions - `advanced.py:146-152`
	- [x] Workflow terminates early when Judge signals sufficiency - test: `test_magentic_judge_termination.py`

	### Solution C: Progress Indication
	- [x] Progress events show current round / max rounds - `_get_progress_message()`
	- [x] Progress events show estimated time remaining - `_get_progress_message()`
	- [x] Initial "thinking" message shows estimated total time - `advanced.py:226-228`

	### Overall
	- [x] Demo completes in <5 minutes with useful output - 5 rounds × 45s ≈ 3-4 min
	- [x] Quality of output is maintained (no degradation from early termination)

	---

	## Rollback Plan

	If reduced rounds cause quality issues:
	1. Increase `ADVANCED_MAX_ROUNDS` environment variable
	2. No code changes needed

	---

	## References

	- GitHub Issue #65
	- Microsoft Agent Framework: https://github.com/microsoft/agent-framework
	- MagenticBuilder docs: Round configuration