agentbee

Running

mangubee Claude Sonnet 4.5 commited on 22 days ago

Commit

456c236

1 Parent(s): 8b043d1

Plan: Stage 5 performance optimization strategy

Added comprehensive Stage 5 implementation plan:
- Objective: 10% → 25% accuracy improvement
- Root cause analysis from JSON export (75% quota failures)
- P0 steps: Retry logic + Groq integration
- P1 steps: Tool selection improvements, vision skip, calculator fix
- Success criteria: 5/20 questions, <50% quota errors
- Timeline: ~3.5 hours estimated

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Files changed (1) hide show

PLAN.md +237 -164

PLAN.md CHANGED Viewed

@@ -1,254 +1,327 @@
-# Implementation Plan - Stage 4: MVP - Real Integration
-**Date:** 2026-01-02
-**Dev Record:** dev/dev_260103_15_stage4_mvp_integration.md
 **Status:** Planning
 ## Objective
-Fix integration issues to achieve MVP: Agent answers real GAIA questions using real APIs (Gemini, Claude, Tavily), even if accuracy is low. Target: Get from 0/20 → at least 5/20 questions correct.
-## Current Problem Analysis
-**HuggingFace Result:** 0/20 correct, all answers = "Unable to answer: No evidence collected"
-**Root Causes Identified:**
-1. **API Keys Issue:** Environment variables may not be set in HuggingFace Space
-2. **Silent Failures:** LLM function calling fails but errors are swallowed
-3. **No Evidence Collection:** Tool execution broken, evidence list stays empty
-4. **Poor Error Visibility:** User sees "Unable to answer" with no diagnostic info
-## Steps
-### 1. Add Comprehensive Debug Logging
-**File:** `src/agent/graph.py`
-**Changes:**
-- Add detailed logging in each node (plan/execute/answer)
-- Log LLM responses, tool calls, evidence collected
-- Log errors with full stack traces
-- Add state inspection logging
-**Purpose:** Understand where exactly the integration fails
-### 2. Improve Error Messages
-**File:** `src/agent/graph.py` - `answer_node`
-**Current:**
-```python
-state["answer"] = "Unable to answer: No evidence collected"
-```
-**New:**
-```python
-if not evidence:
-    error_summary = "; ".join(state["errors"]) if state["errors"] else "No errors logged"
-    state["answer"] = f"ERROR: No evidence. Errors: {error_summary}"
-```
-**Purpose:** Show WHY it failed (API key missing? Tool failed? LLM failed?)
-### 3. Add Graceful Degradation in LLM Client
-**File:** `src/agent/llm_client.py`
 **Changes:**
-- Better exception handling with specific error types
-- Distinguish between: API key missing, rate limit, network error, API error
-- Log which provider failed and why
-- Add fallback messages instead of re-raising
-**Example:**
 ```python
-try:
-    return plan_question_gemini(...)
-except ValueError as e:
-    if "GOOGLE_API_KEY" in str(e):
-        logger.error("Gemini API key not set")
-        # Try Claude fallback
-except Exception as e:
-    logger.error(f"Gemini failed: {type(e).__name__}: {e}")
 ```
-### 4. Add API Key Validation Check
-**File:** `src/agent/graph.py` - Add validation before execution
-**New function:**
 ```python
-def validate_environment() -> List[str]:
-    """Check which API keys are available."""
-    missing = []
-    if not os.getenv("GOOGLE_API_KEY"):
-        missing.append("GOOGLE_API_KEY (Gemini)")
-    if not os.getenv("ANTHROPIC_API_KEY"):
-        missing.append("ANTHROPIC_API_KEY (Claude)")
-    if not os.getenv("TAVILY_API_KEY"):
-        missing.append("TAVILY_API_KEY (Search)")
-    return missing
 ```
-Call at agent initialization to warn early.
-### 5. Fix Tool Execution Error Handling
 **File:** `src/agent/graph.py` - `execute_node`
-**Issue:** If LLM function calling returns empty tool_calls, execution continues silently
-**Fix:**
 ```python
-tool_calls = select_tools_with_function_calling(...)
-if not tool_calls:
-    logger.error("LLM returned no tool calls - check LLM integration")
-    state["errors"].append("Tool selection failed: LLM returned no tools")
-    return state  # Early return instead of continuing
 ```
-### 6. Add Fallback to Direct Tool Execution (MVP Hack)
-**File:** `src/agent/graph.py` - `execute_node`
-**If LLM function calling fails completely, use rule-based fallback:**
 ```python
-# If LLM function calling fails, try simple heuristics
-if not tool_calls and "search" in question.lower():
-    logger.warning("LLM tool selection failed, using fallback: search")
-    tool_calls = [{"tool": "search", "params": {"query": question}}]
 ```
-**Purpose:** Get SOMETHING working even if LLM fails (this is MVP - quality doesn't matter)
-### 7. Test with Mock-Free Integration Tests
-**File:** `test/test_integration_real_apis.py` (NEW)
-**Tests:**
-- Test with real GOOGLE_API_KEY (if available)
-- Test with real ANTHROPIC_API_KEY (if available)
-- Test with real TAVILY_API_KEY (if available)
-- Skip tests if API keys not available (not fail)
-**Purpose:** Validate real API integration works locally before deploying
-### 8. Add Gradio UI Error Display
-**File:** `app.py`
-**Current:** Shows only answer
-**New:** Show diagnostic info in UI
-```python
-def answer_question(question):
-    agent = GAIAAgent()
-    answer = agent(question)
-    # Show errors if present
-    if hasattr(agent, 'last_state'):
-        errors = agent.last_state.get('errors', [])
-        if errors:
-            return f"{answer}\n\nDIAGNOSTICS:\n" + "\n".join(errors)
-    return answer
-```
-### 9. Update HuggingFace Space Configuration
-**Action Items:**
-1. Add environment variables in Space Settings:
-   - `GOOGLE_API_KEY` (for Gemini - primary)
-   - `ANTHROPIC_API_KEY` (for Claude - fallback)
-   - `TAVILY_API_KEY` (for web search)
-2. Set to "Public" visibility if needed
-3. Verify build succeeds after adding keys
-### 10. Deploy and Test Real Questions
-**Actions:**
-- Commit all changes
-- Push to HuggingFace Spaces
-- Wait for build
-- Test with 5 simple GAIA questions manually
-- Verify at least 1-2 work (doesn't need to be correct, just collect evidence)
-## Files to Modify
-1. `src/agent/graph.py` - Add logging, improve error handling, add validation
-2. `src/agent/llm_client.py` - Better exception handling, specific error types
-3. `app.py` - Show diagnostics in UI
-4. `test/test_integration_real_apis.py` - NEW - Real API integration tests
-5. `README.md` - Document required API keys
-## Success Criteria
-**MVP Definition:** Agent runs real APIs and collects evidence (even if answers wrong)
-- [ ] Agent attempts real LLM calls (Gemini or Claude)
-- [ ] Agent attempts real tool calls (Tavily search)
-- [ ] Evidence is collected (not empty list)
-- [ ] Errors are visible and actionable
-- [ ] At least 1/20 GAIA questions collects evidence (even if answer wrong)
-- [ ] Target: 5/20 questions answered (quality doesn't matter, just not "Unable to answer")
-**Non-Goals for MVP:**
-- ❌ High accuracy (not needed for MVP)
-- ❌ Optimal tool selection (can be random/fallback)
-- ❌ Perfect error recovery (basic is enough)
-- ❌ Performance optimization (Stage 5)
-## Debug Strategy
-**If still failing after fixes:**
-1. **Check logs in HuggingFace Space container logs**
-2. **Add print statements** (not just logger) to see output
-3. **Test locally first** with real API keys
-4. **Simplify to single tool** (just search, no LLM function calling)
-5. **Hardcode a simple question** to verify basic flow works
 ## Risk Analysis
-**High Risk Issues:**
-1. **Gemini function calling API complex** - May fail even with correct implementation
-   - **Mitigation:** Claude fallback + hardcoded tool selection fallback
-2. **API keys not propagating** to container
-   - **Mitigation:** Add validation at startup, fail fast with clear message
-3. **Tool execution fails silently**
-   - **Mitigation:** Explicit error logging, return partial results
-**Medium Risk Issues:**
-1. **Rate limits** on free tier APIs
-   - **Mitigation:** Retry with exponential backoff (already in tools)
-2. **Network timeouts** in HuggingFace environment
-   - **Mitigation:** Increase timeout settings, add timeout logging
-## Next Stage Preview
-**Stage 5: Production Quality (After MVP Works)**
-- Performance optimization (reduce latency)
-- Accuracy improvements (15/20 target)
-- GAIA benchmark validation
-- Cost optimization
-- Caching strategies
-**But first:** Get to MVP (5/20 working, real APIs connected)

+# Implementation Plan - Stage 5: Performance Optimization
+**Date:** 2026-01-04
+**Previous Stage:** Stage 4 Complete (10% score achieved)
 **Status:** Planning
+---
 ## Objective
+Improve GAIA agent performance from 10% (2/20) to 25% (5/20) accuracy through systematic optimization of LLM quota management, tool selection, and error handling.
+---
+## Current State Analysis
+**JSON Export:** `output/gaia_results_20260104_011001.json`
+### Success Cases (2/20 correct)
+1. **Question 3:** Reverse text reasoning → "right" ✅
+2. **Question 5:** Wikipedia search → "FunkMonk" ✅
+### Failure Breakdown (18/20 failed)
+**P0 - Critical: LLM Quota Exhaustion (15/20 failed - 75%)**
+```
+Gemini: 429 quota exceeded (daily + per-minute + input tokens)
+HuggingFace: 402 Payment Required (novita free limit reached)
+Claude: 400 credit balance too low
+```
+**P1 - High: Vision Tool Failures (3/20 failed)**
+```
+Questions 4, 6, 9: "Vision analysis failed - Gemini and Claude both failed"
+```
+**P1 - High: Tool Selection Errors (2/20 failed)**
+```
+Question 6: "Tool selection returned no tools - using fallback keyword matching"
+Question 7: "Tool calculator failed: ValueError: Expression must be a non-empty string"
+```
+---
+## Root Cause Analysis
+### Issue 1: LLM Quota Exhaustion (CRITICAL)
+- **Impact:** 75% of questions fail not due to logic, but infrastructure
+- **Cause:** All 3 LLM tiers exhausted simultaneously
+- **Fix Priority:** P0 - Without LLMs, nothing works
+### Issue 2: Vision Tool Architecture
+- **Impact:** All image/video questions auto-fail
+- **Cause:** Vision depends on Gemini/Claude, both quota-exhausted
+- **Fix Priority:** P1 - Can improve score by graceful skip
+### Issue 3: Tool Selection Logic
+- **Impact:** Reduces success rate on solvable questions
+- **Cause:** Keyword fallback too simplistic, parameter validation too strict
+- **Fix Priority:** P1 - Direct impact on accuracy
+---
+## Implementation Steps
+### Step 1: Add Retry Logic with Exponential Backoff (P0)
+**File:** `src/agent/llm_client.py`
+**Problem:** 429 errors immediately fail, no retry attempted
+**Solution:**
+```python
+import time
+from typing import Callable, Any
+def retry_with_backoff(func: Callable, max_retries: int = 3) -> Any:
+    """Retry function with exponential backoff on quota errors."""
+    for attempt in range(max_retries):
+        try:
+            return func()
+        except Exception as e:
+            if "429" in str(e) or "quota" in str(e).lower():
+                if attempt < max_retries - 1:
+                    wait_time = 2 ** attempt  # 1s, 2s, 4s
+                    logger.warning(f"Quota error, retrying in {wait_time}s...")
+                    time.sleep(wait_time)
+                    continue
+            raise
+```
 **Changes:**
+- Wrap all LLM calls in `plan_question()`, `select_tools()`, `synthesize_answer()`
+- Respect `retry_after` header if present
+- Max 3 retries per tier
+**Expected Impact:** Reduce quota failures from 75% to <50%
+### Step 2: Add Alternative Free LLM Providers (P0)
+**File:** `src/agent/llm_client.py`
+**Add Groq (Fast + Free Tier):**
 ```python
+from groq import Groq
+def plan_question_groq(question, available_tools, file_paths=None):
+    """Use Groq's free tier (llama-3.1-70b)."""
+    client = Groq(api_key=os.getenv("GROQ_API_KEY"))
+    response = client.chat.completions.create(
+        model="llama-3.1-70b-versatile",
+        messages=[{"role": "user", "content": prompt}],
+        max_tokens=MAX_TOKENS,
+        temperature=TEMPERATURE
+    )
+    return response.choices[0].message.content
 ```
+**New Fallback Chain:**
+1. Gemini (free, 1,500/day)
+2. HuggingFace (free, rate-limited)
+3. **Groq** (NEW - free, 30 req/min)
+4. Claude (paid, credits)
+5. Keyword matching
+**Expected Impact:** Ensure at least one LLM tier always available
+### Step 3: Improve Tool Selection Prompt (P1)
+**File:** `src/agent/llm_client.py` - `select_tools_with_function_calling()`
+**Current Prompt:** Generic description
+**New Prompt with Few-Shot Examples:**
 ```python
+system_prompt = """You are a tool selection expert. Select appropriate tools based on the question.
+Examples:
+- "How many albums did X release?" → web_search
+- "What is 25 * 37?" → calculator
+- "Analyze this image URL" → vision
+- "What is in this Excel file?" → parse_file
+Available tools: {tools}
+Question: {question}
+Select the best tool(s)."""
 ```
+**Expected Impact:** Reduce keyword fallback usage from 20% to <10%
+### Step 4: Graceful Vision Question Skip (P1)
 **File:** `src/agent/graph.py` - `execute_node`
+**Solution:** Detect vision questions early, skip if quota exhausted
 ```python
+def is_vision_question(question: str) -> bool:
+    """Detect if question requires vision tool."""
+    vision_keywords = ["image", "video", "youtube", "photo", "picture", "watch"]
+    return any(kw in question.lower() for kw in vision_keywords)
+# In execute_node:
+if is_vision_question(question) and all_llms_exhausted():
+    logger.warning("Vision question detected but LLMs quota exhausted, skipping")
+    state["answer"] = "Unable to answer (vision analysis unavailable)"
+    return state
 ```
+**Expected Impact:** Avoid crashes, set expectations correctly
+### Step 5: Relax Calculator Parameter Validation (P1)
+**File:** `src/tools/calculator.py`
+**Current:**
 ```python
+if not expression or not expression.strip():
+    raise ValueError("Expression must be a non-empty string")
 ```
+**New:**
+```python
+if not expression or not expression.strip():
+    logger.warning("Empty calculator expression, extracting from context")
+    # Try to extract numbers from question context
+    expression = extract_expression_from_context(question)
+```
+**Expected Impact:** +1 question improvement
+### Step 6: Improve TOOLS Schema Descriptions (P1)
+**File:** `src/tools/__init__.py`
+**Current:**
+```python
+"web_search": {
+    "description": "Search the web for information"
+}
+```
+**New:**
+```python
+"web_search": {
+    "description": "Search the web for factual information, current events, Wikipedia articles, statistics, and research. Use when question requires external knowledge."
+}
+```
+**Make descriptions more specific and action-oriented.**
+**Expected Impact:** Better LLM tool selection accuracy
+---
+## Files to Modify
+### Priority 1 (Critical)
+1. **src/agent/llm_client.py**
+   - Add `retry_with_backoff()` helper
+   - Integrate Groq provider
+   - Wrap all LLM calls with retry logic
+2. **requirements.txt**
+   - Add `groq` package
+### Priority 2 (High Impact)
+3. **src/agent/graph.py**
+   - Add `is_vision_question()` helper
+   - Add vision question skip logic
+4. **src/tools/__init__.py**
+   - Improve TOOLS descriptions
+5. **src/tools/calculator.py**
+   - Relax parameter validation
+### Priority 3 (Nice to Have)
+6. **test/test_llm_integration.py**
+   - Add retry logic tests
+   - Add Groq integration tests
+---
+## Success Criteria
+**Minimum (Stage 5 Pass):**
+- ✅ 5/20 questions correct (25% accuracy)
+- ✅ LLM quota errors <50% of failures (down from 75%)
+- ✅ Tool selection keyword fallback <20% usage
+- ✅ All tests passing (99/99)
+**Stretch Goals:**
+- ⭐ 6-7/20 questions correct (30-35% accuracy)
+- ⭐ Zero vision tool crashes (graceful skips)
+- ⭐ Tool selection accuracy >80%
+---
+## Testing Strategy
+### Local Testing
+1. Mock 429 errors, verify retry logic works
+2. Test Groq integration with real API key
+3. Run unit tests: `uv run pytest test/ -q`
+### HF Spaces Testing
+1. Add `GROQ_API_KEY` to Space environment variables
+2. Deploy updated code
+3. Run GAIA validation (20 questions)
+4. Download JSON export: `output/gaia_results_TIMESTAMP.json`
+### Analysis
+```python
+import json
+# Compare before/after
+before = json.load(open('output/gaia_results_20260104_011001.json'))
+after = json.load(open('output/gaia_results_TIMESTAMP.json'))
+# Count improvements
+before_quota_errors = sum(1 for r in before['results'] if '429' in r['submitted_answer'])
+after_quota_errors = sum(1 for r in after['results'] if '429' in r['submitted_answer'])
+print(f"Quota errors: {before_quota_errors} → {after_quota_errors}")
+```
+---
 ## Risk Analysis
+**Risk 1:** Groq also has free tier limits
+- **Mitigation:** Groq has 30 req/min (generous), add more providers if needed (Together.ai, OpenRouter)
+**Risk 2:** Retry logic adds latency (up to 7 seconds per question)
+- **Mitigation:** Acceptable for accuracy improvement, only triggers on quota errors
+**Risk 3:** Tool selection improvements don't impact accuracy much
+- **Mitigation:** Focus remains on P0 (LLM quota), P1 is bonus
+---
+## Next Actions
+1. ✅ Review this plan
+2. Start Step 1: Add retry logic to llm_client.py
+3. Start Step 2: Integrate Groq as 4th LLM tier
+4. Deploy and run GAIA validation
+5. Analyze JSON export, compare with baseline
+6. Create new dev log: `dev/dev_260104_17_stage5_performance_optimization.md`
+---
+## Timeline Estimate
+- **Step 1 (Retry logic):** 30 minutes
+- **Step 2 (Groq integration):** 60 minutes
+- **Step 3 (Tool selection):** 30 minutes
+- **Step 4 (Vision skip):** 20 minutes
+- **Step 5 (Calculator):** 15 minutes
+- **Step 6 (Descriptions):** 15 minutes
+- **Testing & Deployment:** 30 minutes
+- **Documentation:** 20 minutes
+**Total:** ~3.5 hours
+**Ready to begin Stage 5 implementation!**