mangubee Claude Sonnet 4.5 commited on
Commit
456c236
·
1 Parent(s): 8b043d1

Plan: Stage 5 performance optimization strategy

Browse files

Added comprehensive Stage 5 implementation plan:
- Objective: 10% → 25% accuracy improvement
- Root cause analysis from JSON export (75% quota failures)
- P0 steps: Retry logic + Groq integration
- P1 steps: Tool selection improvements, vision skip, calculator fix
- Success criteria: 5/20 questions, <50% quota errors
- Timeline: ~3.5 hours estimated

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Files changed (1) hide show
  1. PLAN.md +237 -164
PLAN.md CHANGED
@@ -1,254 +1,327 @@
1
- # Implementation Plan - Stage 4: MVP - Real Integration
2
 
3
- **Date:** 2026-01-02
4
- **Dev Record:** dev/dev_260103_15_stage4_mvp_integration.md
5
  **Status:** Planning
6
 
 
 
7
  ## Objective
8
 
9
- Fix integration issues to achieve MVP: Agent answers real GAIA questions using real APIs (Gemini, Claude, Tavily), even if accuracy is low. Target: Get from 0/20 at least 5/20 questions correct.
10
 
11
- ## Current Problem Analysis
12
 
13
- **HuggingFace Result:** 0/20 correct, all answers = "Unable to answer: No evidence collected"
14
 
15
- **Root Causes Identified:**
16
 
17
- 1. **API Keys Issue:** Environment variables may not be set in HuggingFace Space
18
- 2. **Silent Failures:** LLM function calling fails but errors are swallowed
19
- 3. **No Evidence Collection:** Tool execution broken, evidence list stays empty
20
- 4. **Poor Error Visibility:** User sees "Unable to answer" with no diagnostic info
21
 
22
- ## Steps
23
 
24
- ### 1. Add Comprehensive Debug Logging
 
 
 
 
 
25
 
26
- **File:** `src/agent/graph.py`
 
 
 
27
 
28
- **Changes:**
 
 
 
 
29
 
30
- - Add detailed logging in each node (plan/execute/answer)
31
- - Log LLM responses, tool calls, evidence collected
32
- - Log errors with full stack traces
33
- - Add state inspection logging
34
 
35
- **Purpose:** Understand where exactly the integration fails
36
 
37
- ### 2. Improve Error Messages
 
 
 
38
 
39
- **File:** `src/agent/graph.py` - `answer_node`
 
 
 
40
 
41
- **Current:**
 
 
 
42
 
43
- ```python
44
- state["answer"] = "Unable to answer: No evidence collected"
45
- ```
46
 
47
- **New:**
48
 
49
- ```python
50
- if not evidence:
51
- error_summary = "; ".join(state["errors"]) if state["errors"] else "No errors logged"
52
- state["answer"] = f"ERROR: No evidence. Errors: {error_summary}"
53
- ```
54
 
55
- **Purpose:** Show WHY it failed (API key missing? Tool failed? LLM failed?)
56
 
57
- ### 3. Add Graceful Degradation in LLM Client
58
 
59
- **File:** `src/agent/llm_client.py`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
  **Changes:**
 
 
 
62
 
63
- - Better exception handling with specific error types
64
- - Distinguish between: API key missing, rate limit, network error, API error
65
- - Log which provider failed and why
66
- - Add fallback messages instead of re-raising
67
 
68
- **Example:**
69
 
 
 
 
70
  ```python
71
- try:
72
- return plan_question_gemini(...)
73
- except ValueError as e:
74
- if "GOOGLE_API_KEY" in str(e):
75
- logger.error("Gemini API key not set")
76
- # Try Claude fallback
77
- except Exception as e:
78
- logger.error(f"Gemini failed: {type(e).__name__}: {e}")
 
 
 
 
79
  ```
80
 
81
- ### 4. Add API Key Validation Check
 
 
 
 
 
 
 
82
 
83
- **File:** `src/agent/graph.py` - Add validation before execution
84
 
85
- **New function:**
86
 
 
 
 
87
  ```python
88
- def validate_environment() -> List[str]:
89
- """Check which API keys are available."""
90
- missing = []
91
- if not os.getenv("GOOGLE_API_KEY"):
92
- missing.append("GOOGLE_API_KEY (Gemini)")
93
- if not os.getenv("ANTHROPIC_API_KEY"):
94
- missing.append("ANTHROPIC_API_KEY (Claude)")
95
- if not os.getenv("TAVILY_API_KEY"):
96
- missing.append("TAVILY_API_KEY (Search)")
97
- return missing
 
98
  ```
99
 
100
- Call at agent initialization to warn early.
101
 
102
- ### 5. Fix Tool Execution Error Handling
103
 
104
  **File:** `src/agent/graph.py` - `execute_node`
105
 
106
- **Issue:** If LLM function calling returns empty tool_calls, execution continues silently
107
-
108
- **Fix:**
109
 
110
  ```python
111
- tool_calls = select_tools_with_function_calling(...)
112
-
113
- if not tool_calls:
114
- logger.error("LLM returned no tool calls - check LLM integration")
115
- state["errors"].append("Tool selection failed: LLM returned no tools")
116
- return state # Early return instead of continuing
 
 
 
 
117
  ```
118
 
119
- ### 6. Add Fallback to Direct Tool Execution (MVP Hack)
120
 
121
- **File:** `src/agent/graph.py` - `execute_node`
122
 
123
- **If LLM function calling fails completely, use rule-based fallback:**
124
 
 
125
  ```python
126
- # If LLM function calling fails, try simple heuristics
127
- if not tool_calls and "search" in question.lower():
128
- logger.warning("LLM tool selection failed, using fallback: search")
129
- tool_calls = [{"tool": "search", "params": {"query": question}}]
130
  ```
131
 
132
- **Purpose:** Get SOMETHING working even if LLM fails (this is MVP - quality doesn't matter)
 
 
 
 
 
 
133
 
134
- ### 7. Test with Mock-Free Integration Tests
135
 
136
- **File:** `test/test_integration_real_apis.py` (NEW)
137
 
138
- **Tests:**
139
 
140
- - Test with real GOOGLE_API_KEY (if available)
141
- - Test with real ANTHROPIC_API_KEY (if available)
142
- - Test with real TAVILY_API_KEY (if available)
143
- - Skip tests if API keys not available (not fail)
 
 
144
 
145
- **Purpose:** Validate real API integration works locally before deploying
 
 
 
 
 
146
 
147
- ### 8. Add Gradio UI Error Display
148
 
149
- **File:** `app.py`
150
 
151
- **Current:** Shows only answer
152
 
153
- **New:** Show diagnostic info in UI
154
 
155
- ```python
156
- def answer_question(question):
157
- agent = GAIAAgent()
158
- answer = agent(question)
 
159
 
160
- # Show errors if present
161
- if hasattr(agent, 'last_state'):
162
- errors = agent.last_state.get('errors', [])
163
- if errors:
164
- return f"{answer}\n\nDIAGNOSTICS:\n" + "\n".join(errors)
165
 
166
- return answer
167
- ```
 
 
168
 
169
- ### 9. Update HuggingFace Space Configuration
 
170
 
171
- **Action Items:**
 
172
 
173
- 1. Add environment variables in Space Settings:
174
- - `GOOGLE_API_KEY` (for Gemini - primary)
175
- - `ANTHROPIC_API_KEY` (for Claude - fallback)
176
- - `TAVILY_API_KEY` (for web search)
177
- 2. Set to "Public" visibility if needed
178
- 3. Verify build succeeds after adding keys
179
 
180
- ### 10. Deploy and Test Real Questions
181
 
182
- **Actions:**
183
 
184
- - Commit all changes
185
- - Push to HuggingFace Spaces
186
- - Wait for build
187
- - Test with 5 simple GAIA questions manually
188
- - Verify at least 1-2 work (doesn't need to be correct, just collect evidence)
189
 
190
- ## Files to Modify
 
 
 
191
 
192
- 1. `src/agent/graph.py` - Add logging, improve error handling, add validation
193
- 2. `src/agent/llm_client.py` - Better exception handling, specific error types
194
- 3. `app.py` - Show diagnostics in UI
195
- 4. `test/test_integration_real_apis.py` - NEW - Real API integration tests
196
- 5. `README.md` - Document required API keys
197
 
198
- ## Success Criteria
199
 
200
- **MVP Definition:** Agent runs real APIs and collects evidence (even if answers wrong)
 
 
 
201
 
202
- - [ ] Agent attempts real LLM calls (Gemini or Claude)
203
- - [ ] Agent attempts real tool calls (Tavily search)
204
- - [ ] Evidence is collected (not empty list)
205
- - [ ] Errors are visible and actionable
206
- - [ ] At least 1/20 GAIA questions collects evidence (even if answer wrong)
207
- - [ ] Target: 5/20 questions answered (quality doesn't matter, just not "Unable to answer")
208
 
209
- **Non-Goals for MVP:**
 
 
210
 
211
- - High accuracy (not needed for MVP)
212
- - Optimal tool selection (can be random/fallback)
213
- - Perfect error recovery (basic is enough)
214
- - ❌ Performance optimization (Stage 5)
215
 
216
- ## Debug Strategy
 
 
217
 
218
- **If still failing after fixes:**
 
219
 
220
- 1. **Check logs in HuggingFace Space container logs**
221
- 2. **Add print statements** (not just logger) to see output
222
- 3. **Test locally first** with real API keys
223
- 4. **Simplify to single tool** (just search, no LLM function calling)
224
- 5. **Hardcode a simple question** to verify basic flow works
225
 
226
  ## Risk Analysis
227
 
228
- **High Risk Issues:**
 
 
 
 
 
 
 
 
 
229
 
230
- 1. **Gemini function calling API complex** - May fail even with correct implementation
231
- - **Mitigation:** Claude fallback + hardcoded tool selection fallback
232
- 2. **API keys not propagating** to container
233
- - **Mitigation:** Add validation at startup, fail fast with clear message
234
- 3. **Tool execution fails silently**
235
- - **Mitigation:** Explicit error logging, return partial results
236
 
237
- **Medium Risk Issues:**
 
 
 
 
 
238
 
239
- 1. **Rate limits** on free tier APIs
240
- - **Mitigation:** Retry with exponential backoff (already in tools)
241
- 2. **Network timeouts** in HuggingFace environment
242
- - **Mitigation:** Increase timeout settings, add timeout logging
243
 
244
- ## Next Stage Preview
245
 
246
- **Stage 5: Production Quality (After MVP Works)**
 
 
 
 
 
 
 
247
 
248
- - Performance optimization (reduce latency)
249
- - Accuracy improvements (15/20 target)
250
- - GAIA benchmark validation
251
- - Cost optimization
252
- - Caching strategies
253
 
254
- **But first:** Get to MVP (5/20 working, real APIs connected)
 
1
+ # Implementation Plan - Stage 5: Performance Optimization
2
 
3
+ **Date:** 2026-01-04
4
+ **Previous Stage:** Stage 4 Complete (10% score achieved)
5
  **Status:** Planning
6
 
7
+ ---
8
+
9
  ## Objective
10
 
11
+ Improve GAIA agent performance from 10% (2/20) to 25% (5/20) accuracy through systematic optimization of LLM quota management, tool selection, and error handling.
12
 
13
+ ---
14
 
15
+ ## Current State Analysis
16
 
17
+ **JSON Export:** `output/gaia_results_20260104_011001.json`
18
 
19
+ ### Success Cases (2/20 correct)
20
+ 1. **Question 3:** Reverse text reasoning "right"
21
+ 2. **Question 5:** Wikipedia search "FunkMonk"
 
22
 
23
+ ### Failure Breakdown (18/20 failed)
24
 
25
+ **P0 - Critical: LLM Quota Exhaustion (15/20 failed - 75%)**
26
+ ```
27
+ Gemini: 429 quota exceeded (daily + per-minute + input tokens)
28
+ HuggingFace: 402 Payment Required (novita free limit reached)
29
+ Claude: 400 credit balance too low
30
+ ```
31
 
32
+ **P1 - High: Vision Tool Failures (3/20 failed)**
33
+ ```
34
+ Questions 4, 6, 9: "Vision analysis failed - Gemini and Claude both failed"
35
+ ```
36
 
37
+ **P1 - High: Tool Selection Errors (2/20 failed)**
38
+ ```
39
+ Question 6: "Tool selection returned no tools - using fallback keyword matching"
40
+ Question 7: "Tool calculator failed: ValueError: Expression must be a non-empty string"
41
+ ```
42
 
43
+ ---
 
 
 
44
 
45
+ ## Root Cause Analysis
46
 
47
+ ### Issue 1: LLM Quota Exhaustion (CRITICAL)
48
+ - **Impact:** 75% of questions fail not due to logic, but infrastructure
49
+ - **Cause:** All 3 LLM tiers exhausted simultaneously
50
+ - **Fix Priority:** P0 - Without LLMs, nothing works
51
 
52
+ ### Issue 2: Vision Tool Architecture
53
+ - **Impact:** All image/video questions auto-fail
54
+ - **Cause:** Vision depends on Gemini/Claude, both quota-exhausted
55
+ - **Fix Priority:** P1 - Can improve score by graceful skip
56
 
57
+ ### Issue 3: Tool Selection Logic
58
+ - **Impact:** Reduces success rate on solvable questions
59
+ - **Cause:** Keyword fallback too simplistic, parameter validation too strict
60
+ - **Fix Priority:** P1 - Direct impact on accuracy
61
 
62
+ ---
 
 
63
 
64
+ ## Implementation Steps
65
 
66
+ ### Step 1: Add Retry Logic with Exponential Backoff (P0)
 
 
 
 
67
 
68
+ **File:** `src/agent/llm_client.py`
69
 
70
+ **Problem:** 429 errors immediately fail, no retry attempted
71
 
72
+ **Solution:**
73
+ ```python
74
+ import time
75
+ from typing import Callable, Any
76
+
77
+ def retry_with_backoff(func: Callable, max_retries: int = 3) -> Any:
78
+ """Retry function with exponential backoff on quota errors."""
79
+ for attempt in range(max_retries):
80
+ try:
81
+ return func()
82
+ except Exception as e:
83
+ if "429" in str(e) or "quota" in str(e).lower():
84
+ if attempt < max_retries - 1:
85
+ wait_time = 2 ** attempt # 1s, 2s, 4s
86
+ logger.warning(f"Quota error, retrying in {wait_time}s...")
87
+ time.sleep(wait_time)
88
+ continue
89
+ raise
90
+ ```
91
 
92
  **Changes:**
93
+ - Wrap all LLM calls in `plan_question()`, `select_tools()`, `synthesize_answer()`
94
+ - Respect `retry_after` header if present
95
+ - Max 3 retries per tier
96
 
97
+ **Expected Impact:** Reduce quota failures from 75% to <50%
 
 
 
98
 
99
+ ### Step 2: Add Alternative Free LLM Providers (P0)
100
 
101
+ **File:** `src/agent/llm_client.py`
102
+
103
+ **Add Groq (Fast + Free Tier):**
104
  ```python
105
+ from groq import Groq
106
+
107
+ def plan_question_groq(question, available_tools, file_paths=None):
108
+ """Use Groq's free tier (llama-3.1-70b)."""
109
+ client = Groq(api_key=os.getenv("GROQ_API_KEY"))
110
+ response = client.chat.completions.create(
111
+ model="llama-3.1-70b-versatile",
112
+ messages=[{"role": "user", "content": prompt}],
113
+ max_tokens=MAX_TOKENS,
114
+ temperature=TEMPERATURE
115
+ )
116
+ return response.choices[0].message.content
117
  ```
118
 
119
+ **New Fallback Chain:**
120
+ 1. Gemini (free, 1,500/day)
121
+ 2. HuggingFace (free, rate-limited)
122
+ 3. **Groq** (NEW - free, 30 req/min)
123
+ 4. Claude (paid, credits)
124
+ 5. Keyword matching
125
+
126
+ **Expected Impact:** Ensure at least one LLM tier always available
127
 
128
+ ### Step 3: Improve Tool Selection Prompt (P1)
129
 
130
+ **File:** `src/agent/llm_client.py` - `select_tools_with_function_calling()`
131
 
132
+ **Current Prompt:** Generic description
133
+
134
+ **New Prompt with Few-Shot Examples:**
135
  ```python
136
+ system_prompt = """You are a tool selection expert. Select appropriate tools based on the question.
137
+
138
+ Examples:
139
+ - "How many albums did X release?" → web_search
140
+ - "What is 25 * 37?" → calculator
141
+ - "Analyze this image URL" → vision
142
+ - "What is in this Excel file?" → parse_file
143
+
144
+ Available tools: {tools}
145
+ Question: {question}
146
+ Select the best tool(s)."""
147
  ```
148
 
149
+ **Expected Impact:** Reduce keyword fallback usage from 20% to <10%
150
 
151
+ ### Step 4: Graceful Vision Question Skip (P1)
152
 
153
  **File:** `src/agent/graph.py` - `execute_node`
154
 
155
+ **Solution:** Detect vision questions early, skip if quota exhausted
 
 
156
 
157
  ```python
158
+ def is_vision_question(question: str) -> bool:
159
+ """Detect if question requires vision tool."""
160
+ vision_keywords = ["image", "video", "youtube", "photo", "picture", "watch"]
161
+ return any(kw in question.lower() for kw in vision_keywords)
162
+
163
+ # In execute_node:
164
+ if is_vision_question(question) and all_llms_exhausted():
165
+ logger.warning("Vision question detected but LLMs quota exhausted, skipping")
166
+ state["answer"] = "Unable to answer (vision analysis unavailable)"
167
+ return state
168
  ```
169
 
170
+ **Expected Impact:** Avoid crashes, set expectations correctly
171
 
172
+ ### Step 5: Relax Calculator Parameter Validation (P1)
173
 
174
+ **File:** `src/tools/calculator.py`
175
 
176
+ **Current:**
177
  ```python
178
+ if not expression or not expression.strip():
179
+ raise ValueError("Expression must be a non-empty string")
 
 
180
  ```
181
 
182
+ **New:**
183
+ ```python
184
+ if not expression or not expression.strip():
185
+ logger.warning("Empty calculator expression, extracting from context")
186
+ # Try to extract numbers from question context
187
+ expression = extract_expression_from_context(question)
188
+ ```
189
 
190
+ **Expected Impact:** +1 question improvement
191
 
192
+ ### Step 6: Improve TOOLS Schema Descriptions (P1)
193
 
194
+ **File:** `src/tools/__init__.py`
195
 
196
+ **Current:**
197
+ ```python
198
+ "web_search": {
199
+ "description": "Search the web for information"
200
+ }
201
+ ```
202
 
203
+ **New:**
204
+ ```python
205
+ "web_search": {
206
+ "description": "Search the web for factual information, current events, Wikipedia articles, statistics, and research. Use when question requires external knowledge."
207
+ }
208
+ ```
209
 
210
+ **Make descriptions more specific and action-oriented.**
211
 
212
+ **Expected Impact:** Better LLM tool selection accuracy
213
 
214
+ ---
215
 
216
+ ## Files to Modify
217
 
218
+ ### Priority 1 (Critical)
219
+ 1. **src/agent/llm_client.py**
220
+ - Add `retry_with_backoff()` helper
221
+ - Integrate Groq provider
222
+ - Wrap all LLM calls with retry logic
223
 
224
+ 2. **requirements.txt**
225
+ - Add `groq` package
 
 
 
226
 
227
+ ### Priority 2 (High Impact)
228
+ 3. **src/agent/graph.py**
229
+ - Add `is_vision_question()` helper
230
+ - Add vision question skip logic
231
 
232
+ 4. **src/tools/__init__.py**
233
+ - Improve TOOLS descriptions
234
 
235
+ 5. **src/tools/calculator.py**
236
+ - Relax parameter validation
237
 
238
+ ### Priority 3 (Nice to Have)
239
+ 6. **test/test_llm_integration.py**
240
+ - Add retry logic tests
241
+ - Add Groq integration tests
 
 
242
 
243
+ ---
244
 
245
+ ## Success Criteria
246
 
247
+ **Minimum (Stage 5 Pass):**
248
+ - 5/20 questions correct (25% accuracy)
249
+ - LLM quota errors <50% of failures (down from 75%)
250
+ - Tool selection keyword fallback <20% usage
251
+ - All tests passing (99/99)
252
 
253
+ **Stretch Goals:**
254
+ - ⭐ 6-7/20 questions correct (30-35% accuracy)
255
+ - ⭐ Zero vision tool crashes (graceful skips)
256
+ - ⭐ Tool selection accuracy >80%
257
 
258
+ ---
 
 
 
 
259
 
260
+ ## Testing Strategy
261
 
262
+ ### Local Testing
263
+ 1. Mock 429 errors, verify retry logic works
264
+ 2. Test Groq integration with real API key
265
+ 3. Run unit tests: `uv run pytest test/ -q`
266
 
267
+ ### HF Spaces Testing
268
+ 1. Add `GROQ_API_KEY` to Space environment variables
269
+ 2. Deploy updated code
270
+ 3. Run GAIA validation (20 questions)
271
+ 4. Download JSON export: `output/gaia_results_TIMESTAMP.json`
 
272
 
273
+ ### Analysis
274
+ ```python
275
+ import json
276
 
277
+ # Compare before/after
278
+ before = json.load(open('output/gaia_results_20260104_011001.json'))
279
+ after = json.load(open('output/gaia_results_TIMESTAMP.json'))
 
280
 
281
+ # Count improvements
282
+ before_quota_errors = sum(1 for r in before['results'] if '429' in r['submitted_answer'])
283
+ after_quota_errors = sum(1 for r in after['results'] if '429' in r['submitted_answer'])
284
 
285
+ print(f"Quota errors: {before_quota_errors} {after_quota_errors}")
286
+ ```
287
 
288
+ ---
 
 
 
 
289
 
290
  ## Risk Analysis
291
 
292
+ **Risk 1:** Groq also has free tier limits
293
+ - **Mitigation:** Groq has 30 req/min (generous), add more providers if needed (Together.ai, OpenRouter)
294
+
295
+ **Risk 2:** Retry logic adds latency (up to 7 seconds per question)
296
+ - **Mitigation:** Acceptable for accuracy improvement, only triggers on quota errors
297
+
298
+ **Risk 3:** Tool selection improvements don't impact accuracy much
299
+ - **Mitigation:** Focus remains on P0 (LLM quota), P1 is bonus
300
+
301
+ ---
302
 
303
+ ## Next Actions
 
 
 
 
 
304
 
305
+ 1. Review this plan
306
+ 2. Start Step 1: Add retry logic to llm_client.py
307
+ 3. Start Step 2: Integrate Groq as 4th LLM tier
308
+ 4. Deploy and run GAIA validation
309
+ 5. Analyze JSON export, compare with baseline
310
+ 6. Create new dev log: `dev/dev_260104_17_stage5_performance_optimization.md`
311
 
312
+ ---
 
 
 
313
 
314
+ ## Timeline Estimate
315
 
316
+ - **Step 1 (Retry logic):** 30 minutes
317
+ - **Step 2 (Groq integration):** 60 minutes
318
+ - **Step 3 (Tool selection):** 30 minutes
319
+ - **Step 4 (Vision skip):** 20 minutes
320
+ - **Step 5 (Calculator):** 15 minutes
321
+ - **Step 6 (Descriptions):** 15 minutes
322
+ - **Testing & Deployment:** 30 minutes
323
+ - **Documentation:** 20 minutes
324
 
325
+ **Total:** ~3.5 hours
 
 
 
 
326
 
327
+ **Ready to begin Stage 5 implementation!**