mangubee commited on
Commit
5f56dbc
Β·
1 Parent(s): 2a449c8
PLAN.md CHANGED
@@ -1,21 +1,254 @@
1
- # Implementation Plan
2
 
3
- **Date:** [YYYY-MM-DD]
4
- **Dev Record:** [link to dev/dev_YYMMDD_##_concise_title.md]
5
- **Status:** [Planning | In Progress | Completed]
6
 
7
  ## Objective
8
 
9
- [Clear goal statement]
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  ## Steps
12
 
13
- [Implementation steps]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ## Files to Modify
16
 
17
- [List of files]
 
 
 
 
18
 
19
  ## Success Criteria
20
 
21
- [Completion criteria]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Implementation Plan - Stage 4: MVP - Real Integration
2
 
3
+ **Date:** 2026-01-02
4
+ **Dev Record:** dev/dev_260103_15_stage4_mvp_integration.md
5
+ **Status:** Planning
6
 
7
  ## Objective
8
 
9
+ Fix integration issues to achieve MVP: Agent answers real GAIA questions using real APIs (Gemini, Claude, Tavily), even if accuracy is low. Target: Get from 0/20 β†’ at least 5/20 questions correct.
10
+
11
+ ## Current Problem Analysis
12
+
13
+ **HuggingFace Result:** 0/20 correct, all answers = "Unable to answer: No evidence collected"
14
+
15
+ **Root Causes Identified:**
16
+
17
+ 1. **API Keys Issue:** Environment variables may not be set in HuggingFace Space
18
+ 2. **Silent Failures:** LLM function calling fails but errors are swallowed
19
+ 3. **No Evidence Collection:** Tool execution broken, evidence list stays empty
20
+ 4. **Poor Error Visibility:** User sees "Unable to answer" with no diagnostic info
21
 
22
  ## Steps
23
 
24
+ ### 1. Add Comprehensive Debug Logging
25
+
26
+ **File:** `src/agent/graph.py`
27
+
28
+ **Changes:**
29
+
30
+ - Add detailed logging in each node (plan/execute/answer)
31
+ - Log LLM responses, tool calls, evidence collected
32
+ - Log errors with full stack traces
33
+ - Add state inspection logging
34
+
35
+ **Purpose:** Understand where exactly the integration fails
36
+
37
+ ### 2. Improve Error Messages
38
+
39
+ **File:** `src/agent/graph.py` - `answer_node`
40
+
41
+ **Current:**
42
+
43
+ ```python
44
+ state["answer"] = "Unable to answer: No evidence collected"
45
+ ```
46
+
47
+ **New:**
48
+
49
+ ```python
50
+ if not evidence:
51
+ error_summary = "; ".join(state["errors"]) if state["errors"] else "No errors logged"
52
+ state["answer"] = f"ERROR: No evidence. Errors: {error_summary}"
53
+ ```
54
+
55
+ **Purpose:** Show WHY it failed (API key missing? Tool failed? LLM failed?)
56
+
57
+ ### 3. Add Graceful Degradation in LLM Client
58
+
59
+ **File:** `src/agent/llm_client.py`
60
+
61
+ **Changes:**
62
+
63
+ - Better exception handling with specific error types
64
+ - Distinguish between: API key missing, rate limit, network error, API error
65
+ - Log which provider failed and why
66
+ - Add fallback messages instead of re-raising
67
+
68
+ **Example:**
69
+
70
+ ```python
71
+ try:
72
+ return plan_question_gemini(...)
73
+ except ValueError as e:
74
+ if "GOOGLE_API_KEY" in str(e):
75
+ logger.error("Gemini API key not set")
76
+ # Try Claude fallback
77
+ except Exception as e:
78
+ logger.error(f"Gemini failed: {type(e).__name__}: {e}")
79
+ ```
80
+
81
+ ### 4. Add API Key Validation Check
82
+
83
+ **File:** `src/agent/graph.py` - Add validation before execution
84
+
85
+ **New function:**
86
+
87
+ ```python
88
+ def validate_environment() -> List[str]:
89
+ """Check which API keys are available."""
90
+ missing = []
91
+ if not os.getenv("GOOGLE_API_KEY"):
92
+ missing.append("GOOGLE_API_KEY (Gemini)")
93
+ if not os.getenv("ANTHROPIC_API_KEY"):
94
+ missing.append("ANTHROPIC_API_KEY (Claude)")
95
+ if not os.getenv("TAVILY_API_KEY"):
96
+ missing.append("TAVILY_API_KEY (Search)")
97
+ return missing
98
+ ```
99
+
100
+ Call at agent initialization to warn early.
101
+
102
+ ### 5. Fix Tool Execution Error Handling
103
+
104
+ **File:** `src/agent/graph.py` - `execute_node`
105
+
106
+ **Issue:** If LLM function calling returns empty tool_calls, execution continues silently
107
+
108
+ **Fix:**
109
+
110
+ ```python
111
+ tool_calls = select_tools_with_function_calling(...)
112
+
113
+ if not tool_calls:
114
+ logger.error("LLM returned no tool calls - check LLM integration")
115
+ state["errors"].append("Tool selection failed: LLM returned no tools")
116
+ return state # Early return instead of continuing
117
+ ```
118
+
119
+ ### 6. Add Fallback to Direct Tool Execution (MVP Hack)
120
+
121
+ **File:** `src/agent/graph.py` - `execute_node`
122
+
123
+ **If LLM function calling fails completely, use rule-based fallback:**
124
+
125
+ ```python
126
+ # If LLM function calling fails, try simple heuristics
127
+ if not tool_calls and "search" in question.lower():
128
+ logger.warning("LLM tool selection failed, using fallback: search")
129
+ tool_calls = [{"tool": "search", "params": {"query": question}}]
130
+ ```
131
+
132
+ **Purpose:** Get SOMETHING working even if LLM fails (this is MVP - quality doesn't matter)
133
+
134
+ ### 7. Test with Mock-Free Integration Tests
135
+
136
+ **File:** `test/test_integration_real_apis.py` (NEW)
137
+
138
+ **Tests:**
139
+
140
+ - Test with real GOOGLE_API_KEY (if available)
141
+ - Test with real ANTHROPIC_API_KEY (if available)
142
+ - Test with real TAVILY_API_KEY (if available)
143
+ - Skip tests if API keys not available (not fail)
144
+
145
+ **Purpose:** Validate real API integration works locally before deploying
146
+
147
+ ### 8. Add Gradio UI Error Display
148
+
149
+ **File:** `app.py`
150
+
151
+ **Current:** Shows only answer
152
+
153
+ **New:** Show diagnostic info in UI
154
+
155
+ ```python
156
+ def answer_question(question):
157
+ agent = GAIAAgent()
158
+ answer = agent(question)
159
+
160
+ # Show errors if present
161
+ if hasattr(agent, 'last_state'):
162
+ errors = agent.last_state.get('errors', [])
163
+ if errors:
164
+ return f"{answer}\n\nDIAGNOSTICS:\n" + "\n".join(errors)
165
+
166
+ return answer
167
+ ```
168
+
169
+ ### 9. Update HuggingFace Space Configuration
170
+
171
+ **Action Items:**
172
+
173
+ 1. Add environment variables in Space Settings:
174
+ - `GOOGLE_API_KEY` (for Gemini - primary)
175
+ - `ANTHROPIC_API_KEY` (for Claude - fallback)
176
+ - `TAVILY_API_KEY` (for web search)
177
+ 2. Set to "Public" visibility if needed
178
+ 3. Verify build succeeds after adding keys
179
+
180
+ ### 10. Deploy and Test Real Questions
181
+
182
+ **Actions:**
183
+
184
+ - Commit all changes
185
+ - Push to HuggingFace Spaces
186
+ - Wait for build
187
+ - Test with 5 simple GAIA questions manually
188
+ - Verify at least 1-2 work (doesn't need to be correct, just collect evidence)
189
 
190
  ## Files to Modify
191
 
192
+ 1. `src/agent/graph.py` - Add logging, improve error handling, add validation
193
+ 2. `src/agent/llm_client.py` - Better exception handling, specific error types
194
+ 3. `app.py` - Show diagnostics in UI
195
+ 4. `test/test_integration_real_apis.py` - NEW - Real API integration tests
196
+ 5. `README.md` - Document required API keys
197
 
198
  ## Success Criteria
199
 
200
+ **MVP Definition:** Agent runs real APIs and collects evidence (even if answers wrong)
201
+
202
+ - [ ] Agent attempts real LLM calls (Gemini or Claude)
203
+ - [ ] Agent attempts real tool calls (Tavily search)
204
+ - [ ] Evidence is collected (not empty list)
205
+ - [ ] Errors are visible and actionable
206
+ - [ ] At least 1/20 GAIA questions collects evidence (even if answer wrong)
207
+ - [ ] Target: 5/20 questions answered (quality doesn't matter, just not "Unable to answer")
208
+
209
+ **Non-Goals for MVP:**
210
+
211
+ - ❌ High accuracy (not needed for MVP)
212
+ - ❌ Optimal tool selection (can be random/fallback)
213
+ - ❌ Perfect error recovery (basic is enough)
214
+ - ❌ Performance optimization (Stage 5)
215
+
216
+ ## Debug Strategy
217
+
218
+ **If still failing after fixes:**
219
+
220
+ 1. **Check logs in HuggingFace Space container logs**
221
+ 2. **Add print statements** (not just logger) to see output
222
+ 3. **Test locally first** with real API keys
223
+ 4. **Simplify to single tool** (just search, no LLM function calling)
224
+ 5. **Hardcode a simple question** to verify basic flow works
225
+
226
+ ## Risk Analysis
227
+
228
+ **High Risk Issues:**
229
+
230
+ 1. **Gemini function calling API complex** - May fail even with correct implementation
231
+ - **Mitigation:** Claude fallback + hardcoded tool selection fallback
232
+ 2. **API keys not propagating** to container
233
+ - **Mitigation:** Add validation at startup, fail fast with clear message
234
+ 3. **Tool execution fails silently**
235
+ - **Mitigation:** Explicit error logging, return partial results
236
+
237
+ **Medium Risk Issues:**
238
+
239
+ 1. **Rate limits** on free tier APIs
240
+ - **Mitigation:** Retry with exponential backoff (already in tools)
241
+ 2. **Network timeouts** in HuggingFace environment
242
+ - **Mitigation:** Increase timeout settings, add timeout logging
243
+
244
+ ## Next Stage Preview
245
+
246
+ **Stage 5: Production Quality (After MVP Works)**
247
+
248
+ - Performance optimization (reduce latency)
249
+ - Accuracy improvements (15/20 target)
250
+ - GAIA benchmark validation
251
+ - Cost optimization
252
+ - Caching strategies
253
+
254
+ **But first:** Get to MVP (5/20 working, real APIs connected)
dev/dev_260103_16_huggingface_integration.md CHANGED
@@ -154,12 +154,14 @@ All tests passing with new 3-tier fallback architecture.
154
  - **Status:** Agent is functional and operational
155
 
156
  **What worked:**
 
157
  - βœ… Question 1: "How many studio albums were published by Mercedes Sosa between 2000 and 2009?" β†’ Answer: "3" (CORRECT)
158
  - βœ… HuggingFace LLM (Qwen 2.5 72B) successfully used for planning and tool selection
159
  - βœ… Web search tool executed successfully
160
  - βœ… Evidence collection and answer synthesis working
161
 
162
  **What failed:**
 
163
  - ❌ Question 2: YouTube video analysis (vision tool) - "Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed"
164
  - Issue: Vision tool requires multimodal LLM access (quota-limited or needs configuration)
165
 
 
154
  - **Status:** Agent is functional and operational
155
 
156
  **What worked:**
157
+
158
  - βœ… Question 1: "How many studio albums were published by Mercedes Sosa between 2000 and 2009?" β†’ Answer: "3" (CORRECT)
159
  - βœ… HuggingFace LLM (Qwen 2.5 72B) successfully used for planning and tool selection
160
  - βœ… Web search tool executed successfully
161
  - βœ… Evidence collection and answer synthesis working
162
 
163
  **What failed:**
164
+
165
  - ❌ Question 2: YouTube video analysis (vision tool) - "Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed"
166
  - Issue: Vision tool requires multimodal LLM access (quota-limited or needs configuration)
167