github-actions[bot] commited on
Commit
c8fd33f
·
0 Parent(s):

Deploy to HuggingFace Space

Browse files
.clinerules ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python Coding Conventions for GAIA Benchmark Agent
2
+
3
+ ## Naming Conventions
4
+ - Private methods and functions MUST start with underscore (_)
5
+ - Internal helper methods that are only called within the same class/module are private
6
+ - Public API methods should NOT start with underscore
7
+ - Class names use PascalCase
8
+ - Functions and methods use snake_case
9
+ - Constants use UPPER_SNAKE_CASE
10
+
11
+ ## Function Privacy Rules
12
+ These functions should be private (prefixed with _):
13
+ - Helper functions used only within the same class
14
+ - Internal implementation details not part of public API
15
+ - Functions only called by other functions in the same module
16
+
17
+ These functions should be public (no underscore):
18
+ - Entry points (like main())
19
+ - Functions called from other modules
20
+ - API endpoints
21
+ - Functions passed as callbacks to external libraries
22
+
23
+ ## Function Organization
24
+ - Helper functions used only internally should be marked private
25
+ - Public functions should have comprehensive docstrings with Args, Returns, Raises
26
+ - Private functions should have brief docstrings explaining their purpose
27
+ - Group related functions together
28
+
29
+ ## Import Organization
30
+ - Standard library imports first
31
+ - Third-party imports second
32
+ - Local application imports last
33
+ - Separate groups with blank lines
34
+ - Use absolute imports for clarity
35
+
36
+ ## Documentation
37
+ - All public functions must have Google-style docstrings
38
+ - Include Args, Returns, and Raises sections where applicable
39
+ - Private functions should have brief one-line docstrings
40
+ - Avoid redundant comments that just repeat what the code does
41
+
42
+ ## Code Structure
43
+ - Prefer composition over inheritance
44
+ - Use wrapper classes for extensibility (like MyGAIAAgents)
45
+ - Keep functions focused on single responsibility
46
+ - Extract complex logic into private helper methods
47
+ - Classes should have clear, single responsibilities
48
+
49
+ ## Error Handling
50
+ - Use specific exception types, not bare except clauses
51
+ - Validate inputs at API boundaries using validators module
52
+ - Log errors with context information
53
+ - Use custom exception classes where appropriate (like ValidationError)
54
+
55
+ ## Testing Philosophy
56
+ - Public API should be easily testable
57
+ - Private methods don't need direct tests (tested via public API)
58
+ - Test behavior, not implementation
59
+
60
+ ## Project-Specific Architecture Rules
61
+ - Agent implementations should be in separate files (e.g., langgraphagent.py)
62
+ - All agent classes must implement __call__(question, file_name) method
63
+ - Configuration should be centralized in config.py
64
+ - Use ResultFormatter for all output formatting
65
+ - Use QuestionLoader for all question fetching
66
+ - Use AgentRunner for agent execution orchestration
67
+
68
+ ## Type Hints
69
+ - Use type hints for all function signatures
70
+ - Import types from typing module
71
+ - Use Optional[] for nullable parameters
72
+ - Use List[], Dict[], Tuple[] for collections
73
+
74
+ ## Async/Concurrency
75
+ - Not currently used in this project
76
+ - If added, use async/await consistently
77
+ - Document async functions clearly
78
+
79
+ ## File Organization
80
+ - One class per file for major components
81
+ - Related utility functions can share a module
82
+ - Keep files under 500 lines when possible
83
+
84
+ ## Comments
85
+ - Use # for single-line comments
86
+ - Use """docstrings""" for function/class documentation
87
+ - Avoid obvious comments like "# increment counter"
88
+ - Explain WHY, not WHAT (code shows what, comments explain why)
89
+
90
+ ## Git Commit Messages
91
+ - Use conventional commit format
92
+ - Include "Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>" for AI-assisted code
93
+ - Explain both what changed and why
94
+
95
+ ## External Libraries
96
+ - scorer.py is copied from official GAIA - do NOT modify function names
97
+ - Prefer using existing utilities over creating new ones
98
+ - Document external dependencies clearly
.github/workflows/sync-to-hf.yml ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Sync to Hugging Face
2
+
3
+ on:
4
+ push:
5
+ branches:
6
+ - main
7
+ paths-ignore:
8
+ - 'README.md'
9
+ - 'docs/**'
10
+ - '**.md'
11
+ - 'LICENSE'
12
+
13
+ jobs:
14
+ sync:
15
+ runs-on: ubuntu-latest
16
+ steps:
17
+ - name: Checkout repository
18
+ uses: actions/checkout@v4
19
+ with:
20
+ fetch-depth: 0
21
+ lfs: true
22
+
23
+ - name: Push to Hugging Face Space
24
+ env:
25
+ HF_SYNC_TOKEN: ${{ secrets.HF_SYNC_TOKEN }}
26
+ run: |
27
+ git config --global user.email "github-actions[bot]@users.noreply.github.com"
28
+ git config --global user.name "github-actions[bot]"
29
+ git remote add hf https://hemantvirmani:$HF_SYNC_TOKEN@huggingface.co/spaces/hemantvirmani/Final_Assignment_Template
30
+ # Push only the current file state as a single orphan commit.
31
+ # This prevents HF from seeing old commits that contained binary files.
32
+ git checkout --orphan hf-deploy
33
+ git add -A
34
+ git commit -m "Deploy to HuggingFace Space"
35
+ git push hf hf-deploy:main --force
36
+
.gitignore ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ __pycache__/*.pyc
2
+ .venv/
3
+
4
+ # GAIA question attachment files — downloaded at runtime from HuggingFace dataset
5
+ # Keep only questions.json and metadata.jsonl; ignore everything else in files/
6
+ files/*
7
+ !files/questions.json
8
+ !files/metadata.jsonl
README.md ADDED
@@ -0,0 +1,310 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: GAIA Benchmark Agent
3
+ emoji: 🕵🏻‍♂️
4
+ colorFrom: indigo
5
+ colorTo: indigo
6
+ sdk: gradio
7
+ sdk_version: 6.2.0
8
+ app_file: app.py
9
+ pinned: false
10
+ hf_oauth: true
11
+ hf_oauth_expiration_minutes: 480
12
+ ---
13
+
14
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
+
16
+
17
+ # GAIA Benchmark Agent
18
+
19
+ A LangGraph-based AI agent designed to solve questions from the GAIA (General AI Assistants) benchmark. This agent uses Google's Gemini model with custom tools for web search, file processing, and multimodal analysis to answer complex questions requiring reasoning and information gathering.
20
+
21
+ ## Features
22
+
23
+ - **LangGraph Architecture**: Implements a state-graph agent workflow with tool calling capabilities
24
+ - **Multimodal Capabilities**:
25
+ - Image analysis (PNG, JPG, JPEG, GIF, WebP, BMP)
26
+ - YouTube video analysis and transcript extraction
27
+ - Audio transcription (MP3)
28
+ - PDF and Excel file processing
29
+ - **Web Research Tools**:
30
+ - DuckDuckGo web search
31
+ - Wikipedia integration
32
+ - ArXiv academic paper search
33
+ - Web page content extraction
34
+ - **Mathematical Operations**: Basic arithmetic and modulus operations
35
+ - **Gradio Interface**: User-friendly web UI for testing and evaluation
36
+ - **Automated Evaluation**: Fetches questions from API, processes them, and submits answers
37
+ - **Observability**: Built-in integration with Langfuse for tracking traces and metrics
38
+
39
+ ## Project Structure
40
+
41
+ ```
42
+ GAIA_Benchmark_Agent/
43
+ ├── app.py # Main application entry point
44
+ ├── agents.py # LangGraph agent implementation
45
+ ├── custom_tools.py # Tool definitions for web search, files, etc.
46
+ ├── system_prompt.py # Agent system prompt and instructions
47
+ ├── gradioapp.py # Gradio UI components
48
+ ├── requirements.txt # Python dependencies
49
+ └── files/
50
+ └── metadata.jsonl # Ground truth data for local testing
51
+ ```
52
+
53
+ ## Installation
54
+
55
+ 1. Clone the repository:
56
+ ```bash
57
+ git clone https://github.com/yourusername/GAIA_Benchmark_Agent.git
58
+ cd GAIA_Benchmark_Agent
59
+ ```
60
+
61
+ 2. Install dependencies:
62
+ ```bash
63
+ pip install -r requirements.txt
64
+ ```
65
+
66
+ 3. Set up environment variables:
67
+ ```bash
68
+ export GOOGLE_API_KEY="your_google_api_key"
69
+ export HUGGINGFACEHUB_API_TOKEN="your_hf_token" # Optional. not yet used
70
+
71
+ # Langfuse Observability (Optional)
72
+ export LANGFUSE_PUBLIC_KEY="pk-lf-..."
73
+ export LANGFUSE_SECRET_KEY="sk-lf-..."
74
+ export LANGFUSE_HOST="https://cloud.langfuse.com" # Optional
75
+ ```
76
+
77
+ ## Requirements
78
+
79
+ - Python 3.8+
80
+ - Google API Key (for Gemini model)
81
+ - ffmpeg (optional, for audio processing)
82
+
83
+ ### Key Dependencies
84
+
85
+ - `langchain-core`, `langgraph` - Agent framework
86
+ - `langchain-google-genai` - Google Gemini integration
87
+ - `gradio` - Web UI
88
+ - `requests`, `beautifulsoup4` - Web scraping
89
+ - `pypdf`, `pandas` - File processing
90
+ - `youtube-transcript-api` - YouTube integration
91
+ - `ddgs` - DuckDuckGo search
92
+
93
+ ## Usage
94
+
95
+ ### Running the Gradio Interface
96
+
97
+ Launch the web interface for interactive testing:
98
+
99
+ ```bash
100
+ python app.py
101
+ ```
102
+
103
+ This will start a Gradio app where you can:
104
+ - Log in with your Hugging Face account
105
+ - Run evaluation on all questions
106
+ - Test individual questions
107
+ - View results and scores
108
+
109
+ ### Running Local Tests
110
+
111
+ Test the agent on specific questions without the web interface:
112
+
113
+ ```bash
114
+ python app.py --test
115
+ ```
116
+
117
+ Edit the question indices in [app.py:196](app.py#L196) to customize which questions to test.
118
+
119
+ ### Using the Agent Programmatically
120
+
121
+ ```python
122
+ from agents import MyGAIAAgents
123
+
124
+ # Initialize agent (automatically uses ACTIVE_AGENT from config)
125
+ agent = MyGAIAAgents()
126
+
127
+ # Ask a question
128
+ answer = agent("What is the capital of France?")
129
+ print(answer)
130
+
131
+ # Ask a question with a file reference
132
+ answer = agent(
133
+ "What data is in this spreadsheet?",
134
+ file_name="data.xlsx"
135
+ )
136
+ print(answer)
137
+ ```
138
+
139
+ ## How It Works
140
+
141
+ ### Agent Architecture
142
+
143
+ The agent is built using LangGraph with the following workflow:
144
+
145
+ 1. **Initialize**: Loads the question and system prompt
146
+ 2. **Assistant Node**: Calls the LLM (Gemini) to decide on tool usage
147
+ 3. **Tool Node**: Executes requested tools (search, file reading, etc.)
148
+ 4. **Iteration**: Loops between assistant and tools until answer is found
149
+ 5. **Termination**: Returns final answer or hits step limit (25 steps max)
150
+
151
+ ### Available Tools
152
+
153
+ **Search & Research:**
154
+ - `websearch` - DuckDuckGo web search
155
+ - `wiki_search` - Wikipedia articles
156
+ - `arvix_search` - Academic papers
157
+ - `get_webpage_content` - Extract webpage text
158
+ - `get_youtube_transcript` - YouTube video transcripts
159
+ - `analyze_youtube_video` - AI analysis of YouTube videos
160
+
161
+ **File Processing:**
162
+ - `read_excel_file` - Read Excel spreadsheets
163
+ - `read_python_script` - Read Python source code
164
+ - `parse_audio_file` - Transcribe MP3 files
165
+ - `analyze_image` - AI vision analysis of images
166
+
167
+ **Utilities:**
168
+ - Math operations: `add`, `subtract`, `multiply`, `divide`, `power`, `modulus`
169
+ - `string_reverse` - Reverse encoded/gibberish text
170
+ - `get_current_time_in_timezone` - Get time in any timezone
171
+
172
+ ### System Prompt
173
+
174
+ The agent follows strict output formatting rules defined in [system_prompt.py](system_prompt.py):
175
+ - Returns only the final answer (no conversational filler)
176
+ - No markdown formatting or JSON structures
177
+ - Uses tools instead of guessing
178
+ - Handles encoded/reversed text
179
+ - Verifies answers before output
180
+
181
+ ## Configuration
182
+
183
+ ### Change Agent Type
184
+
185
+ Edit the `ACTIVE_AGENT` variable in [config.py:32](config.py#L32):
186
+
187
+ ```python
188
+ # Valid values: "LangGraph", "ReActLangGraph", "LLamaIndex", "SMOL"
189
+ ACTIVE_AGENT = "LangGraph" # Currently only LangGraph is implemented
190
+ ```
191
+
192
+ The `MyGAIAAgents` wrapper class will automatically instantiate the correct agent based on this configuration.
193
+
194
+ ### Adjust Step Limits
195
+
196
+ Modify the maximum iteration count in [agents.py:169](agents.py#L169):
197
+
198
+ ```python
199
+ if step_count >= 25: # Change this value
200
+ # ...
201
+ ```
202
+
203
+ ### Customize Tools
204
+
205
+ Add or modify tools in [custom_tools.py](custom_tools.py) using the `@tool` decorator:
206
+
207
+ ```python
208
+ from langchain_core.tools import tool
209
+
210
+ @tool
211
+ def my_custom_tool(param: str) -> str:
212
+ """Tool description for the LLM."""
213
+ # Implementation
214
+ return result
215
+ ```
216
+
217
+ ## API Integration
218
+
219
+ The agent integrates with the GAIA benchmark API:
220
+
221
+ - **Questions Endpoint**: `https://agents-course-unit4-scoring.hf.space/questions`
222
+ - **Submit Endpoint**: `https://agents-course-unit4-scoring.hf.space/submit`
223
+
224
+ Questions may include file references which are automatically fetched from:
225
+ - Local `files/` directory (if available)
226
+ - Remote API endpoint (fallback)
227
+
228
+ ## Testing
229
+
230
+ ### Local Ground Truth Verification
231
+
232
+ The app includes local verification against ground truth data in `files/metadata.jsonl`. This allows you to test your agent's performance before submitting to the evaluation server.
233
+
234
+ ### Test Mode
235
+
236
+ Run specific questions in test mode by modifying [app.py:196](app.py#L196):
237
+
238
+ ```python
239
+ my_questions = [
240
+ {
241
+ "question": my_questions_data[i]["question"],
242
+ "file_name": my_questions_data[i].get("file_name")
243
+ }
244
+ for i in (0, 5, 17) if i < len(my_questions_data) # Customize indices
245
+ ]
246
+ ```
247
+
248
+ ## Performance Considerations
249
+
250
+ - **Timeout**: Agent has 180-second timeout per question
251
+ - **Step Limit**: Maximum 25 reasoning steps to prevent infinite loops
252
+ - **Tool Timeouts**: Individual tools have their own timeout settings
253
+ - **Cost**: Uses Google Gemini API (gemini-2.5-flash model)
254
+
255
+ ## Deployment
256
+
257
+ ### Hugging Face Spaces
258
+
259
+ This project is designed to run on Hugging Face Spaces:
260
+
261
+ 1. Create a new Space on Hugging Face
262
+ 2. Set SDK to Gradio (version 6.2.0+)
263
+ 3. Add environment variables: `GOOGLE_API_KEY`, `SPACE_ID`, `SPACE_HOST`
264
+ 4. Enable OAuth authentication
265
+
266
+ The app will automatically detect the Hugging Face environment and configure URLs accordingly.
267
+
268
+ ### Local Deployment
269
+
270
+ Simply run `python app.py` locally. The app will detect it's not in a Hugging Face Space and adjust behavior accordingly.
271
+
272
+ ## Troubleshooting
273
+
274
+ ### Common Issues
275
+
276
+ **"GOOGLE_API_KEY not found"**
277
+ - Set the environment variable: `export GOOGLE_API_KEY="your_key"`
278
+
279
+ **Audio parsing fails**
280
+ - Install ffmpeg: `apt-get install ffmpeg` (Linux) or `brew install ffmpeg` (macOS)
281
+
282
+ **Tool timeouts**
283
+ - Adjust timeout values in respective tool functions in [custom_tools.py](custom_tools.py)
284
+
285
+ **Agent exceeds step limit**
286
+ - Increase limit in [agents.py:169](agents.py#L169) or optimize tool usage in system prompt
287
+
288
+ ## Contributing
289
+
290
+ Contributions are welcome! Areas for improvement:
291
+ - Add more tools (database access, code execution, etc.)
292
+ - Move the Benchmark from 50% to 100%
293
+ - Improve error handling and retry logic
294
+ - Try with smaller LLMs
295
+ - Make it work with Ollama
296
+
297
+ ## License
298
+
299
+ This project is open-source and available under the MIT License.
300
+
301
+ ## Acknowledgments
302
+
303
+ - Built for the GAIA (General AI Assistants) benchmark
304
+ - Uses Google's Gemini model via LangChain
305
+ - LangGraph framework by LangChain
306
+ - Gradio for web interface
307
+
308
+ ## Contact
309
+
310
+ For questions, issues, or suggestions, please open an issue on GitHub.
agent_runner.py ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Agent execution functionality for running questions through the GAIA agent."""
2
+
3
+ from typing import Optional, Tuple, List, Dict
4
+ from colorama import Fore, Style
5
+ from agents import MyGAIAAgents
6
+ import config
7
+
8
+ class AgentRunner:
9
+ """Handles agent execution and question processing.
10
+ """
11
+
12
+ def __init__(self, active_agent: str = None):
13
+ """Initialize the AgentRunner.
14
+
15
+ Args:
16
+ active_agent: The agent type to use. If None, uses config.ACTIVE_AGENT.
17
+ """
18
+ self.agent = None
19
+ self.active_agent = active_agent
20
+
21
+ def _initialize_agent(self) -> bool:
22
+ """Initialize the agent. Returns True if successful."""
23
+ try:
24
+ self.agent = MyGAIAAgents(active_agent=self.active_agent)
25
+ return True
26
+ except Exception as e:
27
+ print(f"{Fore.RED}Error instantiating agent: {e}{Style.RESET_ALL}")
28
+ return False
29
+
30
+ def run_on_questions(self, questions_data: List[Dict]) -> Optional[List[Tuple]]:
31
+ """Run agent on a list of questions and return results."""
32
+ if not self._initialize_agent():
33
+ return None
34
+
35
+ results = []
36
+ total = len(questions_data)
37
+ print(f"{Fore.CYAN}Running agent on {total} questions...{Style.RESET_ALL}")
38
+
39
+ for idx, item in enumerate(questions_data, 1):
40
+ task_id = item.get("task_id")
41
+ question_text = item.get("question")
42
+ file_name = item.get("file_name")
43
+
44
+ if not task_id or question_text is None:
45
+ print(f"\n{Fore.YELLOW}Skipping item with missing task_id or question: {item}{Style.RESET_ALL}\n")
46
+ continue
47
+
48
+ print(f"\n{'#' * config.SEPARATOR_WIDTH}")
49
+ print(f"{Fore.CYAN}Processing Question {idx}/{total} - Task ID: {task_id}{Style.RESET_ALL}")
50
+ print(f"{'#' * config.SEPARATOR_WIDTH}")
51
+
52
+ try:
53
+ answer = self.agent(question_text, file_name=file_name)
54
+
55
+ print(f"\n{Fore.GREEN}[RESULT] Task ID: {task_id}{Style.RESET_ALL}")
56
+ print(f"Question: {question_text[:config.QUESTION_PREVIEW_LENGTH]}{'...' if len(question_text) > config.QUESTION_PREVIEW_LENGTH else ''}")
57
+ print(f"Answer: {answer}")
58
+ results.append((task_id, question_text, answer))
59
+ except Exception as e:
60
+ print(f"{Fore.RED}[ERROR] Exception running agent on task {task_id}: {e}{Style.RESET_ALL}")
61
+ error_msg = f"AGENT ERROR: {str(e)[:config.ERROR_MESSAGE_LENGTH]}"
62
+ results.append((task_id, question_text, error_msg))
63
+
64
+ return results
agents.py ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Agent wrapper module for GAIA Benchmark."""
2
+
3
+ import config
4
+
5
+ # All agents are imported lazily to avoid loading unnecessary dependencies
6
+ # and suppress warnings from unused agent implementations
7
+
8
+
9
+ class MyGAIAAgents:
10
+ """Wrapper class to manage multiple agent implementations.
11
+
12
+ This class provides a unified interface for different agent types.
13
+ The active agent is determined by the ACTIVE_AGENT configuration or constructor parameter.
14
+ """
15
+
16
+ def __init__(self, active_agent: str = None):
17
+ """Initialize the wrapper with the active agent.
18
+
19
+ Args:
20
+ active_agent: The agent type to use. If None, uses config.ACTIVE_AGENT.
21
+ Valid values: config.AGENT_LANGGRAPH, config.AGENT_REACT_LANGGRAPH
22
+ """
23
+ if active_agent is None:
24
+ active_agent = config.ACTIVE_AGENT
25
+
26
+ if active_agent == config.AGENT_LANGGRAPH:
27
+ from langgraphagent import LangGraphAgent
28
+ self.agent = LangGraphAgent()
29
+ elif active_agent == config.AGENT_REACT_LANGGRAPH:
30
+ from reactlanggraphagent import ReActLangGraphAgent
31
+ self.agent = ReActLangGraphAgent()
32
+ elif active_agent == config.AGENT_LLAMAINDEX:
33
+ from llamaindexagent import LlamaIndexAgent
34
+ self.agent = LlamaIndexAgent()
35
+ else:
36
+ # Default to LangGraph if unknown agent type
37
+ print(f"[WARNING] Unknown agent type '{active_agent}', defaulting to {config.AGENT_LANGGRAPH}")
38
+ from langgraphagent import LangGraphAgent
39
+ self.agent = LangGraphAgent()
40
+
41
+ def __call__(self, question: str, file_name: str = None) -> str:
42
+ """Invoke the active agent with the given question.
43
+
44
+ Args:
45
+ question: The question to answer
46
+ file_name: Optional file name if the question references a file
47
+
48
+ Returns:
49
+ The agent's answer as a string
50
+ """
51
+ return self.agent(question, file_name)
app.py ADDED
@@ -0,0 +1,383 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import argparse
3
+ import requests
4
+ import pandas as pd
5
+ import json
6
+ import time
7
+ import warnings
8
+ import logging
9
+ from enum import Enum
10
+ from colorama import init
11
+
12
+ # Initialize colorama for Windows compatibility
13
+ init(autoreset=True)
14
+
15
+ # Suppress asyncio event loop cleanup warnings (common on HF Spaces)
16
+ warnings.filterwarnings('ignore', message='.*Invalid file descriptor.*')
17
+ logging.getLogger('asyncio').setLevel(logging.ERROR)
18
+
19
+ # Import configuration
20
+ import config
21
+
22
+ # Agent-related code is imported via agent_runner module
23
+ # Import Gradio UI creation function
24
+ from gradioapp import create_ui
25
+ # Import scoring function for answer verification
26
+ from scorer import question_scorer
27
+
28
+ # Import new utilities
29
+ from question_loader import QuestionLoader
30
+ from result_formatter import ResultFormatter
31
+ from agent_runner import AgentRunner
32
+ from validators import InputValidator, ValidationError
33
+ from utils import retry_with_backoff
34
+
35
+ # --- Run Modes ---
36
+ class RunMode(Enum):
37
+ UI = "ui" # Gradio UI mode
38
+ CLI = "cli" # Command-line test mode
39
+
40
+
41
+ @retry_with_backoff(max_retries=3, initial_delay=2.0)
42
+ def _submit_to_server(submit_url: str, submission_data: dict) -> dict:
43
+ """Internal function to submit to server (with retries)."""
44
+ response = requests.post(submit_url, json=submission_data, timeout=config.SUBMIT_TIMEOUT)
45
+ response.raise_for_status()
46
+ return response.json()
47
+
48
+ def submit_and_score(username: str, results: list) -> str:
49
+ """
50
+ Submit answers to the GAIA scoring server and return status message.
51
+
52
+ Args:
53
+ username: Hugging Face username for submission
54
+ results: List of tuples (task_id, question_text, answer)
55
+
56
+ Returns:
57
+ str: Status message (success or error details)
58
+ """
59
+ # Validate username
60
+ try:
61
+ username = InputValidator.validate_username(username)
62
+ except ValidationError as e:
63
+ error_msg = f"Invalid username: {e}"
64
+ print(error_msg)
65
+ return error_msg
66
+
67
+ # Format results for API submission
68
+ answers_payload = ResultFormatter.format_for_api(results)
69
+
70
+ if not answers_payload:
71
+ error_msg = "No answers to submit."
72
+ print(error_msg)
73
+ return error_msg
74
+
75
+ space_id = config.SPACE_ID
76
+ submit_url = f"{config.DEFAULT_API_URL}/submit"
77
+ agent_code = f"https://huggingface.co/spaces/{space_id}/tree/main"
78
+
79
+ # Prepare submission data
80
+ submission_data = {
81
+ "username": username,
82
+ "agent_code": agent_code,
83
+ "answers": answers_payload
84
+ }
85
+
86
+ print(f"\n{'=' * config.SEPARATOR_WIDTH}")
87
+ print(f"Submitting {len(answers_payload)} answers for user '{username}'...")
88
+ print(f"{'=' * config.SEPARATOR_WIDTH}\n")
89
+
90
+ # Submit to server
91
+ print(f"Submitting to: {submit_url}")
92
+ try:
93
+ result_data = _submit_to_server(submit_url, submission_data)
94
+
95
+ final_status = (
96
+ f"Submission Successful!\n"
97
+ f"User: {result_data.get('username')}\n"
98
+ f"Overall Score: {result_data.get('score', 'N/A')}% "
99
+ f"({result_data.get('correct_count', '?')}/{result_data.get('total_attempted', '?')} correct)\n"
100
+ f"Message: {result_data.get('message', 'No message received.')}"
101
+ )
102
+ print("Submission successful.")
103
+ return final_status
104
+
105
+ except requests.exceptions.HTTPError as e:
106
+ error_detail = f"Server responded with status {e.response.status_code}."
107
+ try:
108
+ error_json = e.response.json()
109
+ error_detail += f" Detail: {error_json.get('detail', e.response.text)}"
110
+ except requests.exceptions.JSONDecodeError:
111
+ error_detail += f" Response: {e.response.text[:500]}"
112
+ status_message = f"Submission Failed: {error_detail}"
113
+ print(status_message)
114
+ return status_message
115
+
116
+ except requests.exceptions.Timeout:
117
+ status_message = "Submission Failed: The request timed out."
118
+ print(status_message)
119
+ return status_message
120
+
121
+ except requests.exceptions.RequestException as e:
122
+ status_message = f"Submission Failed: Network error - {e}"
123
+ print(status_message)
124
+ return status_message
125
+
126
+ except Exception as e:
127
+ status_message = f"An unexpected error occurred during submission: {e}"
128
+ print(status_message)
129
+ return status_message
130
+
131
+
132
+ def run_and_submit_all(username: str, active_agent: str = None) -> tuple:
133
+ """
134
+ Fetches all questions, runs the GAIA agent on them, submits all answers,
135
+ and displays the results.
136
+
137
+ Args:
138
+ username: Hugging Face username for submission
139
+ active_agent: The agent type to use (default: config.AGENT_LANGGRAPH)
140
+
141
+ Returns:
142
+ tuple: (status_message: str, results_df: pd.DataFrame)
143
+ """
144
+ # Fetch questions from API (always online for submission)
145
+ try:
146
+ questions_data = QuestionLoader().get_questions(test_mode=False)
147
+ except Exception as e:
148
+ return f"Error loading questions: {e}", None
149
+
150
+ # Validate questions data
151
+ try:
152
+ questions_data = InputValidator.validate_questions_data(questions_data)
153
+ except ValidationError as e:
154
+ return f"Invalid questions data: {e}", None
155
+
156
+ results = AgentRunner(active_agent=active_agent).run_on_questions(questions_data)
157
+
158
+ if results is None:
159
+ return "Error initializing agent.", None
160
+
161
+ # Submit answers and get score (formatting happens inside submit_and_score)
162
+ status_message = submit_and_score(username, results)
163
+
164
+ # Format results for UI display
165
+ results_for_display = ResultFormatter.format_for_display(results)
166
+ results_df = pd.DataFrame(results_for_display)
167
+ return status_message, results_df
168
+
169
+ def _load_ground_truth(file_path: str = config.METADATA_FILE) -> dict:
170
+ """Load ground truth data indexed by task_id.
171
+
172
+ Args:
173
+ file_path: Path to the metadata file
174
+
175
+ Returns:
176
+ dict: Mapping of task_id -> {"question": str, "answer": str}
177
+ """
178
+ truth_mapping = {}
179
+ try:
180
+ with open(file_path, 'r', encoding='utf-8') as f:
181
+ for line in f:
182
+ data = json.loads(line)
183
+ task_id = data.get("task_id")
184
+ question = data.get("Question")
185
+ answer = data.get("Final answer")
186
+ if task_id and answer:
187
+ truth_mapping[task_id] = {
188
+ "question": question,
189
+ "answer": answer
190
+ }
191
+ except Exception as e:
192
+ print(f"Error loading ground truth: {e}")
193
+ return truth_mapping
194
+
195
+ def _verify_answers(results: list, log_output: list, runtime: tuple = None) -> None:
196
+ """Verify answers against ground truth using the official GAIA scorer.
197
+
198
+ Args:
199
+ results: List of tuples (task_id, question_text, answer)
200
+ log_output: List to append verification results to
201
+ runtime: Optional tuple of (minutes, seconds) for total runtime
202
+ """
203
+ ground_truth = _load_ground_truth()
204
+ log_output.append("\n=== Verification Results ===")
205
+
206
+ correct_count = 0
207
+ total_count = 0
208
+
209
+ for task_id, question_text, answer in results:
210
+ if task_id in ground_truth:
211
+ truth_data = ground_truth[task_id]
212
+ correct_answer = truth_data["answer"]
213
+
214
+ # Use the official GAIA question_scorer for comparison
215
+ # This handles numbers, lists, and strings with proper normalization
216
+ is_correct = question_scorer(str(answer), str(correct_answer))
217
+
218
+ if is_correct:
219
+ correct_count += 1
220
+ total_count += 1
221
+
222
+ log_output.append(f"Task ID: {task_id}")
223
+ log_output.append(f"Question: {question_text[:config.ERROR_MESSAGE_LENGTH]}...")
224
+ log_output.append(f"Expected: {correct_answer}")
225
+ log_output.append(f"Got: {answer}")
226
+ log_output.append(f"Match: {'✓ Correct' if is_correct else '✗ Incorrect'}\n")
227
+ else:
228
+ log_output.append(f"Task ID: {task_id}")
229
+ log_output.append(f"Question: {question_text[:config.ERROR_MESSAGE_LENGTH]}...")
230
+ log_output.append(f"No ground truth found.\n")
231
+
232
+ # Add summary statistics
233
+ if total_count > 0:
234
+ accuracy = (correct_count / total_count) * 100
235
+ log_output.append("=" * config.SEPARATOR_WIDTH)
236
+ log_output.append(f"SUMMARY: {correct_count}/{total_count} correct ({accuracy:.1f}%)")
237
+ if runtime:
238
+ minutes, seconds = runtime
239
+ log_output.append(f"Runtime: {minutes}m {seconds}s")
240
+ log_output.append("=" * config.SEPARATOR_WIDTH)
241
+
242
+ def run_test_code(filter=None, active_agent=None) -> pd.DataFrame:
243
+ """Run test code on selected questions.
244
+
245
+ Args:
246
+ filter: Optional tuple/list of question indices to test (e.g., (4, 7, 15)).
247
+ If None, processes all questions.
248
+ active_agent: Optional agent type to use (e.g., "LangGraph", "ReActLangGraph", "LLamaIndex").
249
+ If None, uses config.ACTIVE_AGENT.
250
+
251
+ Returns:
252
+ pd.DataFrame: Results and verification output
253
+ """
254
+ start_time = time.time()
255
+ logs_for_display = []
256
+ logs_for_display.append("=== Processing Example Questions One by One ===")
257
+
258
+ # Fetch questions (OFFLINE for testing)
259
+ try:
260
+ questions_data = QuestionLoader().get_questions(test_mode=True)
261
+ except Exception as e:
262
+ return pd.DataFrame([f"Error loading questions: {e}"])
263
+
264
+ # Validate questions data
265
+ try:
266
+ questions_data = InputValidator.validate_questions_data(questions_data)
267
+ except ValidationError as e:
268
+ return pd.DataFrame([f"Invalid questions data: {e}"])
269
+
270
+ # Validate and apply filter
271
+ try:
272
+ filter = InputValidator.validate_filter_indices(filter, len(questions_data))
273
+ except ValidationError as e:
274
+ return pd.DataFrame([f"Invalid filter: {e}"])
275
+
276
+ # Apply filter or use all questions
277
+ if filter is not None:
278
+ questions_to_process = [questions_data[i] for i in filter]
279
+ logs_for_display.append(f"Testing {len(questions_to_process)} selected questions (indices: {filter})")
280
+ else:
281
+ questions_to_process = questions_data
282
+ logs_for_display.append(f"Testing all {len(questions_to_process)} questions")
283
+
284
+ results = AgentRunner(active_agent=active_agent).run_on_questions(questions_to_process)
285
+
286
+ if results is None:
287
+ return pd.DataFrame(["Error initializing agent."])
288
+
289
+ logs_for_display.append("\n=== Completed Example Questions ===")
290
+
291
+ # Calculate runtime
292
+ elapsed_time = time.time() - start_time
293
+ minutes = int(elapsed_time // 60)
294
+ seconds = int(elapsed_time % 60)
295
+
296
+ _verify_answers(results, logs_for_display, runtime=(minutes, seconds))
297
+ return pd.DataFrame(logs_for_display)
298
+
299
+
300
+ def main() -> None:
301
+ """Main entry point for the application."""
302
+ parser = argparse.ArgumentParser(description="Run the agent application.")
303
+ parser.add_argument("--test", type=str, nargs='?', const='default', help="Run local tests on selected questions and exit. Optionally provide comma-separated question indices (e.g., --test 2,4,6). If no indices provided, uses default test questions.")
304
+ parser.add_argument("--testall", action="store_true", help="Run local tests on all questions and exit.")
305
+ parser.add_argument("--agent", type=str, choices=['langgraph', 'reactlangg', 'llamaindex'], help="Agent to use in CLI mode (case-insensitive). Options: langgraph, react langgraph, llamaindex. Default: uses config.ACTIVE_AGENT")
306
+ args = parser.parse_args()
307
+
308
+ # Map agent name to config constant (case-insensitive)
309
+ agent_mapping = {
310
+ 'langgraph': config.AGENT_LANGGRAPH,
311
+ 'reactlangg': config.AGENT_REACT_LANGGRAPH,
312
+ 'llamaindex': config.AGENT_LLAMAINDEX,
313
+ }
314
+
315
+ active_agent = None
316
+ if args.agent:
317
+ agent_key = args.agent.lower()
318
+ active_agent = agent_mapping.get(agent_key)
319
+ if not active_agent:
320
+ print(f"Error: Unknown agent '{args.agent}'. Valid options: langgraph, react, llamaindex")
321
+ return
322
+ print(f"[CLI] Using agent: {active_agent}")
323
+
324
+ print(f"\n{'-' * 30} App Starting {'-' * 30}")
325
+
326
+ # Determine run mode
327
+ run_mode = RunMode.CLI if (args.test or args.testall) else RunMode.UI
328
+
329
+ # Print environment info only in UI mode
330
+ if run_mode == RunMode.UI:
331
+ space_host = config.SPACE_HOST
332
+ space_id = config.SPACE_ID
333
+
334
+ if space_host:
335
+ print(f"[OK] SPACE_HOST found: {space_host}")
336
+ print(f" Runtime URL should be: https://{space_host}")
337
+ else:
338
+ print("[INFO] SPACE_HOST environment variable not found (running locally?).")
339
+
340
+ if space_id:
341
+ print(f"[OK] SPACE_ID found: {space_id}")
342
+ print(f" Repo URL: https://huggingface.co/spaces/{space_id}")
343
+ print(f" Repo Tree URL: https://huggingface.co/spaces/{space_id}/tree/main")
344
+ else:
345
+ print("[INFO] SPACE_ID environment variable not found (running locally?). Repo URL cannot be determined.")
346
+
347
+ print(f"{'-' * (60 + len(' App Starting '))}\n")
348
+
349
+ # Execute based on run mode
350
+ if run_mode == RunMode.UI:
351
+ print("Launching Gradio Interface for Basic Agent Evaluation...")
352
+ grTestApp = create_ui(run_and_submit_all, run_test_code)
353
+ grTestApp.launch()
354
+
355
+ else: # RunMode.CLI
356
+ # Determine test filter based on which CLI flag was used
357
+ if args.test:
358
+ # Check if custom indices were provided
359
+ if args.test == 'default':
360
+ # No indices provided, use default
361
+ test_filter = config.DEFAULT_TEST_FILTER
362
+ else:
363
+ # Parse comma-separated indices
364
+ try:
365
+ test_filter = tuple(int(idx.strip()) for idx in args.test.split(','))
366
+ except ValueError:
367
+ print(f"Error: Invalid test indices '{args.test}'. Must be comma-separated integers (e.g., 2,4,6)")
368
+ return
369
+ else: # args.testall
370
+ test_filter = None # Test all questions
371
+
372
+ print(f"Running test code on {len(test_filter) if test_filter else 'ALL'} questions (CLI mode)...")
373
+ result = run_test_code(filter=test_filter, active_agent=active_agent)
374
+
375
+ # Print results
376
+ if isinstance(result, pd.DataFrame):
377
+ ResultFormatter.print_dataframe(result)
378
+ else:
379
+ print(result)
380
+
381
+
382
+ if __name__ == "__main__":
383
+ main()
config.py ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Configuration settings for GAIA Benchmark Agent."""
2
+
3
+ import os
4
+
5
+ # API Configuration
6
+ DEFAULT_API_URL = "https://agents-course-unit4-scoring.hf.space"
7
+ AGENT_TIMEOUT_SECONDS = 180 # 3 minutes max per question
8
+
9
+ # File Paths
10
+ QUESTIONS_FILE = "files/questions.json"
11
+ METADATA_FILE = "files/metadata.jsonl"
12
+ FILES_DIR = "files"
13
+
14
+ # API Timeouts (in seconds)
15
+ FETCH_TIMEOUT = 15
16
+ SUBMIT_TIMEOUT = 60
17
+ WEBPAGE_TIMEOUT = 30
18
+
19
+ # Test Configuration
20
+ DEFAULT_TEST_FILTER = (4, 7, 15) # Q2, Q5, Q8, Q16
21
+
22
+ # Display Configuration
23
+ QUESTION_PREVIEW_LENGTH = 200 # Characters to show in question preview
24
+ ERROR_MESSAGE_LENGTH = 100 # Characters to show in error messages
25
+ SEPARATOR_WIDTH = 60 # Width of separator lines
26
+
27
+ # Environment Variables
28
+ SPACE_HOST = os.getenv("SPACE_HOST")
29
+ SPACE_ID = os.getenv("SPACE_ID")
30
+ GOOGLE_API_KEY = os.getenv("GOOGLE_DESKGENIE_KEY")
31
+
32
+ # Agent Type Constants
33
+ AGENT_LANGGRAPH = "LangGraph"
34
+ AGENT_REACT_LANGGRAPH = "ReActLangGraph"
35
+ AGENT_LLAMAINDEX = "LLamaIndex"
36
+ AGENT_SMOL = "SMOL"
37
+
38
+ ACTIVE_AGENT = AGENT_LANGGRAPH # Active agent to use by default
39
+
40
+ # Model Configuration
41
+ GEMINI_MODEL = "gemini-3.5-flash"
42
+ GEMINI_TEMPERATURE = 0
43
+ GEMINI_MAX_TOKENS = 1024
44
+
45
+ ACTIVE_AGENT_LLM_MODEL = GEMINI_MODEL
46
+
47
+ # Agent Step Limits
48
+ # AGENT_STEP_LIMIT is the single source of truth — the max number of assistant
49
+ # iterations (LLM + tool call) per question before the graph is force-terminated.
50
+ # The agent forces a final bare-answer call one step BEFORE this limit.
51
+ # AGENT_RECURSION_LIMIT is DERIVED so the invariant always holds: LangGraph's
52
+ # recursion_limit must exceed 2x the step limit (each step ~= 2 graph nodes:
53
+ # assistant + tools), plus a safety buffer.
54
+ AGENT_STEP_LIMIT = 60
55
+ AGENT_RECURSION_LIMIT = AGENT_STEP_LIMIT * 2 + 20
56
+
57
+ # ArXiv timeout
58
+ ARXIV_TIMEOUT_SECONDS = 30
59
+
60
+ # Retry Configuration for 504 DEADLINE_EXCEEDED errors
61
+ MAX_RETRIES = 3
62
+ INITIAL_RETRY_DELAY = 2.0 # seconds
63
+ RETRY_BACKOFF_FACTOR = 2.0
custom_tools.py ADDED
@@ -0,0 +1,827 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import concurrent.futures
2
+ from ddgs import DDGS
3
+ from bs4 import BeautifulSoup
4
+ import requests
5
+ import re
6
+ import io
7
+ import os
8
+ import subprocess
9
+ import sys
10
+ from google import genai
11
+ from google.genai import types
12
+ import config
13
+
14
+ from langchain_community.document_loaders import WikipediaLoader
15
+ from langchain_community.document_loaders import ArxivLoader
16
+ from youtube_transcript_api import YouTubeTranscriptApi
17
+ from pytube import extract
18
+ from langchain_core.tools import tool
19
+
20
+ import pandas as pd
21
+ import speech_recognition as sr
22
+ from pydub import AudioSegment
23
+ from pypdf import PdfReader
24
+ from io import BytesIO
25
+ from markdownify import markdownify as md
26
+
27
+ # ============================================================================
28
+ # Shared HTTP headers
29
+ # ============================================================================
30
+ # Many sites (notably Wikipedia) return 403 to the default python-requests
31
+ # User-Agent. Send a browser-like UA for all outbound page/file fetches.
32
+ _HTTP_HEADERS = {
33
+ "User-Agent": (
34
+ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
35
+ "AppleWebKit/537.36 (KHTML, like Gecko) "
36
+ "Chrome/120.0.0.0 Safari/537.36 GAIA-Agent/1.0"
37
+ )
38
+ }
39
+
40
+ # ============================================================================
41
+ # Per-question tool call counters (reset at start of each question)
42
+ # ============================================================================
43
+ _analyze_image_call_count = 0
44
+ MAX_ANALYZE_IMAGE_CALLS = 2
45
+
46
+ # Maps normalized websearch query -> its result string, for the current question.
47
+ # Lets websearch detect and short-circuit verbatim repeated queries (loop guard).
48
+ _websearch_seen_queries = {}
49
+
50
+
51
+ def reset_tool_counters():
52
+ """Reset per-question tool counters. Call at the start of each new question."""
53
+ global _analyze_image_call_count
54
+ _analyze_image_call_count = 0
55
+ _websearch_seen_queries.clear()
56
+
57
+
58
+ # ============================================================================
59
+ # Helper Functions (must be defined before tools that use them)
60
+ # ============================================================================
61
+
62
+ def _sanitize_file_path(file_name: str) -> tuple:
63
+ """
64
+ Sanitize file name to prevent path traversal attacks.
65
+
66
+ Args:
67
+ file_name: The file name to sanitize
68
+
69
+ Returns:
70
+ tuple: (is_valid: bool, sanitized_name_or_error: str)
71
+ """
72
+ # Check for path traversal attempts
73
+ if '..' in file_name or file_name.startswith('/') or file_name.startswith('\\'):
74
+ return False, "Invalid file name: path traversal not allowed"
75
+
76
+ # Check for absolute paths (Windows and Unix)
77
+ if os.path.isabs(file_name):
78
+ return False, "Invalid file name: absolute paths not allowed"
79
+
80
+ # Normalize the path and ensure it doesn't escape the files directory
81
+ normalized = os.path.normpath(file_name)
82
+ if normalized.startswith('..') or os.path.isabs(normalized):
83
+ return False, "Invalid file name: path traversal detected"
84
+
85
+ return True, normalized
86
+
87
+ def _get_file_content(file_name: str, mode: str = 'binary'):
88
+ """
89
+ Helper function to get file content from local filesystem or remote URL.
90
+
91
+ Args:
92
+ file_name: The file name (without 'files/' prefix)
93
+ mode: 'binary' for bytes, 'text' for string
94
+
95
+ Returns:
96
+ tuple: (success: bool, data: bytes/str or error_message: str)
97
+
98
+ NOTE — File source for GAIA benchmark question attachments:
99
+ The question files (.png, .mp3, .py, .xlsx, etc.) are NOT served by the
100
+ scoring API at agents-course-unit4-scoring.hf.space. A previous version of
101
+ this code defaulted to that URL, which caused silent 404 failures for any
102
+ question that referenced a file attachment.
103
+
104
+ The correct source is the HuggingFace dataset:
105
+ repo: gaia-benchmark/GAIA (type: dataset)
106
+ path: 2023/validation/<file_name>
107
+
108
+ This function now tries sources in order:
109
+ 1. Local files/ directory (cache)
110
+ 2. HuggingFace dataset download (saves to files/ for future runs)
111
+ 3. SPACE_HOST env var (only when deployed on HF Spaces)
112
+
113
+ To pre-download all question files manually, run:
114
+ python -c "
115
+ import json, os, shutil
116
+ from huggingface_hub import hf_hub_download
117
+ questions = json.load(open('files/questions.json', encoding='utf-8'))
118
+ for q in questions:
119
+ fn = q.get('file_name', '')
120
+ if fn and not os.path.exists(f'files/{fn}'):
121
+ src = hf_hub_download('gaia-benchmark/GAIA', f'2023/validation/{fn}', repo_type='dataset')
122
+ shutil.copy(src, f'files/{fn}')
123
+ print('Downloaded', fn)
124
+ "
125
+ """
126
+ # Sanitize file name first
127
+ is_valid, result = _sanitize_file_path(file_name)
128
+ if not is_valid:
129
+ return False, result
130
+
131
+ file_name = result # Use sanitized name
132
+ file_path = f"files/{file_name}"
133
+
134
+ def _read(path: str):
135
+ if mode == 'binary':
136
+ with open(path, 'rb') as f:
137
+ return True, f.read()
138
+ else:
139
+ with open(path, 'r', encoding='utf-8') as f:
140
+ return True, f.read()
141
+
142
+ # 1. Local cache
143
+ if os.path.exists(file_path):
144
+ try:
145
+ return _read(file_path)
146
+ except Exception as e:
147
+ return False, f"Error reading local file: {e}"
148
+
149
+ # 2. HuggingFace GAIA dataset — downloads and caches locally
150
+ try:
151
+ import shutil
152
+ from huggingface_hub import hf_hub_download
153
+ print(f"[INFO] Downloading {file_name} from HuggingFace GAIA dataset...")
154
+ hf_local = hf_hub_download(
155
+ repo_id='gaia-benchmark/GAIA',
156
+ filename=f'2023/validation/{file_name}',
157
+ repo_type='dataset',
158
+ )
159
+ os.makedirs('files', exist_ok=True)
160
+ shutil.copy(hf_local, file_path)
161
+ print(f"[INFO] Cached to {file_path}")
162
+ return _read(file_path)
163
+ except Exception as e:
164
+ print(f"[WARNING] HuggingFace download failed for {file_name}: {e}")
165
+
166
+ # 3. SPACE_HOST fallback (only when explicitly deployed on a HF Space that serves files)
167
+ space_host = os.getenv("SPACE_HOST")
168
+ if space_host:
169
+ try:
170
+ if not space_host.startswith("http"):
171
+ file_url = f"https://{space_host}/files/{file_name}"
172
+ else:
173
+ file_url = f"{space_host}/files/{file_name}"
174
+ print(f"[INFO] Fetching {file_name} from {file_url}")
175
+ response = requests.get(file_url, timeout=30)
176
+ response.raise_for_status()
177
+ if mode == 'binary':
178
+ return True, response.content
179
+ else:
180
+ return True, response.text
181
+ except Exception as e:
182
+ print(f"[WARNING] SPACE_HOST fetch failed for {file_name}: {e}")
183
+
184
+ return False, f"Could not retrieve file '{file_name}' from any source."
185
+
186
+ def _get_mime_type(file_name: str) -> str:
187
+ """Helper function to determine MIME type from file extension."""
188
+ ext = file_name.lower().split('.')[-1]
189
+ mime_types = {
190
+ 'png': 'image/png',
191
+ 'jpg': 'image/jpeg',
192
+ 'jpeg': 'image/jpeg',
193
+ 'gif': 'image/gif',
194
+ 'webp': 'image/webp',
195
+ 'bmp': 'image/bmp'
196
+ }
197
+ return mime_types.get(ext, 'image/png')
198
+
199
+ # ============================================================================
200
+ # Tools
201
+ # ============================================================================
202
+
203
+ @tool
204
+ def calculate(operation: str, a: float, b: float) -> str:
205
+ """Perform a basic arithmetic operation on two numbers.
206
+
207
+ Args:
208
+ operation (str): One of 'add', 'subtract', 'multiply', 'divide', 'power', 'modulus'.
209
+ a (float): First number.
210
+ b (float): Second number.
211
+ """
212
+ op = (operation or "").strip().lower()
213
+ if op == "add":
214
+ return str(a + b)
215
+ elif op == "subtract":
216
+ return str(a - b)
217
+ elif op == "multiply":
218
+ return str(a * b)
219
+ elif op == "divide":
220
+ if b == 0:
221
+ return "Cannot divide by zero"
222
+ return str(a / b)
223
+ elif op == "power":
224
+ return str(a ** b)
225
+ elif op == "modulus":
226
+ return str(int(a) % int(b))
227
+ else:
228
+ return f"Unsupported operation '{operation}'. Use: add, subtract, multiply, divide, power, modulus."
229
+
230
+ @tool
231
+ def string_reverse(input_string: str) -> str:
232
+ """
233
+ Reverses the input string. Useful whenever a string seems to be non-sensical or
234
+ contains a lot of gibberish. This function can be used to reverse the string
235
+ and check if it makes more sense when reversed.
236
+
237
+ Args:
238
+ input_string (str): The string to reverse.
239
+
240
+ Returns:
241
+ str: The reversed string.
242
+ """
243
+ return input_string[::-1]
244
+
245
+
246
+ @tool
247
+ def websearch(query: str) -> str:
248
+ """This tool will search the web using DuckDuckGo.
249
+
250
+ Args:
251
+ query: The search query.
252
+ """
253
+
254
+ try:
255
+ print(f"websearch called: {query}")
256
+
257
+ # Loop guard: if this exact query (normalized) was already run for the
258
+ # current question, don't re-run it — repeating it returns nothing new.
259
+ # Return the prior results plus a nudge to change strategy or answer.
260
+ norm_query = " ".join(query.lower().split())
261
+ if norm_query in _websearch_seen_queries:
262
+ print("[WEBSEARCH] Duplicate query detected — returning cached result with nudge")
263
+ return (
264
+ "DUPLICATE SEARCH: You already ran this exact query earlier for this question, "
265
+ "so it returns no new information. Do NOT repeat it. Instead: try a substantially "
266
+ "different query, call get_webpage_content on a promising URL from earlier results, "
267
+ "or give your best answer now based on what you already have.\n\n"
268
+ f"Previous results for this query:\n{_websearch_seen_queries[norm_query]}"
269
+ )
270
+
271
+ with DDGS() as ddgs:
272
+ results = ddgs.text(query, max_results=5, timelimit='y') # Limit to past year for faster results
273
+ if results:
274
+ print(f"websearch results: {len(results)}")
275
+ output = "\n\n".join([f"Title: {r['title']}\nURL: {r['href']}\nSnippet: {r['body']}" for r in results])
276
+ else:
277
+ output = "No results found. Try search with a different query."
278
+ _websearch_seen_queries[norm_query] = output
279
+ return output
280
+ except Exception as e:
281
+ return f"Search error (try again): {str(e)}"
282
+
283
+ @tool
284
+ def wiki_search(query: str) -> str:
285
+ """Search Wikipedia for a query and return maximum 3 results.
286
+
287
+ Args:
288
+ query: The search query."""
289
+ try:
290
+ print(f"wiki_search called: {query}")
291
+
292
+ search_docs = WikipediaLoader(query=query, load_max_docs=3).load()
293
+ formatted_search_docs = "\n\n---\n\n".join(
294
+ [
295
+ f'<Document source="{doc.metadata["source"]}" page="{doc.metadata.get("page", "")}"/>\n{doc.page_content}\n</Document>'
296
+ for doc in search_docs
297
+ ])
298
+ print(f"wiki_results: {len(formatted_search_docs)} characters")
299
+ return {"wiki_results": formatted_search_docs}
300
+ except Exception as e:
301
+ return f"Error performing wikipedia search: {e}. try again."
302
+
303
+ @tool
304
+ def arvix_search(query: str) -> str:
305
+ """Search Arxiv for a query and return maximum 3 result.
306
+
307
+ Args:
308
+ query: The search query."""
309
+ try:
310
+ print(f"arvix_search called: {query}")
311
+
312
+ with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
313
+ future = executor.submit(lambda: ArxivLoader(query=query, load_max_docs=3).load())
314
+ search_docs = future.result(timeout=config.ARXIV_TIMEOUT_SECONDS)
315
+
316
+ formatted_search_docs = "\n\n---\n\n".join(
317
+ [
318
+ f'<Document source="{doc.metadata["source"]}" page="{doc.metadata.get("page", "")}"/>\n{doc.page_content[:1000]}\n</Document>'
319
+ for doc in search_docs
320
+ ])
321
+
322
+ print(f"arvix_results: {len(formatted_search_docs)} characters")
323
+ return {"arvix_results": formatted_search_docs}
324
+ except concurrent.futures.TimeoutError:
325
+ return f"ArXiv timed out after {config.ARXIV_TIMEOUT_SECONDS}s — try websearch instead"
326
+ except Exception as e:
327
+ return f"Error performing arxiv search: {e}. try again."
328
+
329
+ @tool
330
+ def youtube_tool(youtube_url: str, question: str = "") -> str:
331
+ """Get the transcript of a YouTube video, or analyze it with AI to answer a question.
332
+
333
+ If question is provided, uses a multimodal AI model to analyze the video (handles visual
334
+ or audio content beyond just transcript). If question is empty, returns the raw transcript.
335
+
336
+ Args:
337
+ youtube_url (str): Full HTTPS URL of the YouTube video.
338
+ question (str): Optional question to answer about the video. If empty, returns raw transcript.
339
+ """
340
+ print(f"youtube_tool called: {youtube_url} question={question!r}")
341
+
342
+ if not question:
343
+ # Transcript-only path — no API key needed
344
+ try:
345
+ video_id = extract.video_id(youtube_url)
346
+ ytt_api = YouTubeTranscriptApi()
347
+ transcript = ytt_api.fetch(video_id)
348
+ txt = '\n'.join([s.text for s in transcript.snippets])
349
+ print(f"youtube_transcript: {len(txt)} characters")
350
+ return txt
351
+ except Exception as e:
352
+ msg = f"youtube_tool (transcript) failed: {e}"
353
+ print(msg)
354
+ return msg
355
+
356
+ # AI analysis path
357
+ try:
358
+ api_key = config.GOOGLE_API_KEY
359
+ if not api_key:
360
+ return "Error: GOOGLE_API_KEY environment variable not set"
361
+
362
+ client = genai.Client(api_key=api_key)
363
+ response = client.models.generate_content(
364
+ model=config.GEMINI_MODEL,
365
+ contents=[types.Content(
366
+ parts=[
367
+ types.Part(file_data=types.FileData(file_uri=youtube_url)),
368
+ types.Part(text=question)
369
+ ]
370
+ )],
371
+ config=types.GenerateContentConfig(
372
+ temperature=config.GEMINI_TEMPERATURE,
373
+ max_output_tokens=config.GEMINI_MAX_TOKENS,
374
+ )
375
+ )
376
+ return response.text or "(no response from model)"
377
+ except Exception as e:
378
+ error_msg = f"youtube_tool (AI analysis) failed: {str(e)[:config.QUESTION_PREVIEW_LENGTH]}"
379
+ print(error_msg)
380
+ return error_msg
381
+
382
+ @tool
383
+ def get_webpage_content(page_url: str) -> str:
384
+ """Load a web page and return it as markdown if possible
385
+
386
+ Args:
387
+ page_url (str): the URL of web page to get
388
+
389
+ Returns:
390
+ str: The content of the page(s).
391
+ """
392
+
393
+ try:
394
+ print(f"get_web_page_content called: with url {page_url}")
395
+ r = requests.get(page_url, timeout=30, headers=_HTTP_HEADERS) # Add 30s timeout
396
+ r.raise_for_status()
397
+ text = ""
398
+ # special case if page is a PDF file
399
+ if r.headers.get('Content-Type', '') == 'application/pdf':
400
+ pdf_file = BytesIO(r.content)
401
+ reader = PdfReader(pdf_file)
402
+ for page in reader.pages:
403
+ text += page.extract_text()
404
+ else:
405
+ soup = BeautifulSoup((r.text), 'html.parser')
406
+ if soup.body:
407
+ # convert to markdown
408
+ text = md(str(soup.body))
409
+ else:
410
+ # return the raw content
411
+ text = r.text
412
+ print(f"webpage_content: {len(text)} characters")
413
+ return text
414
+ except Exception as e:
415
+ return f"get_webpage_content failed: {e}"
416
+
417
+ @tool
418
+ def read_file(file_name: str, sheet_name: str = "") -> str:
419
+ """Read a file from the files directory and return its content.
420
+
421
+ Supported formats:
422
+ - .xlsx / .csv → returned as a Markdown table
423
+ - .py / .txt / .md / .json / .jsonl → returned as raw text
424
+
425
+ Args:
426
+ file_name (str): Name of the file (e.g., 'data.xlsx'). Do not include 'files/' prefix.
427
+ sheet_name (str): For Excel files, the sheet name to read. Leave empty to read the first sheet.
428
+ """
429
+ print(f"read_file called: {file_name}")
430
+ ext = file_name.rsplit(".", 1)[-1].lower() if "." in file_name else ""
431
+
432
+ if ext in ("xlsx", "xls"):
433
+ success, data = _get_file_content(file_name, mode='binary')
434
+ if not success:
435
+ return f"Error: {data}"
436
+ assert isinstance(data, bytes)
437
+ try:
438
+ df = pd.read_excel(BytesIO(data), sheet_name=sheet_name or 0)
439
+ return df.to_markdown(index=False)
440
+ except Exception as e:
441
+ return f"Error reading Excel file: {e}"
442
+
443
+ if ext == "csv":
444
+ success, data = _get_file_content(file_name, mode='binary')
445
+ if not success:
446
+ return f"Error: {data}"
447
+ assert isinstance(data, bytes)
448
+ try:
449
+ df = pd.read_csv(BytesIO(data))
450
+ return df.to_markdown(index=False)
451
+ except Exception as e:
452
+ return f"Error reading CSV file: {e}"
453
+
454
+ # Text-based formats
455
+ if ext in ("py", "txt", "md", "json", "jsonl", ""):
456
+ success, data = _get_file_content(file_name, mode='text')
457
+ if not success:
458
+ return f"Error: {data}"
459
+ return data
460
+
461
+ return f"Unsupported file type '.{ext}'. Supported: xlsx, xls, csv, py, txt, md, json, jsonl."
462
+
463
+ @tool
464
+ def parse_audio_file(file_name: str) -> str:
465
+ """
466
+ Transcribes audio from an MP3 file into text.
467
+ Use this tool to extract speech/text from audio files.
468
+
469
+ Args:
470
+ file_name (str): The name of the MP3 file (e.g., 'audio.mp3'). Do not include the 'files/' prefix.
471
+
472
+ Returns:
473
+ str: The transcribed text.
474
+ """
475
+
476
+ try:
477
+ print(f"parse_audio_file called: with file {file_name}")
478
+
479
+ # Get file content using helper function
480
+ success, data = _get_file_content(file_name, mode='binary')
481
+ if not success:
482
+ return f"Error: Failed to read audio file. {data}"
483
+
484
+ # Load audio from bytes
485
+ audio = AudioSegment.from_file(io.BytesIO(data), format="mp3")
486
+ # SpeechRecognition works best with WAV data so we to WAV format in memory
487
+ wav_data = io.BytesIO()
488
+ audio.export(wav_data, format="wav")
489
+ wav_data.seek(0) # Rewind the buffer to the beginning
490
+
491
+ # Now we directly process the WAV data
492
+ recognizer = sr.Recognizer()
493
+ with sr.AudioFile(wav_data) as source:
494
+ audio_data = recognizer.record(source)
495
+ text = recognizer.recognize_google(audio_data)
496
+ return text
497
+
498
+ except sr.RequestError as e:
499
+ return f"Error: Could not request results from Google Web Speech API; {e}"
500
+ except Exception as e:
501
+ if "ffmpeg" in str(e).lower() or "avlib" in str(e).lower():
502
+ return f"Error: Failed to process audio. Reason: {e}. Ensure ffmpeg is installed and in your system's PATH."
503
+ return f"Error: Failed to parse the audio file. Reason: {e}"
504
+
505
+ @tool
506
+ def analyze_image(question: str, file_name: str) -> str:
507
+ """
508
+ Analyzes an image file and answers a specific question about it using AI vision.
509
+ Use this tool when you need to understand image content (e.g., chess positions, diagrams, photos).
510
+
511
+ Args:
512
+ question (str): The question you want answered about the image.
513
+ file_name (str): The name of the image file (e.g., 'image.png'). Do not include the 'files/' prefix.
514
+
515
+ Returns:
516
+ str: The answer to the question based on the image analysis.
517
+ """
518
+
519
+ global _analyze_image_call_count
520
+ _analyze_image_call_count += 1
521
+ print(f"analyze_image called: {file_name} with question: {question}")
522
+ if _analyze_image_call_count > MAX_ANALYZE_IMAGE_CALLS:
523
+ return (
524
+ f"ERROR: analyze_image has already been called {_analyze_image_call_count - 1} times. "
525
+ f"MAXIMUM is {MAX_ANALYZE_IMAGE_CALLS}. "
526
+ "Do NOT call analyze_image again. Commit to the chess position already described and use "
527
+ "execute_python with the chess library to find the winning move."
528
+ )
529
+
530
+ try:
531
+ api_key = config.GOOGLE_API_KEY
532
+ if not api_key:
533
+ return "Error: GOOGLE_API_KEY environment variable not set"
534
+
535
+ # Get file content using helper function
536
+ success, image_data = _get_file_content(file_name, mode='binary')
537
+ if not success:
538
+ return f"Error: Failed to read image file. {image_data}"
539
+
540
+ client = genai.Client(api_key=api_key)
541
+
542
+ # Use Gemini vision model with image data
543
+ response = client.models.generate_content(
544
+ model=config.GEMINI_MODEL,
545
+ contents=[types.Content(
546
+ parts=[
547
+ types.Part(inline_data=types.Blob(
548
+ mime_type=_get_mime_type(file_name),
549
+ data=image_data
550
+ )),
551
+ types.Part(text=question)
552
+ ]
553
+ )],
554
+ config=types.GenerateContentConfig(
555
+ temperature=config.GEMINI_TEMPERATURE,
556
+ max_output_tokens=config.GEMINI_MAX_TOKENS,
557
+ )
558
+ )
559
+ return response.text
560
+
561
+ except Exception as e:
562
+ error_msg = f"Error analyzing image: {str(e)[:config.QUESTION_PREVIEW_LENGTH]}"
563
+ print(error_msg)
564
+ return error_msg
565
+
566
+
567
+ @tool
568
+ def classical_cipher(cipher_type: str, mode: str, text: str, keyword: str = "", period: int = 5) -> str:
569
+ """Encrypt or decrypt common classical ciphers.
570
+
571
+ Supported ciphers: playfair, bifid.
572
+
573
+ Args:
574
+ cipher_type (str): Cipher family: 'playfair' or 'bifid'.
575
+ mode (str): 'encrypt' or 'decrypt'.
576
+ text (str): Input text (letters only; j is mapped to i).
577
+ keyword (str): Key phrase used to build the 5x5 square.
578
+ period (int): Bifid period (ignored for Playfair). Default is 5.
579
+ """
580
+ ctype = (cipher_type or "").strip().lower()
581
+ op = (mode or "").strip().lower()
582
+ if ctype not in {"playfair", "bifid"}:
583
+ return "Unsupported cipher_type. Use 'playfair' or 'bifid'."
584
+ if op not in {"encrypt", "decrypt"}:
585
+ return "Unsupported mode. Use 'encrypt' or 'decrypt'."
586
+ if period <= 0:
587
+ return "Invalid period. Use a positive integer."
588
+
589
+ alphabet = "abcdefghiklmnopqrstuvwxyz"
590
+
591
+ def _normalize(s: str) -> str:
592
+ return re.sub(r"[^a-z]", "", (s or "").lower().replace("j", "i"))
593
+
594
+ def _build_square(key: str):
595
+ seen = []
596
+ for c in _normalize(key) + alphabet:
597
+ if c not in seen:
598
+ seen.append(c)
599
+ sq = [seen[i * 5:(i + 1) * 5] for i in range(5)]
600
+ pos = {c: (r, cidx) for r, row in enumerate(sq) for cidx, c in enumerate(row)}
601
+ inv = {(r, cidx): ch for r, row in enumerate(sq) for cidx, ch in enumerate(row)}
602
+ return sq, pos, inv
603
+
604
+ sq, pos, inv = _build_square(keyword)
605
+ normalized = _normalize(text)
606
+ if not normalized:
607
+ return ""
608
+
609
+ if ctype == "playfair":
610
+ if len(normalized) % 2 != 0:
611
+ normalized = normalized + "x"
612
+ d = -1 if op == "decrypt" else 1
613
+ out = []
614
+ for i in range(0, len(normalized), 2):
615
+ a, b = normalized[i], normalized[i + 1]
616
+ ra, ca = pos[a]
617
+ rb, cb = pos[b]
618
+ if ra == rb:
619
+ out.append(sq[ra][(ca + d) % 5])
620
+ out.append(sq[rb][(cb + d) % 5])
621
+ elif ca == cb:
622
+ out.append(sq[(ra + d) % 5][ca])
623
+ out.append(sq[(rb + d) % 5][cb])
624
+ else:
625
+ out.append(sq[ra][cb])
626
+ out.append(sq[rb][ca])
627
+ return "".join(out)
628
+
629
+ # bifid
630
+ if op == "encrypt":
631
+ out = []
632
+ for i in range(0, len(normalized), period):
633
+ block = normalized[i:i + period]
634
+ rows, cols = [], []
635
+ for ch in block:
636
+ r, c = pos[ch]
637
+ rows.append(r + 1)
638
+ cols.append(c + 1)
639
+ nums = rows + cols
640
+ for j in range(0, len(nums), 2):
641
+ out.append(inv[(nums[j] - 1, nums[j + 1] - 1)])
642
+ return "".join(out)
643
+
644
+ out = []
645
+ for i in range(0, len(normalized), period):
646
+ block = normalized[i:i + period]
647
+ nums = []
648
+ for ch in block:
649
+ r, c = pos[ch]
650
+ nums.extend([r + 1, c + 1])
651
+ half = len(block)
652
+ rows, cols = nums[:half], nums[half:]
653
+ for rr, cc in zip(rows, cols):
654
+ out.append(inv[(rr - 1, cc - 1)])
655
+ return "".join(out)
656
+
657
+
658
+ @tool
659
+ def execute_python(code: str) -> str:
660
+ """Execute a Python code snippet and return its stdout output.
661
+
662
+ Use this for precise computations the LLM cannot do reliably:
663
+ counting characters, implementing algorithms (ciphers, prime sieves),
664
+ math calculations, data transformations, etc.
665
+
666
+ Args:
667
+ code (str): Valid Python 3 code. Use print() to produce output.
668
+ Do not read/write files or make network calls from within the code.
669
+ """
670
+ timeout = 30
671
+ try:
672
+ result = subprocess.run(
673
+ [sys.executable, "-c", code],
674
+ capture_output=True,
675
+ text=True,
676
+ timeout=timeout,
677
+ )
678
+ if result.returncode == 0:
679
+ return result.stdout.strip() or "(no output)"
680
+ return f"Exit {result.returncode}:\n{result.stderr.strip()}"
681
+ except subprocess.TimeoutExpired:
682
+ return f"Execution timed out after {timeout}s"
683
+ except Exception as e:
684
+ return f"execute_python failed: {e}"
685
+
686
+
687
+ @tool
688
+ def http_request(method: str, url: str, headers_json: str = "{}", body_json: str = "{}") -> str:
689
+ """Make an HTTP request with a custom method, headers, and JSON body.
690
+
691
+ Use this for POST, DELETE, or authenticated GET requests that require
692
+ custom headers (e.g. Authorization: Bearer ...) or a request body.
693
+
694
+ Args:
695
+ method (str): HTTP method — 'GET', 'POST', or 'DELETE'.
696
+ url (str): The full URL to call.
697
+ headers_json (str): JSON object of request headers, e.g. '{"Authorization": "Bearer TOKEN"}'.
698
+ body_json (str): JSON object for the request body (POST only). Use '{}' for empty body.
699
+
700
+ Returns:
701
+ str: Response body as text, prefixed with the HTTP status code.
702
+ """
703
+ import json
704
+ method = method.upper()
705
+ try:
706
+ headers = json.loads(headers_json)
707
+ except Exception as e:
708
+ return f"Invalid headers_json: {e}"
709
+ try:
710
+ body = json.loads(body_json)
711
+ except Exception as e:
712
+ return f"Invalid body_json: {e}"
713
+
714
+ try:
715
+ if method == "GET":
716
+ r = requests.get(url, headers=headers, timeout=30)
717
+ elif method == "POST":
718
+ r = requests.post(url, headers=headers, json=body, timeout=30)
719
+ elif method == "DELETE":
720
+ r = requests.delete(url, headers=headers, timeout=30)
721
+ else:
722
+ return f"Unsupported method '{method}'. Use GET, POST, or DELETE."
723
+ try:
724
+ content = json.dumps(r.json(), ensure_ascii=False)
725
+ except ValueError:
726
+ content = r.text
727
+ return f"HTTP {r.status_code}\n{content}"
728
+ except Exception as e:
729
+ return f"http_request failed ({method} {url}): {e}"
730
+
731
+
732
+ @tool
733
+ def download_file(url: str, file_name: str) -> str:
734
+ """Download a binary file from a URL and save it to the files directory.
735
+
736
+ Use this before calling read_file, parse_audio_file,
737
+ or analyze_image on files fetched from an API.
738
+ After downloading, call the appropriate tool with the same file_name.
739
+
740
+ Args:
741
+ url (str): The full URL of the file to download.
742
+ file_name (str): Local file name to save as (e.g. 'data.xlsx', 'audio.mp3').
743
+ Must not contain path separators or '..'.
744
+ """
745
+ if "/" in file_name or "\\" in file_name or ".." in file_name:
746
+ return "Invalid file_name: path separators and '..' are not allowed."
747
+
748
+ try:
749
+ r = requests.get(url, timeout=60, headers=_HTTP_HEADERS)
750
+ r.raise_for_status()
751
+ except Exception as e:
752
+ return f"download_file failed (fetch): {e}"
753
+
754
+ os.makedirs(config.FILES_DIR, exist_ok=True)
755
+ dest = os.path.join(config.FILES_DIR, file_name)
756
+ try:
757
+ with open(dest, "wb") as f:
758
+ f.write(r.content)
759
+ return f"Downloaded {len(r.content)} bytes → {dest}"
760
+ except Exception as e:
761
+ return f"download_file failed (write): {e}"
762
+
763
+
764
+ @tool
765
+ def ask_advisor(question: str) -> str:
766
+ """Consult a more powerful AI model when you are stuck or uncertain after 2+ failed attempts.
767
+
768
+ Describe what you are trying to solve and what you have already tried.
769
+ The advisor returns a concise recommendation (2-3 sentences) to guide your next step.
770
+ Use sparingly — only for genuinely hard reasoning or planning problems, not for tool failures.
771
+
772
+ Args:
773
+ question (str): A clear description of the problem and what approaches you have already tried.
774
+ """
775
+ try:
776
+ api_key = config.GOOGLE_API_KEY
777
+ if not api_key:
778
+ return "Error: GOOGLE_API_KEY not configured"
779
+ client = genai.Client(api_key=api_key)
780
+ response = client.models.generate_content(
781
+ model=config.GEMINI_MODEL,
782
+ contents=question,
783
+ config=types.GenerateContentConfig(
784
+ system_instruction=(
785
+ "You are an expert advisor for an AI agent that is stuck on a search or reasoning problem. "
786
+ "Give a concise, actionable recommendation in 2-3 sentences about what to search for or how to reason. "
787
+ "Do NOT suggest installing Python packages or software. "
788
+ "Do NOT suggest writing code. "
789
+ "Only give search strategy or reasoning guidance."
790
+ ),
791
+ temperature=0,
792
+ )
793
+ )
794
+ return response.text or "Advisor returned no response."
795
+ except Exception as e:
796
+ return f"Advisor unavailable: {e}"
797
+
798
+
799
+ # ============================================================================
800
+ # Tools List
801
+ # ============================================================================
802
+
803
+
804
+ def get_custom_tools_list() -> list:
805
+ """Get list of all custom tools for the agent.
806
+
807
+ Returns:
808
+ list: List of tool functions
809
+ """
810
+ tools = [
811
+ calculate,
812
+ string_reverse,
813
+ websearch,
814
+ wiki_search,
815
+ arvix_search,
816
+ youtube_tool,
817
+ get_webpage_content,
818
+ read_file,
819
+ parse_audio_file,
820
+ analyze_image,
821
+ classical_cipher,
822
+ execute_python,
823
+ ask_advisor,
824
+ http_request,
825
+ download_file,
826
+ ]
827
+ return tools
files/metadata.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
files/questions.json ADDED
@@ -0,0 +1 @@
 
 
1
+ [{"task_id":"8e867cd7-cff9-4e6c-867a-ff5ddc2550be","question":"How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can use the latest 2022 version of english wikipedia.","Level":"1","file_name":""},{"task_id":"a1e91b78-d3d8-4675-bb8d-62741b4b68a6","question":"In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird species to be on camera simultaneously?","Level":"1","file_name":""},{"task_id":"2d83110e-a098-4ebb-9987-066c06fa42d0","question":".rewsna eht sa \"tfel\" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI","Level":"1","file_name":""},{"task_id":"cca530fc-4052-43b2-b130-b30968d8aa44","question":"Review the chess position provided in the image. It is black's turn. Provide the correct next move for black which guarantees a win. Please provide your response in algebraic notation.","Level":"1","file_name":"cca530fc-4052-43b2-b130-b30968d8aa44.png"},{"task_id":"4fc2f1ae-8625-45b5-ab34-ad4433bc21f8","question":"Who nominated the only Featured Article on English Wikipedia about a dinosaur that was promoted in November 2016?","Level":"1","file_name":""},{"task_id":"6f37996b-2ac7-44b0-8e68-6d28256631b4","question":"Given this table defining * on the set S = {a, b, c, d, e}\n\n|*|a|b|c|d|e|\n|---|---|---|---|---|---|\n|a|a|b|c|b|d|\n|b|b|c|a|e|c|\n|c|c|a|b|b|a|\n|d|b|e|b|e|d|\n|e|d|b|a|d|c|\n\nprovide the subset of S involved in any possible counter-examples that prove * is not commutative. Provide your answer as a comma separated list of the elements in the set in alphabetical order.","Level":"1","file_name":""},{"task_id":"9d191bce-651d-4746-be2d-7ef8ecadb9c2","question":"Examine the video at https://www.youtube.com/watch?v=1htKBjuUWec.\n\nWhat does Teal'c say in response to the question \"Isn't that hot?\"","Level":"1","file_name":""},{"task_id":"cabe07ed-9eca-40ea-8ead-410ef5e83f91","question":"What is the surname of the equine veterinarian mentioned in 1.E Exercises from the chemistry materials licensed by Marisa Alviar-Agnew & Henry Agnew under the CK-12 license in LibreText's Introductory Chemistry materials as compiled 08/21/2023?","Level":"1","file_name":""},{"task_id":"3cef3a44-215e-4aed-8e3b-b1e3f08063b7","question":"I'm making a grocery list for my mom, but she's a professor of botany and she's a real stickler when it comes to categorizing things. I need to add different foods to different categories on the grocery list, but if I make a mistake, she won't buy anything inserted in the wrong category. Here's the list I have so far:\n\nmilk, eggs, flour, whole bean coffee, Oreos, sweet potatoes, fresh basil, plums, green beans, rice, corn, bell pepper, whole allspice, acorns, broccoli, celery, zucchini, lettuce, peanuts\n\nI need to make headings for the fruits and vegetables. Could you please create a list of just the vegetables from my list? If you could do that, then I can figure out how to categorize the rest of the list into the appropriate categories. But remember that my mom is a real stickler, so make sure that no botanical fruits end up on the vegetable list, or she won't get them when she's at the store. Please alphabetize the list of vegetables, and place each item in a comma separated list.","Level":"1","file_name":""},{"task_id":"99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3","question":"Hi, I'm making a pie but I could use some help with my shopping list. I have everything I need for the crust, but I'm not sure about the filling. I got the recipe from my friend Aditi, but she left it as a voice memo and the speaker on my phone is buzzing so I can't quite make out what she's saying. Could you please listen to the recipe and list all of the ingredients that my friend described? I only want the ingredients for the filling, as I have everything I need to make my favorite pie crust. I've attached the recipe as Strawberry pie.mp3.\n\nIn your response, please only list the ingredients, not any measurements. So if the recipe calls for \"a pinch of salt\" or \"two cups of ripe strawberries\" the ingredients on the list would be \"salt\" and \"ripe strawberries\".\n\nPlease format your response as a comma separated list of ingredients. Also, please alphabetize the ingredients.","Level":"1","file_name":"99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3.mp3"},{"task_id":"305ac316-eef6-4446-960a-92d80d542f82","question":"Who did the actor who played Ray in the Polish-language version of Everybody Loves Raymond play in Magda M.? Give only the first name.","Level":"1","file_name":""},{"task_id":"f918266a-b3e0-4914-865d-4faa564f1aef","question":"What is the final numeric output from the attached Python code?","Level":"1","file_name":"f918266a-b3e0-4914-865d-4faa564f1aef.py"},{"task_id":"3f57289b-8c60-48be-bd80-01f8099ca449","question":"How many at bats did the Yankee with the most walks in the 1977 regular season have that same season?","Level":"1","file_name":""},{"task_id":"1f975693-876d-457b-a649-393859e79bf3","question":"Hi, I was out sick from my classes on Friday, so I'm trying to figure out what I need to study for my Calculus mid-term next week. My friend from class sent me an audio recording of Professor Willowbrook giving out the recommended reading for the test, but my headphones are broken :(\n\nCould you please listen to the recording for me and tell me the page numbers I'm supposed to go over? I've attached a file called Homework.mp3 that has the recording. Please provide just the page numbers as a comma-delimited list. And please provide the list in ascending order.","Level":"1","file_name":"1f975693-876d-457b-a649-393859e79bf3.mp3"},{"task_id":"840bfca7-4f7b-481a-8794-c560c340185d","question":"On June 6, 2023, an article by Carolyn Collins Petersen was published in Universe Today. This article mentions a team that produced a paper about their observations, linked at the bottom of the article. Find this paper. Under what NASA award number was the work performed by R. G. Arendt supported by?","Level":"1","file_name":""},{"task_id":"bda648d7-d618-4883-88f4-3466eabd860e","question":"Where were the Vietnamese specimens described by Kuznetzov in Nedoshivina's 2010 paper eventually deposited? Just give me the city name without abbreviations.","Level":"1","file_name":""},{"task_id":"cf106601-ab4f-4af9-b045-5295fe67b37d","question":"What country had the least number of athletes at the 1928 Summer Olympics? If there's a tie for a number of athletes, return the first in alphabetical order. Give the IOC country code as your answer.","Level":"1","file_name":""},{"task_id":"a0c07678-e491-4bbc-8f0b-07405144218f","question":"Who are the pitchers with the number before and after Taishō Tamai's number as of July 2023? Give them to me in the form Pitcher Before, Pitcher After, use their last names only, in Roman characters.","Level":"1","file_name":""},{"task_id":"7bd855d8-463d-4ed5-93ca-5fe35145f733","question":"The attached Excel file contains the sales of menu items for a local fast-food chain. What were the total sales that the chain made from food (not including drinks)? Express your answer in USD with two decimal places.","Level":"1","file_name":"7bd855d8-463d-4ed5-93ca-5fe35145f733.xlsx"},{"task_id":"5a0c1adf-205e-4841-a666-7c3ef95def9d","question":"What is the first name of the only Malko Competition recipient from the 20th Century (after 1977) whose nationality on record is a country that no longer exists?","Level":"1","file_name":""}]
gradioapp.py ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import config
3
+
4
+ # --- Build Gradio Interface without Blocks Context ---
5
+
6
+ run_and_submit_all_callback = None # Placeholder for the actual function
7
+
8
+ def _run_and_submit_all_local(profile: gr.OAuthProfile | None = None, active_agent: str = None):
9
+ """Run and submit with specified agent type."""
10
+ username = None
11
+
12
+ if profile is not None:
13
+ username = f"{profile.username}"
14
+ print(f"User logged in: {username}")
15
+ else:
16
+ print("User not logged in.")
17
+ return "Please Login to Hugging Face with the button.", None
18
+
19
+ return run_and_submit_all_callback(username, active_agent)
20
+
21
+ def _run_and_submit_langgraph(profile: gr.OAuthProfile | None = None):
22
+ """Run and submit with LangGraph agent."""
23
+ return _run_and_submit_all_local(profile, active_agent=config.AGENT_LANGGRAPH)
24
+
25
+ def _run_and_submit_react(profile: gr.OAuthProfile | None = None):
26
+ """Run and submit with ReActLangGraph agent."""
27
+ return _run_and_submit_all_local(profile, active_agent=config.AGENT_REACT_LANGGRAPH)
28
+
29
+ def _run_and_submit_llamaindex(profile: gr.OAuthProfile | None = None):
30
+ """Run and submit with LlamaIndex agent."""
31
+ return _run_and_submit_all_local(profile, active_agent=config.AGENT_LLAMAINDEX)
32
+
33
+
34
+ def _parse_filter_indices(filter_text: str):
35
+ """Parse comma-separated filter indices from text input.
36
+
37
+ Args:
38
+ filter_text: Comma-separated indices (e.g., "4, 7, 15") or empty for all questions
39
+
40
+ Returns:
41
+ tuple of indices or None if empty/invalid
42
+ """
43
+ if not filter_text or not filter_text.strip():
44
+ return None # Run all questions
45
+
46
+ try:
47
+ indices = tuple(int(idx.strip()) for idx in filter_text.split(',') if idx.strip())
48
+ return indices if indices else None
49
+ except ValueError:
50
+ return None # Invalid input, run all questions
51
+
52
+
53
+ def create_ui(run_and_submit_all, run_test_code):
54
+ """Create the Main App with custom layout to include LoginButton"""
55
+
56
+ global run_and_submit_all_callback
57
+ run_and_submit_all_callback = run_and_submit_all
58
+
59
+ def _run_test_with_filter(filter_text: str):
60
+ """Wrapper to run test code with parsed filter indices."""
61
+ filter_indices = _parse_filter_indices(filter_text)
62
+ return run_test_code(filter=filter_indices)
63
+
64
+ # --- Build Gradio Interface using Blocks ---
65
+ with gr.Blocks() as demoApp:
66
+ gr.Markdown("# Basic Agent Evaluation Runner")
67
+ gr.Markdown(
68
+ """
69
+ **Instructions:**
70
+ 1. Please clone this space, then modify the code to define your agent's logic, the tools, the necessary packages, etc ...
71
+ 2. Log in to your Hugging Face account using the button below. This uses your HF username for submission.
72
+ 3. Click 'Run Evaluation & Submit All Answers' to fetch questions, run your agent, submit answers, and see the score.
73
+ ---
74
+ **Disclaimers:**
75
+ Once clicking on the "submit button, it can take quite some time ( this is the time for the agent to go through all the questions).
76
+ This space provides a basic setup and is intentionally sub-optimal to encourage you to develop your own, more robust solution. For instance for the delay process of the submit button, a solution could be to cache the answers and submit in a seperate action or even to answer the questions in async.
77
+ """
78
+ )
79
+
80
+ gr.LoginButton()
81
+
82
+ gr.Markdown("### Run Evaluation with Different Agents")
83
+
84
+ with gr.Row():
85
+ run_button_langgraph = gr.Button("Run with LangGraph Agent", variant="primary")
86
+ run_button_react = gr.Button("Run with ReAct Agent", variant="secondary")
87
+ run_button_llamaindex = gr.Button("Run with LlamaIndex Agent", variant="secondary")
88
+
89
+ status_output = gr.Textbox(label="Run Status / Submission Result", lines=5, interactive=False)
90
+ # Removed max_rows=10 from DataFrame constructor
91
+ results_table = gr.DataFrame(label="Questions and Agent Answers", wrap=True)
92
+
93
+ run_button_langgraph.click(
94
+ fn=_run_and_submit_langgraph,
95
+ outputs=[status_output, results_table]
96
+ )
97
+
98
+ run_button_react.click(
99
+ fn=_run_and_submit_react,
100
+ outputs=[status_output, results_table]
101
+ )
102
+
103
+ run_button_llamaindex.click(
104
+ fn=_run_and_submit_llamaindex,
105
+ outputs=[status_output, results_table]
106
+ )
107
+
108
+ gr.Markdown("---")
109
+ gr.Markdown("### Test Mode")
110
+ gr.Markdown("Run agent on specific questions for testing. Leave empty to run all questions.")
111
+
112
+ test_filter_input = gr.Textbox(
113
+ label="Question Indices (comma-separated)",
114
+ placeholder="e.g., 4, 7, 15 (leave empty for all questions)",
115
+ value="",
116
+ interactive=True
117
+ )
118
+ test_button = gr.Button("Run Test Examples")
119
+ test_results_table = gr.DataFrame(label="Test Answers from Agent", wrap=True)
120
+ test_button.click(
121
+ fn=_run_test_with_filter,
122
+ inputs=[test_filter_input],
123
+ outputs=[test_results_table]
124
+ )
125
+
126
+ return demoApp
langgraphagent.py ADDED
@@ -0,0 +1,348 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import logging
3
+ import warnings
4
+ import re
5
+ import time
6
+
7
+ # Suppress TensorFlow/Keras warnings
8
+ os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
9
+ logging.getLogger('tensorflow').setLevel(logging.ERROR)
10
+ warnings.filterwarnings('ignore', module='tensorflow')
11
+ warnings.filterwarnings('ignore', module='tf_keras')
12
+
13
+ from typing import TypedDict, Optional, List, Annotated
14
+ from langchain_core.messages import HumanMessage, SystemMessage
15
+ from langgraph.graph import MessagesState, StateGraph, START, END
16
+ from langgraph.graph.message import add_messages
17
+ from langgraph.prebuilt import tools_condition
18
+ from langgraph.prebuilt import ToolNode
19
+ from langchain_google_genai import ChatGoogleGenerativeAI
20
+ from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint
21
+
22
+ from custom_tools import get_custom_tools_list, reset_tool_counters
23
+ from system_prompt import SYSTEM_PROMPT
24
+ from utils import cleanup_answer, extract_text_from_content
25
+ import config
26
+
27
+ # Suppress BeautifulSoup GuessedAtParserWarning
28
+ try:
29
+ from bs4 import GuessedAtParserWarning
30
+ warnings.filterwarnings('ignore', category=GuessedAtParserWarning)
31
+ except ImportError:
32
+ pass
33
+
34
+
35
+ class AgentState(TypedDict):
36
+ question: str
37
+ messages: Annotated[list , add_messages] # for LangGraph
38
+ answer: str
39
+ step_count: int # Track number of iterations to prevent infinite loops
40
+ file_name: str # Optional file name for questions that reference files
41
+
42
+
43
+ class LangGraphAgent:
44
+
45
+ def __init__(self):
46
+ # Validate API keys
47
+ if not config.GOOGLE_API_KEY:
48
+ print("WARNING: GOOGLE_API_KEY not found - analyze_youtube_video will fail")
49
+
50
+ self.tools = get_custom_tools_list()
51
+ self.llm_client_with_tools = self._create_llm_client()
52
+ self.graph = self._build_graph()
53
+
54
+ def _create_llm_client(self, model_provider: str = "google"):
55
+ """Create and return the LLM client with tools bound based on the model provider."""
56
+
57
+ if model_provider == "google":
58
+ apikey = config.GOOGLE_API_KEY
59
+
60
+ return ChatGoogleGenerativeAI(
61
+ model=config.ACTIVE_AGENT_LLM_MODEL,
62
+ temperature=0,
63
+ api_key=apikey,
64
+ thinking_budget=0,
65
+ timeout=120
66
+ ).bind_tools(self.tools)
67
+
68
+ elif model_provider == "huggingface":
69
+ LLM_MODEL = "meta-llama/Llama-3.1-8B-Instruct"
70
+ apikey = os.getenv("HUGGINGFACEHUB_API_TOKEN")
71
+
72
+ llmObject = HuggingFaceEndpoint(
73
+ repo_id=LLM_MODEL,
74
+ task="text-generation",
75
+ max_new_tokens=512,
76
+ temperature=0.7,
77
+ do_sample=False,
78
+ repetition_penalty=1.03,
79
+ huggingfacehub_api_token=apikey
80
+ )
81
+ return ChatHuggingFace(llm=llmObject).bind_tools(self.tools)
82
+
83
+ # Nodes
84
+ def _init_questions(self, state: AgentState):
85
+ """Initialize the messages in the state with system prompt and user question."""
86
+
87
+ # Reset per-question tool counters (e.g., analyze_image call limit)
88
+ reset_tool_counters()
89
+
90
+ # Build the question message, including file name if available
91
+ question_content = state["question"]
92
+ if state.get("file_name"):
93
+ question_content += f'\n\nNote: This question references a file: {state["file_name"]}'
94
+
95
+ return {
96
+ "messages": [
97
+ SystemMessage(content=SYSTEM_PROMPT),
98
+ HumanMessage(content=question_content)
99
+ ],
100
+ "step_count": 0 # Initialize step counter
101
+ }
102
+
103
+ def _assistant(self, state: AgentState):
104
+ """Assistant node which calls the LLM with tools"""
105
+
106
+ # Track and log current step
107
+ current_step = state.get("step_count", 0) + 1
108
+ print(f"[STEP {current_step}] Calling assistant with {len(state['messages'])} messages")
109
+
110
+ # Force termination at step limit — _should_continue cannot persist state changes
111
+ # so we detect the near-limit here and force a final LLM call without tool binding
112
+ if current_step >= config.AGENT_STEP_LIMIT - 1: # force a final bare-answer call one step before the hard limit
113
+ existing = state.get("answer")
114
+ if existing:
115
+ return {"messages": [], "answer": existing, "step_count": current_step}
116
+ print(f"[WARNING] Near step limit at step {current_step} with no answer — forcing bare LLM call")
117
+ from langchain_core.messages import SystemMessage as SM
118
+ forced_suffix = SM(content="STOP ALL TOOL CALLS. Based only on information gathered so far, output ONLY the bare answer value — one word, number, or short phrase. No explanation.")
119
+
120
+ def _extract_content(resp_content):
121
+ if not resp_content:
122
+ return ""
123
+ if isinstance(resp_content, str):
124
+ return resp_content.strip()
125
+ if isinstance(resp_content, list):
126
+ parts = [item['text'] if isinstance(item, dict) and 'text' in item else str(item) for item in resp_content]
127
+ return " ".join(parts).strip()
128
+ return str(resp_content).strip()
129
+
130
+ llm_client = self.llm_client_with_tools
131
+ if llm_client is None:
132
+ return {"messages": [], "answer": "Error: Step limit reached", "step_count": current_step}
133
+
134
+ # Attempt 1: full context
135
+ try:
136
+ forced_messages = list(state["messages"]) + [forced_suffix]
137
+ forced_resp = llm_client.invoke(forced_messages)
138
+ content = _extract_content(forced_resp.content)
139
+ if content:
140
+ print(f"[FORCED FINAL] {content[:100]}")
141
+ return {"messages": [], "answer": content, "step_count": current_step}
142
+ print("[FORCED FINAL] Empty content on attempt 1, retrying with reduced context")
143
+ except Exception as fe:
144
+ print(f"[WARNING] Forced final call attempt 1 failed: {fe}")
145
+
146
+ # Attempt 2: reduced context (first 2 messages + last 10 messages) to avoid token overload
147
+ try:
148
+ msgs = state["messages"]
149
+ reduced = msgs[:2] + (msgs[-10:] if len(msgs) > 12 else msgs[2:])
150
+ reduced_messages = reduced + [forced_suffix]
151
+ forced_resp2 = llm_client.invoke(reduced_messages)
152
+ content2 = _extract_content(forced_resp2.content)
153
+ if content2:
154
+ print(f"[FORCED FINAL REDUCED] {content2[:100]}")
155
+ return {"messages": [], "answer": content2, "step_count": current_step}
156
+ print("[FORCED FINAL] Empty content on attempt 2 as well")
157
+ except Exception as fe2:
158
+ print(f"[WARNING] Forced final call attempt 2 failed: {fe2}")
159
+
160
+ return {"messages": [], "answer": "Error: Step limit reached", "step_count": current_step}
161
+
162
+ # Invoke LLM with tools enabled, with retry logic for 504 errors
163
+ max_retries = config.MAX_RETRIES
164
+ delay = config.INITIAL_RETRY_DELAY
165
+
166
+ for attempt in range(max_retries + 1):
167
+ try:
168
+ response = self.llm_client_with_tools.invoke(state["messages"])
169
+ # Success - break out of retry loop
170
+ break
171
+ except Exception as e:
172
+ error_msg = str(e)
173
+
174
+ # Check if this is a 504 DEADLINE_EXCEEDED error
175
+ if "504" in error_msg and "DEADLINE_EXCEEDED" in error_msg:
176
+ if attempt < max_retries:
177
+ print(f"[RETRY] Attempt {attempt + 1}/{max_retries} failed with 504 DEADLINE_EXCEEDED")
178
+ print(f"[RETRY] Retrying in {delay:.1f} seconds...")
179
+ time.sleep(delay)
180
+ delay *= config.RETRY_BACKOFF_FACTOR
181
+ continue
182
+ else:
183
+ print(f"[RETRY] All {max_retries} retries exhausted for 504 error")
184
+ print(f"[ERROR] LLM invocation failed after retries: {e}")
185
+ return {
186
+ "messages": [],
187
+ "answer": f"Error: LLM failed after {max_retries} retries - {str(e)[:100]}",
188
+ "step_count": current_step
189
+ }
190
+ else:
191
+ # Not a 504 error - fail immediately without retry
192
+ print(f"[ERROR] LLM invocation failed: {e}")
193
+ return {
194
+ "messages": [],
195
+ "answer": f"Error: LLM failed - {str(e)[:100]}",
196
+ "step_count": current_step
197
+ }
198
+
199
+ # If no tool calls, set the final answer
200
+ if not response.tool_calls:
201
+ content = response.content
202
+ print(f"[FINAL ANSWER] Agent produced answer (no tool calls)")
203
+
204
+ # Handle case where content is a list (e.g. mixed content from Gemini)
205
+ if isinstance(content, list):
206
+ # Extract text from list of content parts
207
+ text_parts = []
208
+ for item in content:
209
+ if isinstance(item, dict) and 'text' in item:
210
+ text_parts.append(item['text'])
211
+ elif hasattr(item, 'text'):
212
+ text_parts.append(item.text)
213
+ else:
214
+ text_parts.append(str(item))
215
+ content = " ".join(text_parts)
216
+ elif isinstance(content, dict) and 'text' in content:
217
+ # Handle single dict with 'text' field
218
+ content = content['text']
219
+ elif hasattr(content, 'text'):
220
+ # Handle object with text attribute
221
+ content = content.text
222
+ else:
223
+ # Fallback to string conversion
224
+ content = str(content)
225
+
226
+ # Clean up any remaining noise
227
+ content = content.strip()
228
+ print(f"[EXTRACTED TEXT] {content[:100]}{'...' if len(content) > 100 else ''}")
229
+
230
+ # If content is empty (transient Gemini API issue), retry up to 3 times
231
+ retry_num = 0
232
+ while not content and retry_num < 3:
233
+ retry_num += 1
234
+ print(f"[WARNING] Empty response from LLM at step {current_step} — retry {retry_num}/3")
235
+ try:
236
+ import time as _time
237
+ _time.sleep(retry_num * 2) # back off: 2s, 4s, 6s
238
+ retry_resp = self.llm_client_with_tools.invoke(state["messages"]) # type: ignore[union-attr]
239
+ retry_content = retry_resp.content
240
+ if isinstance(retry_content, str):
241
+ content = retry_content.strip()
242
+ elif isinstance(retry_content, list):
243
+ parts = [item['text'] if isinstance(item, dict) and 'text' in item else str(item) for item in retry_content]
244
+ content = " ".join(parts).strip()
245
+ if content:
246
+ print(f"[RETRY SUCCESS] Got content on retry {retry_num}: {content[:80]}")
247
+ except Exception as re_err:
248
+ print(f"[WARNING] Retry {retry_num} failed: {re_err}")
249
+
250
+ return {
251
+ "messages": [response],
252
+ "answer": content,
253
+ "step_count": current_step
254
+ }
255
+
256
+ # Has tool calls, log them
257
+ print(f"[TOOL CALLS] Agent requesting {len(response.tool_calls)} tool(s):")
258
+ for tc in response.tool_calls:
259
+ print(f" - {tc['name']}")
260
+
261
+ return {
262
+ "messages": [response],
263
+ "step_count": current_step
264
+ }
265
+
266
+
267
+ def _should_continue(self, state: AgentState):
268
+ """Check if we should continue or stop based on step count and other conditions."""
269
+
270
+ step_count = state.get("step_count", 0)
271
+
272
+ # Stop if we've exceeded maximum steps
273
+ if step_count >= config.AGENT_STEP_LIMIT: # Backstop; recursion_limit is derived to exceed 2x this
274
+ print(f"[WARNING] Max steps ({config.AGENT_STEP_LIMIT}) reached, forcing termination")
275
+ # Force a final answer if we don't have one
276
+ if not state.get("answer"):
277
+ state["answer"] = "Error: Maximum iteration limit reached"
278
+ return END
279
+
280
+ # Otherwise use the default tools_condition
281
+ return tools_condition(state)
282
+
283
+
284
+ def _build_graph(self):
285
+ """Build and return the Compiled Graph for the agent."""
286
+
287
+ graph = StateGraph(AgentState)
288
+
289
+ # Build graph
290
+ graph.add_node("init", self._init_questions)
291
+ graph.add_node("assistant", self._assistant)
292
+ graph.add_node("tools", ToolNode(self.tools))
293
+ graph.add_edge(START, "init")
294
+ graph.add_edge("init", "assistant")
295
+ graph.add_conditional_edges(
296
+ "assistant",
297
+ # Use custom should_continue instead of tools_condition
298
+ self._should_continue,
299
+ )
300
+ graph.add_edge("tools", "assistant")
301
+ # Compile graph
302
+ return graph.compile()
303
+
304
+ def __call__(self, question: str, file_name: str = None) -> str:
305
+ """Invoke the agent graph with the given question and return the final answer.
306
+
307
+ Args:
308
+ question: The question to answer
309
+ file_name: Optional file name if the question references a file
310
+ """
311
+
312
+ print(f"\n{'='*60}")
313
+ print(f"[LANGGRAPH AGENT START] Question: {question}")
314
+ if file_name:
315
+ print(f"[FILE] {file_name}")
316
+ print(f"{'='*60}")
317
+
318
+ start_time = time.time()
319
+
320
+ try:
321
+ response = self.graph.invoke(
322
+ {"question": question, "messages": [], "answer": None, "step_count": 0, "file_name": file_name or ""},
323
+ config={"recursion_limit": config.AGENT_RECURSION_LIMIT} # Derived in config: > 2x step limit
324
+ )
325
+
326
+ elapsed_time = time.time() - start_time
327
+ print(f"[LANGGRAPH AGENT COMPLETE] Time: {elapsed_time:.2f}s")
328
+ print(f"{'='*60}\n")
329
+
330
+ answer = response.get("answer")
331
+ if not answer or answer is None:
332
+ print("[WARNING] Agent completed but returned None as answer")
333
+ return "Error: No answer generated"
334
+
335
+ # Use utility function to extract text from various content formats
336
+ answer = extract_text_from_content(answer)
337
+
338
+ # Clean up the answer using utility function (includes stripping)
339
+ answer = cleanup_answer(answer)
340
+
341
+ print(f"[FINAL ANSWER] {answer}")
342
+ return answer
343
+
344
+ except Exception as e:
345
+ elapsed_time = time.time() - start_time
346
+ print(f"[LANGGRAPH AGENT ERROR] Failed after {elapsed_time:.2f}s: {e}")
347
+ print(f"{'='*60}\n")
348
+ return f"Error: {str(e)[:100]}"
llamaindexagent.py ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import logging
3
+ import warnings
4
+ import time
5
+ import asyncio
6
+ import nest_asyncio
7
+
8
+ # Apply nest_asyncio to allow nested event loops
9
+ nest_asyncio.apply()
10
+
11
+ # Suppress TensorFlow/Keras warnings
12
+ os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
13
+ logging.getLogger('tensorflow').setLevel(logging.ERROR)
14
+ warnings.filterwarnings('ignore', module='tensorflow')
15
+ warnings.filterwarnings('ignore', module='tf_keras')
16
+
17
+ # Suppress google.generativeai deprecation warning from llama_index
18
+ warnings.filterwarnings('ignore', message='.*google.generativeai.*deprecated.*', category=FutureWarning)
19
+ warnings.filterwarnings('ignore', module='google.generativeai')
20
+
21
+ # Suppress asyncio selector warnings that occur during event loop cleanup on some platforms
22
+ warnings.filterwarnings('ignore', message='.*Invalid file descriptor.*')
23
+ logging.getLogger('asyncio').setLevel(logging.ERROR)
24
+
25
+ from llama_index.core.agent import ReActAgent
26
+ from llama_index.llms.gemini import Gemini
27
+ from llama_index.core.tools import FunctionTool
28
+
29
+ from custom_tools import get_custom_tools_list
30
+ from system_prompt import SYSTEM_PROMPT
31
+ from utils import cleanup_answer, extract_text_from_content
32
+ import config
33
+
34
+ # Suppress BeautifulSoup GuessedAtParserWarning
35
+ try:
36
+ from bs4 import GuessedAtParserWarning
37
+ warnings.filterwarnings('ignore', category=GuessedAtParserWarning)
38
+ except ImportError:
39
+ pass
40
+
41
+
42
+ class LlamaIndexAgent:
43
+ """
44
+ LlamaIndex agent implementation using ReActAgent.
45
+
46
+ This agent uses LlamaIndex's ReAct agent pattern which integrates
47
+ with various LLM providers and tools. It provides an alternative
48
+ implementation to LangGraph-based agents.
49
+ """
50
+
51
+ def __init__(self):
52
+ # Validate API keys
53
+ if not config.GOOGLE_API_KEY:
54
+ print("WARNING: GOOGLE_API_KEY not found - analyze_youtube_video will fail")
55
+
56
+ self.langchain_tools = get_custom_tools_list()
57
+ self.llm = self._create_llm_client()
58
+ self.tools = self._convert_tools_to_llamaindex()
59
+ self.agent = self._build_agent()
60
+
61
+ def _create_llm_client(self):
62
+ """Create and return the LLM client for LlamaIndex."""
63
+ api_key = config.GOOGLE_API_KEY
64
+
65
+ # Create Gemini LLM for LlamaIndex
66
+ llm = Gemini(
67
+ model=config.ACTIVE_AGENT_LLM_MODEL,
68
+ api_key=api_key,
69
+ temperature=config.GEMINI_TEMPERATURE,
70
+ max_tokens=config.GEMINI_MAX_TOKENS,
71
+ )
72
+
73
+ return llm
74
+
75
+ def _convert_tools_to_llamaindex(self) -> list[FunctionTool]:
76
+ """Convert LangChain tools to LlamaIndex FunctionTool format."""
77
+ llamaindex_tools = []
78
+
79
+ for langchain_tool in self.langchain_tools:
80
+ # Extract the function from LangChain tool
81
+ tool_func = langchain_tool.func if hasattr(langchain_tool, 'func') else langchain_tool
82
+
83
+ # Create LlamaIndex FunctionTool
84
+ llamaindex_tool = FunctionTool.from_defaults(
85
+ fn=tool_func,
86
+ name=langchain_tool.name,
87
+ description=langchain_tool.description,
88
+ )
89
+
90
+ llamaindex_tools.append(llamaindex_tool)
91
+
92
+ return llamaindex_tools
93
+
94
+ def _build_agent(self) -> ReActAgent:
95
+ """Build and return the LlamaIndex ReAct agent."""
96
+
97
+ # Create ReAct agent with tools and LLM
98
+ agent = ReActAgent(
99
+ tools=self.tools,
100
+ llm=self.llm,
101
+ verbose=True,
102
+ max_iterations=40, # Match the step limit from other agents
103
+ system_prompt=SYSTEM_PROMPT,
104
+ )
105
+
106
+ return agent
107
+
108
+ def __call__(self, question: str, file_name: str = None) -> str:
109
+ """
110
+ Invoke the LlamaIndex agent with the given question and return the final answer.
111
+
112
+ Args:
113
+ question: The question to answer
114
+ file_name: Optional file name if the question references a file
115
+
116
+ Returns:
117
+ The agent's answer as a string
118
+ """
119
+ print(f"\n{'='*60}")
120
+ print(f"[LLAMAINDEX AGENT START] Question: {question}")
121
+ if file_name:
122
+ print(f"[FILE] {file_name}")
123
+ print(f"{'='*60}")
124
+
125
+ start_time = time.time()
126
+
127
+ try:
128
+ # Build the question with file name if provided
129
+ question_content = question
130
+ if file_name:
131
+ question_content += f'\n\nNote: This question references a file: {file_name}'
132
+
133
+ # Invoke the agent with retry logic for 504 errors
134
+ max_retries = config.MAX_RETRIES
135
+ delay = config.INITIAL_RETRY_DELAY
136
+
137
+ for attempt in range(max_retries + 1):
138
+ try:
139
+ # Create a dedicated async function to run the agent
140
+ async def run_agent_async():
141
+ # Pass max_iterations as a runtime parameter to the workflow
142
+ return await self.agent.run(question_content, max_iterations=40)
143
+
144
+ # Try different approaches to run the async function
145
+ try:
146
+ # Check if a loop is already running
147
+ loop = asyncio.get_running_loop()
148
+ # If we reach here, a loop is already running
149
+ # Use nest_asyncio's patched loop to run coroutine
150
+ response = loop.run_until_complete(run_agent_async())
151
+ except RuntimeError:
152
+ # No running loop, we can use asyncio.run directly
153
+ # But wrap in try-except to suppress cleanup errors
154
+ try:
155
+ response = asyncio.run(run_agent_async())
156
+ except ValueError as ve:
157
+ # Suppress "Invalid file descriptor" errors during cleanup
158
+ if "Invalid file descriptor" not in str(ve):
159
+ raise
160
+
161
+ # Success - break out of retry loop
162
+ break
163
+ except Exception as e:
164
+ error_msg = str(e)
165
+
166
+ # Check if this is a 504 DEADLINE_EXCEEDED error
167
+ if "504" in error_msg and "DEADLINE_EXCEEDED" in error_msg:
168
+ if attempt < max_retries:
169
+ print(f"[RETRY] Attempt {attempt + 1}/{max_retries} failed with 504 DEADLINE_EXCEEDED")
170
+ print(f"[RETRY] Retrying in {delay:.1f} seconds...")
171
+ time.sleep(delay)
172
+ delay *= config.RETRY_BACKOFF_FACTOR
173
+ continue
174
+ else:
175
+ print(f"[RETRY] All {max_retries} retries exhausted for 504 error")
176
+ print(f"[ERROR] Agent invocation failed after retries: {e}")
177
+ return f"Error: Agent failed after {max_retries} retries - {str(e)[:100]}"
178
+ else:
179
+ # Not a 504 error - fail immediately without retry
180
+ print(f"[ERROR] Agent invocation failed: {e}")
181
+ return f"Error: Agent failed - {str(e)[:100]}"
182
+
183
+ elapsed_time = time.time() - start_time
184
+ print(f"[LLAMAINDEX AGENT COMPLETE] Time: {elapsed_time:.2f}s")
185
+ print(f"{'='*60}\n")
186
+
187
+ # Extract the answer from the response using utility function
188
+ # This handles ChatMessage objects, dicts, lists, and strings
189
+ answer = extract_text_from_content(response)
190
+
191
+ if not answer or answer is None:
192
+ print("[WARNING] Agent completed but returned Empty answer")
193
+ return "Error: No answer generated"
194
+
195
+ # LlamaIndex ReActAgent may wrap answers in verbose format
196
+ # Check if the response starts with common verbose patterns and extract the core answer
197
+ import re
198
+
199
+ # Pattern 1: "Answer: X" or "Final Answer: X" from ReAct format
200
+ react_answer_match = re.search(r'(?:Final\s+)?Answer:\s*(.+)', answer, re.IGNORECASE | re.DOTALL)
201
+ if react_answer_match:
202
+ extracted = react_answer_match.group(1).strip()
203
+ print(f"[LLAMAINDEX] Extracted answer from ReAct format: '{extracted[:100]}...'")
204
+ answer = extracted
205
+
206
+ # Clean up the answer using utility function (includes stripping)
207
+ answer = cleanup_answer(answer)
208
+
209
+ print(f"[FINAL ANSWER] {answer}")
210
+ return answer
211
+
212
+ except Exception as e:
213
+ elapsed_time = time.time() - start_time
214
+ print(f"[LLAMAINDEX AGENT ERROR] Failed after {elapsed_time:.2f}s: {e}")
215
+ print(f"{'='*60}\n")
216
+ return f"Error: {str(e)[:100]}"
question_loader.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Question loading and fetching functionality."""
2
+
3
+ import json
4
+ import requests
5
+ from typing import List, Dict
6
+ import config
7
+ from utils import retry_with_backoff
8
+
9
+
10
+ class QuestionLoader:
11
+ """Handles loading questions from various sources."""
12
+
13
+ def __init__(self, api_url: str = config.DEFAULT_API_URL):
14
+ self.api_url = api_url
15
+
16
+ @retry_with_backoff(max_retries=3, initial_delay=1.0, backoff_factor=2.0)
17
+ def _fetch_from_api(self) -> List[Dict]:
18
+ """Fetch questions from the API with retry logic."""
19
+ questions_url = f"{self.api_url}/questions"
20
+ print(f"Fetching questions from: {questions_url}")
21
+
22
+ response = requests.get(questions_url, timeout=config.FETCH_TIMEOUT)
23
+ response.raise_for_status()
24
+ questions_data = response.json()
25
+
26
+ if not questions_data:
27
+ raise ValueError("Fetched questions list is empty.")
28
+
29
+ print(f"Fetched {len(questions_data)} questions.")
30
+ return questions_data
31
+
32
+ def _load_from_file(self, file_path: str = config.QUESTIONS_FILE) -> List[Dict]:
33
+ """Load questions from local file."""
34
+ with open(file_path, 'r', encoding='utf-8') as f:
35
+ questions = json.load(f)
36
+ print(f"[INFO] Loaded {len(questions)} questions from {file_path}")
37
+ return questions
38
+
39
+ def get_questions(self, test_mode: bool = False) -> List[Dict]:
40
+ """Get questions from local file (test) or API (production)."""
41
+ if test_mode:
42
+ try:
43
+ return self._load_from_file()
44
+ except Exception as e:
45
+ print(f"[WARNING] Offline loading failed: {e}, falling back to API")
46
+
47
+ return self._fetch_from_api()
reactlanggraphagent.py ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import logging
3
+ import warnings
4
+ import time
5
+
6
+ # Suppress TensorFlow/Keras warnings
7
+ os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
8
+ logging.getLogger('tensorflow').setLevel(logging.ERROR)
9
+ warnings.filterwarnings('ignore', module='tensorflow')
10
+ warnings.filterwarnings('ignore', module='tf_keras')
11
+
12
+ from langgraph.prebuilt import create_react_agent
13
+ from langchain_google_genai import ChatGoogleGenerativeAI
14
+ from langchain_core.messages import HumanMessage
15
+
16
+ from custom_tools import get_custom_tools_list
17
+ from system_prompt import SYSTEM_PROMPT
18
+ from utils import cleanup_answer, extract_text_from_content
19
+ import config
20
+
21
+ # Suppress BeautifulSoup GuessedAtParserWarning
22
+ try:
23
+ from bs4 import GuessedAtParserWarning
24
+ warnings.filterwarnings('ignore', category=GuessedAtParserWarning)
25
+ except ImportError:
26
+ pass
27
+
28
+
29
+ class ReActLangGraphAgent:
30
+ """
31
+ ReAct agent implementation using LangGraph's create_react_agent function.
32
+
33
+ This agent uses the ReAct (Reasoning + Acting) pattern where the agent
34
+ reasons about what to do and then acts by calling tools iteratively.
35
+ Built on top of LangGraph's prebuilt ReAct agent.
36
+ """
37
+
38
+ def __init__(self):
39
+ # Validate API keys
40
+ if not config.GOOGLE_API_KEY:
41
+ print("WARNING: GOOGLE_API_KEY not found - analyze_youtube_video will fail")
42
+
43
+ self.tools = get_custom_tools_list()
44
+ self.llm = self._create_llm_client()
45
+ self.agent_graph = self._build_agent()
46
+
47
+ def _create_llm_client(self):
48
+ """Create and return the LLM client."""
49
+ apikey = config.GOOGLE_API_KEY
50
+
51
+ return ChatGoogleGenerativeAI(
52
+ model=config.ACTIVE_AGENT_LLM_MODEL,
53
+ temperature=config.GEMINI_TEMPERATURE,
54
+ api_key=apikey,
55
+ thinking_budget=0,
56
+ timeout=120
57
+ )
58
+
59
+ def _build_agent(self):
60
+ """Build and return the ReAct agent graph using LangGraph's create_react_agent."""
61
+
62
+ # LangGraph's create_react_agent returns a compiled graph
63
+ # It automatically handles the ReAct loop with tools
64
+ agent_graph = create_react_agent(
65
+ model=self.llm,
66
+ tools=self.tools,
67
+ prompt=SYSTEM_PROMPT # System prompt is added via the prompt parameter
68
+ )
69
+
70
+ return agent_graph
71
+
72
+ def __call__(self, question: str, file_name: str = None) -> str:
73
+ """
74
+ Invoke the ReAct agent with the given question and return the final answer.
75
+
76
+ Args:
77
+ question: The question to answer
78
+ file_name: Optional file name if the question references a file
79
+
80
+ Returns:
81
+ The agent's answer as a string
82
+ """
83
+ print(f"\n{'='*60}")
84
+ print(f"[REACT AGENT START] Question: {question}")
85
+ if file_name:
86
+ print(f"[FILE] {file_name}")
87
+ print(f"{'='*60}")
88
+
89
+ start_time = time.time()
90
+
91
+ try:
92
+ # Build the question with file name if provided
93
+ question_content = question
94
+ if file_name:
95
+ question_content += f'\n\nNote: This question references a file: {file_name}'
96
+
97
+ # Invoke the agent graph with retry logic for 504 errors
98
+ max_retries = config.MAX_RETRIES
99
+ delay = config.INITIAL_RETRY_DELAY
100
+
101
+ for attempt in range(max_retries + 1):
102
+ try:
103
+ # LangGraph's create_react_agent expects messages as input
104
+ response = self.agent_graph.invoke(
105
+ {"messages": [HumanMessage(content=question_content)]},
106
+ config={"recursion_limit": config.AGENT_RECURSION_LIMIT} # Shared with LangGraphAgent via config
107
+ )
108
+ # Success - break out of retry loop
109
+ break
110
+ except Exception as e:
111
+ error_msg = str(e)
112
+
113
+ # Check if this is a 504 DEADLINE_EXCEEDED error
114
+ if "504" in error_msg and "DEADLINE_EXCEEDED" in error_msg:
115
+ if attempt < max_retries:
116
+ print(f"[RETRY] Attempt {attempt + 1}/{max_retries} failed with 504 DEADLINE_EXCEEDED")
117
+ print(f"[RETRY] Retrying in {delay:.1f} seconds...")
118
+ time.sleep(delay)
119
+ delay *= config.RETRY_BACKOFF_FACTOR
120
+ continue
121
+ else:
122
+ print(f"[RETRY] All {max_retries} retries exhausted for 504 error")
123
+ print(f"[ERROR] Agent invocation failed after retries: {e}")
124
+ return f"Error: Agent failed after {max_retries} retries - {str(e)[:100]}"
125
+ else:
126
+ # Not a 504 error - fail immediately without retry
127
+ print(f"[ERROR] Agent invocation failed: {e}")
128
+ return f"Error: Agent failed - {str(e)[:100]}"
129
+
130
+ elapsed_time = time.time() - start_time
131
+ print(f"[REACT AGENT COMPLETE] Time: {elapsed_time:.2f}s")
132
+ print(f"{'='*60}\n")
133
+
134
+ # Extract the answer from the response
135
+ # LangGraph's create_react_agent returns the last message in the messages list
136
+ messages = response.get("messages", [])
137
+
138
+ if not messages:
139
+ print("[WARNING] Agent completed but returned no messages")
140
+ return "Error: No answer generated"
141
+
142
+ # Get the last message (the agent's final response)
143
+ last_message = messages[-1]
144
+
145
+ # Extract content from the message
146
+ if hasattr(last_message, 'content'):
147
+ content = last_message.content
148
+ else:
149
+ content = str(last_message)
150
+
151
+ # Use utility function to extract text from various content formats
152
+ answer = extract_text_from_content(content)
153
+
154
+ if not answer or answer is None:
155
+ print("[WARNING] Agent completed but returned None as answer")
156
+ return "Error: No answer generated"
157
+
158
+ # Clean up the answer using utility function
159
+ answer = cleanup_answer(answer)
160
+
161
+ print(f"[FINAL ANSWER] {answer}")
162
+ return answer
163
+
164
+ except Exception as e:
165
+ elapsed_time = time.time() - start_time
166
+ print(f"[REACT AGENT ERROR] Failed after {elapsed_time:.2f}s: {e}")
167
+ print(f"{'='*60}\n")
168
+ return f"Error: {str(e)[:100]}"
requirements.txt ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio
2
+ requests
3
+ huggingface_hub
4
+ pillow
5
+ ddgs
6
+ pytz
7
+ wikipedia
8
+ arxiv
9
+ langchain
10
+ langgraph
11
+ langchain-core
12
+ langchain-google-genai
13
+ langchain-huggingface
14
+ langchain-community
15
+ llama-index
16
+ llama-index-llms-gemini
17
+ llama-index-core
18
+ pypdf
19
+ youtube-transcript-api
20
+ pytube
21
+ pymupdf
22
+ nest_asyncio
23
+ speechrecognition
24
+ pydub
25
+ markdownify
26
+ numpy
27
+ pandas
28
+ colorama
29
+ gradio[oauth]
result_formatter.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Result formatting for different output types."""
2
+
3
+ import pandas as pd
4
+ from typing import List, Tuple, Dict
5
+ from colorama import Fore, Style
6
+
7
+
8
+ class ResultFormatter:
9
+ """Formats results for different output targets."""
10
+
11
+ @staticmethod
12
+ def format_for_api(results: List[Tuple[str, str, str]]) -> List[Dict]:
13
+ """Format results for API submission."""
14
+ return [
15
+ {"task_id": task_id, "submitted_answer": answer}
16
+ for task_id, _, answer in results
17
+ ]
18
+
19
+ @staticmethod
20
+ def format_for_display(results: List[Tuple[str, str, str]]) -> List[Dict]:
21
+ """Format results for UI display."""
22
+ return [
23
+ {
24
+ "Task ID": task_id,
25
+ "Question": question_text,
26
+ "Submitted Answer": answer
27
+ }
28
+ for task_id, question_text, answer in results
29
+ ]
30
+
31
+ @staticmethod
32
+ def format_for_verification(results: List[Tuple[str, str, str]]) -> List[str]:
33
+ """Format results for test verification output."""
34
+ output = []
35
+ for task_id, question_text, answer in results:
36
+ output.append(f"\nTask ID: {task_id}")
37
+ output.append(f"Question: {question_text}")
38
+ output.append(f"Answer: {answer}")
39
+ return output
40
+
41
+ @staticmethod
42
+ def print_dataframe(df: pd.DataFrame) -> None:
43
+ """Print DataFrame with full content (no truncation) with colored output."""
44
+ pd.set_option('display.max_colwidth', None)
45
+ pd.set_option('display.max_rows', None)
46
+ for col in df.columns:
47
+ for val in df[col]:
48
+ val_str = str(val)
49
+ # Color based on content
50
+ if '✓ Correct' in val_str:
51
+ print(f"{Fore.GREEN}{val}{Style.RESET_ALL}", flush=True)
52
+ elif '✗ Incorrect' in val_str:
53
+ print(f"{Fore.RED}{val}{Style.RESET_ALL}", flush=True)
54
+ elif val_str.startswith('===') or val_str.startswith('SUMMARY'):
55
+ print(f"{Fore.CYAN}{val}{Style.RESET_ALL}", flush=True)
56
+ elif 'ERROR' in val_str:
57
+ print(f"{Fore.RED}{val}{Style.RESET_ALL}", flush=True)
58
+ elif val_str.startswith('Expected:') or val_str.startswith('Got:'):
59
+ print(f"{Fore.YELLOW}{val}{Style.RESET_ALL}", flush=True)
60
+ else:
61
+ print(val, flush=True)
scorer.py ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #Official GAIA Scorer Module from HF. Copied from https://huggingface.co/spaces/gaia-benchmark/leaderboard/blob/main/scorer.py for offline Use. Hoping there are no licensing issues as it is intended for learning purposes only.
2
+ #Thanks, Hemant Virmani
3
+
4
+ import json
5
+ import re
6
+ import string
7
+ import warnings
8
+
9
+ import numpy as np
10
+
11
+
12
+ def normalize_number_str(number_str: str) -> float:
13
+ # we replace these common units and commas to allow
14
+ # conversion to float
15
+ for char in ["$", "%", ","]:
16
+ number_str = number_str.replace(char, "")
17
+ try:
18
+ return float(number_str)
19
+ except ValueError:
20
+ print(f"String {number_str} cannot be normalized to number str.")
21
+ return float("inf")
22
+
23
+
24
+ def split_string(
25
+ s: str,
26
+ char_list: list[str] = [",", ";"],
27
+ ) -> list[str]:
28
+ pattern = f"[{''.join(char_list)}]"
29
+ return re.split(pattern, s)
30
+
31
+
32
+ def question_scorer(
33
+ model_answer: str,
34
+ ground_truth: str,
35
+ ) -> bool:
36
+ def is_float(element: any) -> bool:
37
+ try:
38
+ float(element)
39
+ return True
40
+ except ValueError:
41
+ return False
42
+
43
+ if model_answer is None:
44
+ model_answer = "None"
45
+
46
+ # if gt is a number
47
+ if is_float(ground_truth):
48
+ print(f"Evaluating {model_answer} as a number.")
49
+ normalized_answer = normalize_number_str(model_answer)
50
+ return normalized_answer == float(ground_truth)
51
+
52
+ # if gt is a list
53
+ elif any(char in ground_truth for char in [",", ";"]):
54
+ print(f"Evaluating {model_answer} as a comma separated list.")
55
+ # question with the fish: normalization removes punct
56
+
57
+ gt_elems = split_string(ground_truth)
58
+ ma_elems = split_string(model_answer)
59
+
60
+ # check length is the same
61
+ if len(gt_elems) != len(ma_elems):
62
+ warnings.warn(
63
+ "Answer lists have different lengths, returning False.", UserWarning
64
+ )
65
+ return False
66
+
67
+ # compare each element as float or str
68
+ comparisons = []
69
+ for ma_elem, gt_elem in zip(ma_elems, gt_elems):
70
+ if is_float(gt_elem):
71
+ normalized_ma_elem = normalize_number_str(ma_elem)
72
+ comparisons.append(normalized_ma_elem == float(gt_elem))
73
+ else:
74
+ # we do not remove punct since comparisons can include punct
75
+ comparisons.append(
76
+ normalize_str(ma_elem, remove_punct=False)
77
+ == normalize_str(gt_elem, remove_punct=False)
78
+ )
79
+ return all(comparisons)
80
+
81
+ # if gt is a str
82
+ else:
83
+ print(f"Evaluating {model_answer} as a string.")
84
+ return normalize_str(model_answer) == normalize_str(ground_truth)
85
+
86
+
87
+ def normalize_str(input_str, remove_punct=True) -> str:
88
+ """
89
+ Normalize a string by:
90
+ - Removing all white spaces
91
+ - Optionally removing punctuation (if remove_punct is True)
92
+ - Converting to lowercase
93
+ Parameters:
94
+ - input_str: str, the string to normalize
95
+ - remove_punct: bool, whether to remove punctuation (default: True)
96
+ Returns:
97
+ - str, the normalized string
98
+ """
99
+ # Remove all white spaces. Required e.g for seagull vs. sea gull
100
+ no_spaces = re.sub(r"\s", "", input_str)
101
+
102
+ # Remove punctuation, if specified.
103
+ if remove_punct:
104
+ translator = str.maketrans("", "", string.punctuation)
105
+ return no_spaces.lower().translate(translator)
106
+ else:
107
+ return no_spaces.lower()
system_prompt.py ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ SYSTEM_PROMPT = """You are an expert, precise and disciplined AI assistant who can solve any task.
2
+ To do so, you have been given access to a list of external tools that you MUST use to find information.
3
+
4
+ CRITICAL: When you need to use a tool, you MUST call it using the tool calling mechanism. DO NOT write pseudo-code or descriptions of tools. ACTUALLY CALL THE TOOL.
5
+
6
+ Your task is to answer the user's question using the available tools and provide the answer in a STRICT format.
7
+
8
+ ### AVAILABLE TOOLS
9
+
10
+ You have access to the following categories of tools:
11
+
12
+ **Mathematical Operations:**
13
+ - calculate (operation, a, b): Perform arithmetic — add, subtract, multiply, divide, power, modulus
14
+
15
+ **String & Encoding:**
16
+ - string_reverse: Reverse a string (useful for gibberish or backwards-encoded text)
17
+ - classical_cipher (cipher_type, mode, text, keyword): Encrypt or decrypt Playfair and Bifid classical ciphers
18
+
19
+ **Computation:**
20
+ - execute_python (code): Execute Python 3 code and return stdout. Use for precise counting, algorithms, or math the LLM cannot do reliably. Use print() for output. IMPORTANT: execute_python runs in a subprocess in the project directory and CAN read files from the files/ directory using pandas (e.g., `import pandas as pd; df = pd.read_excel('files/filename.xlsx'); print(df)`). However, it has NO access to data from previous tool calls as Python variables — to process data returned by a previous tool, embed that data as a string literal in your code. If execute_python fails 3 times, stop and use a different approach.
21
+
22
+ **Time & Date:**
23
+ - get_current_time_in_timezone: Get current time in any timezone
24
+
25
+ **Web & Information Search:**
26
+ - websearch: Search the web using DuckDuckGo (returns 5 results with titles, URLs, snippets)
27
+ - wiki_search: Search Wikipedia (returns up to 3 detailed articles)
28
+ - arvix_search: Search academic papers on Arxiv (returns up to 3 papers)
29
+ - get_webpage_content: Load and parse any webpage as markdown (handles PDFs too)
30
+ - youtube_tool (youtube_url, question=""): Pass question="" to get raw transcript; pass a question string to analyze the video with AI (handles visual/audio content)
31
+
32
+ **File Operations:**
33
+ - read_file (file_name): Read files from the files directory — Excel/CSV → markdown table; .py/.txt/.md/.json → raw text
34
+ - parse_audio_file (file_name): Transcribe MP3 audio files to text
35
+ - analyze_image (question, file_name): Analyze image files (.png, .jpg, .jpeg, etc.) using AI vision
36
+ - download_file (url, file_name): Download a file from a URL and save it to the files directory before reading it
37
+
38
+ **HTTP:**
39
+ - http_request (method, url, headers_json, body_json): Make GET/POST/DELETE requests with custom headers or body
40
+
41
+ **Meta / Planning:**
42
+ - ask_advisor (question): Consult a more capable AI when you are completely stuck on HOW TO SEARCH for something, after 2+ failed search attempts with no useful results. NEVER call ask_advisor if any tool (websearch, wiki_search, get_webpage_content, read_file, execute_python, parse_audio_file, analyze_image) has already returned data — work with the data you have. NEVER call it for calculation help, code execution problems, or when you have partial results. At most 1 call per question.
43
+
44
+ **IMPORTANT:** If the question mentions a file or you see "Note: This question references a file: filename.ext" in the question, use the appropriate file reading tool with that filename:
45
+ - For images (.png, .jpg, .jpeg, .gif, .webp, .bmp): Use analyze_image with your question and the filename
46
+ - For Excel files (.xlsx) or CSV (.csv): Use read_file
47
+ - For Python files (.py) or text files (.txt, .md, .json): Use read_file
48
+ - For audio files (.mp3): Use parse_audio_file
49
+
50
+ ### WORKFLOW
51
+
52
+ 1. **Analyze the Question**: Break down what information you need and what steps are required
53
+ 2. **Use Tools Strategically and Efficiently**:
54
+ - PRIORITY ORDER: Use specific domain tools first, then general search
55
+ 1. For academic/scientific: Try arvix_search first
56
+ 2. For general knowledge: Try wiki_search first
57
+ 3. For current events/specific facts: Use websearch
58
+ 4. For detailed investigation: Use get_webpage_content on promising URLs
59
+ - QUERY OPTIMIZATION: If first search fails, try 2-3 different query phrasings before switching tools
60
+ - AVOID REDUNDANCY: Don't repeat the same search with the same tool
61
+ - Chain calculations using math tools in sequence rather than separate calls
62
+ 3. **Process Tool Results**: Extract relevant information from tool outputs
63
+ 4. **Calculate/Reason**: If multiple steps are needed, use tools sequentially
64
+ 5. **Verify**: Double-check your answer makes sense given the question
65
+ 6. **Output**: Provide ONLY the final answer in the exact format required
66
+
67
+ ### CRITICAL OUTPUT RULES (ZERO TOLERANCE)
68
+
69
+ 1. **SINGLE LINE / SINGLE WORD OUTPUT**: Output ONLY the answer value — a single word, short phrase, or number. NO multi-line responses. NO paragraphs. NO explanations.
70
+ 2. **NO CONVERSATIONAL FILLER**: Do not use phrases like "I found", "The answer is", "Here are the results", "Based on the search", "According to", "After checking", "Looking at", "The X was Y", etc.
71
+ 3. **NO PREAMBLE OR POSTSCRIPT**: Do NOT include "FINAL ANSWER:", "Result:", "Answer:", or any other prefix/suffix
72
+ 4. **NO MARKDOWN/TAGS**: Do not wrap the answer in markdown code blocks, JSON, or XML tags
73
+ 5. **NO STRUCTURED DATA**: Do NOT output dictionaries, JSON objects, or any structured format - ONLY a single value
74
+ 6. **NO TOOL CODE IN OUTPUT**: Never output raw Python code or tool calls (like `tool_code`, `print()`, `default_api.websearch()`)
75
+ 7. **EXACT MATCH SCORING**: The grading system checks for an exact string match. Any extra character will cause failure
76
+ 8. **ALWAYS USE TOOLS**: If you do not know the answer, use the available tools. Do NOT hallucinate or guess
77
+ 9. **TRY MULTIPLE APPROACHES**: If one search doesn't work, try different queries or different tools
78
+ 10. **FOR NUMERICAL ANSWERS**:
79
+ - NO comma separators (use "17000" not "17,000")
80
+ - NO units unless explicitly requested (use "17" not "17 hours" or "17 thousand")
81
+ - NO text forms (use "17" not "seventeen")
82
+ - Follow rounding instructions exactly as specified in the question
83
+ - If question asks for "thousands", provide the actual thousand value (e.g., "17" for 17,000)
84
+
85
+ ### CRITICAL: SINGLE VALUE ONLY
86
+ Your response must be a single line of plain text — just the answer with NO additional text. Examples of WRONG outputs:
87
+ - ❌ {'type': 'text', 'text': 'answer'}
88
+ - ❌ {"answer": "value"}
89
+ - ❌ `answer`
90
+ - ❌ **answer**
91
+ - ❌ The answer is: answer
92
+ - ❌ The nominator was JohnDoe (WRONG - has preamble)
93
+ - ❌ The featured article "SomeTopic" was promoted... (WRONG - full sentence)
94
+
95
+ Examples of CORRECT outputs:
96
+ - ✅ 7
97
+ - ✅ 1995
98
+ - ✅ blue
99
+ - ✅ Harrison
100
+ - ✅ Nf3
101
+ - ✅ Tanaka, Yamamoto
102
+ - ✅ Erik
103
+ - ✅ semicolon
104
+ - ✅ 23000
105
+
106
+ CRITICAL: Even after long multi-step reasoning, your final output is ONLY the bare answer. Do NOT include the reasoning. Examples of WRONG outputs that contain the correct answer but will still fail:
107
+ - ❌ The only recipient whose country no longer exists is John Smith... His first name is John (WRONG — contains reasoning)
108
+ - ❌ Player A's number is 12. The pitcher with number 18 is Garcia and number 20 is Martinez (WRONG — contains reasoning)
109
+ - ❌ The answer is John (WRONG — has preamble)
110
+ - ❌ Alex Brown led the team in walks with 80. In that same season, he had 412 at-bats (WRONG — answer 412 is buried at end of sentence)
111
+ - ❌ The specimens described in the 2005 paper were eventually deposited in Berlin (WRONG — answer is buried at end of sentence)
112
+ - ❌ The work was supported under grant number ABC123456 (WRONG — answer is buried at end of sentence)
113
+ - ❌ The countries with the fewest athletes are Brazil (BRA) and Chile (CHI), both with 1. Alphabetically, Brazil comes first (WRONG — answer is BRA)
114
+ - ❌ The competition records show John Smith as a 1983 recipient with Westland as his nationality. Westland no longer exists (WRONG — answer is John)
115
+ - ❌ Player A's number is 12. The pitcher with number 18 is Garcia and number 20 is Martinez (WRONG — answer must be just: Garcia, Martinez)
116
+
117
+ For each of the above, the CORRECT output would be just: 412 / Berlin / ABC123456 / BRA / John / Garcia, Martinez
118
+
119
+ ### IMPORTANT NOTES
120
+
121
+ - **Reversed/Encoded Text**: If text looks like gibberish, use string_reverse tool to decode it
122
+ - **Multiple Search Results**: If websearch returns multiple results, you may need to use get_webpage_content on relevant URLs to find the exact answer
123
+ - **Calculations**: Break down complex math problems and use the math tools step by step
124
+ - **File References**: When questions mention files, use the appropriate read tool based on file extension
125
+ - **Image Analysis**: For visual questions with image files (.png, .jpg, etc.), use analyze_image with the question and filename
126
+ - **YouTube Content**: Use youtube_tool with question="" for raw transcript; pass a non-empty question to analyze the video visually/audio with AI
127
+ - **Audio Transcription**: When listing ingredients, items, or any content from audio, use the EXACT phrasing heard — do NOT simplify or paraphrase. "freshly squeezed lemon juice" ≠ "lemon juice". Every modifier matters. If the question asks to alphabetize the result, sort the items alphabetically AFTER transcribing — the order heard in the audio does not matter, only the words.
128
+ - **List Ordering**: When a question asks for a list of ingredients, grocery items, or similar unordered items and no explicit ordering is specified, output the items sorted in alphabetical order. When the question EXPLICITLY asks to alphabetize, always sort alphabetically regardless of the order encountered during research. CRITICAL: Alphabetize by the ENTIRE item string exactly as written, starting from the first character of the first word — NOT by the "main" noun or any internal keyword. A multi-word item with a leading modifier sorts by that modifier (e.g., "ground black pepper" sorts under G, not under B or P), not by a later noun in the phrase.
129
+ - **Verification**: After finding an answer, verify it matches what the question is asking for
130
+ - **Location Names**: Always expand abbreviated location names to their full form
131
+ - "St." → "Saint" (e.g., "Saint Petersburg", "Saint Paul", "Saint Louis")
132
+ - "Mt." → "Mount" (e.g., "Mount Everest", "Mount Rushmore")
133
+ - "Ft." → "Fort" (e.g., "Fort Worth", "Fort Lauderdale")
134
+ - Use the canonical/official name when multiple forms exist
135
+
136
+ ### PRECISION AND VERIFICATION
137
+
138
+ - **Category Distinctions**: Pay careful attention to category qualifiers in questions (e.g., a subset qualifier vs. the whole set, or a part of a name vs. the full name). Filter results precisely to match the exact category requested, and answer the exact entity the question asks for rather than a related one.
139
+ - **Time-Sensitive Data**: When questions specify a date or time period (e.g., "as of July 2023", "compiled 08/21/2023"), you MUST use data from that exact timeframe. **MANDATORY WAYBACK MACHINE RULE**: For ANY question containing date phrases like "as of [date]", "compiled [date]", "as of [month year]" — you MUST fetch the archived Wayback Machine version of relevant webpages. Use this URL format: https://web.archive.org/web/YYYYMMDD000000/[original_URL] where you replace YYYY, MM, DD with the question's date. Example: question says "compiled 08/21/2023" → fetch https://web.archive.org/web/20230821000000/[URL]. Example: question says "as of July 2023" → fetch https://web.archive.org/web/20230701000000/[URL]. Do NOT use current data when a historical date is specified — current pages may differ significantly. If the Wayback Machine page does not contain the expected information, try these variations: (1) simplify the URL path (e.g., remove parenthetical or trailing path segments), (2) try a snapshot a day or two before/after the target date, (3) try the current page as a fallback.
140
+ - **Cross-Verification**: For factual questions, try to verify answers from multiple independent sources when possible. If sources conflict, prefer official/primary sources (Wikipedia, official websites) over secondary sources.
141
+ - **Unique Constraints**: When questions use words like "only", "unique", or "single", verify that exactly one item matches the criteria. If multiple items match, re-examine the constraints.
142
+ - **Sequential/Ordered Data**: For questions about sequences, rankings, or ordered lists (jersey numbers, chronological order, etc.), carefully verify the exact position or order from authoritative sources.
143
+
144
+ ### ERROR HANDLING
145
+
146
+ - If a tool fails, try again with a different query or approach
147
+ - If multiple sources give conflicting information, use the most authoritative source
148
+ - If websearch returns results but you need more detail, use get_webpage_content on the most relevant URL
149
+ - If you cannot find the answer after exhausting all tools and approaches, output: Unable to determine [brief reason]
150
+
151
+ ### REMEMBER
152
+
153
+ Your intermediate reasoning and tool usage are separate from your final output. Think through the problem, use tools as needed, but when you output your final answer, it must be ONLY the answer value with NO additional text.
154
+
155
+ ### ABSOLUTE FINAL RULE
156
+
157
+ After all reasoning and tool calls, your LAST message must be the BARE ANSWER ONLY — one word, one number, or a short comma-separated list. No sentence. No explanation. No prefix. If you find yourself writing a sentence as your final output, STOP, DELETE it, and output only the answer value.
158
+
159
+ WRONG: "The answer based on my research is Jane Smith"
160
+ RIGHT: Jane
161
+
162
+ WRONG: "In that same season, Alex Brown had 412 at-bats"
163
+ RIGHT: 412
164
+
165
+ WRONG: "Brazil comes first alphabetically, so the answer is BRA"
166
+ RIGHT: BRA
167
+ """
utils.py ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Utility functions for GAIA Benchmark Agent including retry logic and answer cleanup."""
2
+
3
+ import re
4
+ import time
5
+ import requests
6
+ from typing import Callable, Any
7
+ from functools import wraps
8
+ import config
9
+
10
+
11
+ def retry_with_backoff(
12
+ max_retries: int = config.MAX_RETRIES,
13
+ initial_delay: float = config.INITIAL_RETRY_DELAY,
14
+ backoff_factor: float = config.RETRY_BACKOFF_FACTOR,
15
+ exceptions: tuple = (requests.RequestException,)
16
+ ):
17
+ """
18
+ Decorator to retry a function with exponential backoff.
19
+
20
+ Args:
21
+ max_retries: Maximum number of retry attempts
22
+ initial_delay: Initial delay in seconds before first retry
23
+ backoff_factor: Multiplier for delay after each retry
24
+ exceptions: Tuple of exception types to catch and retry
25
+ """
26
+ def decorator(func: Callable) -> Callable:
27
+ @wraps(func)
28
+ def wrapper(*args, **kwargs) -> Any:
29
+ delay = initial_delay
30
+ last_exception = None
31
+
32
+ for attempt in range(max_retries + 1):
33
+ try:
34
+ return func(*args, **kwargs)
35
+ except exceptions as e:
36
+ last_exception = e
37
+ if attempt < max_retries:
38
+ print(f"[RETRY] Attempt {attempt + 1}/{max_retries} failed: {e}")
39
+ print(f"[RETRY] Retrying in {delay:.1f} seconds...")
40
+ time.sleep(delay)
41
+ delay *= backoff_factor
42
+ else:
43
+ print(f"[RETRY] All {max_retries} retries exhausted")
44
+
45
+ # Re-raise the last exception if all retries failed
46
+ raise last_exception
47
+
48
+ return wrapper
49
+ return decorator
50
+
51
+
52
+ def extract_text_from_content(content: Any) -> str:
53
+ """
54
+ Extract plain text from various content formats returned by LLM agents.
55
+
56
+ This function handles multiple content formats:
57
+ - AgentOutput objects (LlamaIndex): Extracts the response attribute
58
+ - Message objects with 'content' attribute: Extracts the content attribute
59
+ (works for LlamaIndex ChatMessage, LangChain AIMessage, etc.)
60
+ - String: Returns as-is
61
+ - Dict with 'text' field: Extracts the text value
62
+ - List of content blocks: Extracts text from all blocks with type='text'
63
+ - Other types: Converts to string
64
+
65
+ Args:
66
+ content: The content object from an LLM response (can be str, dict, list, etc.)
67
+
68
+ Returns:
69
+ str: Extracted plain text content
70
+ """
71
+ # Handle LlamaIndex AgentOutput objects (has 'response' attribute)
72
+ if hasattr(content, 'response') and not isinstance(content, (str, dict, list)):
73
+ # Extract the response attribute from AgentOutput
74
+ response = content.response
75
+ # The response might itself be a message object with 'content'
76
+ if hasattr(response, 'content'):
77
+ return str(response.content)
78
+ elif hasattr(response, 'message') and hasattr(response.message, 'content'):
79
+ return str(response.message.content)
80
+ else:
81
+ return str(response)
82
+
83
+ # Handle message objects with 'content' attribute (e.g., ChatMessage from various frameworks)
84
+ # This works for LlamaIndex ChatMessage, LangChain AIMessage, etc.
85
+ if hasattr(content, 'content') and not isinstance(content, (str, dict, list)):
86
+ # Extract the content attribute (works for any message object)
87
+ return str(content.content)
88
+
89
+ # Handle dict format (e.g., {'text': 'answer'})
90
+ if isinstance(content, dict):
91
+ if 'text' in content:
92
+ return str(content['text'])
93
+ else:
94
+ print(f"[WARNING] Content was dict without 'text' field, converting to string")
95
+ return str(content)
96
+
97
+ # Handle list format (e.g., [{'type': 'text', 'text': 'answer'}])
98
+ elif isinstance(content, list):
99
+ text_parts = []
100
+ for item in content:
101
+ if isinstance(item, dict):
102
+ # Look for items with type='text' and extract the 'text' field
103
+ if item.get('type') == 'text':
104
+ text_parts.append(str(item.get('text', '')))
105
+ # Fallback: if there's a 'text' field but no type, use it
106
+ elif 'text' in item:
107
+ text_parts.append(str(item['text']))
108
+ elif isinstance(item, str):
109
+ text_parts.append(item)
110
+ else:
111
+ text_parts.append(str(item))
112
+
113
+ result = ' '.join(text_parts)
114
+ if len(content) > 1 or (len(content) == 1 and isinstance(content[0], dict)):
115
+ print(f"[INFO] Extracted text from list with {len(content)} item(s)")
116
+ return result
117
+
118
+ # Handle string format (already plain text)
119
+ elif isinstance(content, str):
120
+ return content
121
+
122
+ # Fallback for other types
123
+ else:
124
+ print(f"[WARNING] Content was {type(content)}, converting to string")
125
+ return str(content)
126
+
127
+
128
+ def cleanup_answer(answer: Any) -> str:
129
+ """
130
+ Clean up the agent answer to ensure it's in plain text format.
131
+
132
+ This function:
133
+ - Converts answer to string
134
+ - Handles multi-line answers (extracts last meaningful non-debug line)
135
+ - Normalizes whitespace
136
+ - Strips trailing punctuation
137
+ - Logs warnings for verbose or malformatted answers
138
+
139
+ Args:
140
+ answer: The raw answer from the agent (can be str, dict, list, etc.)
141
+
142
+ Returns:
143
+ str: Cleaned up answer as plain text
144
+ """
145
+ answer = str(answer).strip()
146
+
147
+ if not answer:
148
+ return answer
149
+
150
+ # Handle multi-line: take the last line that isn't a debug/log prefix
151
+ lines = [l.strip() for l in answer.split('\n') if l.strip()]
152
+ if len(lines) > 1:
153
+ debug_prefixes = ('[info', '[warning', '[error', '[retry', '[step', '[tool', '[final')
154
+ for l in reversed(lines):
155
+ if not l.lower().startswith(debug_prefixes):
156
+ answer = l
157
+ break
158
+ else:
159
+ answer = lines[-1]
160
+ print(f"[CLEANUP] Extracted last meaningful line from {len(lines)}-line answer: '{answer[:80]}'")
161
+
162
+ # NOTE: Do NOT strip commas here. The GAIA scorer's normalize_number_str already
163
+ # strips commas from numeric answers, and split_string uses commas to split list
164
+ # answers. Stripping here would corrupt comma-separated lists (e.g., "132,133,134"
165
+ # becomes the invalid number string "132133134").
166
+
167
+ # Normalize whitespace and strip trailing punctuation
168
+ answer = ' '.join(answer.split()).strip().rstrip('.')
169
+
170
+ # Sentence.NUMBER suffix — the model echoed its final answer as a bare number
171
+ # appended directly after its reasoning, e.g. "...published 3 albums (included).3".
172
+ # Match a NON-DIGIT char before the period (covers letters, ')', etc.) and require
173
+ # whitespace earlier in the string so genuine bare decimals like "89706.00" or
174
+ # "3.14" (no spaces) are never altered.
175
+ if ' ' in answer and re.search(r'[^\d\s]\s*\.\d+$', answer):
176
+ extracted = re.search(r'\.(\d+)$', answer).group(1)
177
+ print(f"[CLEANUP] Extracted appended number from verbose answer: '{extracted}'")
178
+ answer = extracted
179
+
180
+ # NOTE: We deliberately do NOT regex-extract a bare answer out of a verbose
181
+ # sentence. The model is instructed to emit only the bare answer, and earlier
182
+ # extraction patterns here were reverse-engineered from specific GAIA questions —
183
+ # an integrity and maintenance hazard. If the model returns prose, the right fix
184
+ # is the prompt/model, not question-tuned post-processing.
185
+
186
+ # Log if answer looks verbose (agent not following instructions)
187
+ if len(answer) > 100:
188
+ print(f"[WARNING] Answer appears verbose ({len(answer)} chars). Agent may not be following SYSTEM_PROMPT instructions.")
189
+ print(f"[WARNING] First 150 chars: {answer[:150]}...")
190
+
191
+ # Log if answer contains suspicious formatting characters
192
+ if any(char in answer for char in ['{', '}', '[', ']', '`', '*', '#']):
193
+ print(f"[WARNING] Answer contains suspicious formatting characters: {answer[:100]}")
194
+
195
+ return answer
validators.py ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Input validation utilities."""
2
+
3
+ import re
4
+ from typing import List, Optional, Tuple
5
+
6
+
7
+ class ValidationError(Exception):
8
+ """Custom exception for validation errors."""
9
+ pass
10
+
11
+
12
+ class InputValidator:
13
+ """Validates user inputs."""
14
+
15
+ @staticmethod
16
+ def validate_username(username: str) -> str:
17
+ """
18
+ Validate username for submission.
19
+
20
+ Args:
21
+ username: The username to validate
22
+
23
+ Returns:
24
+ Cleaned username
25
+
26
+ Raises:
27
+ ValidationError: If username is invalid
28
+ """
29
+ if not username or not username.strip():
30
+ raise ValidationError("Username cannot be empty")
31
+
32
+ cleaned = username.strip()
33
+
34
+ if len(cleaned) < 3:
35
+ raise ValidationError("Username must be at least 3 characters")
36
+
37
+ if len(cleaned) > 50:
38
+ raise ValidationError("Username must be less than 50 characters")
39
+
40
+ # Allow alphanumeric, underscore, hyphen
41
+ if not re.match(r'^[a-zA-Z0-9_-]+$', cleaned):
42
+ raise ValidationError("Username can only contain letters, numbers, underscore, and hyphen")
43
+
44
+ return cleaned
45
+
46
+ @staticmethod
47
+ def validate_filter_indices(filter_list: Optional[Tuple], max_index: int) -> Optional[List[int]]:
48
+ """
49
+ Validate filter indices for test questions.
50
+
51
+ Args:
52
+ filter_list: Tuple/list of indices or None
53
+ max_index: Maximum valid index (exclusive)
54
+
55
+ Returns:
56
+ Validated list of indices or None
57
+
58
+ Raises:
59
+ ValidationError: If indices are invalid
60
+ """
61
+ if filter_list is None:
62
+ return None
63
+
64
+ if not isinstance(filter_list, (list, tuple)):
65
+ raise ValidationError("Filter must be a list or tuple")
66
+
67
+ if not filter_list:
68
+ raise ValidationError("Filter cannot be empty (use None for all questions)")
69
+
70
+ validated = []
71
+ for idx in filter_list:
72
+ if not isinstance(idx, int):
73
+ raise ValidationError(f"Filter index must be integer, got {type(idx)}")
74
+
75
+ if idx < 0:
76
+ raise ValidationError(f"Filter index cannot be negative: {idx}")
77
+
78
+ if idx >= max_index:
79
+ raise ValidationError(f"Filter index {idx} out of range (max: {max_index - 1})")
80
+
81
+ validated.append(idx)
82
+
83
+ return validated
84
+
85
+ @staticmethod
86
+ def validate_questions_data(questions_data: any) -> List[dict]:
87
+ """
88
+ Validate questions data structure.
89
+
90
+ Args:
91
+ questions_data: Data to validate
92
+
93
+ Returns:
94
+ Validated questions list
95
+
96
+ Raises:
97
+ ValidationError: If data is invalid
98
+ """
99
+ if not isinstance(questions_data, list):
100
+ raise ValidationError(f"Questions data must be a list, got {type(questions_data)}")
101
+
102
+ if not questions_data:
103
+ raise ValidationError("Questions list is empty")
104
+
105
+ for idx, item in enumerate(questions_data):
106
+ if not isinstance(item, dict):
107
+ raise ValidationError(f"Question {idx} must be a dict, got {type(item)}")
108
+
109
+ if "task_id" not in item:
110
+ raise ValidationError(f"Question {idx} missing 'task_id'")
111
+
112
+ if "question" not in item:
113
+ raise ValidationError(f"Question {idx} missing 'question'")
114
+
115
+ return questions_data