Spaces:
Runtime error
Runtime error
File size: 16,602 Bytes
1777acb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 |
# Agent Architecture Documentation
## Overview
This is a LangGraph-based AI agent designed for the GAIA (General AI Assistants) benchmark evaluation. The agent uses GPT-4o/GPT-4o-mini with tool-calling capabilities to answer complex multi-step questions involving web search, file analysis, multimedia processing, and reasoning.
## System Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Request β
β (20 GAIA Questions) β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β app.py β
β β’ Fetches questions from API β
β β’ Downloads attached files (Excel, MP3, images, Python) β
β β’ Saves files to downloads/ directory β
β β’ Calls BasicAgent for each question β
β β’ Submits answers to evaluation API β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BasicAgent β
β (agent/basic_agent.py) β
β β
β 1. Check Cache (agent_cache.json) β
β ββ If cached: Return answer instantly β
β
β β
β 2. If not cached: β
β ββ Invoke LangGraph workflow β
β β
β 3. Clean & validate answer β
β ββ Remove JSON, code blocks, explanations β
β β
β 4. Cache answer to disk β
β ββ Save to agent_cache.json for future use β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LangGraph Workflow β
β (agent/graph.py) β
β β
β ββββββββββββββββ β
β β Agent Node β β Decides next action β
β β (GPT-4o) β β’ Analyze question β
β ββββββββ¬ββββββββ β’ Choose tool(s) β
β β β’ Generate response β
β β β
β ββββββββββββββββ β
β β Tools Node β β Executes tools β
β β β β’ Search, calculate, read files β
β ββββββββ¬ββββββββ β’ Returns results β
β β β
β β β
β ββββββββββββββββ β
β β Agent Node β β Processes results β
β β (GPT-4o) β β’ Analyzes tool output β
β ββββββββ¬ββββββββ β’ Decides: more tools or final answer? β
β β β
β βββββββββββ Loop (max 50 iterations) β
β β
β Final Answer β Return to BasicAgent β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
## Core Components
### 1. **app.py** - Main Application
**Responsibilities:**
- Fetch questions from evaluation API
- Download attached files from `/files/{task_id}` endpoint
- Orchestrate agent execution for all questions
- Submit answers to evaluation API
- Display results
**Key Functions:**
- `run_and_submit_all()` - Main evaluation loop
- File download with error handling
- Results aggregation and submission
### 2. **agent/basic_agent.py** - Agent Wrapper
**Responsibilities:**
- Manage agent lifecycle
- Implement caching system (persistent to disk)
- Clean and validate answers
- Logging to file
**Key Features:**
- **Persistent Caching:** Saves answers to `agent_cache.json`
- **Answer Cleaning:** Removes JSON, code blocks, explanations
- **Validation:** Ensures no empty answers submitted
- **Logging:** All output saved to timestamped log files
**Cache System:**
```python
{
"question_text": "answer",
"How many albums...": "4",
"What is 2+2?": "4"
}
```
### 3. **agent/graph.py** - LangGraph Workflow
**Responsibilities:**
- Define agent workflow (nodes and edges)
- Initialize LLM chains (primary + fallback)
- Initialize and manage tools
- Route between agent and tools nodes
**Workflow Structure:**
```
START β Agent Node β [Tools Node] β Agent Node β END
β_______________|
(loop until answer found or max iterations)
```
**Key Components:**
- `agent_node()` - LLM decision making
- `tool_node()` - Tool execution
- `should_continue()` - Routing logic
- System prompt with detailed instructions
**LLM Configuration:**
- **Primary:** GPT-4o (with tools)
- **Fallback:** GPT-4o-mini (with tools)
- **Recursion Limit:** 50 iterations
- **Rate Limiting:** Exponential backoff (5 retries)
### 4. **agent/tools.py** - Tool Implementations
**Responsibilities:**
- Implement all tools available to the agent
- Handle file path resolution (current dir + downloads/)
- Integrate with external APIs (Gemini, search engines)
**Available Tools:**
#### Search & Research (5 tools)
- `duckduckgo_search` - Web search
- `tavily_search` - Advanced web search
- `wikipedia` - Wikipedia lookup
- `youtube_transcript` - Get YouTube transcripts
- `arxiv_search` - Academic paper search
#### File Operations (5 tools)
- `list_files` - List files in current/downloads directory
- `read_file` - Read text files
- `read_excel` - Read and analyze Excel files
- `download_file` - Download files from URLs
- `execute_python_file` - Run Python scripts
#### Multimedia Analysis (3 tools - Gemini-powered)
- `understand_video` - Analyze YouTube videos
- `understand_audio` - Transcribe and analyze MP3/audio
- `analyze_image` - Analyze images (chess, diagrams, text)
#### Computation (2 tools)
- `calculator` - Safe math evaluation
- `python_repl` - Execute Python code
**File Path Resolution:**
All file tools use `find_file()` helper that checks:
1. Current directory
2. `downloads/` directory
3. Returns best match or downloads path
## Data Flow
### Question Processing Flow
```
1. API Request
ββ GET /questions
ββ Returns: [{task_id, question, Level, file_name}, ...]
2. File Download (if file_name exists)
ββ GET /files/{task_id}
ββ Save to: downloads/{file_name}
3. Agent Invocation
ββ Check cache
β ββ If hit: Return cached answer (0 LLM calls)
β
ββ If miss:
ββ Create initial state with question
ββ Invoke LangGraph workflow
β ββ Agent decides action
β ββ Execute tools
β ββ Agent processes results
β ββ Loop until answer or max iterations
β
ββ Extract answer from final message
ββ Clean answer (remove JSON, explanations)
ββ Validate answer (ensure not empty)
ββ Cache to disk
4. Answer Submission
ββ POST /submit
ββ Body: {username, answers: [{task_id, submitted_answer}]}
```
## Tool Execution Flow
```
Agent Node (GPT-4o)
β
Decides: "I need to use list_files tool"
β
Tool Node
ββ Finds tool by name
ββ Validates parameters
ββ Executes tool._run()
β ββ Example: list_files()
β ββ Check current directory
β ββ Check downloads/ directory
β ββ Return: "Files found:\n./app.py\ndownloads/data.xlsx"
ββ Returns ToolMessage with result
β
Agent Node (GPT-4o)
ββ Receives tool output
ββ Analyzes results
ββ Decides: Use another tool OR provide final answer
```
## Key Design Decisions
### 1. **Persistent Caching**
**Why:** Reduce costs and enable fast re-runs
**How:** JSON file on disk, loaded at startup, saved after each answer
**Benefit:** 100% cost savings on repeated questions
### 2. **File Path Resolution**
**Why:** Files can be in current directory or downloads/
**How:** `find_file()` helper checks both locations
**Benefit:** Agent doesn't need to know exact file location
### 3. **Gemini for Multimedia**
**Why:** GPT-4o doesn't support direct video/audio analysis
**How:** Upload files to Gemini API, get analysis
**Benefit:** Can handle YouTube videos, MP3 files, images
### 4. **Answer Cleaning Pipeline**
**Why:** LLMs often return verbose explanations or JSON
**How:** Multi-stage cleaning (JSON removal, pattern matching, validation)
**Benefit:** Clean, concise answers that match expected format
### 5. **Dual LLM Strategy**
**Why:** Reliability and cost optimization
**How:** Primary (GPT-4o) with fallback (GPT-4o-mini)
**Benefit:** Continues working if primary fails
### 6. **Tool-First Architecture**
**Why:** Many questions require external data
**How:** Rich tool suite with 15+ specialized tools
**Benefit:** Can handle diverse question types
## Configuration
### Environment Variables (.env)
```bash
OPENAI_API_KEY=sk-... # Required for GPT-4o
GEMINI_API_KEY=... # Required for video/audio/image analysis
TAVILY_API_KEY=... # Optional for advanced search
HF_TOKEN=... # For HuggingFace API access
```
### Configurable Parameters
**In BasicAgent:**
- `log_to_file` - Enable/disable logging (default: True)
- `use_cache` - Enable/disable caching (default: True)
- `cache_file` - Cache file path (default: "agent_cache.json")
**In LangGraph:**
- `recursion_limit` - Max iterations (default: 50)
- `temperature` - LLM temperature (default: 0.0)
- `max_retries` - Rate limit retries (default: 5)
## File Structure
```
Final_Assignment_Template/
βββ app.py # Main application
βββ agent/
β βββ __init__.py
β βββ basic_agent.py # Agent wrapper with caching
β βββ graph.py # LangGraph workflow
β βββ tools.py # Tool implementations
βββ downloads/ # Downloaded files (gitignored)
β βββ file1.xlsx
β βββ audio.mp3
β βββ image.png
βββ agent_cache.json # Persistent cache (gitignored)
βββ agent_run_*.log # Log files (gitignored)
βββ requirements.txt # Python dependencies
βββ .env # Environment variables (gitignored)
βββ .gitignore
βββ ARCHITECTURE.md # This file
βββ README.md # User documentation
```
## Performance Characteristics
### Typical Question Processing Time
- **Simple (cached):** < 0.1 seconds
- **Simple (web search):** 2-5 seconds
- **Medium (file analysis):** 5-15 seconds
- **Complex (multi-step):** 15-60 seconds
- **Multimedia (video/audio):** 30-120 seconds
### LLM Token Usage (per question)
- **Simple:** 500-2,000 tokens
- **Medium:** 2,000-8,000 tokens
- **Complex:** 8,000-20,000 tokens
### Cost Estimates (GPT-4o)
- **Per question (avg):** $0.01-0.05
- **20 questions:** $0.20-1.00
- **With caching (re-runs):** $0.00
## Error Handling
### Graceful Degradation
1. **Cache file corrupted:** Start with empty cache
2. **File download fails:** Continue without file, agent handles gracefully
3. **Tool execution fails:** Return error message, agent tries alternative
4. **LLM rate limit:** Exponential backoff, retry up to 5 times
5. **Primary LLM fails:** Fallback to GPT-4o-mini
6. **Recursion limit hit:** Return best answer so far
### Validation
- All answers validated (never empty)
- File paths validated before access
- API responses validated before processing
- Tool parameters validated before execution
## Testing & Development
### Local Testing
```bash
# Run full evaluation
python app.py
# Check logs
tail -f agent_run_*.log
# View cache
cat agent_cache.json
# Clear cache for fresh run
rm agent_cache.json
```
### Debugging
- All tool calls logged with arguments and results
- Agent reasoning logged at each step
- Errors logged with full stack traces
- Cache hits/misses logged
## Future Enhancements
### Potential Improvements
1. **Pattern-based answering** - Skip LLM for simple questions
2. **Parallel tool execution** - Run independent tools simultaneously
3. **Smarter caching** - Fuzzy matching for similar questions
4. **Cost tracking** - Log token usage and costs
5. **A/B testing** - Compare different prompts/strategies
6. **Streaming responses** - Show progress in real-time
### Scalability Considerations
- Cache can grow large (consider size limits or TTL)
- Multiple concurrent runs need separate cache files
- Rate limiting may need adjustment for production
- Consider database instead of JSON for large-scale caching
## Dependencies
### Core
- `langchain` - LLM framework
- `langgraph` - Workflow orchestration
- `langchain-openai` - OpenAI integration
- `langchain-community` - Community tools
### Tools
- `google-generativeai` - Gemini API
- `tavily-python` - Advanced search
- `duckduckgo-search` - Web search
- `youtube-transcript-api` - YouTube transcripts
- `pandas` - Data analysis
- `openpyxl` - Excel files
### Utilities
- `requests` - HTTP requests
- `python-dotenv` - Environment variables
- `gradio` - Web UI (optional)
## Security Considerations
### API Keys
- Stored in `.env` file (gitignored)
- Never hardcoded in source
- Loaded via `python-dotenv`
### Code Execution
- `python_repl` uses AST-based REPL (safer than eval)
- `execute_python_file` runs in subprocess with timeout
- No shell injection vulnerabilities
### File Access
- All file operations use Path validation
- No arbitrary file system access
- Downloads isolated to `downloads/` directory
## Monitoring & Observability
### Logs
- Timestamped log files for each run
- Structured logging with emojis for easy parsing
- Tool calls logged with full context
- Errors logged with stack traces
### Metrics (available in logs)
- Questions processed
- Cache hit rate
- Tool usage frequency
- LLM calls per question
- Execution time per question
- Error rate
---
**Version:** 1.0
**Last Updated:** 2025-09-30
**Author:** Leon Woo
**License:** MIT
|