# Agent Architecture Documentation ## Overview This is a LangGraph-based AI agent designed for the GAIA (General AI Assistants) benchmark evaluation. The agent uses GPT-4o/GPT-4o-mini with tool-calling capabilities to answer complex multi-step questions involving web search, file analysis, multimedia processing, and reasoning. ## System Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ User Request │ │ (20 GAIA Questions) │ └────────────────────────────┬────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ app.py │ │ • Fetches questions from API │ │ • Downloads attached files (Excel, MP3, images, Python) │ │ • Saves files to downloads/ directory │ │ • Calls BasicAgent for each question │ │ • Submits answers to evaluation API │ └────────────────────────────┬────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ BasicAgent │ │ (agent/basic_agent.py) │ │ │ │ 1. Check Cache (agent_cache.json) │ │ └─ If cached: Return answer instantly ✅ │ │ │ │ 2. If not cached: │ │ └─ Invoke LangGraph workflow │ │ │ │ 3. Clean & validate answer │ │ └─ Remove JSON, code blocks, explanations │ │ │ │ 4. Cache answer to disk │ │ └─ Save to agent_cache.json for future use │ └────────────────────────────┬────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ LangGraph Workflow │ │ (agent/graph.py) │ │ │ │ ┌──────────────┐ │ │ │ Agent Node │ ← Decides next action │ │ │ (GPT-4o) │ • Analyze question │ │ └──────┬───────┘ • Choose tool(s) │ │ │ • Generate response │ │ ↓ │ │ ┌──────────────┐ │ │ │ Tools Node │ ← Executes tools │ │ │ │ • Search, calculate, read files │ │ └──────┬───────┘ • Returns results │ │ │ │ │ ↓ │ │ ┌──────────────┐ │ │ │ Agent Node │ ← Processes results │ │ │ (GPT-4o) │ • Analyzes tool output │ │ └──────┬───────┘ • Decides: more tools or final answer? │ │ │ │ │ └─────────→ Loop (max 50 iterations) │ │ │ │ Final Answer → Return to BasicAgent │ └─────────────────────────────────────────────────────────────────┘ ``` ## Core Components ### 1. **app.py** - Main Application **Responsibilities:** - Fetch questions from evaluation API - Download attached files from `/files/{task_id}` endpoint - Orchestrate agent execution for all questions - Submit answers to evaluation API - Display results **Key Functions:** - `run_and_submit_all()` - Main evaluation loop - File download with error handling - Results aggregation and submission ### 2. **agent/basic_agent.py** - Agent Wrapper **Responsibilities:** - Manage agent lifecycle - Implement caching system (persistent to disk) - Clean and validate answers - Logging to file **Key Features:** - **Persistent Caching:** Saves answers to `agent_cache.json` - **Answer Cleaning:** Removes JSON, code blocks, explanations - **Validation:** Ensures no empty answers submitted - **Logging:** All output saved to timestamped log files **Cache System:** ```python { "question_text": "answer", "How many albums...": "4", "What is 2+2?": "4" } ``` ### 3. **agent/graph.py** - LangGraph Workflow **Responsibilities:** - Define agent workflow (nodes and edges) - Initialize LLM chains (primary + fallback) - Initialize and manage tools - Route between agent and tools nodes **Workflow Structure:** ``` START → Agent Node → [Tools Node] → Agent Node → END ↑_______________| (loop until answer found or max iterations) ``` **Key Components:** - `agent_node()` - LLM decision making - `tool_node()` - Tool execution - `should_continue()` - Routing logic - System prompt with detailed instructions **LLM Configuration:** - **Primary:** GPT-4o (with tools) - **Fallback:** GPT-4o-mini (with tools) - **Recursion Limit:** 50 iterations - **Rate Limiting:** Exponential backoff (5 retries) ### 4. **agent/tools.py** - Tool Implementations **Responsibilities:** - Implement all tools available to the agent - Handle file path resolution (current dir + downloads/) - Integrate with external APIs (Gemini, search engines) **Available Tools:** #### Search & Research (5 tools) - `duckduckgo_search` - Web search - `tavily_search` - Advanced web search - `wikipedia` - Wikipedia lookup - `youtube_transcript` - Get YouTube transcripts - `arxiv_search` - Academic paper search #### File Operations (5 tools) - `list_files` - List files in current/downloads directory - `read_file` - Read text files - `read_excel` - Read and analyze Excel files - `download_file` - Download files from URLs - `execute_python_file` - Run Python scripts #### Multimedia Analysis (3 tools - Gemini-powered) - `understand_video` - Analyze YouTube videos - `understand_audio` - Transcribe and analyze MP3/audio - `analyze_image` - Analyze images (chess, diagrams, text) #### Computation (2 tools) - `calculator` - Safe math evaluation - `python_repl` - Execute Python code **File Path Resolution:** All file tools use `find_file()` helper that checks: 1. Current directory 2. `downloads/` directory 3. Returns best match or downloads path ## Data Flow ### Question Processing Flow ``` 1. API Request └─ GET /questions └─ Returns: [{task_id, question, Level, file_name}, ...] 2. File Download (if file_name exists) └─ GET /files/{task_id} └─ Save to: downloads/{file_name} 3. Agent Invocation ├─ Check cache │ └─ If hit: Return cached answer (0 LLM calls) │ └─ If miss: ├─ Create initial state with question ├─ Invoke LangGraph workflow │ ├─ Agent decides action │ ├─ Execute tools │ ├─ Agent processes results │ └─ Loop until answer or max iterations │ ├─ Extract answer from final message ├─ Clean answer (remove JSON, explanations) ├─ Validate answer (ensure not empty) └─ Cache to disk 4. Answer Submission └─ POST /submit └─ Body: {username, answers: [{task_id, submitted_answer}]} ``` ## Tool Execution Flow ``` Agent Node (GPT-4o) ↓ Decides: "I need to use list_files tool" ↓ Tool Node ├─ Finds tool by name ├─ Validates parameters ├─ Executes tool._run() │ └─ Example: list_files() │ ├─ Check current directory │ ├─ Check downloads/ directory │ └─ Return: "Files found:\n./app.py\ndownloads/data.xlsx" └─ Returns ToolMessage with result ↓ Agent Node (GPT-4o) ├─ Receives tool output ├─ Analyzes results └─ Decides: Use another tool OR provide final answer ``` ## Key Design Decisions ### 1. **Persistent Caching** **Why:** Reduce costs and enable fast re-runs **How:** JSON file on disk, loaded at startup, saved after each answer **Benefit:** 100% cost savings on repeated questions ### 2. **File Path Resolution** **Why:** Files can be in current directory or downloads/ **How:** `find_file()` helper checks both locations **Benefit:** Agent doesn't need to know exact file location ### 3. **Gemini for Multimedia** **Why:** GPT-4o doesn't support direct video/audio analysis **How:** Upload files to Gemini API, get analysis **Benefit:** Can handle YouTube videos, MP3 files, images ### 4. **Answer Cleaning Pipeline** **Why:** LLMs often return verbose explanations or JSON **How:** Multi-stage cleaning (JSON removal, pattern matching, validation) **Benefit:** Clean, concise answers that match expected format ### 5. **Dual LLM Strategy** **Why:** Reliability and cost optimization **How:** Primary (GPT-4o) with fallback (GPT-4o-mini) **Benefit:** Continues working if primary fails ### 6. **Tool-First Architecture** **Why:** Many questions require external data **How:** Rich tool suite with 15+ specialized tools **Benefit:** Can handle diverse question types ## Configuration ### Environment Variables (.env) ```bash OPENAI_API_KEY=sk-... # Required for GPT-4o GEMINI_API_KEY=... # Required for video/audio/image analysis TAVILY_API_KEY=... # Optional for advanced search HF_TOKEN=... # For HuggingFace API access ``` ### Configurable Parameters **In BasicAgent:** - `log_to_file` - Enable/disable logging (default: True) - `use_cache` - Enable/disable caching (default: True) - `cache_file` - Cache file path (default: "agent_cache.json") **In LangGraph:** - `recursion_limit` - Max iterations (default: 50) - `temperature` - LLM temperature (default: 0.0) - `max_retries` - Rate limit retries (default: 5) ## File Structure ``` Final_Assignment_Template/ ├── app.py # Main application ├── agent/ │ ├── __init__.py │ ├── basic_agent.py # Agent wrapper with caching │ ├── graph.py # LangGraph workflow │ └── tools.py # Tool implementations ├── downloads/ # Downloaded files (gitignored) │ ├── file1.xlsx │ ├── audio.mp3 │ └── image.png ├── agent_cache.json # Persistent cache (gitignored) ├── agent_run_*.log # Log files (gitignored) ├── requirements.txt # Python dependencies ├── .env # Environment variables (gitignored) ├── .gitignore ├── ARCHITECTURE.md # This file └── README.md # User documentation ``` ## Performance Characteristics ### Typical Question Processing Time - **Simple (cached):** < 0.1 seconds - **Simple (web search):** 2-5 seconds - **Medium (file analysis):** 5-15 seconds - **Complex (multi-step):** 15-60 seconds - **Multimedia (video/audio):** 30-120 seconds ### LLM Token Usage (per question) - **Simple:** 500-2,000 tokens - **Medium:** 2,000-8,000 tokens - **Complex:** 8,000-20,000 tokens ### Cost Estimates (GPT-4o) - **Per question (avg):** $0.01-0.05 - **20 questions:** $0.20-1.00 - **With caching (re-runs):** $0.00 ## Error Handling ### Graceful Degradation 1. **Cache file corrupted:** Start with empty cache 2. **File download fails:** Continue without file, agent handles gracefully 3. **Tool execution fails:** Return error message, agent tries alternative 4. **LLM rate limit:** Exponential backoff, retry up to 5 times 5. **Primary LLM fails:** Fallback to GPT-4o-mini 6. **Recursion limit hit:** Return best answer so far ### Validation - All answers validated (never empty) - File paths validated before access - API responses validated before processing - Tool parameters validated before execution ## Testing & Development ### Local Testing ```bash # Run full evaluation python app.py # Check logs tail -f agent_run_*.log # View cache cat agent_cache.json # Clear cache for fresh run rm agent_cache.json ``` ### Debugging - All tool calls logged with arguments and results - Agent reasoning logged at each step - Errors logged with full stack traces - Cache hits/misses logged ## Future Enhancements ### Potential Improvements 1. **Pattern-based answering** - Skip LLM for simple questions 2. **Parallel tool execution** - Run independent tools simultaneously 3. **Smarter caching** - Fuzzy matching for similar questions 4. **Cost tracking** - Log token usage and costs 5. **A/B testing** - Compare different prompts/strategies 6. **Streaming responses** - Show progress in real-time ### Scalability Considerations - Cache can grow large (consider size limits or TTL) - Multiple concurrent runs need separate cache files - Rate limiting may need adjustment for production - Consider database instead of JSON for large-scale caching ## Dependencies ### Core - `langchain` - LLM framework - `langgraph` - Workflow orchestration - `langchain-openai` - OpenAI integration - `langchain-community` - Community tools ### Tools - `google-generativeai` - Gemini API - `tavily-python` - Advanced search - `duckduckgo-search` - Web search - `youtube-transcript-api` - YouTube transcripts - `pandas` - Data analysis - `openpyxl` - Excel files ### Utilities - `requests` - HTTP requests - `python-dotenv` - Environment variables - `gradio` - Web UI (optional) ## Security Considerations ### API Keys - Stored in `.env` file (gitignored) - Never hardcoded in source - Loaded via `python-dotenv` ### Code Execution - `python_repl` uses AST-based REPL (safer than eval) - `execute_python_file` runs in subprocess with timeout - No shell injection vulnerabilities ### File Access - All file operations use Path validation - No arbitrary file system access - Downloads isolated to `downloads/` directory ## Monitoring & Observability ### Logs - Timestamped log files for each run - Structured logging with emojis for easy parsing - Tool calls logged with full context - Errors logged with stack traces ### Metrics (available in logs) - Questions processed - Cache hit rate - Tool usage frequency - LLM calls per question - Execution time per question - Error rate --- **Version:** 1.0 **Last Updated:** 2025-09-30 **Author:** Leon Woo **License:** MIT