Spaces:
Runtime error
Runtime error
| # Agent Architecture Documentation | |
| ## Overview | |
| This is a LangGraph-based AI agent designed for the GAIA (General AI Assistants) benchmark evaluation. The agent uses GPT-4o/GPT-4o-mini with tool-calling capabilities to answer complex multi-step questions involving web search, file analysis, multimedia processing, and reasoning. | |
| ## System Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β User Request β | |
| β (20 GAIA Questions) β | |
| ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ | |
| β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β app.py β | |
| β β’ Fetches questions from API β | |
| β β’ Downloads attached files (Excel, MP3, images, Python) β | |
| β β’ Saves files to downloads/ directory β | |
| β β’ Calls BasicAgent for each question β | |
| β β’ Submits answers to evaluation API β | |
| ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ | |
| β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β BasicAgent β | |
| β (agent/basic_agent.py) β | |
| β β | |
| β 1. Check Cache (agent_cache.json) β | |
| β ββ If cached: Return answer instantly β β | |
| β β | |
| β 2. If not cached: β | |
| β ββ Invoke LangGraph workflow β | |
| β β | |
| β 3. Clean & validate answer β | |
| β ββ Remove JSON, code blocks, explanations β | |
| β β | |
| β 4. Cache answer to disk β | |
| β ββ Save to agent_cache.json for future use β | |
| ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ | |
| β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β LangGraph Workflow β | |
| β (agent/graph.py) β | |
| β β | |
| β ββββββββββββββββ β | |
| β β Agent Node β β Decides next action β | |
| β β (GPT-4o) β β’ Analyze question β | |
| β ββββββββ¬ββββββββ β’ Choose tool(s) β | |
| β β β’ Generate response β | |
| β β β | |
| β ββββββββββββββββ β | |
| β β Tools Node β β Executes tools β | |
| β β β β’ Search, calculate, read files β | |
| β ββββββββ¬ββββββββ β’ Returns results β | |
| β β β | |
| β β β | |
| β ββββββββββββββββ β | |
| β β Agent Node β β Processes results β | |
| β β (GPT-4o) β β’ Analyzes tool output β | |
| β ββββββββ¬ββββββββ β’ Decides: more tools or final answer? β | |
| β β β | |
| β βββββββββββ Loop (max 50 iterations) β | |
| β β | |
| β Final Answer β Return to BasicAgent β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Core Components | |
| ### 1. **app.py** - Main Application | |
| **Responsibilities:** | |
| - Fetch questions from evaluation API | |
| - Download attached files from `/files/{task_id}` endpoint | |
| - Orchestrate agent execution for all questions | |
| - Submit answers to evaluation API | |
| - Display results | |
| **Key Functions:** | |
| - `run_and_submit_all()` - Main evaluation loop | |
| - File download with error handling | |
| - Results aggregation and submission | |
| ### 2. **agent/basic_agent.py** - Agent Wrapper | |
| **Responsibilities:** | |
| - Manage agent lifecycle | |
| - Implement caching system (persistent to disk) | |
| - Clean and validate answers | |
| - Logging to file | |
| **Key Features:** | |
| - **Persistent Caching:** Saves answers to `agent_cache.json` | |
| - **Answer Cleaning:** Removes JSON, code blocks, explanations | |
| - **Validation:** Ensures no empty answers submitted | |
| - **Logging:** All output saved to timestamped log files | |
| **Cache System:** | |
| ```python | |
| { | |
| "question_text": "answer", | |
| "How many albums...": "4", | |
| "What is 2+2?": "4" | |
| } | |
| ``` | |
| ### 3. **agent/graph.py** - LangGraph Workflow | |
| **Responsibilities:** | |
| - Define agent workflow (nodes and edges) | |
| - Initialize LLM chains (primary + fallback) | |
| - Initialize and manage tools | |
| - Route between agent and tools nodes | |
| **Workflow Structure:** | |
| ``` | |
| START β Agent Node β [Tools Node] β Agent Node β END | |
| β_______________| | |
| (loop until answer found or max iterations) | |
| ``` | |
| **Key Components:** | |
| - `agent_node()` - LLM decision making | |
| - `tool_node()` - Tool execution | |
| - `should_continue()` - Routing logic | |
| - System prompt with detailed instructions | |
| **LLM Configuration:** | |
| - **Primary:** GPT-4o (with tools) | |
| - **Fallback:** GPT-4o-mini (with tools) | |
| - **Recursion Limit:** 50 iterations | |
| - **Rate Limiting:** Exponential backoff (5 retries) | |
| ### 4. **agent/tools.py** - Tool Implementations | |
| **Responsibilities:** | |
| - Implement all tools available to the agent | |
| - Handle file path resolution (current dir + downloads/) | |
| - Integrate with external APIs (Gemini, search engines) | |
| **Available Tools:** | |
| #### Search & Research (5 tools) | |
| - `duckduckgo_search` - Web search | |
| - `tavily_search` - Advanced web search | |
| - `wikipedia` - Wikipedia lookup | |
| - `youtube_transcript` - Get YouTube transcripts | |
| - `arxiv_search` - Academic paper search | |
| #### File Operations (5 tools) | |
| - `list_files` - List files in current/downloads directory | |
| - `read_file` - Read text files | |
| - `read_excel` - Read and analyze Excel files | |
| - `download_file` - Download files from URLs | |
| - `execute_python_file` - Run Python scripts | |
| #### Multimedia Analysis (3 tools - Gemini-powered) | |
| - `understand_video` - Analyze YouTube videos | |
| - `understand_audio` - Transcribe and analyze MP3/audio | |
| - `analyze_image` - Analyze images (chess, diagrams, text) | |
| #### Computation (2 tools) | |
| - `calculator` - Safe math evaluation | |
| - `python_repl` - Execute Python code | |
| **File Path Resolution:** | |
| All file tools use `find_file()` helper that checks: | |
| 1. Current directory | |
| 2. `downloads/` directory | |
| 3. Returns best match or downloads path | |
| ## Data Flow | |
| ### Question Processing Flow | |
| ``` | |
| 1. API Request | |
| ββ GET /questions | |
| ββ Returns: [{task_id, question, Level, file_name}, ...] | |
| 2. File Download (if file_name exists) | |
| ββ GET /files/{task_id} | |
| ββ Save to: downloads/{file_name} | |
| 3. Agent Invocation | |
| ββ Check cache | |
| β ββ If hit: Return cached answer (0 LLM calls) | |
| β | |
| ββ If miss: | |
| ββ Create initial state with question | |
| ββ Invoke LangGraph workflow | |
| β ββ Agent decides action | |
| β ββ Execute tools | |
| β ββ Agent processes results | |
| β ββ Loop until answer or max iterations | |
| β | |
| ββ Extract answer from final message | |
| ββ Clean answer (remove JSON, explanations) | |
| ββ Validate answer (ensure not empty) | |
| ββ Cache to disk | |
| 4. Answer Submission | |
| ββ POST /submit | |
| ββ Body: {username, answers: [{task_id, submitted_answer}]} | |
| ``` | |
| ## Tool Execution Flow | |
| ``` | |
| Agent Node (GPT-4o) | |
| β | |
| Decides: "I need to use list_files tool" | |
| β | |
| Tool Node | |
| ββ Finds tool by name | |
| ββ Validates parameters | |
| ββ Executes tool._run() | |
| β ββ Example: list_files() | |
| β ββ Check current directory | |
| β ββ Check downloads/ directory | |
| β ββ Return: "Files found:\n./app.py\ndownloads/data.xlsx" | |
| ββ Returns ToolMessage with result | |
| β | |
| Agent Node (GPT-4o) | |
| ββ Receives tool output | |
| ββ Analyzes results | |
| ββ Decides: Use another tool OR provide final answer | |
| ``` | |
| ## Key Design Decisions | |
| ### 1. **Persistent Caching** | |
| **Why:** Reduce costs and enable fast re-runs | |
| **How:** JSON file on disk, loaded at startup, saved after each answer | |
| **Benefit:** 100% cost savings on repeated questions | |
| ### 2. **File Path Resolution** | |
| **Why:** Files can be in current directory or downloads/ | |
| **How:** `find_file()` helper checks both locations | |
| **Benefit:** Agent doesn't need to know exact file location | |
| ### 3. **Gemini for Multimedia** | |
| **Why:** GPT-4o doesn't support direct video/audio analysis | |
| **How:** Upload files to Gemini API, get analysis | |
| **Benefit:** Can handle YouTube videos, MP3 files, images | |
| ### 4. **Answer Cleaning Pipeline** | |
| **Why:** LLMs often return verbose explanations or JSON | |
| **How:** Multi-stage cleaning (JSON removal, pattern matching, validation) | |
| **Benefit:** Clean, concise answers that match expected format | |
| ### 5. **Dual LLM Strategy** | |
| **Why:** Reliability and cost optimization | |
| **How:** Primary (GPT-4o) with fallback (GPT-4o-mini) | |
| **Benefit:** Continues working if primary fails | |
| ### 6. **Tool-First Architecture** | |
| **Why:** Many questions require external data | |
| **How:** Rich tool suite with 15+ specialized tools | |
| **Benefit:** Can handle diverse question types | |
| ## Configuration | |
| ### Environment Variables (.env) | |
| ```bash | |
| OPENAI_API_KEY=sk-... # Required for GPT-4o | |
| GEMINI_API_KEY=... # Required for video/audio/image analysis | |
| TAVILY_API_KEY=... # Optional for advanced search | |
| HF_TOKEN=... # For HuggingFace API access | |
| ``` | |
| ### Configurable Parameters | |
| **In BasicAgent:** | |
| - `log_to_file` - Enable/disable logging (default: True) | |
| - `use_cache` - Enable/disable caching (default: True) | |
| - `cache_file` - Cache file path (default: "agent_cache.json") | |
| **In LangGraph:** | |
| - `recursion_limit` - Max iterations (default: 50) | |
| - `temperature` - LLM temperature (default: 0.0) | |
| - `max_retries` - Rate limit retries (default: 5) | |
| ## File Structure | |
| ``` | |
| Final_Assignment_Template/ | |
| βββ app.py # Main application | |
| βββ agent/ | |
| β βββ __init__.py | |
| β βββ basic_agent.py # Agent wrapper with caching | |
| β βββ graph.py # LangGraph workflow | |
| β βββ tools.py # Tool implementations | |
| βββ downloads/ # Downloaded files (gitignored) | |
| β βββ file1.xlsx | |
| β βββ audio.mp3 | |
| β βββ image.png | |
| βββ agent_cache.json # Persistent cache (gitignored) | |
| βββ agent_run_*.log # Log files (gitignored) | |
| βββ requirements.txt # Python dependencies | |
| βββ .env # Environment variables (gitignored) | |
| βββ .gitignore | |
| βββ ARCHITECTURE.md # This file | |
| βββ README.md # User documentation | |
| ``` | |
| ## Performance Characteristics | |
| ### Typical Question Processing Time | |
| - **Simple (cached):** < 0.1 seconds | |
| - **Simple (web search):** 2-5 seconds | |
| - **Medium (file analysis):** 5-15 seconds | |
| - **Complex (multi-step):** 15-60 seconds | |
| - **Multimedia (video/audio):** 30-120 seconds | |
| ### LLM Token Usage (per question) | |
| - **Simple:** 500-2,000 tokens | |
| - **Medium:** 2,000-8,000 tokens | |
| - **Complex:** 8,000-20,000 tokens | |
| ### Cost Estimates (GPT-4o) | |
| - **Per question (avg):** $0.01-0.05 | |
| - **20 questions:** $0.20-1.00 | |
| - **With caching (re-runs):** $0.00 | |
| ## Error Handling | |
| ### Graceful Degradation | |
| 1. **Cache file corrupted:** Start with empty cache | |
| 2. **File download fails:** Continue without file, agent handles gracefully | |
| 3. **Tool execution fails:** Return error message, agent tries alternative | |
| 4. **LLM rate limit:** Exponential backoff, retry up to 5 times | |
| 5. **Primary LLM fails:** Fallback to GPT-4o-mini | |
| 6. **Recursion limit hit:** Return best answer so far | |
| ### Validation | |
| - All answers validated (never empty) | |
| - File paths validated before access | |
| - API responses validated before processing | |
| - Tool parameters validated before execution | |
| ## Testing & Development | |
| ### Local Testing | |
| ```bash | |
| # Run full evaluation | |
| python app.py | |
| # Check logs | |
| tail -f agent_run_*.log | |
| # View cache | |
| cat agent_cache.json | |
| # Clear cache for fresh run | |
| rm agent_cache.json | |
| ``` | |
| ### Debugging | |
| - All tool calls logged with arguments and results | |
| - Agent reasoning logged at each step | |
| - Errors logged with full stack traces | |
| - Cache hits/misses logged | |
| ## Future Enhancements | |
| ### Potential Improvements | |
| 1. **Pattern-based answering** - Skip LLM for simple questions | |
| 2. **Parallel tool execution** - Run independent tools simultaneously | |
| 3. **Smarter caching** - Fuzzy matching for similar questions | |
| 4. **Cost tracking** - Log token usage and costs | |
| 5. **A/B testing** - Compare different prompts/strategies | |
| 6. **Streaming responses** - Show progress in real-time | |
| ### Scalability Considerations | |
| - Cache can grow large (consider size limits or TTL) | |
| - Multiple concurrent runs need separate cache files | |
| - Rate limiting may need adjustment for production | |
| - Consider database instead of JSON for large-scale caching | |
| ## Dependencies | |
| ### Core | |
| - `langchain` - LLM framework | |
| - `langgraph` - Workflow orchestration | |
| - `langchain-openai` - OpenAI integration | |
| - `langchain-community` - Community tools | |
| ### Tools | |
| - `google-generativeai` - Gemini API | |
| - `tavily-python` - Advanced search | |
| - `duckduckgo-search` - Web search | |
| - `youtube-transcript-api` - YouTube transcripts | |
| - `pandas` - Data analysis | |
| - `openpyxl` - Excel files | |
| ### Utilities | |
| - `requests` - HTTP requests | |
| - `python-dotenv` - Environment variables | |
| - `gradio` - Web UI (optional) | |
| ## Security Considerations | |
| ### API Keys | |
| - Stored in `.env` file (gitignored) | |
| - Never hardcoded in source | |
| - Loaded via `python-dotenv` | |
| ### Code Execution | |
| - `python_repl` uses AST-based REPL (safer than eval) | |
| - `execute_python_file` runs in subprocess with timeout | |
| - No shell injection vulnerabilities | |
| ### File Access | |
| - All file operations use Path validation | |
| - No arbitrary file system access | |
| - Downloads isolated to `downloads/` directory | |
| ## Monitoring & Observability | |
| ### Logs | |
| - Timestamped log files for each run | |
| - Structured logging with emojis for easy parsing | |
| - Tool calls logged with full context | |
| - Errors logged with stack traces | |
| ### Metrics (available in logs) | |
| - Questions processed | |
| - Cache hit rate | |
| - Tool usage frequency | |
| - LLM calls per question | |
| - Execution time per question | |
| - Error rate | |
| --- | |
| **Version:** 1.0 | |
| **Last Updated:** 2025-09-30 | |
| **Author:** Leon Woo | |
| **License:** MIT | |