Final_Assignment_Template / ARCHITECTURE.md
leonwoo's picture
Upload 22 files
1777acb verified

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

Agent Architecture Documentation

Overview

This is a LangGraph-based AI agent designed for the GAIA (General AI Assistants) benchmark evaluation. The agent uses GPT-4o/GPT-4o-mini with tool-calling capabilities to answer complex multi-step questions involving web search, file analysis, multimedia processing, and reasoning.

System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         User Request                             β”‚
β”‚                    (20 GAIA Questions)                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          app.py                                  β”‚
β”‚  β€’ Fetches questions from API                                    β”‚
β”‚  β€’ Downloads attached files (Excel, MP3, images, Python)         β”‚
β”‚  β€’ Saves files to downloads/ directory                           β”‚
β”‚  β€’ Calls BasicAgent for each question                            β”‚
β”‚  β€’ Submits answers to evaluation API                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      BasicAgent                                  β”‚
β”‚                   (agent/basic_agent.py)                         β”‚
β”‚                                                                   β”‚
β”‚  1. Check Cache (agent_cache.json)                               β”‚
β”‚     └─ If cached: Return answer instantly βœ…                     β”‚
β”‚                                                                   β”‚
β”‚  2. If not cached:                                               β”‚
β”‚     └─ Invoke LangGraph workflow                                 β”‚
β”‚                                                                   β”‚
β”‚  3. Clean & validate answer                                      β”‚
β”‚     └─ Remove JSON, code blocks, explanations                    β”‚
β”‚                                                                   β”‚
β”‚  4. Cache answer to disk                                         β”‚
β”‚     └─ Save to agent_cache.json for future use                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LangGraph Workflow                            β”‚
β”‚                     (agent/graph.py)                             β”‚
β”‚                                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                β”‚
β”‚  β”‚ Agent Node   β”‚ ← Decides next action                          β”‚
β”‚  β”‚ (GPT-4o)     β”‚   β€’ Analyze question                           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β€’ Choose tool(s)                             β”‚
β”‚         β”‚           β€’ Generate response                           β”‚
β”‚         ↓                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                β”‚
β”‚  β”‚ Tools Node   β”‚ ← Executes tools                               β”‚
β”‚  β”‚              β”‚   β€’ Search, calculate, read files              β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β€’ Returns results                            β”‚
β”‚         β”‚                                                         β”‚
β”‚         ↓                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                β”‚
β”‚  β”‚ Agent Node   β”‚ ← Processes results                            β”‚
β”‚  β”‚ (GPT-4o)     β”‚   β€’ Analyzes tool output                       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β€’ Decides: more tools or final answer?       β”‚
β”‚         β”‚                                                         β”‚
β”‚         └─────────→ Loop (max 50 iterations)                     β”‚
β”‚                                                                   β”‚
β”‚  Final Answer β†’ Return to BasicAgent                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

1. app.py - Main Application

Responsibilities:

  • Fetch questions from evaluation API
  • Download attached files from /files/{task_id} endpoint
  • Orchestrate agent execution for all questions
  • Submit answers to evaluation API
  • Display results

Key Functions:

  • run_and_submit_all() - Main evaluation loop
  • File download with error handling
  • Results aggregation and submission

2. agent/basic_agent.py - Agent Wrapper

Responsibilities:

  • Manage agent lifecycle
  • Implement caching system (persistent to disk)
  • Clean and validate answers
  • Logging to file

Key Features:

  • Persistent Caching: Saves answers to agent_cache.json
  • Answer Cleaning: Removes JSON, code blocks, explanations
  • Validation: Ensures no empty answers submitted
  • Logging: All output saved to timestamped log files

Cache System:

{
  "question_text": "answer",
  "How many albums...": "4",
  "What is 2+2?": "4"
}

3. agent/graph.py - LangGraph Workflow

Responsibilities:

  • Define agent workflow (nodes and edges)
  • Initialize LLM chains (primary + fallback)
  • Initialize and manage tools
  • Route between agent and tools nodes

Workflow Structure:

START β†’ Agent Node β†’ [Tools Node] β†’ Agent Node β†’ END
         ↑_______________|
         (loop until answer found or max iterations)

Key Components:

  • agent_node() - LLM decision making
  • tool_node() - Tool execution
  • should_continue() - Routing logic
  • System prompt with detailed instructions

LLM Configuration:

  • Primary: GPT-4o (with tools)
  • Fallback: GPT-4o-mini (with tools)
  • Recursion Limit: 50 iterations
  • Rate Limiting: Exponential backoff (5 retries)

4. agent/tools.py - Tool Implementations

Responsibilities:

  • Implement all tools available to the agent
  • Handle file path resolution (current dir + downloads/)
  • Integrate with external APIs (Gemini, search engines)

Available Tools:

Search & Research (5 tools)

  • duckduckgo_search - Web search
  • tavily_search - Advanced web search
  • wikipedia - Wikipedia lookup
  • youtube_transcript - Get YouTube transcripts
  • arxiv_search - Academic paper search

File Operations (5 tools)

  • list_files - List files in current/downloads directory
  • read_file - Read text files
  • read_excel - Read and analyze Excel files
  • download_file - Download files from URLs
  • execute_python_file - Run Python scripts

Multimedia Analysis (3 tools - Gemini-powered)

  • understand_video - Analyze YouTube videos
  • understand_audio - Transcribe and analyze MP3/audio
  • analyze_image - Analyze images (chess, diagrams, text)

Computation (2 tools)

  • calculator - Safe math evaluation
  • python_repl - Execute Python code

File Path Resolution: All file tools use find_file() helper that checks:

  1. Current directory
  2. downloads/ directory
  3. Returns best match or downloads path

Data Flow

Question Processing Flow

1. API Request
   └─ GET /questions
   └─ Returns: [{task_id, question, Level, file_name}, ...]

2. File Download (if file_name exists)
   └─ GET /files/{task_id}
   └─ Save to: downloads/{file_name}

3. Agent Invocation
   β”œβ”€ Check cache
   β”‚  └─ If hit: Return cached answer (0 LLM calls)
   β”‚
   └─ If miss:
      β”œβ”€ Create initial state with question
      β”œβ”€ Invoke LangGraph workflow
      β”‚  β”œβ”€ Agent decides action
      β”‚  β”œβ”€ Execute tools
      β”‚  β”œβ”€ Agent processes results
      β”‚  └─ Loop until answer or max iterations
      β”‚
      β”œβ”€ Extract answer from final message
      β”œβ”€ Clean answer (remove JSON, explanations)
      β”œβ”€ Validate answer (ensure not empty)
      └─ Cache to disk

4. Answer Submission
   └─ POST /submit
   └─ Body: {username, answers: [{task_id, submitted_answer}]}

Tool Execution Flow

Agent Node (GPT-4o)
  ↓
Decides: "I need to use list_files tool"
  ↓
Tool Node
  β”œβ”€ Finds tool by name
  β”œβ”€ Validates parameters
  β”œβ”€ Executes tool._run()
  β”‚  └─ Example: list_files()
  β”‚     β”œβ”€ Check current directory
  β”‚     β”œβ”€ Check downloads/ directory
  β”‚     └─ Return: "Files found:\n./app.py\ndownloads/data.xlsx"
  └─ Returns ToolMessage with result
  ↓
Agent Node (GPT-4o)
  β”œβ”€ Receives tool output
  β”œβ”€ Analyzes results
  └─ Decides: Use another tool OR provide final answer

Key Design Decisions

1. Persistent Caching

Why: Reduce costs and enable fast re-runs How: JSON file on disk, loaded at startup, saved after each answer Benefit: 100% cost savings on repeated questions

2. File Path Resolution

Why: Files can be in current directory or downloads/ How: find_file() helper checks both locations Benefit: Agent doesn't need to know exact file location

3. Gemini for Multimedia

Why: GPT-4o doesn't support direct video/audio analysis How: Upload files to Gemini API, get analysis Benefit: Can handle YouTube videos, MP3 files, images

4. Answer Cleaning Pipeline

Why: LLMs often return verbose explanations or JSON How: Multi-stage cleaning (JSON removal, pattern matching, validation) Benefit: Clean, concise answers that match expected format

5. Dual LLM Strategy

Why: Reliability and cost optimization How: Primary (GPT-4o) with fallback (GPT-4o-mini) Benefit: Continues working if primary fails

6. Tool-First Architecture

Why: Many questions require external data How: Rich tool suite with 15+ specialized tools Benefit: Can handle diverse question types

Configuration

Environment Variables (.env)

OPENAI_API_KEY=sk-...           # Required for GPT-4o
GEMINI_API_KEY=...              # Required for video/audio/image analysis
TAVILY_API_KEY=...              # Optional for advanced search
HF_TOKEN=...                    # For HuggingFace API access

Configurable Parameters

In BasicAgent:

  • log_to_file - Enable/disable logging (default: True)
  • use_cache - Enable/disable caching (default: True)
  • cache_file - Cache file path (default: "agent_cache.json")

In LangGraph:

  • recursion_limit - Max iterations (default: 50)
  • temperature - LLM temperature (default: 0.0)
  • max_retries - Rate limit retries (default: 5)

File Structure

Final_Assignment_Template/
β”œβ”€β”€ app.py                      # Main application
β”œβ”€β”€ agent/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ basic_agent.py          # Agent wrapper with caching
β”‚   β”œβ”€β”€ graph.py                # LangGraph workflow
β”‚   └── tools.py                # Tool implementations
β”œβ”€β”€ downloads/                  # Downloaded files (gitignored)
β”‚   β”œβ”€β”€ file1.xlsx
β”‚   β”œβ”€β”€ audio.mp3
β”‚   └── image.png
β”œβ”€β”€ agent_cache.json            # Persistent cache (gitignored)
β”œβ”€β”€ agent_run_*.log             # Log files (gitignored)
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ .env                        # Environment variables (gitignored)
β”œβ”€β”€ .gitignore
β”œβ”€β”€ ARCHITECTURE.md             # This file
└── README.md                   # User documentation

Performance Characteristics

Typical Question Processing Time

  • Simple (cached): < 0.1 seconds
  • Simple (web search): 2-5 seconds
  • Medium (file analysis): 5-15 seconds
  • Complex (multi-step): 15-60 seconds
  • Multimedia (video/audio): 30-120 seconds

LLM Token Usage (per question)

  • Simple: 500-2,000 tokens
  • Medium: 2,000-8,000 tokens
  • Complex: 8,000-20,000 tokens

Cost Estimates (GPT-4o)

  • Per question (avg): $0.01-0.05
  • 20 questions: $0.20-1.00
  • With caching (re-runs): $0.00

Error Handling

Graceful Degradation

  1. Cache file corrupted: Start with empty cache
  2. File download fails: Continue without file, agent handles gracefully
  3. Tool execution fails: Return error message, agent tries alternative
  4. LLM rate limit: Exponential backoff, retry up to 5 times
  5. Primary LLM fails: Fallback to GPT-4o-mini
  6. Recursion limit hit: Return best answer so far

Validation

  • All answers validated (never empty)
  • File paths validated before access
  • API responses validated before processing
  • Tool parameters validated before execution

Testing & Development

Local Testing

# Run full evaluation
python app.py

# Check logs
tail -f agent_run_*.log

# View cache
cat agent_cache.json

# Clear cache for fresh run
rm agent_cache.json

Debugging

  • All tool calls logged with arguments and results
  • Agent reasoning logged at each step
  • Errors logged with full stack traces
  • Cache hits/misses logged

Future Enhancements

Potential Improvements

  1. Pattern-based answering - Skip LLM for simple questions
  2. Parallel tool execution - Run independent tools simultaneously
  3. Smarter caching - Fuzzy matching for similar questions
  4. Cost tracking - Log token usage and costs
  5. A/B testing - Compare different prompts/strategies
  6. Streaming responses - Show progress in real-time

Scalability Considerations

  • Cache can grow large (consider size limits or TTL)
  • Multiple concurrent runs need separate cache files
  • Rate limiting may need adjustment for production
  • Consider database instead of JSON for large-scale caching

Dependencies

Core

  • langchain - LLM framework
  • langgraph - Workflow orchestration
  • langchain-openai - OpenAI integration
  • langchain-community - Community tools

Tools

  • google-generativeai - Gemini API
  • tavily-python - Advanced search
  • duckduckgo-search - Web search
  • youtube-transcript-api - YouTube transcripts
  • pandas - Data analysis
  • openpyxl - Excel files

Utilities

  • requests - HTTP requests
  • python-dotenv - Environment variables
  • gradio - Web UI (optional)

Security Considerations

API Keys

  • Stored in .env file (gitignored)
  • Never hardcoded in source
  • Loaded via python-dotenv

Code Execution

  • python_repl uses AST-based REPL (safer than eval)
  • execute_python_file runs in subprocess with timeout
  • No shell injection vulnerabilities

File Access

  • All file operations use Path validation
  • No arbitrary file system access
  • Downloads isolated to downloads/ directory

Monitoring & Observability

Logs

  • Timestamped log files for each run
  • Structured logging with emojis for easy parsing
  • Tool calls logged with full context
  • Errors logged with stack traces

Metrics (available in logs)

  • Questions processed
  • Cache hit rate
  • Tool usage frequency
  • LLM calls per question
  • Execution time per question
  • Error rate

Version: 1.0
Last Updated: 2025-09-30
Author: Leon Woo
License: MIT