Spaces:
Runtime error
A newer version of the Gradio SDK is available:
6.2.0
Agent Architecture Documentation
Overview
This is a LangGraph-based AI agent designed for the GAIA (General AI Assistants) benchmark evaluation. The agent uses GPT-4o/GPT-4o-mini with tool-calling capabilities to answer complex multi-step questions involving web search, file analysis, multimedia processing, and reasoning.
System Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Request β
β (20 GAIA Questions) β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β app.py β
β β’ Fetches questions from API β
β β’ Downloads attached files (Excel, MP3, images, Python) β
β β’ Saves files to downloads/ directory β
β β’ Calls BasicAgent for each question β
β β’ Submits answers to evaluation API β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BasicAgent β
β (agent/basic_agent.py) β
β β
β 1. Check Cache (agent_cache.json) β
β ββ If cached: Return answer instantly β
β
β β
β 2. If not cached: β
β ββ Invoke LangGraph workflow β
β β
β 3. Clean & validate answer β
β ββ Remove JSON, code blocks, explanations β
β β
β 4. Cache answer to disk β
β ββ Save to agent_cache.json for future use β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LangGraph Workflow β
β (agent/graph.py) β
β β
β ββββββββββββββββ β
β β Agent Node β β Decides next action β
β β (GPT-4o) β β’ Analyze question β
β ββββββββ¬ββββββββ β’ Choose tool(s) β
β β β’ Generate response β
β β β
β ββββββββββββββββ β
β β Tools Node β β Executes tools β
β β β β’ Search, calculate, read files β
β ββββββββ¬ββββββββ β’ Returns results β
β β β
β β β
β ββββββββββββββββ β
β β Agent Node β β Processes results β
β β (GPT-4o) β β’ Analyzes tool output β
β ββββββββ¬ββββββββ β’ Decides: more tools or final answer? β
β β β
β βββββββββββ Loop (max 50 iterations) β
β β
β Final Answer β Return to BasicAgent β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Core Components
1. app.py - Main Application
Responsibilities:
- Fetch questions from evaluation API
- Download attached files from
/files/{task_id}endpoint - Orchestrate agent execution for all questions
- Submit answers to evaluation API
- Display results
Key Functions:
run_and_submit_all()- Main evaluation loop- File download with error handling
- Results aggregation and submission
2. agent/basic_agent.py - Agent Wrapper
Responsibilities:
- Manage agent lifecycle
- Implement caching system (persistent to disk)
- Clean and validate answers
- Logging to file
Key Features:
- Persistent Caching: Saves answers to
agent_cache.json - Answer Cleaning: Removes JSON, code blocks, explanations
- Validation: Ensures no empty answers submitted
- Logging: All output saved to timestamped log files
Cache System:
{
"question_text": "answer",
"How many albums...": "4",
"What is 2+2?": "4"
}
3. agent/graph.py - LangGraph Workflow
Responsibilities:
- Define agent workflow (nodes and edges)
- Initialize LLM chains (primary + fallback)
- Initialize and manage tools
- Route between agent and tools nodes
Workflow Structure:
START β Agent Node β [Tools Node] β Agent Node β END
β_______________|
(loop until answer found or max iterations)
Key Components:
agent_node()- LLM decision makingtool_node()- Tool executionshould_continue()- Routing logic- System prompt with detailed instructions
LLM Configuration:
- Primary: GPT-4o (with tools)
- Fallback: GPT-4o-mini (with tools)
- Recursion Limit: 50 iterations
- Rate Limiting: Exponential backoff (5 retries)
4. agent/tools.py - Tool Implementations
Responsibilities:
- Implement all tools available to the agent
- Handle file path resolution (current dir + downloads/)
- Integrate with external APIs (Gemini, search engines)
Available Tools:
Search & Research (5 tools)
duckduckgo_search- Web searchtavily_search- Advanced web searchwikipedia- Wikipedia lookupyoutube_transcript- Get YouTube transcriptsarxiv_search- Academic paper search
File Operations (5 tools)
list_files- List files in current/downloads directoryread_file- Read text filesread_excel- Read and analyze Excel filesdownload_file- Download files from URLsexecute_python_file- Run Python scripts
Multimedia Analysis (3 tools - Gemini-powered)
understand_video- Analyze YouTube videosunderstand_audio- Transcribe and analyze MP3/audioanalyze_image- Analyze images (chess, diagrams, text)
Computation (2 tools)
calculator- Safe math evaluationpython_repl- Execute Python code
File Path Resolution:
All file tools use find_file() helper that checks:
- Current directory
downloads/directory- Returns best match or downloads path
Data Flow
Question Processing Flow
1. API Request
ββ GET /questions
ββ Returns: [{task_id, question, Level, file_name}, ...]
2. File Download (if file_name exists)
ββ GET /files/{task_id}
ββ Save to: downloads/{file_name}
3. Agent Invocation
ββ Check cache
β ββ If hit: Return cached answer (0 LLM calls)
β
ββ If miss:
ββ Create initial state with question
ββ Invoke LangGraph workflow
β ββ Agent decides action
β ββ Execute tools
β ββ Agent processes results
β ββ Loop until answer or max iterations
β
ββ Extract answer from final message
ββ Clean answer (remove JSON, explanations)
ββ Validate answer (ensure not empty)
ββ Cache to disk
4. Answer Submission
ββ POST /submit
ββ Body: {username, answers: [{task_id, submitted_answer}]}
Tool Execution Flow
Agent Node (GPT-4o)
β
Decides: "I need to use list_files tool"
β
Tool Node
ββ Finds tool by name
ββ Validates parameters
ββ Executes tool._run()
β ββ Example: list_files()
β ββ Check current directory
β ββ Check downloads/ directory
β ββ Return: "Files found:\n./app.py\ndownloads/data.xlsx"
ββ Returns ToolMessage with result
β
Agent Node (GPT-4o)
ββ Receives tool output
ββ Analyzes results
ββ Decides: Use another tool OR provide final answer
Key Design Decisions
1. Persistent Caching
Why: Reduce costs and enable fast re-runs How: JSON file on disk, loaded at startup, saved after each answer Benefit: 100% cost savings on repeated questions
2. File Path Resolution
Why: Files can be in current directory or downloads/
How: find_file() helper checks both locations
Benefit: Agent doesn't need to know exact file location
3. Gemini for Multimedia
Why: GPT-4o doesn't support direct video/audio analysis How: Upload files to Gemini API, get analysis Benefit: Can handle YouTube videos, MP3 files, images
4. Answer Cleaning Pipeline
Why: LLMs often return verbose explanations or JSON How: Multi-stage cleaning (JSON removal, pattern matching, validation) Benefit: Clean, concise answers that match expected format
5. Dual LLM Strategy
Why: Reliability and cost optimization How: Primary (GPT-4o) with fallback (GPT-4o-mini) Benefit: Continues working if primary fails
6. Tool-First Architecture
Why: Many questions require external data How: Rich tool suite with 15+ specialized tools Benefit: Can handle diverse question types
Configuration
Environment Variables (.env)
OPENAI_API_KEY=sk-... # Required for GPT-4o
GEMINI_API_KEY=... # Required for video/audio/image analysis
TAVILY_API_KEY=... # Optional for advanced search
HF_TOKEN=... # For HuggingFace API access
Configurable Parameters
In BasicAgent:
log_to_file- Enable/disable logging (default: True)use_cache- Enable/disable caching (default: True)cache_file- Cache file path (default: "agent_cache.json")
In LangGraph:
recursion_limit- Max iterations (default: 50)temperature- LLM temperature (default: 0.0)max_retries- Rate limit retries (default: 5)
File Structure
Final_Assignment_Template/
βββ app.py # Main application
βββ agent/
β βββ __init__.py
β βββ basic_agent.py # Agent wrapper with caching
β βββ graph.py # LangGraph workflow
β βββ tools.py # Tool implementations
βββ downloads/ # Downloaded files (gitignored)
β βββ file1.xlsx
β βββ audio.mp3
β βββ image.png
βββ agent_cache.json # Persistent cache (gitignored)
βββ agent_run_*.log # Log files (gitignored)
βββ requirements.txt # Python dependencies
βββ .env # Environment variables (gitignored)
βββ .gitignore
βββ ARCHITECTURE.md # This file
βββ README.md # User documentation
Performance Characteristics
Typical Question Processing Time
- Simple (cached): < 0.1 seconds
- Simple (web search): 2-5 seconds
- Medium (file analysis): 5-15 seconds
- Complex (multi-step): 15-60 seconds
- Multimedia (video/audio): 30-120 seconds
LLM Token Usage (per question)
- Simple: 500-2,000 tokens
- Medium: 2,000-8,000 tokens
- Complex: 8,000-20,000 tokens
Cost Estimates (GPT-4o)
- Per question (avg): $0.01-0.05
- 20 questions: $0.20-1.00
- With caching (re-runs): $0.00
Error Handling
Graceful Degradation
- Cache file corrupted: Start with empty cache
- File download fails: Continue without file, agent handles gracefully
- Tool execution fails: Return error message, agent tries alternative
- LLM rate limit: Exponential backoff, retry up to 5 times
- Primary LLM fails: Fallback to GPT-4o-mini
- Recursion limit hit: Return best answer so far
Validation
- All answers validated (never empty)
- File paths validated before access
- API responses validated before processing
- Tool parameters validated before execution
Testing & Development
Local Testing
# Run full evaluation
python app.py
# Check logs
tail -f agent_run_*.log
# View cache
cat agent_cache.json
# Clear cache for fresh run
rm agent_cache.json
Debugging
- All tool calls logged with arguments and results
- Agent reasoning logged at each step
- Errors logged with full stack traces
- Cache hits/misses logged
Future Enhancements
Potential Improvements
- Pattern-based answering - Skip LLM for simple questions
- Parallel tool execution - Run independent tools simultaneously
- Smarter caching - Fuzzy matching for similar questions
- Cost tracking - Log token usage and costs
- A/B testing - Compare different prompts/strategies
- Streaming responses - Show progress in real-time
Scalability Considerations
- Cache can grow large (consider size limits or TTL)
- Multiple concurrent runs need separate cache files
- Rate limiting may need adjustment for production
- Consider database instead of JSON for large-scale caching
Dependencies
Core
langchain- LLM frameworklanggraph- Workflow orchestrationlangchain-openai- OpenAI integrationlangchain-community- Community tools
Tools
google-generativeai- Gemini APItavily-python- Advanced searchduckduckgo-search- Web searchyoutube-transcript-api- YouTube transcriptspandas- Data analysisopenpyxl- Excel files
Utilities
requests- HTTP requestspython-dotenv- Environment variablesgradio- Web UI (optional)
Security Considerations
API Keys
- Stored in
.envfile (gitignored) - Never hardcoded in source
- Loaded via
python-dotenv
Code Execution
python_repluses AST-based REPL (safer than eval)execute_python_fileruns in subprocess with timeout- No shell injection vulnerabilities
File Access
- All file operations use Path validation
- No arbitrary file system access
- Downloads isolated to
downloads/directory
Monitoring & Observability
Logs
- Timestamped log files for each run
- Structured logging with emojis for easy parsing
- Tool calls logged with full context
- Errors logged with stack traces
Metrics (available in logs)
- Questions processed
- Cache hit rate
- Tool usage frequency
- LLM calls per question
- Execution time per question
- Error rate
Version: 1.0
Last Updated: 2025-09-30
Author: Leon Woo
License: MIT