Final_Assignment_Template / ARCHITECTURE.md
leonwoo's picture
Upload 22 files
1777acb verified
# Agent Architecture Documentation
## Overview
This is a LangGraph-based AI agent designed for the GAIA (General AI Assistants) benchmark evaluation. The agent uses GPT-4o/GPT-4o-mini with tool-calling capabilities to answer complex multi-step questions involving web search, file analysis, multimedia processing, and reasoning.
## System Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ User Request β”‚
β”‚ (20 GAIA Questions) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ app.py β”‚
β”‚ β€’ Fetches questions from API β”‚
β”‚ β€’ Downloads attached files (Excel, MP3, images, Python) β”‚
β”‚ β€’ Saves files to downloads/ directory β”‚
β”‚ β€’ Calls BasicAgent for each question β”‚
β”‚ β€’ Submits answers to evaluation API β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ BasicAgent β”‚
β”‚ (agent/basic_agent.py) β”‚
β”‚ β”‚
β”‚ 1. Check Cache (agent_cache.json) β”‚
β”‚ └─ If cached: Return answer instantly βœ… β”‚
β”‚ β”‚
β”‚ 2. If not cached: β”‚
β”‚ └─ Invoke LangGraph workflow β”‚
β”‚ β”‚
β”‚ 3. Clean & validate answer β”‚
β”‚ └─ Remove JSON, code blocks, explanations β”‚
β”‚ β”‚
β”‚ 4. Cache answer to disk β”‚
β”‚ └─ Save to agent_cache.json for future use β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LangGraph Workflow β”‚
β”‚ (agent/graph.py) β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Agent Node β”‚ ← Decides next action β”‚
β”‚ β”‚ (GPT-4o) β”‚ β€’ Analyze question β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β€’ Choose tool(s) β”‚
β”‚ β”‚ β€’ Generate response β”‚
β”‚ ↓ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Tools Node β”‚ ← Executes tools β”‚
β”‚ β”‚ β”‚ β€’ Search, calculate, read files β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β€’ Returns results β”‚
β”‚ β”‚ β”‚
β”‚ ↓ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Agent Node β”‚ ← Processes results β”‚
β”‚ β”‚ (GPT-4o) β”‚ β€’ Analyzes tool output β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β€’ Decides: more tools or final answer? β”‚
β”‚ β”‚ β”‚
β”‚ └─────────→ Loop (max 50 iterations) β”‚
β”‚ β”‚
β”‚ Final Answer β†’ Return to BasicAgent β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Core Components
### 1. **app.py** - Main Application
**Responsibilities:**
- Fetch questions from evaluation API
- Download attached files from `/files/{task_id}` endpoint
- Orchestrate agent execution for all questions
- Submit answers to evaluation API
- Display results
**Key Functions:**
- `run_and_submit_all()` - Main evaluation loop
- File download with error handling
- Results aggregation and submission
### 2. **agent/basic_agent.py** - Agent Wrapper
**Responsibilities:**
- Manage agent lifecycle
- Implement caching system (persistent to disk)
- Clean and validate answers
- Logging to file
**Key Features:**
- **Persistent Caching:** Saves answers to `agent_cache.json`
- **Answer Cleaning:** Removes JSON, code blocks, explanations
- **Validation:** Ensures no empty answers submitted
- **Logging:** All output saved to timestamped log files
**Cache System:**
```python
{
"question_text": "answer",
"How many albums...": "4",
"What is 2+2?": "4"
}
```
### 3. **agent/graph.py** - LangGraph Workflow
**Responsibilities:**
- Define agent workflow (nodes and edges)
- Initialize LLM chains (primary + fallback)
- Initialize and manage tools
- Route between agent and tools nodes
**Workflow Structure:**
```
START β†’ Agent Node β†’ [Tools Node] β†’ Agent Node β†’ END
↑_______________|
(loop until answer found or max iterations)
```
**Key Components:**
- `agent_node()` - LLM decision making
- `tool_node()` - Tool execution
- `should_continue()` - Routing logic
- System prompt with detailed instructions
**LLM Configuration:**
- **Primary:** GPT-4o (with tools)
- **Fallback:** GPT-4o-mini (with tools)
- **Recursion Limit:** 50 iterations
- **Rate Limiting:** Exponential backoff (5 retries)
### 4. **agent/tools.py** - Tool Implementations
**Responsibilities:**
- Implement all tools available to the agent
- Handle file path resolution (current dir + downloads/)
- Integrate with external APIs (Gemini, search engines)
**Available Tools:**
#### Search & Research (5 tools)
- `duckduckgo_search` - Web search
- `tavily_search` - Advanced web search
- `wikipedia` - Wikipedia lookup
- `youtube_transcript` - Get YouTube transcripts
- `arxiv_search` - Academic paper search
#### File Operations (5 tools)
- `list_files` - List files in current/downloads directory
- `read_file` - Read text files
- `read_excel` - Read and analyze Excel files
- `download_file` - Download files from URLs
- `execute_python_file` - Run Python scripts
#### Multimedia Analysis (3 tools - Gemini-powered)
- `understand_video` - Analyze YouTube videos
- `understand_audio` - Transcribe and analyze MP3/audio
- `analyze_image` - Analyze images (chess, diagrams, text)
#### Computation (2 tools)
- `calculator` - Safe math evaluation
- `python_repl` - Execute Python code
**File Path Resolution:**
All file tools use `find_file()` helper that checks:
1. Current directory
2. `downloads/` directory
3. Returns best match or downloads path
## Data Flow
### Question Processing Flow
```
1. API Request
└─ GET /questions
└─ Returns: [{task_id, question, Level, file_name}, ...]
2. File Download (if file_name exists)
└─ GET /files/{task_id}
└─ Save to: downloads/{file_name}
3. Agent Invocation
β”œβ”€ Check cache
β”‚ └─ If hit: Return cached answer (0 LLM calls)
β”‚
└─ If miss:
β”œβ”€ Create initial state with question
β”œβ”€ Invoke LangGraph workflow
β”‚ β”œβ”€ Agent decides action
β”‚ β”œβ”€ Execute tools
β”‚ β”œβ”€ Agent processes results
β”‚ └─ Loop until answer or max iterations
β”‚
β”œβ”€ Extract answer from final message
β”œβ”€ Clean answer (remove JSON, explanations)
β”œβ”€ Validate answer (ensure not empty)
└─ Cache to disk
4. Answer Submission
└─ POST /submit
└─ Body: {username, answers: [{task_id, submitted_answer}]}
```
## Tool Execution Flow
```
Agent Node (GPT-4o)
↓
Decides: "I need to use list_files tool"
↓
Tool Node
β”œβ”€ Finds tool by name
β”œβ”€ Validates parameters
β”œβ”€ Executes tool._run()
β”‚ └─ Example: list_files()
β”‚ β”œβ”€ Check current directory
β”‚ β”œβ”€ Check downloads/ directory
β”‚ └─ Return: "Files found:\n./app.py\ndownloads/data.xlsx"
└─ Returns ToolMessage with result
↓
Agent Node (GPT-4o)
β”œβ”€ Receives tool output
β”œβ”€ Analyzes results
└─ Decides: Use another tool OR provide final answer
```
## Key Design Decisions
### 1. **Persistent Caching**
**Why:** Reduce costs and enable fast re-runs
**How:** JSON file on disk, loaded at startup, saved after each answer
**Benefit:** 100% cost savings on repeated questions
### 2. **File Path Resolution**
**Why:** Files can be in current directory or downloads/
**How:** `find_file()` helper checks both locations
**Benefit:** Agent doesn't need to know exact file location
### 3. **Gemini for Multimedia**
**Why:** GPT-4o doesn't support direct video/audio analysis
**How:** Upload files to Gemini API, get analysis
**Benefit:** Can handle YouTube videos, MP3 files, images
### 4. **Answer Cleaning Pipeline**
**Why:** LLMs often return verbose explanations or JSON
**How:** Multi-stage cleaning (JSON removal, pattern matching, validation)
**Benefit:** Clean, concise answers that match expected format
### 5. **Dual LLM Strategy**
**Why:** Reliability and cost optimization
**How:** Primary (GPT-4o) with fallback (GPT-4o-mini)
**Benefit:** Continues working if primary fails
### 6. **Tool-First Architecture**
**Why:** Many questions require external data
**How:** Rich tool suite with 15+ specialized tools
**Benefit:** Can handle diverse question types
## Configuration
### Environment Variables (.env)
```bash
OPENAI_API_KEY=sk-... # Required for GPT-4o
GEMINI_API_KEY=... # Required for video/audio/image analysis
TAVILY_API_KEY=... # Optional for advanced search
HF_TOKEN=... # For HuggingFace API access
```
### Configurable Parameters
**In BasicAgent:**
- `log_to_file` - Enable/disable logging (default: True)
- `use_cache` - Enable/disable caching (default: True)
- `cache_file` - Cache file path (default: "agent_cache.json")
**In LangGraph:**
- `recursion_limit` - Max iterations (default: 50)
- `temperature` - LLM temperature (default: 0.0)
- `max_retries` - Rate limit retries (default: 5)
## File Structure
```
Final_Assignment_Template/
β”œβ”€β”€ app.py # Main application
β”œβ”€β”€ agent/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ basic_agent.py # Agent wrapper with caching
β”‚ β”œβ”€β”€ graph.py # LangGraph workflow
β”‚ └── tools.py # Tool implementations
β”œβ”€β”€ downloads/ # Downloaded files (gitignored)
β”‚ β”œβ”€β”€ file1.xlsx
β”‚ β”œβ”€β”€ audio.mp3
β”‚ └── image.png
β”œβ”€β”€ agent_cache.json # Persistent cache (gitignored)
β”œβ”€β”€ agent_run_*.log # Log files (gitignored)
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ .env # Environment variables (gitignored)
β”œβ”€β”€ .gitignore
β”œβ”€β”€ ARCHITECTURE.md # This file
└── README.md # User documentation
```
## Performance Characteristics
### Typical Question Processing Time
- **Simple (cached):** < 0.1 seconds
- **Simple (web search):** 2-5 seconds
- **Medium (file analysis):** 5-15 seconds
- **Complex (multi-step):** 15-60 seconds
- **Multimedia (video/audio):** 30-120 seconds
### LLM Token Usage (per question)
- **Simple:** 500-2,000 tokens
- **Medium:** 2,000-8,000 tokens
- **Complex:** 8,000-20,000 tokens
### Cost Estimates (GPT-4o)
- **Per question (avg):** $0.01-0.05
- **20 questions:** $0.20-1.00
- **With caching (re-runs):** $0.00
## Error Handling
### Graceful Degradation
1. **Cache file corrupted:** Start with empty cache
2. **File download fails:** Continue without file, agent handles gracefully
3. **Tool execution fails:** Return error message, agent tries alternative
4. **LLM rate limit:** Exponential backoff, retry up to 5 times
5. **Primary LLM fails:** Fallback to GPT-4o-mini
6. **Recursion limit hit:** Return best answer so far
### Validation
- All answers validated (never empty)
- File paths validated before access
- API responses validated before processing
- Tool parameters validated before execution
## Testing & Development
### Local Testing
```bash
# Run full evaluation
python app.py
# Check logs
tail -f agent_run_*.log
# View cache
cat agent_cache.json
# Clear cache for fresh run
rm agent_cache.json
```
### Debugging
- All tool calls logged with arguments and results
- Agent reasoning logged at each step
- Errors logged with full stack traces
- Cache hits/misses logged
## Future Enhancements
### Potential Improvements
1. **Pattern-based answering** - Skip LLM for simple questions
2. **Parallel tool execution** - Run independent tools simultaneously
3. **Smarter caching** - Fuzzy matching for similar questions
4. **Cost tracking** - Log token usage and costs
5. **A/B testing** - Compare different prompts/strategies
6. **Streaming responses** - Show progress in real-time
### Scalability Considerations
- Cache can grow large (consider size limits or TTL)
- Multiple concurrent runs need separate cache files
- Rate limiting may need adjustment for production
- Consider database instead of JSON for large-scale caching
## Dependencies
### Core
- `langchain` - LLM framework
- `langgraph` - Workflow orchestration
- `langchain-openai` - OpenAI integration
- `langchain-community` - Community tools
### Tools
- `google-generativeai` - Gemini API
- `tavily-python` - Advanced search
- `duckduckgo-search` - Web search
- `youtube-transcript-api` - YouTube transcripts
- `pandas` - Data analysis
- `openpyxl` - Excel files
### Utilities
- `requests` - HTTP requests
- `python-dotenv` - Environment variables
- `gradio` - Web UI (optional)
## Security Considerations
### API Keys
- Stored in `.env` file (gitignored)
- Never hardcoded in source
- Loaded via `python-dotenv`
### Code Execution
- `python_repl` uses AST-based REPL (safer than eval)
- `execute_python_file` runs in subprocess with timeout
- No shell injection vulnerabilities
### File Access
- All file operations use Path validation
- No arbitrary file system access
- Downloads isolated to `downloads/` directory
## Monitoring & Observability
### Logs
- Timestamped log files for each run
- Structured logging with emojis for easy parsing
- Tool calls logged with full context
- Errors logged with stack traces
### Metrics (available in logs)
- Questions processed
- Cache hit rate
- Tool usage frequency
- LLM calls per question
- Execution time per question
- Error rate
---
**Version:** 1.0
**Last Updated:** 2025-09-30
**Author:** Leon Woo
**License:** MIT