--- title: Agentbee | GAIA Project | HuggingFace Course emoji: 🕵🏻‍♂️ colorFrom: indigo colorTo: indigo sdk: gradio sdk_version: 6.3.0 app_file: app.py pinned: false hf_oauth: true hf_oauth_expiration_minutes: 480 --- Check out the configuration reference at ## Project Overview **Project Name:** Final_Assignment_Template **Purpose:** Course assignment template for building an AI agent that passes the GAIA benchmark (General AI Assistants). This project serves as a learning-focused workspace to support iterative agent development and experimentation. **Target Users:** Students learning agent development through hands-on implementation **Key Objectives:** - Build production-ready code that passes GAIA test requirements - Learn agent development through discovery-based implementation - Develop systematic approach to complex AI task solving - Document learning process and key decisions ## Project Architecture **Technology Stack:** - **Platform:** Hugging Face Spaces with OAuth integration - **UI Framework:** Gradio 5.x with OAuth support - **Agent Framework:** LangGraph (state graph orchestration) - **LLM Providers (4-tier fallback):** - Google Gemini 2.0 Flash (free tier) - HuggingFace Inference API (free tier) - Groq (Llama 3.1 70B / Qwen 3 32B, free tier) - Anthropic Claude Sonnet 4.5 (paid tier) - **Tools:** - Web Search: Tavily API / Exa API - File Parser: PyPDF2, openpyxl, python-docx, pillow - Calculator: Safe expression evaluator - Vision: Multimodal LLM (Gemini/Claude) - **Language:** Python 3.12+ - **Package Manager:** uv **Project Structure:** ``` Final_Assignment_Template/ ├── archive/ # Reference materials, previous solutions, static resources ├── input/ # Input files, configuration, raw data ├── output/ # Generated files, results, processed data ├── test/ # Testing files, test scripts (99 tests) ├── dev/ # Development records (permanent knowledge packages) ├── src/ # Source code │ ├── agent/ # Agent orchestration │ │ ├── graph.py # LangGraph state machine │ │ └── llm_client.py # Multi-provider LLM integration with retry logic │ └── tools/ # Agent tools │ ├── __init__.py # Tool registry │ ├── web_search.py # Tavily/Exa web search │ ├── file_parser.py # Multi-format file reader │ ├── calculator.py # Safe math evaluator │ └── vision.py # Multimodal image/video analysis ├── app.py # Gradio UI with OAuth, LLM provider selection ├── pyproject.toml # uv package management ├── requirements.txt # Python dependencies (generated from pyproject.toml) ├── .env # Local environment variables (API keys, config) ├── README.md # Project overview, architecture, workflow, specification ├── CLAUDE.md # Project-specific AI instructions ├── PLAN.md # Active implementation plan (temporary workspace) ├── TODO.md # Active task tracking (temporary workspace) └── CHANGELOG.md # Session changelog (temporary workspace) ``` **Core Components:** - **GAIAAgent class** (src/agent/graph.py): LangGraph-based agent with state machine orchestration - Planning node: Analyze question and generate execution plan - Tool selection node: LLM function calling for dynamic tool selection - Tool execution node: Execute selected tools with timeout and error handling - Answer synthesis node: Generate factoid answer from evidence - **LLM Client** (src/agent/llm_client.py): Multi-provider LLM integration - 4-tier fallback chain: Gemini → HuggingFace → Groq → Claude - Exponential backoff retry logic (3 attempts per provider) - Runtime config for UI-based provider selection - Few-shot prompting for improved tool selection - **Tool System** (src/tools/): - Web Search: Tavily/Exa API with query optimization - File Parser: Multi-format support (PDF, Excel, Word, CSV, images) - Calculator: Safe expression evaluator with graceful error handling - Vision: Multimodal analysis for images/videos - **Gradio UI** (app.py): - Test & Debug tab: Single question testing with LLM provider dropdown - Full Evaluation tab: Run all GAIA questions with provider selection - Results export: JSON file download for analysis - OAuth integration for submission - **Evaluation Infrastructure**: Pre-built orchestration (question fetching, submission, scoring) **System Architecture Diagram:** ```mermaid --- config: layout: elk --- graph TB subgraph "UI Layer" GradioUI[Gradio UI
LLM Provider Selection
Test & Full Evaluation] OAuth[HF OAuth
User authentication] end subgraph "Agent Orchestration (LangGraph)" GAIAAgent[GAIAAgent
State Machine] PlanNode[Planning Node
Analyze question] ToolSelectNode[Tool Selection Node
LLM function calling] ToolExecNode[Tool Execution Node
Run selected tools] SynthesizeNode[Answer Synthesis Node
Generate factoid] end subgraph "LLM Layer (4-Tier Fallback)" LLMClient[LLM Client
Retry + Fallback] Gemini[Gemini 2.0 Flash
Free Tier 1] HF[HuggingFace API
Free Tier 2] Groq[Groq Llama/Qwen
Free Tier 3] Claude[Claude Sonnet 4.5
Paid Tier 4] end subgraph "Tool Layer" WebSearch[Web Search
Tavily/Exa] FileParser[File Parser
PDF/Excel/Word] Calculator[Calculator
Safe eval] Vision[Vision
Multimodal LLM] end subgraph "External Services" API[GAIA Scoring API] QEndpoint["/questions endpoint"] SEndpoint["/submit endpoint"] end GradioUI --> OAuth OAuth -->|Authenticated| GAIAAgent GAIAAgent --> PlanNode PlanNode --> ToolSelectNode ToolSelectNode --> ToolExecNode ToolExecNode --> SynthesizeNode PlanNode --> LLMClient ToolSelectNode --> LLMClient SynthesizeNode --> LLMClient LLMClient -->|Try 1| Gemini LLMClient -->|Fallback 2| HF LLMClient -->|Fallback 3| Groq LLMClient -->|Fallback 4| Claude ToolExecNode --> WebSearch ToolExecNode --> FileParser ToolExecNode --> Calculator ToolExecNode --> Vision GAIAAgent -->|Answers| API API --> QEndpoint API --> SEndpoint SEndpoint -->|Score| GradioUI style GAIAAgent fill:#ffcccc style LLMClient fill:#fff4cc style GradioUI fill:#cce5ff style API fill:#d9f2d9 ``` ## Project Specification **Project Context:** This is a course assignment template for building an AI agent that passes the GAIA benchmark (General AI Assistants). The project was recently started as a learning-focused workspace to support iterative agent development and experimentation. **Current State:** - **Status:** Stage 5 Complete - Performance Optimization - **Development Progress:** - Stage 1-2: Basic infrastructure and LangGraph setup ✅ - Stage 3: Multi-provider LLM integration ✅ - Stage 4: Tool system and MVP (10% GAIA score: 2/20 questions) ✅ - Stage 5: Performance optimization (retry logic, Groq integration, improved prompts) ✅ - **Current Performance:** Testing in progress (target: 25% accuracy, 5/20 questions) - **Next Stage:** Stage 6 - Advanced optimizations based on Stage 5 results **Data & Workflows:** - **Input Data:** GAIA test questions fetched from external scoring API (`agents-course-unit4-scoring.hf.space`) - **Processing:** BasicAgent class processes questions and generates answers - **Output:** Agent responses submitted to scoring endpoint for evaluation - **Development Workflow:** 1. Local development and testing 2. Deploy to Hugging Face Space 3. Submit via integrated evaluation UI **User Workflow Diagram:** ```mermaid --- config: layout: fixed --- flowchart TB Start(["Student starts assignment"]) --> Clone["Clone HF Space template"] Clone --> LocalDev["Local development:
Implement BasicAgent logic"] LocalDev --> LocalTest{"Test locally?"} LocalTest -- Yes --> RunLocal["Run app locally"] RunLocal --> Debug{"Works?"} Debug -- No --> LocalDev Debug -- Yes --> Deploy["Deploy to HF Space"] LocalTest -- Skip --> Deploy Deploy --> Login["Login with HF OAuth"] Login --> RunEval@{ label: "Click 'Run Evaluation'
button in UI" } RunEval --> FetchQ["System fetches GAIA
questions from API"] FetchQ --> RunAgent["Agent processes
each question"] RunAgent --> Submit["Submit answers
to scoring API"] Submit --> Display["Display score
and results"] Display --> Iterate{"Satisfied with
score?"} Iterate -- "No - improve agent" --> LocalDev Iterate -- Yes --> Complete(["Assignment complete"]) RunEval@{ shape: rect} style Start fill:#e1f5e1 style LocalDev fill:#fff4e1 style Deploy fill:#e1f0ff style RunAgent fill:#ffe1f0 style Complete fill:#e1f5e1 ``` **Technical Architecture:** - **Platform:** Hugging Face Spaces with OAuth integration - **Framework:** Gradio for UI, Requests for API communication - **Core Component:** BasicAgent class (student-customizable template) - **Evaluation Infrastructure:** Pre-built orchestration (question fetching, submission, scoring display) - **Deployment:** HF Space with environment variables (SPACE_ID, SPACE_HOST) **Requirements & Constraints:** - **Constraint Type:** Minimal at current stage - **Infrastructure:** Must run on Hugging Face Spaces platform - **Integration:** Fixed scoring API endpoints (cannot modify evaluation system) - **Flexibility:** Students have full freedom to design agent capabilities **Integration Points:** - **External API:** `https://agents-course-unit4-scoring.hf.space` - `/questions` endpoint: Fetch GAIA test questions - `/submit` endpoint: Submit answers and receive scores - **Authentication:** Hugging Face OAuth for student identification - **Deployment:** HF Space runtime environment variables **Development Goals:** - **Primary:** Achieve competitive GAIA benchmark performance through systematic optimization - **Focus:** Multi-tier LLM architecture with free-tier prioritization to minimize costs - **Key Features:** - 4-tier LLM fallback for quota resilience (Gemini → HF → Groq → Claude) - Exponential backoff retry logic for quota/rate limit errors - UI-based LLM provider selection for easy A/B testing in cloud - Comprehensive tool system (web search, file parsing, calculator, vision) - Graceful error handling and degradation - Extensive test coverage (99 tests) - **Documentation:** Full dev record workflow tracking all decisions and changes ## Key Features ### LLM Provider Selection (UI-Based) **Local Testing (.env configuration):** ```bash LLM_PROVIDER=gemini # Options: gemini, huggingface, groq, claude ENABLE_LLM_FALLBACK=false # Disable fallback for debugging single provider ``` **Cloud Testing (HuggingFace Spaces):** - Use UI dropdowns in Test & Debug tab or Full Evaluation tab - Select from: Gemini, HuggingFace, Groq, Claude - Toggle fallback behavior with checkbox - No environment variable changes needed, instant provider switching **Benefits:** - Easy A/B testing between providers - Clear visibility which LLM is used - Isolated testing for debugging - Production safety with fallback enabled ### Retry Logic - **Exponential backoff:** 3 attempts with 1s, 2s, 4s delays - **Error detection:** 429 status, quota errors, rate limits - **Scope:** All LLM calls (planning, tool selection, synthesis) ### Tool System **Web Search (Tavily/Exa):** - Factual information, current events, statistics - Wikipedia, company info, people **File Parser:** - PDF, Excel, Word, CSV, Text, Images - Handles uploaded files and local paths **Calculator:** - Safe expression evaluation - Arithmetic, algebra, trigonometry, logarithms - Functions: sqrt, sin, cos, log, abs, etc. **Vision:** - Multimodal image/video analysis - Describe content, identify objects, read text - YouTube video understanding ### Performance Optimizations (Stage 5) - Few-shot prompting for improved tool selection - Graceful vision question skip when quota exhausted - Relaxed calculator validation (returns error dicts instead of crashes) - Improved tool descriptions with "Use when..." guidance - Config-based provider debugging ## GAIA Benchmark Results **Baseline (Stage 4):** 10% accuracy (2/20 questions correct) **Stage 5 Target:** 25% accuracy (5/20 questions correct) - Status: Testing in progress - Expected improvements from retry logic, Groq integration, improved prompts **Test Coverage:** 99 passing tests (~2min 40sec runtime) > **Note:** This project implements the **Course Leaderboard** (20 questions, 30% target). See [GAIA Submission Guide](../agentbee/docs/gaia_submission_guide.md) for distinction between Course and Official GAIA leaderboards. ## Workflow ### Dev Record Workflow **Philosophy:** Dev records are the single source of truth. CHANGELOG/PLAN/TODO are temporary workspace files. **Dev Record Types:** - 🐞 **Issue:** Problem-solving, bug fixes, error resolution - 🔨 **Development:** Feature development, enhancements, new functionality ### Session Start Workflow #### Phase 1: Planning (Explicit) 1. **Create or identify dev record:** `dev/dev_YYMMDD_##_concise_title.md` - Choose type: 🐞 Issue or 🔨 Development 2. **Create PLAN.md ONLY:** Use `/plan` command or write directly - Document implementation approach, steps, files to modify - DO NOT create TODO.md or CHANGELOG.md yet #### Phase 2: Development (Automatic) 3. **Create TODO.md:** Automatically populate as you start implementing - Track tasks in real-time using TodoWrite tool - Mark in_progress/completed as you work 4. **Create CHANGELOG.md:** Automatically populate as you make changes - Record file modifications/creations/deletions as they happen 5. **Work on solution:** Update all three files during development ### Session End Workflow #### Phase 3: Completion (Manual) After AI completes all work and updates PLAN/TODO/CHANGELOG: - AI stops and waits for user review (Checkpoint 3) - User reviews PLAN.md, TODO.md, and CHANGELOG.md - User manually runs `/update-dev dev_YYMMDD_##` when satisfied When /update-dev runs: 1. Distills PLAN decisions → dev record "Key Decisions" section 2. Distills TODO deliverables → dev record "Outcome" section 3. Distills CHANGELOG changes → dev record "Changelog" section 4. Empties PLAN.md, TODO.md, CHANGELOG.md back to templates 5. Marks dev record status as ✅ Resolved ### AI Context Loading Protocol **MANDATORY - Execute in exact order. NO delegating to sub-agents for initial context.** **Phase 1: Current State (What's happening NOW)** 1. **Read workspace files:** - `CHANGELOG.md` - Active session changes (reverse chronological, newest first) - `PLAN.md` - Current implementation plan (if exists) - `TODO.md` - Active task tracking (if exists) 2. **Read actual outputs (CRITICAL - verify claims, don't trust summaries):** - Latest files in `output/` folder (sorted by timestamp, newest first) - For GAIA projects: Read latest `output/gaia_results_*.json` completely - Check `metadata.score_percent` and `metadata.correct_count` - Read ALL `results[].submitted_answer` to understand failure patterns - Identify error categories (vision failures, tool errors, wrong answers) - For test projects: Read latest test output logs - **Purpose:** Ground truth of what ACTUALLY happened, not what was claimed **Phase 2: Recent History (What was done recently)** 3. **Read last 3 dev records from `dev/` folder:** - Sort by filename (newest `dev_YYMMDD_##_title.md` first) - Read: Problem Description, Key Decisions, Outcome, Changelog - **Cross-verify:** Compare dev record claims with actual output files - **Red flag:** If dev record says "25% accuracy" but latest JSON shows "0%", prioritize JSON truth **Phase 3: Project Structure (How it works)** 4. **Read README.md sections in order:** - Section 1: Overview (purpose, objectives) - Section 2: Architecture (tech stack, components, diagrams) - Section 3: Specification (current state, workflows, requirements) - Section 4: Workflow (this protocol) 5. **Read CLAUDE.md:** - Project-specific coding standards - Usually empty (inherits from global ~/.claude/CLAUDE.md) **Phase 4: Code Structure (Critical files)** 6. **Identify critical files from README.md Architecture section:** - Note main entry points (e.g., `app.py`) - Note core logic files (e.g., `src/agent/graph.py`, `src/agent/llm_client.py`) - Note tool implementations (e.g., `src/tools/*.py`) - **DO NOT read these yet** - only note their locations for later reference **Verification Checklist (Before claiming "I have context"):** - [ ] I personally read CHANGELOG.md, PLAN.md, TODO.md (not delegated) - [ ] I personally read latest output files (JSON results, test logs, etc.) - [ ] I know the ACTUAL current accuracy/status from output files - [ ] I read last 3 dev records and cross-verified claims with output data - [ ] I read README.md sections 1-4 completely - [ ] I can answer: "What is the current status and why?" - [ ] I can answer: "What were the last 3 major changes and their outcomes?" - [ ] I can answer: "What specific problems exist based on latest outputs?" **Anti-Patterns (NEVER do these):** - ❌ Delegate initial context loading to Explore/Task agents - ❌ Trust dev record claims without verifying against output files - ❌ Skip reading actual output data (JSON results, logs, test outputs) - ❌ Claim "I have context" after only reading summaries - ❌ Read code files before understanding current state from outputs **Context Priority:** Latest Outputs (ground truth) > CHANGELOG (active work) > Dev Records (history) > README (structure)