agentbee

Running

File size: 18,195 Bytes

---
title: Agentbee | GAIA Project | HuggingFace Course
emoji: 🕵🏻‍♂️
colorFrom: indigo
colorTo: indigo
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 480
---

Check out the configuration reference at <https://huggingface.co/docs/hub/spaces-config-reference>

## Project Overview

**Project Name:** Final_Assignment_Template

**Purpose:** Course assignment template for building an AI agent that passes the GAIA benchmark (General AI Assistants). This project serves as a learning-focused workspace to support iterative agent development and experimentation.

**Target Users:** Students learning agent development through hands-on implementation

**Key Objectives:**

- Build production-ready code that passes GAIA test requirements
- Learn agent development through discovery-based implementation
- Develop systematic approach to complex AI task solving
- Document learning process and key decisions

## Project Architecture

**Technology Stack:**

- **Platform:** Hugging Face Spaces with OAuth integration
- **UI Framework:** Gradio 5.x with OAuth support
- **Agent Framework:** LangGraph (state graph orchestration)
- **LLM Providers (4-tier fallback):**
  - Google Gemini 2.0 Flash (free tier)
  - HuggingFace Inference API (free tier)
  - Groq (Llama 3.1 70B / Qwen 3 32B, free tier)
  - Anthropic Claude Sonnet 4.5 (paid tier)
- **Tools:**
  - Web Search: Tavily API / Exa API
  - File Parser: PyPDF2, openpyxl, python-docx, pillow
  - Calculator: Safe expression evaluator
  - Vision: Multimodal LLM (Gemini/Claude)
- **Language:** Python 3.12+
- **Package Manager:** uv

**Project Structure:**

```
Final_Assignment_Template/
├── archive/             # Reference materials, previous solutions, static resources
├── input/               # Input files, configuration, raw data
├── output/              # Generated files, results, processed data
├── test/                # Testing files, test scripts (99 tests)
├── dev/                 # Development records (permanent knowledge packages)
├── src/                 # Source code
│   ├── agent/           # Agent orchestration
│   │   ├── graph.py     # LangGraph state machine
│   │   └── llm_client.py # Multi-provider LLM integration with retry logic
│   └── tools/           # Agent tools
│       ├── __init__.py  # Tool registry
│       ├── web_search.py    # Tavily/Exa web search
│       ├── file_parser.py   # Multi-format file reader
│       ├── calculator.py    # Safe math evaluator
│       └── vision.py        # Multimodal image/video analysis
├── app.py               # Gradio UI with OAuth, LLM provider selection
├── pyproject.toml       # uv package management
├── requirements.txt     # Python dependencies (generated from pyproject.toml)
├── .env                 # Local environment variables (API keys, config)
├── README.md            # Project overview, architecture, workflow, specification
├── CLAUDE.md            # Project-specific AI instructions
├── PLAN.md              # Active implementation plan (temporary workspace)
├── TODO.md              # Active task tracking (temporary workspace)
└── CHANGELOG.md         # Session changelog (temporary workspace)
```

**Core Components:**

- **GAIAAgent class** (src/agent/graph.py): LangGraph-based agent with state machine orchestration
  - Planning node: Analyze question and generate execution plan
  - Tool selection node: LLM function calling for dynamic tool selection
  - Tool execution node: Execute selected tools with timeout and error handling
  - Answer synthesis node: Generate factoid answer from evidence
- **LLM Client** (src/agent/llm_client.py): Multi-provider LLM integration
  - 4-tier fallback chain: Gemini → HuggingFace → Groq → Claude
  - Exponential backoff retry logic (3 attempts per provider)
  - Runtime config for UI-based provider selection
  - Few-shot prompting for improved tool selection
- **Tool System** (src/tools/):
  - Web Search: Tavily/Exa API with query optimization
  - File Parser: Multi-format support (PDF, Excel, Word, CSV, images)
  - Calculator: Safe expression evaluator with graceful error handling
  - Vision: Multimodal analysis for images/videos
- **Gradio UI** (app.py):
  - Test & Debug tab: Single question testing with LLM provider dropdown
  - Full Evaluation tab: Run all GAIA questions with provider selection
  - Results export: JSON file download for analysis
  - OAuth integration for submission
- **Evaluation Infrastructure**: Pre-built orchestration (question fetching, submission, scoring)

**System Architecture Diagram:**

```mermaid
---
config:
  layout: elk
---
graph TB
    subgraph "UI Layer"
        GradioUI[Gradio UI<br/>LLM Provider Selection<br/>Test & Full Evaluation]
        OAuth[HF OAuth<br/>User authentication]
    end

    subgraph "Agent Orchestration (LangGraph)"
        GAIAAgent[GAIAAgent<br/>State Machine]
        PlanNode[Planning Node<br/>Analyze question]
        ToolSelectNode[Tool Selection Node<br/>LLM function calling]
        ToolExecNode[Tool Execution Node<br/>Run selected tools]
        SynthesizeNode[Answer Synthesis Node<br/>Generate factoid]
    end

    subgraph "LLM Layer (4-Tier Fallback)"
        LLMClient[LLM Client<br/>Retry + Fallback]
        Gemini[Gemini 2.0 Flash<br/>Free Tier 1]
        HF[HuggingFace API<br/>Free Tier 2]
        Groq[Groq Llama/Qwen<br/>Free Tier 3]
        Claude[Claude Sonnet 4.5<br/>Paid Tier 4]
    end

    subgraph "Tool Layer"
        WebSearch[Web Search<br/>Tavily/Exa]
        FileParser[File Parser<br/>PDF/Excel/Word]
        Calculator[Calculator<br/>Safe eval]
        Vision[Vision<br/>Multimodal LLM]
    end

    subgraph "External Services"
        API[GAIA Scoring API]
        QEndpoint["/questions endpoint"]
        SEndpoint["/submit endpoint"]
    end

    GradioUI --> OAuth
    OAuth -->|Authenticated| GAIAAgent
    GAIAAgent --> PlanNode
    PlanNode --> ToolSelectNode
    ToolSelectNode --> ToolExecNode
    ToolExecNode --> SynthesizeNode

    PlanNode --> LLMClient
    ToolSelectNode --> LLMClient
    SynthesizeNode --> LLMClient

    LLMClient -->|Try 1| Gemini
    LLMClient -->|Fallback 2| HF
    LLMClient -->|Fallback 3| Groq
    LLMClient -->|Fallback 4| Claude

    ToolExecNode --> WebSearch
    ToolExecNode --> FileParser
    ToolExecNode --> Calculator
    ToolExecNode --> Vision

    GAIAAgent -->|Answers| API
    API --> QEndpoint
    API --> SEndpoint
    SEndpoint -->|Score| GradioUI

    style GAIAAgent fill:#ffcccc
    style LLMClient fill:#fff4cc
    style GradioUI fill:#cce5ff
    style API fill:#d9f2d9
```

## Project Specification

**Project Context:**

This is a course assignment template for building an AI agent that passes the GAIA benchmark (General AI Assistants). The project was recently started as a learning-focused workspace to support iterative agent development and experimentation.

**Current State:**

- **Status:** Stage 5 Complete - Performance Optimization
- **Development Progress:**
  - Stage 1-2: Basic infrastructure and LangGraph setup ✅
  - Stage 3: Multi-provider LLM integration ✅
  - Stage 4: Tool system and MVP (10% GAIA score: 2/20 questions) ✅
  - Stage 5: Performance optimization (retry logic, Groq integration, improved prompts) ✅
- **Current Performance:** Testing in progress (target: 25% accuracy, 5/20 questions)
- **Next Stage:** Stage 6 - Advanced optimizations based on Stage 5 results

**Data & Workflows:**

- **Input Data:** GAIA test questions fetched from external scoring API (`agents-course-unit4-scoring.hf.space`)
- **Processing:** BasicAgent class processes questions and generates answers
- **Output:** Agent responses submitted to scoring endpoint for evaluation
- **Development Workflow:**
  1. Local development and testing
  2. Deploy to Hugging Face Space
  3. Submit via integrated evaluation UI

**User Workflow Diagram:**

```mermaid
---
config:
  layout: fixed
---
flowchart TB
    Start(["Student starts assignment"]) --> Clone["Clone HF Space template"]
    Clone --> LocalDev["Local development:<br>Implement BasicAgent logic"]
    LocalDev --> LocalTest{"Test locally?"}
    LocalTest -- Yes --> RunLocal["Run app locally"]
    RunLocal --> Debug{"Works?"}
    Debug -- No --> LocalDev
    Debug -- Yes --> Deploy["Deploy to HF Space"]
    LocalTest -- Skip --> Deploy
    Deploy --> Login["Login with HF OAuth"]
    Login --> RunEval@{ label: "Click 'Run Evaluation'<br>button in UI" }
    RunEval --> FetchQ["System fetches GAIA<br>questions from API"]
    FetchQ --> RunAgent["Agent processes<br>each question"]
    RunAgent --> Submit["Submit answers<br>to scoring API"]
    Submit --> Display["Display score<br>and results"]
    Display --> Iterate{"Satisfied with<br>score?"}
    Iterate -- "No - improve agent" --> LocalDev
    Iterate -- Yes --> Complete(["Assignment complete"])

    RunEval@{ shape: rect}
    style Start fill:#e1f5e1
    style LocalDev fill:#fff4e1
    style Deploy fill:#e1f0ff
    style RunAgent fill:#ffe1f0
    style Complete fill:#e1f5e1
```

**Technical Architecture:**

- **Platform:** Hugging Face Spaces with OAuth integration
- **Framework:** Gradio for UI, Requests for API communication
- **Core Component:** BasicAgent class (student-customizable template)
- **Evaluation Infrastructure:** Pre-built orchestration (question fetching, submission, scoring display)
- **Deployment:** HF Space with environment variables (SPACE_ID, SPACE_HOST)

**Requirements & Constraints:**

- **Constraint Type:** Minimal at current stage
- **Infrastructure:** Must run on Hugging Face Spaces platform
- **Integration:** Fixed scoring API endpoints (cannot modify evaluation system)
- **Flexibility:** Students have full freedom to design agent capabilities

**Integration Points:**

- **External API:** `https://agents-course-unit4-scoring.hf.space`
  - `/questions` endpoint: Fetch GAIA test questions
  - `/submit` endpoint: Submit answers and receive scores
- **Authentication:** Hugging Face OAuth for student identification
- **Deployment:** HF Space runtime environment variables

**Development Goals:**

- **Primary:** Achieve competitive GAIA benchmark performance through systematic optimization
- **Focus:** Multi-tier LLM architecture with free-tier prioritization to minimize costs
- **Key Features:**
  - 4-tier LLM fallback for quota resilience (Gemini → HF → Groq → Claude)
  - Exponential backoff retry logic for quota/rate limit errors
  - UI-based LLM provider selection for easy A/B testing in cloud
  - Comprehensive tool system (web search, file parsing, calculator, vision)
  - Graceful error handling and degradation
  - Extensive test coverage (99 tests)
- **Documentation:** Full dev record workflow tracking all decisions and changes

## Key Features

### LLM Provider Selection (UI-Based)

**Local Testing (.env configuration):**

```bash
LLM_PROVIDER=gemini          # Options: gemini, huggingface, groq, claude
ENABLE_LLM_FALLBACK=false    # Disable fallback for debugging single provider
```

**Cloud Testing (HuggingFace Spaces):**

- Use UI dropdowns in Test & Debug tab or Full Evaluation tab
- Select from: Gemini, HuggingFace, Groq, Claude
- Toggle fallback behavior with checkbox
- No environment variable changes needed, instant provider switching

**Benefits:**

- Easy A/B testing between providers
- Clear visibility which LLM is used
- Isolated testing for debugging
- Production safety with fallback enabled

### Retry Logic

- **Exponential backoff:** 3 attempts with 1s, 2s, 4s delays
- **Error detection:** 429 status, quota errors, rate limits
- **Scope:** All LLM calls (planning, tool selection, synthesis)

### Tool System

**Web Search (Tavily/Exa):**

- Factual information, current events, statistics
- Wikipedia, company info, people

**File Parser:**

- PDF, Excel, Word, CSV, Text, Images
- Handles uploaded files and local paths

**Calculator:**

- Safe expression evaluation
- Arithmetic, algebra, trigonometry, logarithms
- Functions: sqrt, sin, cos, log, abs, etc.

**Vision:**

- Multimodal image/video analysis
- Describe content, identify objects, read text
- YouTube video understanding

### Performance Optimizations (Stage 5)

- Few-shot prompting for improved tool selection
- Graceful vision question skip when quota exhausted
- Relaxed calculator validation (returns error dicts instead of crashes)
- Improved tool descriptions with "Use when..." guidance
- Config-based provider debugging

## GAIA Benchmark Results

**Baseline (Stage 4):** 10% accuracy (2/20 questions correct)

**Stage 5 Target:** 25% accuracy (5/20 questions correct)

- Status: Testing in progress
- Expected improvements from retry logic, Groq integration, improved prompts

**Test Coverage:** 99 passing tests (~2min 40sec runtime)

> **Note:** This project implements the **Course Leaderboard** (20 questions, 30% target). See [GAIA Submission Guide](../agentbee/docs/gaia_submission_guide.md) for distinction between Course and Official GAIA leaderboards.

## Workflow

### Dev Record Workflow

**Philosophy:** Dev records are the single source of truth. CHANGELOG/PLAN/TODO are temporary workspace files.

**Dev Record Types:**

- 🐞 **Issue:** Problem-solving, bug fixes, error resolution
- 🔨 **Development:** Feature development, enhancements, new functionality

### Session Start Workflow

#### Phase 1: Planning (Explicit)

1. **Create or identify dev record:** `dev/dev_YYMMDD_##_concise_title.md`
   - Choose type: 🐞 Issue or 🔨 Development
2. **Create PLAN.md ONLY:** Use `/plan` command or write directly
   - Document implementation approach, steps, files to modify
   - DO NOT create TODO.md or CHANGELOG.md yet

#### Phase 2: Development (Automatic)

3. **Create TODO.md:** Automatically populate as you start implementing
   - Track tasks in real-time using TodoWrite tool
   - Mark in_progress/completed as you work
4. **Create CHANGELOG.md:** Automatically populate as you make changes
   - Record file modifications/creations/deletions as they happen
5. **Work on solution:** Update all three files during development

### Session End Workflow

#### Phase 3: Completion (Manual)

After AI completes all work and updates PLAN/TODO/CHANGELOG:

- AI stops and waits for user review (Checkpoint 3)
- User reviews PLAN.md, TODO.md, and CHANGELOG.md
- User manually runs `/update-dev dev_YYMMDD_##` when satisfied

When /update-dev runs:

1. Distills PLAN decisions → dev record "Key Decisions" section
2. Distills TODO deliverables → dev record "Outcome" section
3. Distills CHANGELOG changes → dev record "Changelog" section
4. Empties PLAN.md, TODO.md, CHANGELOG.md back to templates
5. Marks dev record status as ✅ Resolved

### AI Context Loading Protocol

**MANDATORY - Execute in exact order. NO delegating to sub-agents for initial context.**

**Phase 1: Current State (What's happening NOW)**

1. **Read workspace files:**

   - `CHANGELOG.md` - Active session changes (reverse chronological, newest first)
   - `PLAN.md` - Current implementation plan (if exists)
   - `TODO.md` - Active task tracking (if exists)

2. **Read actual outputs (CRITICAL - verify claims, don't trust summaries):**
   - Latest files in `output/` folder (sorted by timestamp, newest first)
   - For GAIA projects: Read latest `output/gaia_results_*.json` completely
     - Check `metadata.score_percent` and `metadata.correct_count`
     - Read ALL `results[].submitted_answer` to understand failure patterns
     - Identify error categories (vision failures, tool errors, wrong answers)
   - For test projects: Read latest test output logs
   - **Purpose:** Ground truth of what ACTUALLY happened, not what was claimed

**Phase 2: Recent History (What was done recently)**

3. **Read last 3 dev records from `dev/` folder:**
   - Sort by filename (newest `dev_YYMMDD_##_title.md` first)
   - Read: Problem Description, Key Decisions, Outcome, Changelog
   - **Cross-verify:** Compare dev record claims with actual output files
   - **Red flag:** If dev record says "25% accuracy" but latest JSON shows "0%", prioritize JSON truth

**Phase 3: Project Structure (How it works)**

4. **Read README.md sections in order:**

   - Section 1: Overview (purpose, objectives)
   - Section 2: Architecture (tech stack, components, diagrams)
   - Section 3: Specification (current state, workflows, requirements)
   - Section 4: Workflow (this protocol)

5. **Read CLAUDE.md:**
   - Project-specific coding standards
   - Usually empty (inherits from global ~/.claude/CLAUDE.md)

**Phase 4: Code Structure (Critical files)**

6. **Identify critical files from README.md Architecture section:**
   - Note main entry points (e.g., `app.py`)
   - Note core logic files (e.g., `src/agent/graph.py`, `src/agent/llm_client.py`)
   - Note tool implementations (e.g., `src/tools/*.py`)
   - **DO NOT read these yet** - only note their locations for later reference

**Verification Checklist (Before claiming "I have context"):**

- [ ] I personally read CHANGELOG.md, PLAN.md, TODO.md (not delegated)
- [ ] I personally read latest output files (JSON results, test logs, etc.)
- [ ] I know the ACTUAL current accuracy/status from output files
- [ ] I read last 3 dev records and cross-verified claims with output data
- [ ] I read README.md sections 1-4 completely
- [ ] I can answer: "What is the current status and why?"
- [ ] I can answer: "What were the last 3 major changes and their outcomes?"
- [ ] I can answer: "What specific problems exist based on latest outputs?"

**Anti-Patterns (NEVER do these):**

- ❌ Delegate initial context loading to Explore/Task agents
- ❌ Trust dev record claims without verifying against output files
- ❌ Skip reading actual output data (JSON results, logs, test outputs)
- ❌ Claim "I have context" after only reading summaries
- ❌ Read code files before understanding current state from outputs

**Context Priority:** Latest Outputs (ground truth) > CHANGELOG (active work) > Dev Records (history) > README (structure)