agentbee

Running

App Files Files Community

agentbee / README.md

mangubee

feat: phase1 planning and video processing research

0d77f39 13 days ago

preview code

raw

history blame contribute delete

18.2 kB

A newer version of the Gradio SDK is available: 6.4.0

Upgrade

metadata

title: Agentbee | GAIA Project | HuggingFace Course
emoji: 🕵🏻‍♂️
colorFrom: indigo
colorTo: indigo
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 480

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Project Overview

Project Name: Final_Assignment_Template

Purpose: Course assignment template for building an AI agent that passes the GAIA benchmark (General AI Assistants). This project serves as a learning-focused workspace to support iterative agent development and experimentation.

Target Users: Students learning agent development through hands-on implementation

Key Objectives:

Build production-ready code that passes GAIA test requirements
Learn agent development through discovery-based implementation
Develop systematic approach to complex AI task solving
Document learning process and key decisions

Project Architecture

Technology Stack:

Platform: Hugging Face Spaces with OAuth integration
UI Framework: Gradio 5.x with OAuth support
Agent Framework: LangGraph (state graph orchestration)
LLM Providers (4-tier fallback):
- Google Gemini 2.0 Flash (free tier)
- HuggingFace Inference API (free tier)
- Groq (Llama 3.1 70B / Qwen 3 32B, free tier)
- Anthropic Claude Sonnet 4.5 (paid tier)
Tools:
- Web Search: Tavily API / Exa API
- File Parser: PyPDF2, openpyxl, python-docx, pillow
- Calculator: Safe expression evaluator
- Vision: Multimodal LLM (Gemini/Claude)
Language: Python 3.12+
Package Manager: uv

Project Structure:

Final_Assignment_Template/
├── archive/             # Reference materials, previous solutions, static resources
├── input/               # Input files, configuration, raw data
├── output/              # Generated files, results, processed data
├── test/                # Testing files, test scripts (99 tests)
├── dev/                 # Development records (permanent knowledge packages)
├── src/                 # Source code
│   ├── agent/           # Agent orchestration
│   │   ├── graph.py     # LangGraph state machine
│   │   └── llm_client.py # Multi-provider LLM integration with retry logic
│   └── tools/           # Agent tools
│       ├── __init__.py  # Tool registry
│       ├── web_search.py    # Tavily/Exa web search
│       ├── file_parser.py   # Multi-format file reader
│       ├── calculator.py    # Safe math evaluator
│       └── vision.py        # Multimodal image/video analysis
├── app.py               # Gradio UI with OAuth, LLM provider selection
├── pyproject.toml       # uv package management
├── requirements.txt     # Python dependencies (generated from pyproject.toml)
├── .env                 # Local environment variables (API keys, config)
├── README.md            # Project overview, architecture, workflow, specification
├── CLAUDE.md            # Project-specific AI instructions
├── PLAN.md              # Active implementation plan (temporary workspace)
├── TODO.md              # Active task tracking (temporary workspace)
└── CHANGELOG.md         # Session changelog (temporary workspace)

Core Components:

GAIAAgent class (src/agent/graph.py): LangGraph-based agent with state machine orchestration
- Planning node: Analyze question and generate execution plan
- Tool selection node: LLM function calling for dynamic tool selection
- Tool execution node: Execute selected tools with timeout and error handling
- Answer synthesis node: Generate factoid answer from evidence
LLM Client (src/agent/llm_client.py): Multi-provider LLM integration
- 4-tier fallback chain: Gemini → HuggingFace → Groq → Claude
- Exponential backoff retry logic (3 attempts per provider)
- Runtime config for UI-based provider selection
- Few-shot prompting for improved tool selection
Tool System (src/tools/):
- Web Search: Tavily/Exa API with query optimization
- File Parser: Multi-format support (PDF, Excel, Word, CSV, images)
- Calculator: Safe expression evaluator with graceful error handling
- Vision: Multimodal analysis for images/videos
Gradio UI (app.py):
- Test & Debug tab: Single question testing with LLM provider dropdown
- Full Evaluation tab: Run all GAIA questions with provider selection
- Results export: JSON file download for analysis
- OAuth integration for submission
Evaluation Infrastructure: Pre-built orchestration (question fetching, submission, scoring)

System Architecture Diagram:

---
config:
  layout: elk
---
graph TB
    subgraph "UI Layer"
        GradioUI[Gradio UI<br/>LLM Provider Selection<br/>Test & Full Evaluation]
        OAuth[HF OAuth<br/>User authentication]
    end

    subgraph "Agent Orchestration (LangGraph)"
        GAIAAgent[GAIAAgent<br/>State Machine]
        PlanNode[Planning Node<br/>Analyze question]
        ToolSelectNode[Tool Selection Node<br/>LLM function calling]
        ToolExecNode[Tool Execution Node<br/>Run selected tools]
        SynthesizeNode[Answer Synthesis Node<br/>Generate factoid]
    end

    subgraph "LLM Layer (4-Tier Fallback)"
        LLMClient[LLM Client<br/>Retry + Fallback]
        Gemini[Gemini 2.0 Flash<br/>Free Tier 1]
        HF[HuggingFace API<br/>Free Tier 2]
        Groq[Groq Llama/Qwen<br/>Free Tier 3]
        Claude[Claude Sonnet 4.5<br/>Paid Tier 4]
    end

    subgraph "Tool Layer"
        WebSearch[Web Search<br/>Tavily/Exa]
        FileParser[File Parser<br/>PDF/Excel/Word]
        Calculator[Calculator<br/>Safe eval]
        Vision[Vision<br/>Multimodal LLM]
    end

    subgraph "External Services"
        API[GAIA Scoring API]
        QEndpoint["/questions endpoint"]
        SEndpoint["/submit endpoint"]
    end

    GradioUI --> OAuth
    OAuth -->|Authenticated| GAIAAgent
    GAIAAgent --> PlanNode
    PlanNode --> ToolSelectNode
    ToolSelectNode --> ToolExecNode
    ToolExecNode --> SynthesizeNode

    PlanNode --> LLMClient
    ToolSelectNode --> LLMClient
    SynthesizeNode --> LLMClient

    LLMClient -->|Try 1| Gemini
    LLMClient -->|Fallback 2| HF
    LLMClient -->|Fallback 3| Groq
    LLMClient -->|Fallback 4| Claude

    ToolExecNode --> WebSearch
    ToolExecNode --> FileParser
    ToolExecNode --> Calculator
    ToolExecNode --> Vision

    GAIAAgent -->|Answers| API
    API --> QEndpoint
    API --> SEndpoint
    SEndpoint -->|Score| GradioUI

    style GAIAAgent fill:#ffcccc
    style LLMClient fill:#fff4cc
    style GradioUI fill:#cce5ff
    style API fill:#d9f2d9

Project Specification

Project Context:

This is a course assignment template for building an AI agent that passes the GAIA benchmark (General AI Assistants). The project was recently started as a learning-focused workspace to support iterative agent development and experimentation.

Current State:

Status: Stage 5 Complete - Performance Optimization
Development Progress:
- Stage 1-2: Basic infrastructure and LangGraph setup ✅
- Stage 3: Multi-provider LLM integration ✅
- Stage 4: Tool system and MVP (10% GAIA score: 2/20 questions) ✅
- Stage 5: Performance optimization (retry logic, Groq integration, improved prompts) ✅
Current Performance: Testing in progress (target: 25% accuracy, 5/20 questions)
Next Stage: Stage 6 - Advanced optimizations based on Stage 5 results

Data & Workflows:

Input Data: GAIA test questions fetched from external scoring API (agents-course-unit4-scoring.hf.space)
Processing: BasicAgent class processes questions and generates answers
Output: Agent responses submitted to scoring endpoint for evaluation
Development Workflow:
1. Local development and testing
2. Deploy to Hugging Face Space
3. Submit via integrated evaluation UI

User Workflow Diagram:

---
config:
  layout: fixed
---
flowchart TB
    Start(["Student starts assignment"]) --> Clone["Clone HF Space template"]
    Clone --> LocalDev["Local development:<br>Implement BasicAgent logic"]
    LocalDev --> LocalTest{"Test locally?"}
    LocalTest -- Yes --> RunLocal["Run app locally"]
    RunLocal --> Debug{"Works?"}
    Debug -- No --> LocalDev
    Debug -- Yes --> Deploy["Deploy to HF Space"]
    LocalTest -- Skip --> Deploy
    Deploy --> Login["Login with HF OAuth"]
    Login --> RunEval@{ label: "Click 'Run Evaluation'<br>button in UI" }
    RunEval --> FetchQ["System fetches GAIA<br>questions from API"]
    FetchQ --> RunAgent["Agent processes<br>each question"]
    RunAgent --> Submit["Submit answers<br>to scoring API"]
    Submit --> Display["Display score<br>and results"]
    Display --> Iterate{"Satisfied with<br>score?"}
    Iterate -- "No - improve agent" --> LocalDev
    Iterate -- Yes --> Complete(["Assignment complete"])

    RunEval@{ shape: rect}
    style Start fill:#e1f5e1
    style LocalDev fill:#fff4e1
    style Deploy fill:#e1f0ff
    style RunAgent fill:#ffe1f0
    style Complete fill:#e1f5e1

Technical Architecture:

Platform: Hugging Face Spaces with OAuth integration
Framework: Gradio for UI, Requests for API communication
Core Component: BasicAgent class (student-customizable template)
Evaluation Infrastructure: Pre-built orchestration (question fetching, submission, scoring display)
Deployment: HF Space with environment variables (SPACE_ID, SPACE_HOST)

Requirements & Constraints:

Constraint Type: Minimal at current stage
Infrastructure: Must run on Hugging Face Spaces platform
Integration: Fixed scoring API endpoints (cannot modify evaluation system)
Flexibility: Students have full freedom to design agent capabilities

Integration Points:

External API: https://agents-course-unit4-scoring.hf.space
- /questions endpoint: Fetch GAIA test questions
- /submit endpoint: Submit answers and receive scores
Authentication: Hugging Face OAuth for student identification
Deployment: HF Space runtime environment variables

Development Goals:

Primary: Achieve competitive GAIA benchmark performance through systematic optimization
Focus: Multi-tier LLM architecture with free-tier prioritization to minimize costs
Key Features:
- 4-tier LLM fallback for quota resilience (Gemini → HF → Groq → Claude)
- Exponential backoff retry logic for quota/rate limit errors
- UI-based LLM provider selection for easy A/B testing in cloud
- Comprehensive tool system (web search, file parsing, calculator, vision)
- Graceful error handling and degradation
- Extensive test coverage (99 tests)
Documentation: Full dev record workflow tracking all decisions and changes

Key Features

LLM Provider Selection (UI-Based)

Local Testing (.env configuration):

LLM_PROVIDER=gemini          # Options: gemini, huggingface, groq, claude
ENABLE_LLM_FALLBACK=false    # Disable fallback for debugging single provider

Cloud Testing (HuggingFace Spaces):

Use UI dropdowns in Test & Debug tab or Full Evaluation tab
Select from: Gemini, HuggingFace, Groq, Claude
Toggle fallback behavior with checkbox
No environment variable changes needed, instant provider switching

Benefits:

Easy A/B testing between providers
Clear visibility which LLM is used
Isolated testing for debugging
Production safety with fallback enabled

Retry Logic

Exponential backoff: 3 attempts with 1s, 2s, 4s delays
Error detection: 429 status, quota errors, rate limits
Scope: All LLM calls (planning, tool selection, synthesis)

Tool System

Web Search (Tavily/Exa):

Factual information, current events, statistics
Wikipedia, company info, people

File Parser:

PDF, Excel, Word, CSV, Text, Images
Handles uploaded files and local paths

Calculator:

Safe expression evaluation
Arithmetic, algebra, trigonometry, logarithms
Functions: sqrt, sin, cos, log, abs, etc.

Vision:

Multimodal image/video analysis
Describe content, identify objects, read text
YouTube video understanding

Performance Optimizations (Stage 5)

Few-shot prompting for improved tool selection
Graceful vision question skip when quota exhausted
Relaxed calculator validation (returns error dicts instead of crashes)
Improved tool descriptions with "Use when..." guidance
Config-based provider debugging

GAIA Benchmark Results

Baseline (Stage 4): 10% accuracy (2/20 questions correct)

Stage 5 Target: 25% accuracy (5/20 questions correct)

Status: Testing in progress
Expected improvements from retry logic, Groq integration, improved prompts

Test Coverage: 99 passing tests (~2min 40sec runtime)

Note: This project implements the Course Leaderboard (20 questions, 30% target). See GAIA Submission Guide for distinction between Course and Official GAIA leaderboards.

Workflow

Dev Record Workflow

Philosophy: Dev records are the single source of truth. CHANGELOG/PLAN/TODO are temporary workspace files.

Dev Record Types:

🐞 Issue: Problem-solving, bug fixes, error resolution
🔨 Development: Feature development, enhancements, new functionality

Session Start Workflow

Phase 1: Planning (Explicit)

Create or identify dev record: dev/dev_YYMMDD_##_concise_title.md
- Choose type: 🐞 Issue or 🔨 Development
Create PLAN.md ONLY: Use /plan command or write directly
- Document implementation approach, steps, files to modify
- DO NOT create TODO.md or CHANGELOG.md yet

Phase 2: Development (Automatic)

Create TODO.md: Automatically populate as you start implementing
- Track tasks in real-time using TodoWrite tool
- Mark in_progress/completed as you work
Create CHANGELOG.md: Automatically populate as you make changes
- Record file modifications/creations/deletions as they happen
Work on solution: Update all three files during development

Session End Workflow

Phase 3: Completion (Manual)

After AI completes all work and updates PLAN/TODO/CHANGELOG:

AI stops and waits for user review (Checkpoint 3)
User reviews PLAN.md, TODO.md, and CHANGELOG.md
User manually runs /update-dev dev_YYMMDD_## when satisfied

When /update-dev runs:

Distills PLAN decisions → dev record "Key Decisions" section
Distills TODO deliverables → dev record "Outcome" section
Distills CHANGELOG changes → dev record "Changelog" section
Empties PLAN.md, TODO.md, CHANGELOG.md back to templates
Marks dev record status as ✅ Resolved

AI Context Loading Protocol

MANDATORY - Execute in exact order. NO delegating to sub-agents for initial context.

Phase 1: Current State (What's happening NOW)

Read workspace files:
- CHANGELOG.md - Active session changes (reverse chronological, newest first)
- PLAN.md - Current implementation plan (if exists)
- TODO.md - Active task tracking (if exists)
Read actual outputs (CRITICAL - verify claims, don't trust summaries):
- Latest files in output/ folder (sorted by timestamp, newest first)
- For GAIA projects: Read latest output/gaia_results_*.json completely
  - Check metadata.score_percent and metadata.correct_count
  - Read ALL results[].submitted_answer to understand failure patterns
  - Identify error categories (vision failures, tool errors, wrong answers)
- For test projects: Read latest test output logs
- Purpose: Ground truth of what ACTUALLY happened, not what was claimed

Phase 2: Recent History (What was done recently)

Read last 3 dev records from dev/ folder:
- Sort by filename (newest dev_YYMMDD_##_title.md first)
- Read: Problem Description, Key Decisions, Outcome, Changelog
- Cross-verify: Compare dev record claims with actual output files
- Red flag: If dev record says "25% accuracy" but latest JSON shows "0%", prioritize JSON truth

Phase 3: Project Structure (How it works)

Read README.md sections in order:
- Section 1: Overview (purpose, objectives)
- Section 2: Architecture (tech stack, components, diagrams)
- Section 3: Specification (current state, workflows, requirements)
- Section 4: Workflow (this protocol)
Read CLAUDE.md:
- Project-specific coding standards
- Usually empty (inherits from global ~/.claude/CLAUDE.md)

Phase 4: Code Structure (Critical files)

Identify critical files from README.md Architecture section:
- Note main entry points (e.g., app.py)
- Note core logic files (e.g., src/agent/graph.py, src/agent/llm_client.py)
- Note tool implementations (e.g., src/tools/*.py)
- DO NOT read these yet - only note their locations for later reference

Verification Checklist (Before claiming "I have context"):

I personally read CHANGELOG.md, PLAN.md, TODO.md (not delegated)
I personally read latest output files (JSON results, test logs, etc.)
I know the ACTUAL current accuracy/status from output files
I read last 3 dev records and cross-verified claims with output data
I read README.md sections 1-4 completely
I can answer: "What is the current status and why?"
I can answer: "What were the last 3 major changes and their outcomes?"
I can answer: "What specific problems exist based on latest outputs?"

Anti-Patterns (NEVER do these):

❌ Delegate initial context loading to Explore/Task agents
❌ Trust dev record claims without verifying against output files
❌ Skip reading actual output data (JSON results, logs, test outputs)
❌ Claim "I have context" after only reading summaries
❌ Read code files before understanding current state from outputs

Context Priority: Latest Outputs (ground truth) > CHANGELOG (active work) > Dev Records (history) > README (structure)