A newer version of the Gradio SDK is available:
6.4.0
title: Agentbee | GAIA Project | HuggingFace Course
emoji: π΅π»ββοΈ
colorFrom: indigo
colorTo: indigo
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 480
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
Project Overview
Project Name: Final_Assignment_Template
Purpose: Course assignment template for building an AI agent that passes the GAIA benchmark (General AI Assistants). This project serves as a learning-focused workspace to support iterative agent development and experimentation.
Target Users: Students learning agent development through hands-on implementation
Key Objectives:
- Build production-ready code that passes GAIA test requirements
- Learn agent development through discovery-based implementation
- Develop systematic approach to complex AI task solving
- Document learning process and key decisions
Project Architecture
Technology Stack:
- Platform: Hugging Face Spaces with OAuth integration
- UI Framework: Gradio 5.x with OAuth support
- Agent Framework: LangGraph (state graph orchestration)
- LLM Providers (4-tier fallback):
- Google Gemini 2.0 Flash (free tier)
- HuggingFace Inference API (free tier)
- Groq (Llama 3.1 70B / Qwen 3 32B, free tier)
- Anthropic Claude Sonnet 4.5 (paid tier)
- Tools:
- Web Search: Tavily API / Exa API
- File Parser: PyPDF2, openpyxl, python-docx, pillow
- Calculator: Safe expression evaluator
- Vision: Multimodal LLM (Gemini/Claude)
- Language: Python 3.12+
- Package Manager: uv
Project Structure:
Final_Assignment_Template/
βββ archive/ # Reference materials, previous solutions, static resources
βββ input/ # Input files, configuration, raw data
βββ output/ # Generated files, results, processed data
βββ test/ # Testing files, test scripts (99 tests)
βββ dev/ # Development records (permanent knowledge packages)
βββ src/ # Source code
β βββ agent/ # Agent orchestration
β β βββ graph.py # LangGraph state machine
β β βββ llm_client.py # Multi-provider LLM integration with retry logic
β βββ tools/ # Agent tools
β βββ __init__.py # Tool registry
β βββ web_search.py # Tavily/Exa web search
β βββ file_parser.py # Multi-format file reader
β βββ calculator.py # Safe math evaluator
β βββ vision.py # Multimodal image/video analysis
βββ app.py # Gradio UI with OAuth, LLM provider selection
βββ pyproject.toml # uv package management
βββ requirements.txt # Python dependencies (generated from pyproject.toml)
βββ .env # Local environment variables (API keys, config)
βββ README.md # Project overview, architecture, workflow, specification
βββ CLAUDE.md # Project-specific AI instructions
βββ PLAN.md # Active implementation plan (temporary workspace)
βββ TODO.md # Active task tracking (temporary workspace)
βββ CHANGELOG.md # Session changelog (temporary workspace)
Core Components:
- GAIAAgent class (src/agent/graph.py): LangGraph-based agent with state machine orchestration
- Planning node: Analyze question and generate execution plan
- Tool selection node: LLM function calling for dynamic tool selection
- Tool execution node: Execute selected tools with timeout and error handling
- Answer synthesis node: Generate factoid answer from evidence
- LLM Client (src/agent/llm_client.py): Multi-provider LLM integration
- 4-tier fallback chain: Gemini β HuggingFace β Groq β Claude
- Exponential backoff retry logic (3 attempts per provider)
- Runtime config for UI-based provider selection
- Few-shot prompting for improved tool selection
- Tool System (src/tools/):
- Web Search: Tavily/Exa API with query optimization
- File Parser: Multi-format support (PDF, Excel, Word, CSV, images)
- Calculator: Safe expression evaluator with graceful error handling
- Vision: Multimodal analysis for images/videos
- Gradio UI (app.py):
- Test & Debug tab: Single question testing with LLM provider dropdown
- Full Evaluation tab: Run all GAIA questions with provider selection
- Results export: JSON file download for analysis
- OAuth integration for submission
- Evaluation Infrastructure: Pre-built orchestration (question fetching, submission, scoring)
System Architecture Diagram:
---
config:
layout: elk
---
graph TB
subgraph "UI Layer"
GradioUI[Gradio UI<br/>LLM Provider Selection<br/>Test & Full Evaluation]
OAuth[HF OAuth<br/>User authentication]
end
subgraph "Agent Orchestration (LangGraph)"
GAIAAgent[GAIAAgent<br/>State Machine]
PlanNode[Planning Node<br/>Analyze question]
ToolSelectNode[Tool Selection Node<br/>LLM function calling]
ToolExecNode[Tool Execution Node<br/>Run selected tools]
SynthesizeNode[Answer Synthesis Node<br/>Generate factoid]
end
subgraph "LLM Layer (4-Tier Fallback)"
LLMClient[LLM Client<br/>Retry + Fallback]
Gemini[Gemini 2.0 Flash<br/>Free Tier 1]
HF[HuggingFace API<br/>Free Tier 2]
Groq[Groq Llama/Qwen<br/>Free Tier 3]
Claude[Claude Sonnet 4.5<br/>Paid Tier 4]
end
subgraph "Tool Layer"
WebSearch[Web Search<br/>Tavily/Exa]
FileParser[File Parser<br/>PDF/Excel/Word]
Calculator[Calculator<br/>Safe eval]
Vision[Vision<br/>Multimodal LLM]
end
subgraph "External Services"
API[GAIA Scoring API]
QEndpoint["/questions endpoint"]
SEndpoint["/submit endpoint"]
end
GradioUI --> OAuth
OAuth -->|Authenticated| GAIAAgent
GAIAAgent --> PlanNode
PlanNode --> ToolSelectNode
ToolSelectNode --> ToolExecNode
ToolExecNode --> SynthesizeNode
PlanNode --> LLMClient
ToolSelectNode --> LLMClient
SynthesizeNode --> LLMClient
LLMClient -->|Try 1| Gemini
LLMClient -->|Fallback 2| HF
LLMClient -->|Fallback 3| Groq
LLMClient -->|Fallback 4| Claude
ToolExecNode --> WebSearch
ToolExecNode --> FileParser
ToolExecNode --> Calculator
ToolExecNode --> Vision
GAIAAgent -->|Answers| API
API --> QEndpoint
API --> SEndpoint
SEndpoint -->|Score| GradioUI
style GAIAAgent fill:#ffcccc
style LLMClient fill:#fff4cc
style GradioUI fill:#cce5ff
style API fill:#d9f2d9
Project Specification
Project Context:
This is a course assignment template for building an AI agent that passes the GAIA benchmark (General AI Assistants). The project was recently started as a learning-focused workspace to support iterative agent development and experimentation.
Current State:
- Status: Stage 5 Complete - Performance Optimization
- Development Progress:
- Stage 1-2: Basic infrastructure and LangGraph setup β
- Stage 3: Multi-provider LLM integration β
- Stage 4: Tool system and MVP (10% GAIA score: 2/20 questions) β
- Stage 5: Performance optimization (retry logic, Groq integration, improved prompts) β
- Current Performance: Testing in progress (target: 25% accuracy, 5/20 questions)
- Next Stage: Stage 6 - Advanced optimizations based on Stage 5 results
Data & Workflows:
- Input Data: GAIA test questions fetched from external scoring API (
agents-course-unit4-scoring.hf.space) - Processing: BasicAgent class processes questions and generates answers
- Output: Agent responses submitted to scoring endpoint for evaluation
- Development Workflow:
- Local development and testing
- Deploy to Hugging Face Space
- Submit via integrated evaluation UI
User Workflow Diagram:
---
config:
layout: fixed
---
flowchart TB
Start(["Student starts assignment"]) --> Clone["Clone HF Space template"]
Clone --> LocalDev["Local development:<br>Implement BasicAgent logic"]
LocalDev --> LocalTest{"Test locally?"}
LocalTest -- Yes --> RunLocal["Run app locally"]
RunLocal --> Debug{"Works?"}
Debug -- No --> LocalDev
Debug -- Yes --> Deploy["Deploy to HF Space"]
LocalTest -- Skip --> Deploy
Deploy --> Login["Login with HF OAuth"]
Login --> RunEval@{ label: "Click 'Run Evaluation'<br>button in UI" }
RunEval --> FetchQ["System fetches GAIA<br>questions from API"]
FetchQ --> RunAgent["Agent processes<br>each question"]
RunAgent --> Submit["Submit answers<br>to scoring API"]
Submit --> Display["Display score<br>and results"]
Display --> Iterate{"Satisfied with<br>score?"}
Iterate -- "No - improve agent" --> LocalDev
Iterate -- Yes --> Complete(["Assignment complete"])
RunEval@{ shape: rect}
style Start fill:#e1f5e1
style LocalDev fill:#fff4e1
style Deploy fill:#e1f0ff
style RunAgent fill:#ffe1f0
style Complete fill:#e1f5e1
Technical Architecture:
- Platform: Hugging Face Spaces with OAuth integration
- Framework: Gradio for UI, Requests for API communication
- Core Component: BasicAgent class (student-customizable template)
- Evaluation Infrastructure: Pre-built orchestration (question fetching, submission, scoring display)
- Deployment: HF Space with environment variables (SPACE_ID, SPACE_HOST)
Requirements & Constraints:
- Constraint Type: Minimal at current stage
- Infrastructure: Must run on Hugging Face Spaces platform
- Integration: Fixed scoring API endpoints (cannot modify evaluation system)
- Flexibility: Students have full freedom to design agent capabilities
Integration Points:
- External API:
https://agents-course-unit4-scoring.hf.space/questionsendpoint: Fetch GAIA test questions/submitendpoint: Submit answers and receive scores
- Authentication: Hugging Face OAuth for student identification
- Deployment: HF Space runtime environment variables
Development Goals:
- Primary: Achieve competitive GAIA benchmark performance through systematic optimization
- Focus: Multi-tier LLM architecture with free-tier prioritization to minimize costs
- Key Features:
- 4-tier LLM fallback for quota resilience (Gemini β HF β Groq β Claude)
- Exponential backoff retry logic for quota/rate limit errors
- UI-based LLM provider selection for easy A/B testing in cloud
- Comprehensive tool system (web search, file parsing, calculator, vision)
- Graceful error handling and degradation
- Extensive test coverage (99 tests)
- Documentation: Full dev record workflow tracking all decisions and changes
Key Features
LLM Provider Selection (UI-Based)
Local Testing (.env configuration):
LLM_PROVIDER=gemini # Options: gemini, huggingface, groq, claude
ENABLE_LLM_FALLBACK=false # Disable fallback for debugging single provider
Cloud Testing (HuggingFace Spaces):
- Use UI dropdowns in Test & Debug tab or Full Evaluation tab
- Select from: Gemini, HuggingFace, Groq, Claude
- Toggle fallback behavior with checkbox
- No environment variable changes needed, instant provider switching
Benefits:
- Easy A/B testing between providers
- Clear visibility which LLM is used
- Isolated testing for debugging
- Production safety with fallback enabled
Retry Logic
- Exponential backoff: 3 attempts with 1s, 2s, 4s delays
- Error detection: 429 status, quota errors, rate limits
- Scope: All LLM calls (planning, tool selection, synthesis)
Tool System
Web Search (Tavily/Exa):
- Factual information, current events, statistics
- Wikipedia, company info, people
File Parser:
- PDF, Excel, Word, CSV, Text, Images
- Handles uploaded files and local paths
Calculator:
- Safe expression evaluation
- Arithmetic, algebra, trigonometry, logarithms
- Functions: sqrt, sin, cos, log, abs, etc.
Vision:
- Multimodal image/video analysis
- Describe content, identify objects, read text
- YouTube video understanding
Performance Optimizations (Stage 5)
- Few-shot prompting for improved tool selection
- Graceful vision question skip when quota exhausted
- Relaxed calculator validation (returns error dicts instead of crashes)
- Improved tool descriptions with "Use when..." guidance
- Config-based provider debugging
GAIA Benchmark Results
Baseline (Stage 4): 10% accuracy (2/20 questions correct)
Stage 5 Target: 25% accuracy (5/20 questions correct)
- Status: Testing in progress
- Expected improvements from retry logic, Groq integration, improved prompts
Test Coverage: 99 passing tests (~2min 40sec runtime)
Note: This project implements the Course Leaderboard (20 questions, 30% target). See GAIA Submission Guide for distinction between Course and Official GAIA leaderboards.
Workflow
Dev Record Workflow
Philosophy: Dev records are the single source of truth. CHANGELOG/PLAN/TODO are temporary workspace files.
Dev Record Types:
- π Issue: Problem-solving, bug fixes, error resolution
- π¨ Development: Feature development, enhancements, new functionality
Session Start Workflow
Phase 1: Planning (Explicit)
- Create or identify dev record:
dev/dev_YYMMDD_##_concise_title.md- Choose type: π Issue or π¨ Development
- Create PLAN.md ONLY: Use
/plancommand or write directly- Document implementation approach, steps, files to modify
- DO NOT create TODO.md or CHANGELOG.md yet
Phase 2: Development (Automatic)
- Create TODO.md: Automatically populate as you start implementing
- Track tasks in real-time using TodoWrite tool
- Mark in_progress/completed as you work
- Create CHANGELOG.md: Automatically populate as you make changes
- Record file modifications/creations/deletions as they happen
- Work on solution: Update all three files during development
Session End Workflow
Phase 3: Completion (Manual)
After AI completes all work and updates PLAN/TODO/CHANGELOG:
- AI stops and waits for user review (Checkpoint 3)
- User reviews PLAN.md, TODO.md, and CHANGELOG.md
- User manually runs
/update-dev dev_YYMMDD_##when satisfied
When /update-dev runs:
- Distills PLAN decisions β dev record "Key Decisions" section
- Distills TODO deliverables β dev record "Outcome" section
- Distills CHANGELOG changes β dev record "Changelog" section
- Empties PLAN.md, TODO.md, CHANGELOG.md back to templates
- Marks dev record status as β Resolved
AI Context Loading Protocol
MANDATORY - Execute in exact order. NO delegating to sub-agents for initial context.
Phase 1: Current State (What's happening NOW)
Read workspace files:
CHANGELOG.md- Active session changes (reverse chronological, newest first)PLAN.md- Current implementation plan (if exists)TODO.md- Active task tracking (if exists)
Read actual outputs (CRITICAL - verify claims, don't trust summaries):
- Latest files in
output/folder (sorted by timestamp, newest first) - For GAIA projects: Read latest
output/gaia_results_*.jsoncompletely- Check
metadata.score_percentandmetadata.correct_count - Read ALL
results[].submitted_answerto understand failure patterns - Identify error categories (vision failures, tool errors, wrong answers)
- Check
- For test projects: Read latest test output logs
- Purpose: Ground truth of what ACTUALLY happened, not what was claimed
- Latest files in
Phase 2: Recent History (What was done recently)
- Read last 3 dev records from
dev/folder:- Sort by filename (newest
dev_YYMMDD_##_title.mdfirst) - Read: Problem Description, Key Decisions, Outcome, Changelog
- Cross-verify: Compare dev record claims with actual output files
- Red flag: If dev record says "25% accuracy" but latest JSON shows "0%", prioritize JSON truth
- Sort by filename (newest
Phase 3: Project Structure (How it works)
Read README.md sections in order:
- Section 1: Overview (purpose, objectives)
- Section 2: Architecture (tech stack, components, diagrams)
- Section 3: Specification (current state, workflows, requirements)
- Section 4: Workflow (this protocol)
Read CLAUDE.md:
- Project-specific coding standards
- Usually empty (inherits from global ~/.claude/CLAUDE.md)
Phase 4: Code Structure (Critical files)
- Identify critical files from README.md Architecture section:
- Note main entry points (e.g.,
app.py) - Note core logic files (e.g.,
src/agent/graph.py,src/agent/llm_client.py) - Note tool implementations (e.g.,
src/tools/*.py) - DO NOT read these yet - only note their locations for later reference
- Note main entry points (e.g.,
Verification Checklist (Before claiming "I have context"):
- I personally read CHANGELOG.md, PLAN.md, TODO.md (not delegated)
- I personally read latest output files (JSON results, test logs, etc.)
- I know the ACTUAL current accuracy/status from output files
- I read last 3 dev records and cross-verified claims with output data
- I read README.md sections 1-4 completely
- I can answer: "What is the current status and why?"
- I can answer: "What were the last 3 major changes and their outcomes?"
- I can answer: "What specific problems exist based on latest outputs?"
Anti-Patterns (NEVER do these):
- β Delegate initial context loading to Explore/Task agents
- β Trust dev record claims without verifying against output files
- β Skip reading actual output data (JSON results, logs, test outputs)
- β Claim "I have context" after only reading summaries
- β Read code files before understanding current state from outputs
Context Priority: Latest Outputs (ground truth) > CHANGELOG (active work) > Dev Records (history) > README (structure)