|
|
--- |
|
|
title: Agentbee | GAIA Project | HuggingFace Course |
|
|
emoji: π΅π»ββοΈ |
|
|
colorFrom: indigo |
|
|
colorTo: indigo |
|
|
sdk: gradio |
|
|
sdk_version: 6.3.0 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
hf_oauth: true |
|
|
hf_oauth_expiration_minutes: 480 |
|
|
--- |
|
|
|
|
|
Check out the configuration reference at <https://huggingface.co/docs/hub/spaces-config-reference> |
|
|
|
|
|
## Project Overview |
|
|
|
|
|
**Project Name:** Final_Assignment_Template |
|
|
|
|
|
**Purpose:** Course assignment template for building an AI agent that passes the GAIA benchmark (General AI Assistants). This project serves as a learning-focused workspace to support iterative agent development and experimentation. |
|
|
|
|
|
**Target Users:** Students learning agent development through hands-on implementation |
|
|
|
|
|
**Key Objectives:** |
|
|
|
|
|
- Build production-ready code that passes GAIA test requirements |
|
|
- Learn agent development through discovery-based implementation |
|
|
- Develop systematic approach to complex AI task solving |
|
|
- Document learning process and key decisions |
|
|
|
|
|
## Project Architecture |
|
|
|
|
|
**Technology Stack:** |
|
|
|
|
|
- **Platform:** Hugging Face Spaces with OAuth integration |
|
|
- **UI Framework:** Gradio 5.x with OAuth support |
|
|
- **Agent Framework:** LangGraph (state graph orchestration) |
|
|
- **LLM Providers (4-tier fallback):** |
|
|
- Google Gemini 2.0 Flash (free tier) |
|
|
- HuggingFace Inference API (free tier) |
|
|
- Groq (Llama 3.1 70B / Qwen 3 32B, free tier) |
|
|
- Anthropic Claude Sonnet 4.5 (paid tier) |
|
|
- **Tools:** |
|
|
- Web Search: Tavily API / Exa API |
|
|
- File Parser: PyPDF2, openpyxl, python-docx, pillow |
|
|
- Calculator: Safe expression evaluator |
|
|
- Vision: Multimodal LLM (Gemini/Claude) |
|
|
- **Language:** Python 3.12+ |
|
|
- **Package Manager:** uv |
|
|
|
|
|
**Project Structure:** |
|
|
|
|
|
``` |
|
|
Final_Assignment_Template/ |
|
|
βββ archive/ # Reference materials, previous solutions, static resources |
|
|
βββ input/ # Input files, configuration, raw data |
|
|
βββ output/ # Generated files, results, processed data |
|
|
βββ test/ # Testing files, test scripts (99 tests) |
|
|
βββ dev/ # Development records (permanent knowledge packages) |
|
|
βββ src/ # Source code |
|
|
β βββ agent/ # Agent orchestration |
|
|
β β βββ graph.py # LangGraph state machine |
|
|
β β βββ llm_client.py # Multi-provider LLM integration with retry logic |
|
|
β βββ tools/ # Agent tools |
|
|
β βββ __init__.py # Tool registry |
|
|
β βββ web_search.py # Tavily/Exa web search |
|
|
β βββ file_parser.py # Multi-format file reader |
|
|
β βββ calculator.py # Safe math evaluator |
|
|
β βββ vision.py # Multimodal image/video analysis |
|
|
βββ app.py # Gradio UI with OAuth, LLM provider selection |
|
|
βββ pyproject.toml # uv package management |
|
|
βββ requirements.txt # Python dependencies (generated from pyproject.toml) |
|
|
βββ .env # Local environment variables (API keys, config) |
|
|
βββ README.md # Project overview, architecture, workflow, specification |
|
|
βββ CLAUDE.md # Project-specific AI instructions |
|
|
βββ PLAN.md # Active implementation plan (temporary workspace) |
|
|
βββ TODO.md # Active task tracking (temporary workspace) |
|
|
βββ CHANGELOG.md # Session changelog (temporary workspace) |
|
|
``` |
|
|
|
|
|
**Core Components:** |
|
|
|
|
|
- **GAIAAgent class** (src/agent/graph.py): LangGraph-based agent with state machine orchestration |
|
|
- Planning node: Analyze question and generate execution plan |
|
|
- Tool selection node: LLM function calling for dynamic tool selection |
|
|
- Tool execution node: Execute selected tools with timeout and error handling |
|
|
- Answer synthesis node: Generate factoid answer from evidence |
|
|
- **LLM Client** (src/agent/llm_client.py): Multi-provider LLM integration |
|
|
- 4-tier fallback chain: Gemini β HuggingFace β Groq β Claude |
|
|
- Exponential backoff retry logic (3 attempts per provider) |
|
|
- Runtime config for UI-based provider selection |
|
|
- Few-shot prompting for improved tool selection |
|
|
- **Tool System** (src/tools/): |
|
|
- Web Search: Tavily/Exa API with query optimization |
|
|
- File Parser: Multi-format support (PDF, Excel, Word, CSV, images) |
|
|
- Calculator: Safe expression evaluator with graceful error handling |
|
|
- Vision: Multimodal analysis for images/videos |
|
|
- **Gradio UI** (app.py): |
|
|
- Test & Debug tab: Single question testing with LLM provider dropdown |
|
|
- Full Evaluation tab: Run all GAIA questions with provider selection |
|
|
- Results export: JSON file download for analysis |
|
|
- OAuth integration for submission |
|
|
- **Evaluation Infrastructure**: Pre-built orchestration (question fetching, submission, scoring) |
|
|
|
|
|
**System Architecture Diagram:** |
|
|
|
|
|
```mermaid |
|
|
--- |
|
|
config: |
|
|
layout: elk |
|
|
--- |
|
|
graph TB |
|
|
subgraph "UI Layer" |
|
|
GradioUI[Gradio UI<br/>LLM Provider Selection<br/>Test & Full Evaluation] |
|
|
OAuth[HF OAuth<br/>User authentication] |
|
|
end |
|
|
|
|
|
subgraph "Agent Orchestration (LangGraph)" |
|
|
GAIAAgent[GAIAAgent<br/>State Machine] |
|
|
PlanNode[Planning Node<br/>Analyze question] |
|
|
ToolSelectNode[Tool Selection Node<br/>LLM function calling] |
|
|
ToolExecNode[Tool Execution Node<br/>Run selected tools] |
|
|
SynthesizeNode[Answer Synthesis Node<br/>Generate factoid] |
|
|
end |
|
|
|
|
|
subgraph "LLM Layer (4-Tier Fallback)" |
|
|
LLMClient[LLM Client<br/>Retry + Fallback] |
|
|
Gemini[Gemini 2.0 Flash<br/>Free Tier 1] |
|
|
HF[HuggingFace API<br/>Free Tier 2] |
|
|
Groq[Groq Llama/Qwen<br/>Free Tier 3] |
|
|
Claude[Claude Sonnet 4.5<br/>Paid Tier 4] |
|
|
end |
|
|
|
|
|
subgraph "Tool Layer" |
|
|
WebSearch[Web Search<br/>Tavily/Exa] |
|
|
FileParser[File Parser<br/>PDF/Excel/Word] |
|
|
Calculator[Calculator<br/>Safe eval] |
|
|
Vision[Vision<br/>Multimodal LLM] |
|
|
end |
|
|
|
|
|
subgraph "External Services" |
|
|
API[GAIA Scoring API] |
|
|
QEndpoint["/questions endpoint"] |
|
|
SEndpoint["/submit endpoint"] |
|
|
end |
|
|
|
|
|
GradioUI --> OAuth |
|
|
OAuth -->|Authenticated| GAIAAgent |
|
|
GAIAAgent --> PlanNode |
|
|
PlanNode --> ToolSelectNode |
|
|
ToolSelectNode --> ToolExecNode |
|
|
ToolExecNode --> SynthesizeNode |
|
|
|
|
|
PlanNode --> LLMClient |
|
|
ToolSelectNode --> LLMClient |
|
|
SynthesizeNode --> LLMClient |
|
|
|
|
|
LLMClient -->|Try 1| Gemini |
|
|
LLMClient -->|Fallback 2| HF |
|
|
LLMClient -->|Fallback 3| Groq |
|
|
LLMClient -->|Fallback 4| Claude |
|
|
|
|
|
ToolExecNode --> WebSearch |
|
|
ToolExecNode --> FileParser |
|
|
ToolExecNode --> Calculator |
|
|
ToolExecNode --> Vision |
|
|
|
|
|
GAIAAgent -->|Answers| API |
|
|
API --> QEndpoint |
|
|
API --> SEndpoint |
|
|
SEndpoint -->|Score| GradioUI |
|
|
|
|
|
style GAIAAgent fill:#ffcccc |
|
|
style LLMClient fill:#fff4cc |
|
|
style GradioUI fill:#cce5ff |
|
|
style API fill:#d9f2d9 |
|
|
``` |
|
|
|
|
|
## Project Specification |
|
|
|
|
|
**Project Context:** |
|
|
|
|
|
This is a course assignment template for building an AI agent that passes the GAIA benchmark (General AI Assistants). The project was recently started as a learning-focused workspace to support iterative agent development and experimentation. |
|
|
|
|
|
**Current State:** |
|
|
|
|
|
- **Status:** Stage 5 Complete - Performance Optimization |
|
|
- **Development Progress:** |
|
|
- Stage 1-2: Basic infrastructure and LangGraph setup β
|
|
|
- Stage 3: Multi-provider LLM integration β
|
|
|
- Stage 4: Tool system and MVP (10% GAIA score: 2/20 questions) β
|
|
|
- Stage 5: Performance optimization (retry logic, Groq integration, improved prompts) β
|
|
|
- **Current Performance:** Testing in progress (target: 25% accuracy, 5/20 questions) |
|
|
- **Next Stage:** Stage 6 - Advanced optimizations based on Stage 5 results |
|
|
|
|
|
**Data & Workflows:** |
|
|
|
|
|
- **Input Data:** GAIA test questions fetched from external scoring API (`agents-course-unit4-scoring.hf.space`) |
|
|
- **Processing:** BasicAgent class processes questions and generates answers |
|
|
- **Output:** Agent responses submitted to scoring endpoint for evaluation |
|
|
- **Development Workflow:** |
|
|
1. Local development and testing |
|
|
2. Deploy to Hugging Face Space |
|
|
3. Submit via integrated evaluation UI |
|
|
|
|
|
**User Workflow Diagram:** |
|
|
|
|
|
```mermaid |
|
|
--- |
|
|
config: |
|
|
layout: fixed |
|
|
--- |
|
|
flowchart TB |
|
|
Start(["Student starts assignment"]) --> Clone["Clone HF Space template"] |
|
|
Clone --> LocalDev["Local development:<br>Implement BasicAgent logic"] |
|
|
LocalDev --> LocalTest{"Test locally?"} |
|
|
LocalTest -- Yes --> RunLocal["Run app locally"] |
|
|
RunLocal --> Debug{"Works?"} |
|
|
Debug -- No --> LocalDev |
|
|
Debug -- Yes --> Deploy["Deploy to HF Space"] |
|
|
LocalTest -- Skip --> Deploy |
|
|
Deploy --> Login["Login with HF OAuth"] |
|
|
Login --> RunEval@{ label: "Click 'Run Evaluation'<br>button in UI" } |
|
|
RunEval --> FetchQ["System fetches GAIA<br>questions from API"] |
|
|
FetchQ --> RunAgent["Agent processes<br>each question"] |
|
|
RunAgent --> Submit["Submit answers<br>to scoring API"] |
|
|
Submit --> Display["Display score<br>and results"] |
|
|
Display --> Iterate{"Satisfied with<br>score?"} |
|
|
Iterate -- "No - improve agent" --> LocalDev |
|
|
Iterate -- Yes --> Complete(["Assignment complete"]) |
|
|
|
|
|
RunEval@{ shape: rect} |
|
|
style Start fill:#e1f5e1 |
|
|
style LocalDev fill:#fff4e1 |
|
|
style Deploy fill:#e1f0ff |
|
|
style RunAgent fill:#ffe1f0 |
|
|
style Complete fill:#e1f5e1 |
|
|
``` |
|
|
|
|
|
**Technical Architecture:** |
|
|
|
|
|
- **Platform:** Hugging Face Spaces with OAuth integration |
|
|
- **Framework:** Gradio for UI, Requests for API communication |
|
|
- **Core Component:** BasicAgent class (student-customizable template) |
|
|
- **Evaluation Infrastructure:** Pre-built orchestration (question fetching, submission, scoring display) |
|
|
- **Deployment:** HF Space with environment variables (SPACE_ID, SPACE_HOST) |
|
|
|
|
|
**Requirements & Constraints:** |
|
|
|
|
|
- **Constraint Type:** Minimal at current stage |
|
|
- **Infrastructure:** Must run on Hugging Face Spaces platform |
|
|
- **Integration:** Fixed scoring API endpoints (cannot modify evaluation system) |
|
|
- **Flexibility:** Students have full freedom to design agent capabilities |
|
|
|
|
|
**Integration Points:** |
|
|
|
|
|
- **External API:** `https://agents-course-unit4-scoring.hf.space` |
|
|
- `/questions` endpoint: Fetch GAIA test questions |
|
|
- `/submit` endpoint: Submit answers and receive scores |
|
|
- **Authentication:** Hugging Face OAuth for student identification |
|
|
- **Deployment:** HF Space runtime environment variables |
|
|
|
|
|
**Development Goals:** |
|
|
|
|
|
- **Primary:** Achieve competitive GAIA benchmark performance through systematic optimization |
|
|
- **Focus:** Multi-tier LLM architecture with free-tier prioritization to minimize costs |
|
|
- **Key Features:** |
|
|
- 4-tier LLM fallback for quota resilience (Gemini β HF β Groq β Claude) |
|
|
- Exponential backoff retry logic for quota/rate limit errors |
|
|
- UI-based LLM provider selection for easy A/B testing in cloud |
|
|
- Comprehensive tool system (web search, file parsing, calculator, vision) |
|
|
- Graceful error handling and degradation |
|
|
- Extensive test coverage (99 tests) |
|
|
- **Documentation:** Full dev record workflow tracking all decisions and changes |
|
|
|
|
|
## Key Features |
|
|
|
|
|
### LLM Provider Selection (UI-Based) |
|
|
|
|
|
**Local Testing (.env configuration):** |
|
|
|
|
|
```bash |
|
|
LLM_PROVIDER=gemini # Options: gemini, huggingface, groq, claude |
|
|
ENABLE_LLM_FALLBACK=false # Disable fallback for debugging single provider |
|
|
``` |
|
|
|
|
|
**Cloud Testing (HuggingFace Spaces):** |
|
|
|
|
|
- Use UI dropdowns in Test & Debug tab or Full Evaluation tab |
|
|
- Select from: Gemini, HuggingFace, Groq, Claude |
|
|
- Toggle fallback behavior with checkbox |
|
|
- No environment variable changes needed, instant provider switching |
|
|
|
|
|
**Benefits:** |
|
|
|
|
|
- Easy A/B testing between providers |
|
|
- Clear visibility which LLM is used |
|
|
- Isolated testing for debugging |
|
|
- Production safety with fallback enabled |
|
|
|
|
|
### Retry Logic |
|
|
|
|
|
- **Exponential backoff:** 3 attempts with 1s, 2s, 4s delays |
|
|
- **Error detection:** 429 status, quota errors, rate limits |
|
|
- **Scope:** All LLM calls (planning, tool selection, synthesis) |
|
|
|
|
|
### Tool System |
|
|
|
|
|
**Web Search (Tavily/Exa):** |
|
|
|
|
|
- Factual information, current events, statistics |
|
|
- Wikipedia, company info, people |
|
|
|
|
|
**File Parser:** |
|
|
|
|
|
- PDF, Excel, Word, CSV, Text, Images |
|
|
- Handles uploaded files and local paths |
|
|
|
|
|
**Calculator:** |
|
|
|
|
|
- Safe expression evaluation |
|
|
- Arithmetic, algebra, trigonometry, logarithms |
|
|
- Functions: sqrt, sin, cos, log, abs, etc. |
|
|
|
|
|
**Vision:** |
|
|
|
|
|
- Multimodal image/video analysis |
|
|
- Describe content, identify objects, read text |
|
|
- YouTube video understanding |
|
|
|
|
|
### Performance Optimizations (Stage 5) |
|
|
|
|
|
- Few-shot prompting for improved tool selection |
|
|
- Graceful vision question skip when quota exhausted |
|
|
- Relaxed calculator validation (returns error dicts instead of crashes) |
|
|
- Improved tool descriptions with "Use when..." guidance |
|
|
- Config-based provider debugging |
|
|
|
|
|
## GAIA Benchmark Results |
|
|
|
|
|
**Baseline (Stage 4):** 10% accuracy (2/20 questions correct) |
|
|
|
|
|
**Stage 5 Target:** 25% accuracy (5/20 questions correct) |
|
|
|
|
|
- Status: Testing in progress |
|
|
- Expected improvements from retry logic, Groq integration, improved prompts |
|
|
|
|
|
**Test Coverage:** 99 passing tests (~2min 40sec runtime) |
|
|
|
|
|
> **Note:** This project implements the **Course Leaderboard** (20 questions, 30% target). See [GAIA Submission Guide](../agentbee/docs/gaia_submission_guide.md) for distinction between Course and Official GAIA leaderboards. |
|
|
|
|
|
## Workflow |
|
|
|
|
|
### Dev Record Workflow |
|
|
|
|
|
**Philosophy:** Dev records are the single source of truth. CHANGELOG/PLAN/TODO are temporary workspace files. |
|
|
|
|
|
**Dev Record Types:** |
|
|
|
|
|
- π **Issue:** Problem-solving, bug fixes, error resolution |
|
|
- π¨ **Development:** Feature development, enhancements, new functionality |
|
|
|
|
|
### Session Start Workflow |
|
|
|
|
|
#### Phase 1: Planning (Explicit) |
|
|
|
|
|
1. **Create or identify dev record:** `dev/dev_YYMMDD_##_concise_title.md` |
|
|
- Choose type: π Issue or π¨ Development |
|
|
2. **Create PLAN.md ONLY:** Use `/plan` command or write directly |
|
|
- Document implementation approach, steps, files to modify |
|
|
- DO NOT create TODO.md or CHANGELOG.md yet |
|
|
|
|
|
#### Phase 2: Development (Automatic) |
|
|
|
|
|
3. **Create TODO.md:** Automatically populate as you start implementing |
|
|
- Track tasks in real-time using TodoWrite tool |
|
|
- Mark in_progress/completed as you work |
|
|
4. **Create CHANGELOG.md:** Automatically populate as you make changes |
|
|
- Record file modifications/creations/deletions as they happen |
|
|
5. **Work on solution:** Update all three files during development |
|
|
|
|
|
### Session End Workflow |
|
|
|
|
|
#### Phase 3: Completion (Manual) |
|
|
|
|
|
After AI completes all work and updates PLAN/TODO/CHANGELOG: |
|
|
|
|
|
- AI stops and waits for user review (Checkpoint 3) |
|
|
- User reviews PLAN.md, TODO.md, and CHANGELOG.md |
|
|
- User manually runs `/update-dev dev_YYMMDD_##` when satisfied |
|
|
|
|
|
When /update-dev runs: |
|
|
|
|
|
1. Distills PLAN decisions β dev record "Key Decisions" section |
|
|
2. Distills TODO deliverables β dev record "Outcome" section |
|
|
3. Distills CHANGELOG changes β dev record "Changelog" section |
|
|
4. Empties PLAN.md, TODO.md, CHANGELOG.md back to templates |
|
|
5. Marks dev record status as β
Resolved |
|
|
|
|
|
### AI Context Loading Protocol |
|
|
|
|
|
**MANDATORY - Execute in exact order. NO delegating to sub-agents for initial context.** |
|
|
|
|
|
**Phase 1: Current State (What's happening NOW)** |
|
|
|
|
|
1. **Read workspace files:** |
|
|
|
|
|
- `CHANGELOG.md` - Active session changes (reverse chronological, newest first) |
|
|
- `PLAN.md` - Current implementation plan (if exists) |
|
|
- `TODO.md` - Active task tracking (if exists) |
|
|
|
|
|
2. **Read actual outputs (CRITICAL - verify claims, don't trust summaries):** |
|
|
- Latest files in `output/` folder (sorted by timestamp, newest first) |
|
|
- For GAIA projects: Read latest `output/gaia_results_*.json` completely |
|
|
- Check `metadata.score_percent` and `metadata.correct_count` |
|
|
- Read ALL `results[].submitted_answer` to understand failure patterns |
|
|
- Identify error categories (vision failures, tool errors, wrong answers) |
|
|
- For test projects: Read latest test output logs |
|
|
- **Purpose:** Ground truth of what ACTUALLY happened, not what was claimed |
|
|
|
|
|
**Phase 2: Recent History (What was done recently)** |
|
|
|
|
|
3. **Read last 3 dev records from `dev/` folder:** |
|
|
- Sort by filename (newest `dev_YYMMDD_##_title.md` first) |
|
|
- Read: Problem Description, Key Decisions, Outcome, Changelog |
|
|
- **Cross-verify:** Compare dev record claims with actual output files |
|
|
- **Red flag:** If dev record says "25% accuracy" but latest JSON shows "0%", prioritize JSON truth |
|
|
|
|
|
**Phase 3: Project Structure (How it works)** |
|
|
|
|
|
4. **Read README.md sections in order:** |
|
|
|
|
|
- Section 1: Overview (purpose, objectives) |
|
|
- Section 2: Architecture (tech stack, components, diagrams) |
|
|
- Section 3: Specification (current state, workflows, requirements) |
|
|
- Section 4: Workflow (this protocol) |
|
|
|
|
|
5. **Read CLAUDE.md:** |
|
|
- Project-specific coding standards |
|
|
- Usually empty (inherits from global ~/.claude/CLAUDE.md) |
|
|
|
|
|
**Phase 4: Code Structure (Critical files)** |
|
|
|
|
|
6. **Identify critical files from README.md Architecture section:** |
|
|
- Note main entry points (e.g., `app.py`) |
|
|
- Note core logic files (e.g., `src/agent/graph.py`, `src/agent/llm_client.py`) |
|
|
- Note tool implementations (e.g., `src/tools/*.py`) |
|
|
- **DO NOT read these yet** - only note their locations for later reference |
|
|
|
|
|
**Verification Checklist (Before claiming "I have context"):** |
|
|
|
|
|
- [ ] I personally read CHANGELOG.md, PLAN.md, TODO.md (not delegated) |
|
|
- [ ] I personally read latest output files (JSON results, test logs, etc.) |
|
|
- [ ] I know the ACTUAL current accuracy/status from output files |
|
|
- [ ] I read last 3 dev records and cross-verified claims with output data |
|
|
- [ ] I read README.md sections 1-4 completely |
|
|
- [ ] I can answer: "What is the current status and why?" |
|
|
- [ ] I can answer: "What were the last 3 major changes and their outcomes?" |
|
|
- [ ] I can answer: "What specific problems exist based on latest outputs?" |
|
|
|
|
|
**Anti-Patterns (NEVER do these):** |
|
|
|
|
|
- β Delegate initial context loading to Explore/Task agents |
|
|
- β Trust dev record claims without verifying against output files |
|
|
- β Skip reading actual output data (JSON results, logs, test outputs) |
|
|
- β Claim "I have context" after only reading summaries |
|
|
- β Read code files before understanding current state from outputs |
|
|
|
|
|
**Context Priority:** Latest Outputs (ground truth) > CHANGELOG (active work) > Dev Records (history) > README (structure) |