Final_Assignment_AGENT_GAIA

Sleeping

App Files Files Community

Isateles commited on May 29, 2025

Commit

b0a3ba0

1 Parent(s): e828c8e

Update GAIA agent

Browse files

Files changed (3) hide show

README.md +68 -154
app.py +2 -1
requirements.txt +23 -33

README.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
-title: My FIXED GAIA Agent - Final Project
 emoji: 🤖
-colorFrom: blue
 colorTo: green
 sdk: gradio
 sdk_version: 5.25.2
@@ -11,182 +11,96 @@ hf_oauth: true
 hf_oauth_expiration_minutes: 480
 ---
-# My Course-Optimized GAIA Agent - Final Project ✅
-This is my **CORRECTED** submission for the AI Agents course. My original agent scored 0% due to misunderstanding the evaluation format, but I've now implemented the critical fixes for the **course's specific GAIA system**!
-## 🔧 Critical Discovery & Fixes
-### The Problem: Wrong Evaluation System Understanding
-The course uses a **DIFFERENT** evaluation system than official GAIA:
-- **Course System:** EXACT MATCH on clean answers (no "FINAL ANSWER:" prefix)
-- **Official GAIA:** Quasi-exact match with "FINAL ANSWER:" required
-My original agent was giving:
-```
-"Based on the search results, I found the following studio albums..."
-```
-But the course needs:
-```
-"2"
-```
-**Key insight:** Course evaluation does EXACT MATCH on raw answers only!
-### The Fixes That Actually Work for Course
-1. **✅ Course-Specific Answer Extraction**
-   - Use GAIA system prompt internally for good reasoning
-   - Extract ONLY the final answer for submission (no "FINAL ANSWER:" prefix)
-   - Optimized for course's EXACT MATCH evaluation
-2. **✅ Claude LLM Integration**
-   - Added Claude 3.5 Sonnet support (excellent at following instructions)
-   - Better reasoning capabilities for complex questions
-   - Falls back to Groq/Together/HuggingFace if Claude unavailable
-3. **✅ Clean Answer Processing**
-   - Removes verbose explanations automatically
-   - Extracts core answers that match course expectations
-   - Handles numbers, strings, and lists correctly
-4. **✅ Course Format Compliance**
-   - No commas in numbers (1000 not 1,000)
-   - No units unless requested (50 not $50)
-   - No articles in strings (Paris not The Paris)
-   - No abbreviations (New York City not NYC)
-## What My Course-Optimized Agent Does
-My agent uses the GAIA reasoning approach internally but outputs clean answers for course evaluation:
-- **🧠 Claude LLM**: Excellent reasoning with precise instruction following
-- **🔍 Web Search**: DuckDuckGo integration for current information
-- **🧮 Calculator**: Returns clean numbers (critical for math questions!)
-- **📊 File Analysis**: CSV/data analysis optimized for course questions
-- **👥 Persona Database**: RAG system with vector search
-- **🤖 Agent Workflow**: LlamaIndex with GAIA prompt internally
-- **✅ Clean Extraction**: Removes verbose text, returns exact answers for course matching
-## How to Use
-1. **Login** with your HuggingFace account
-2. **Click "Run Course GAIA Evaluation"** and wait (5-10 minutes)
-3. **See much better results** - should score 30%+ now with clean answer extraction!
-## Technical Details
-### LLM Configuration (Priority Order)
-1. **Claude 3.5 Sonnet** (best for course - excellent instruction following)
-2. **Groq Llama 3 70B** (fast, generous free tier)
-3. **Together AI Llama 3.1 70B** (good open model performance)
-4. **HuggingFace Llama 3.1 70B** (free fallback)
-5. **OpenAI GPT-4o-mini** (if credits available)
-### Course Evaluation Strategy
-- **Internal Processing**: Uses GAIA system prompt for structured reasoning
-- **Answer Extraction**: Extracts clean answers from "FINAL ANSWER:" pattern
-- **Format Cleaning**: Removes commas, units, articles, abbreviations
-- **Exact Matching**: Optimized for course's exact match evaluation
-### Infrastructure
-- **Vector DB**: ChromaDB with in-memory storage for HF Spaces
-- **Embeddings**: BAAI/bge-small-en-v1.5
-- **Agent**: LlamaIndex AgentWorkflow with GAIA reasoning
-- **Interface**: Gradio web app with clean answer extraction
-- **Evaluation**: Course-specific exact match optimization
-## Setup Requirements
-The Space needs **at least one** of these API keys in Repository secrets:
-### Recommended (Best Performance)
-- `ANTHROPIC_API_KEY` or `CLAUDE_API_KEY` - Claude 3.5 Sonnet (excellent for GAIA)
-- `GROQ_API_KEY` - Fast inference, generous free tier
-### Alternative Options
-- `TOGETHER_API_KEY` - Good open models, reasonable pricing
-- `HF_TOKEN` - Free HuggingFace inference (slower but works)
-- `OPENAI_API_KEY` - If you have credits
-## Course Format Requirements (Critical!)
-The course evaluation system does **EXACT MATCH** on clean answers:
-### ✅ Correct for Course
-```
-2                                    # Clean number
-Paris                               # Clean string
-apple, banana, cherry               # Clean list
-```
-### ❌ Wrong for Course (Causes 0% scores)
-```
-FINAL ANSWER: 2                     # Course doesn't want this prefix
-1,000                               # No commas in numbers
-$50                                 # No units unless requested
-The Paris                           # No articles in strings
-NYC                                 # No abbreviations
-```
-### Key Difference from Official GAIA
-- **Official GAIA**: Requires "FINAL ANSWER:" prefix, uses quasi-exact match
-- **Course System**: Wants clean answers only, uses exact match
-## Key Learnings
-1. **Course vs Official GAIA**: Different evaluation systems require different approaches
-2. **Answer Extraction**: Must extract clean answers from agent reasoning
-3. **Exact Match Sensitivity**: Even perfect reasoning fails with format issues
-4. **LLM Choice Matters**: Claude much better at following complex instructions
-5. **Internal Structure**: Use GAIA prompt internally, clean answers for submission
-## Performance Improvements
-| Change | Impact |
-|--------|--------|
-| Understood course evaluation system | 0% → 25%+ (correct submission format) |
-| Added Claude LLM | +10-15% (better reasoning + instruction following) |
-| Clean answer extraction | +5-10% (removes verbose text that causes failures) |
-| Course format optimization | +5% (handles exact match requirements) |
-**Expected Score: 35-50%** (vs 0% original) - well above 30% passing threshold!
-## Course vs Official GAIA Comparison
-| Aspect | Course System | Official GAIA |
-|--------|---------------|---------------|
-| Evaluation | EXACT MATCH | Quasi-exact match |
-| Submission Format | Clean answers only | "FINAL ANSWER: [answer]" |
-| System Prompt | Use internally for reasoning | Required for evaluation |
-| Answer Processing | Extract and clean | Submit full response |
-## Testing
-Run the validation script to test everything:
-```bash
-python test_hf_space.py
-```
-This checks:
-- ✅ All dependencies installed correctly
-- ✅ LLM providers working
-- ✅ Tools functioning properly
-- ✅ Course answer extraction working
-- ✅ End-to-end agent creation and testing
-## Research Sources
-My fixes are based on:
-- Course materials and instructions about exact match evaluation
-- [GAIA Official Paper](https://arxiv.org/abs/2311.12983) - Reasoning approach (used internally)
-- [LlamaIndex Claude Integration](https://docs.llamaindex.ai/en/stable/examples/llm/anthropic/) - Technical setup
-- Course forum discussions about evaluation format differences
 ---
-🎯 **Goal**: Score 30%+ on course GAIA evaluation
-🔧 **Status**: Fixed evaluation format misunderstanding - ready for much higher scores!
-🤞 **Hope**: Clean answer extraction works and I pass the course!

 ---
+title: Isadora Teles - GAIA Agent - Final HF Agents Project
 emoji: 🤖
+colorFrom: orange
 colorTo: green
 sdk: gradio
 sdk_version: 5.25.2
 hf_oauth_expiration_minutes: 480
 ---
+# My GAIA RAG Agent - Final Course Project 🎯
+This is my submission for the AI Agents course final project. I've built a RAG agent to tackle the GAIA benchmark using everything we learned in the course!
+## 🎓 What I Learned & Applied
+Throughout this course, I learned about:
+- Building agents with LlamaIndex AgentWorkflow
+- Creating and integrating tools (web search, calculator, file analysis)
+- Implementing RAG systems with vector databases
+- Proper prompting techniques for agent systems
+- Working with multiple LLM providers
+## 🏗️ Architecture
+My agent uses:
+- **LlamaIndex AgentWorkflow**: For orchestrating the agent's reasoning
+- **Multiple LLMs**: Supports Claude, Groq, Together AI, HuggingFace, and OpenAI
+- **ChromaDB**: For the persona RAG database
+- **GAIA System Prompt**: To ensure proper reasoning and answer formatting
+## 🔧 Tools Implemented
+1. **Web Search** (`web_search`): Uses DuckDuckGo to find current information
+2. **Calculator** (`calculator`): Handles math, percentages, and word problems
+3. **File Analyzer** (`file_analyzer`): Analyzes CSV and text files
+4. **Weather** (`weather`): Real weather data using OpenWeather API
+5. **Persona Database** (`persona_database`): RAG system for finding personas
+## 💡 Key Insights
+The biggest challenge was understanding that the course evaluation uses **exact match** on clean answers. The GAIA prompt helps the agent reason well, but I needed to extract just the answer part (without "FINAL ANSWER:") for submission.
+## 🚀 Features
+- Clean answer extraction aligned with GAIA scoring rules
+- Handles numbers without commas/units as required
+- Properly formats lists and yes/no answers
+- RAG integration for persona queries
+- Real weather data when API key is available
+- Fallback mechanisms for robustness
+## 📋 Requirements
+All dependencies are in `requirements.txt`. The key ones are:
+- LlamaIndex (core framework)
+- Gradio (web interface)
+- ChromaDB (vector storage)
+- DuckDuckGo Search (web tool)
+## 🔑 API Keys Needed
+Add these to your HuggingFace Space secrets:
+- `ANTHROPIC_API_KEY` or `CLAUDE_API_KEY` (recommended for best performance)
+- `GROQ_API_KEY` (good free alternative)
+- `TOGETHER_API_KEY` (another good option)
+- `HF_TOKEN` (free fallback)
+- `OPENAI_API_KEY` (if you have credits)
+- `OPENWEATHER_API_KEY` (for real weather data)
+## 📊 Expected Performance
+Based on my testing and understanding of GAIA:
+- Math questions: Should score well with the calculator tool
+- Factual questions: Web search helps find current information
+- Data questions: File analyzer handles CSV analysis
+- Simple logic: GAIA prompt guides proper reasoning
+Target: 30%+ to pass the course!
+## 🛠️ How It Works
+1. **Question Processing**: Agent receives a GAIA question
+2. **Tool Selection**: Uses the right tools based on the question
+3. **Reasoning**: Follows GAIA prompt to think through the problem
+4. **Answer Extraction**: Extracts clean answer for exact match
+5. **Submission**: Sends properly formatted answer to evaluation
+## 📝 Course Learnings Applied
+- **Agent Architecture**: Using AgentWorkflow as taught in the course
+- **Tool Integration**: Each tool has a clear purpose and description
+- **RAG System**: Persona database shows RAG implementation
+- **Prompt Engineering**: GAIA prompt for structured reasoning
+- **Error Handling**: Graceful fallbacks instead of crashes
+## 🎯 Goal
+Pass the GAIA evaluation with 30%+ score by applying everything learned in the AI Agents course!
 ---
+*This project demonstrates practical application of agent concepts, tool integration, RAG systems, and prompt engineering as taught in the course.*

app.py CHANGED Viewed

@@ -411,7 +411,8 @@ if __name__ == "__main__":
         ("Groq", os.getenv("GROQ_API_KEY")),
         ("Together", os.getenv("TOGETHER_API_KEY")),
         ("HuggingFace", os.getenv("HF_TOKEN")),
-        ("OpenAI", os.getenv("OPENAI_API_KEY"))
     ]
     available = [name for name, key in api_keys if key]

         ("Groq", os.getenv("GROQ_API_KEY")),
         ("Together", os.getenv("TOGETHER_API_KEY")),
         ("HuggingFace", os.getenv("HF_TOKEN")),
+        ("OpenAI", os.getenv("OPENAI_API_KEY")),
+        ("OpenWeather", os.getenv("OPENWEATHER_API_KEY"))
     ]
     available = [name for name, key in api_keys if key]

requirements.txt CHANGED Viewed

@@ -1,41 +1,31 @@
-# My FIXED GAIA Agent Requirements
-# These are all the packages I need for my final project with CRITICAL FIXES
-# Basic stuff for the web interface
 gradio>=4.0.0
-requests>=2.28.0
 pandas>=1.5.0
-# Main LlamaIndex stuff - this is the core framework we learned about
-llama-index-core>=0.10.0
-# Multiple LLM options - UPDATED with Claude support for GAIA
-llama-index-llms-anthropic           # CLAUDE - NEW! Best for GAIA formatting
-llama-index-llms-openai              # OpenAI (if I have credits)
-llama-index-llms-huggingface-api     # HuggingFace (free option)
-llama-index-llms-groq                # Groq (fast and often free)
-llama-index-llms-together            # Together AI (good models)
-# For the RAG part with embeddings and vector search
-llama-index-retrievers-bm25
-llama-index-embeddings-huggingface
-llama-index-vector-stores-chroma
-# Vector database - using ChromaDB like in the course
-chromadb>=0.4.0
-# To load the persona dataset from HuggingFace
-datasets>=2.0.0
-# Web search tool
-duckduckgo-search>=6.0.0
-# CRITICAL: Pydantic for structured responses (GAIA format validation)
-pydantic>=2.0.0
-# Helper packages
-python-dotenv
-nest-asyncio
-# Additional packages for better GAIA performance
-typing-extensions  # For better type hints in validation

+# GAIA RAG Agent Requirements
+# Updated for the final course project
+# Core framework
+llama-index-core>=0.10.0
 gradio>=4.0.0
+requests>=2.28.0
 pandas>=1.5.0
+# LLM integrations - multiple options for flexibility
+llama-index-llms-anthropic       # Claude 3.5 Sonnet (best for GAIA)
+llama-index-llms-groq            # Groq (fast, free tier)
+llama-index-llms-together        # Together AI
+llama-index-llms-huggingface-api # HuggingFace (free option)
+llama-index-llms-openai          # OpenAI GPT models
+# RAG components
+llama-index-embeddings-huggingface  # For persona embeddings
+llama-index-vector-stores-chroma    # Vector storage
+llama-index-retrievers-bm25         # Additional retriever
+chromadb>=0.4.0                     # Vector database
+# Tools
+duckduckgo-search>=6.0.0  # Web search tool
+# Data handling
+datasets>=2.0.0  # For loading persona dataset from HuggingFace
+# Utilities
+python-dotenv    # For local testing with .env files
+nest-asyncio     # For async operations