Update GAIA agent
Browse files- README.md +68 -154
- app.py +2 -1
- requirements.txt +23 -33
README.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
emoji: 🤖
|
| 4 |
-
colorFrom:
|
| 5 |
colorTo: green
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 5.25.2
|
|
@@ -11,182 +11,96 @@ hf_oauth: true
|
|
| 11 |
hf_oauth_expiration_minutes: 480
|
| 12 |
---
|
| 13 |
|
| 14 |
-
# My
|
| 15 |
|
| 16 |
-
This is my
|
| 17 |
|
| 18 |
-
##
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
-
|
| 24 |
-
- **Official GAIA:** Quasi-exact match with "FINAL ANSWER:" required
|
| 25 |
|
| 26 |
-
My
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
|
|
|
| 30 |
|
| 31 |
-
|
| 32 |
-
```
|
| 33 |
-
"2"
|
| 34 |
-
```
|
| 35 |
|
| 36 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
-
- Use GAIA system prompt internally for good reasoning
|
| 42 |
-
- Extract ONLY the final answer for submission (no "FINAL ANSWER:" prefix)
|
| 43 |
-
- Optimized for course's EXACT MATCH evaluation
|
| 44 |
|
| 45 |
-
|
| 46 |
-
- Added Claude 3.5 Sonnet support (excellent at following instructions)
|
| 47 |
-
- Better reasoning capabilities for complex questions
|
| 48 |
-
- Falls back to Groq/Together/HuggingFace if Claude unavailable
|
| 49 |
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
|
|
|
|
|
|
| 54 |
|
| 55 |
-
|
| 56 |
-
- No commas in numbers (1000 not 1,000)
|
| 57 |
-
- No units unless requested (50 not $50)
|
| 58 |
-
- No articles in strings (Paris not The Paris)
|
| 59 |
-
- No abbreviations (New York City not NYC)
|
| 60 |
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
| 66 |
-
-
|
| 67 |
-
-
|
| 68 |
-
-
|
| 69 |
-
-
|
| 70 |
-
-
|
| 71 |
-
-
|
| 72 |
|
| 73 |
-
##
|
| 74 |
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
|
|
|
|
|
|
| 78 |
|
| 79 |
-
|
| 80 |
|
| 81 |
-
|
| 82 |
-
1. **Claude 3.5 Sonnet** (best for course - excellent instruction following)
|
| 83 |
-
2. **Groq Llama 3 70B** (fast, generous free tier)
|
| 84 |
-
3. **Together AI Llama 3.1 70B** (good open model performance)
|
| 85 |
-
4. **HuggingFace Llama 3.1 70B** (free fallback)
|
| 86 |
-
5. **OpenAI GPT-4o-mini** (if credits available)
|
| 87 |
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
|
| 94 |
-
|
| 95 |
-
- **Vector DB**: ChromaDB with in-memory storage for HF Spaces
|
| 96 |
-
- **Embeddings**: BAAI/bge-small-en-v1.5
|
| 97 |
-
- **Agent**: LlamaIndex AgentWorkflow with GAIA reasoning
|
| 98 |
-
- **Interface**: Gradio web app with clean answer extraction
|
| 99 |
-
- **Evaluation**: Course-specific exact match optimization
|
| 100 |
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
-
|
| 104 |
|
| 105 |
-
|
| 106 |
-
- `ANTHROPIC_API_KEY` or `CLAUDE_API_KEY` - Claude 3.5 Sonnet (excellent for GAIA)
|
| 107 |
-
- `GROQ_API_KEY` - Fast inference, generous free tier
|
| 108 |
-
|
| 109 |
-
### Alternative Options
|
| 110 |
-
- `TOGETHER_API_KEY` - Good open models, reasonable pricing
|
| 111 |
-
- `HF_TOKEN` - Free HuggingFace inference (slower but works)
|
| 112 |
-
- `OPENAI_API_KEY` - If you have credits
|
| 113 |
-
|
| 114 |
-
## Course Format Requirements (Critical!)
|
| 115 |
-
|
| 116 |
-
The course evaluation system does **EXACT MATCH** on clean answers:
|
| 117 |
-
|
| 118 |
-
### ✅ Correct for Course
|
| 119 |
-
```
|
| 120 |
-
2 # Clean number
|
| 121 |
-
Paris # Clean string
|
| 122 |
-
apple, banana, cherry # Clean list
|
| 123 |
-
```
|
| 124 |
-
|
| 125 |
-
### ❌ Wrong for Course (Causes 0% scores)
|
| 126 |
-
```
|
| 127 |
-
FINAL ANSWER: 2 # Course doesn't want this prefix
|
| 128 |
-
1,000 # No commas in numbers
|
| 129 |
-
$50 # No units unless requested
|
| 130 |
-
The Paris # No articles in strings
|
| 131 |
-
NYC # No abbreviations
|
| 132 |
-
```
|
| 133 |
-
|
| 134 |
-
### Key Difference from Official GAIA
|
| 135 |
-
- **Official GAIA**: Requires "FINAL ANSWER:" prefix, uses quasi-exact match
|
| 136 |
-
- **Course System**: Wants clean answers only, uses exact match
|
| 137 |
-
|
| 138 |
-
## Key Learnings
|
| 139 |
-
|
| 140 |
-
1. **Course vs Official GAIA**: Different evaluation systems require different approaches
|
| 141 |
-
2. **Answer Extraction**: Must extract clean answers from agent reasoning
|
| 142 |
-
3. **Exact Match Sensitivity**: Even perfect reasoning fails with format issues
|
| 143 |
-
4. **LLM Choice Matters**: Claude much better at following complex instructions
|
| 144 |
-
5. **Internal Structure**: Use GAIA prompt internally, clean answers for submission
|
| 145 |
-
|
| 146 |
-
## Performance Improvements
|
| 147 |
-
|
| 148 |
-
| Change | Impact |
|
| 149 |
-
|--------|--------|
|
| 150 |
-
| Understood course evaluation system | 0% → 25%+ (correct submission format) |
|
| 151 |
-
| Added Claude LLM | +10-15% (better reasoning + instruction following) |
|
| 152 |
-
| Clean answer extraction | +5-10% (removes verbose text that causes failures) |
|
| 153 |
-
| Course format optimization | +5% (handles exact match requirements) |
|
| 154 |
-
|
| 155 |
-
**Expected Score: 35-50%** (vs 0% original) - well above 30% passing threshold!
|
| 156 |
-
|
| 157 |
-
## Course vs Official GAIA Comparison
|
| 158 |
-
|
| 159 |
-
| Aspect | Course System | Official GAIA |
|
| 160 |
-
|--------|---------------|---------------|
|
| 161 |
-
| Evaluation | EXACT MATCH | Quasi-exact match |
|
| 162 |
-
| Submission Format | Clean answers only | "FINAL ANSWER: [answer]" |
|
| 163 |
-
| System Prompt | Use internally for reasoning | Required for evaluation |
|
| 164 |
-
| Answer Processing | Extract and clean | Submit full response |
|
| 165 |
-
|
| 166 |
-
## Testing
|
| 167 |
-
|
| 168 |
-
Run the validation script to test everything:
|
| 169 |
-
```bash
|
| 170 |
-
python test_hf_space.py
|
| 171 |
-
```
|
| 172 |
-
|
| 173 |
-
This checks:
|
| 174 |
-
- ✅ All dependencies installed correctly
|
| 175 |
-
- ✅ LLM providers working
|
| 176 |
-
- ✅ Tools functioning properly
|
| 177 |
-
- ✅ Course answer extraction working
|
| 178 |
-
- ✅ End-to-end agent creation and testing
|
| 179 |
-
|
| 180 |
-
## Research Sources
|
| 181 |
-
|
| 182 |
-
My fixes are based on:
|
| 183 |
-
- Course materials and instructions about exact match evaluation
|
| 184 |
-
- [GAIA Official Paper](https://arxiv.org/abs/2311.12983) - Reasoning approach (used internally)
|
| 185 |
-
- [LlamaIndex Claude Integration](https://docs.llamaindex.ai/en/stable/examples/llm/anthropic/) - Technical setup
|
| 186 |
-
- Course forum discussions about evaluation format differences
|
| 187 |
|
| 188 |
---
|
| 189 |
|
| 190 |
-
|
| 191 |
-
🔧 **Status**: Fixed evaluation format misunderstanding - ready for much higher scores!
|
| 192 |
-
🤞 **Hope**: Clean answer extraction works and I pass the course!
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Isadora Teles - GAIA Agent - Final HF Agents Project
|
| 3 |
emoji: 🤖
|
| 4 |
+
colorFrom: orange
|
| 5 |
colorTo: green
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 5.25.2
|
|
|
|
| 11 |
hf_oauth_expiration_minutes: 480
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# My GAIA RAG Agent - Final Course Project 🎯
|
| 15 |
|
| 16 |
+
This is my submission for the AI Agents course final project. I've built a RAG agent to tackle the GAIA benchmark using everything we learned in the course!
|
| 17 |
|
| 18 |
+
## 🎓 What I Learned & Applied
|
| 19 |
|
| 20 |
+
Throughout this course, I learned about:
|
| 21 |
+
- Building agents with LlamaIndex AgentWorkflow
|
| 22 |
+
- Creating and integrating tools (web search, calculator, file analysis)
|
| 23 |
+
- Implementing RAG systems with vector databases
|
| 24 |
+
- Proper prompting techniques for agent systems
|
| 25 |
+
- Working with multiple LLM providers
|
| 26 |
|
| 27 |
+
## 🏗️ Architecture
|
|
|
|
| 28 |
|
| 29 |
+
My agent uses:
|
| 30 |
+
- **LlamaIndex AgentWorkflow**: For orchestrating the agent's reasoning
|
| 31 |
+
- **Multiple LLMs**: Supports Claude, Groq, Together AI, HuggingFace, and OpenAI
|
| 32 |
+
- **ChromaDB**: For the persona RAG database
|
| 33 |
+
- **GAIA System Prompt**: To ensure proper reasoning and answer formatting
|
| 34 |
|
| 35 |
+
## 🔧 Tools Implemented
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
+
1. **Web Search** (`web_search`): Uses DuckDuckGo to find current information
|
| 38 |
+
2. **Calculator** (`calculator`): Handles math, percentages, and word problems
|
| 39 |
+
3. **File Analyzer** (`file_analyzer`): Analyzes CSV and text files
|
| 40 |
+
4. **Weather** (`weather`): Real weather data using OpenWeather API
|
| 41 |
+
5. **Persona Database** (`persona_database`): RAG system for finding personas
|
| 42 |
|
| 43 |
+
## 💡 Key Insights
|
| 44 |
|
| 45 |
+
The biggest challenge was understanding that the course evaluation uses **exact match** on clean answers. The GAIA prompt helps the agent reason well, but I needed to extract just the answer part (without "FINAL ANSWER:") for submission.
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
## 🚀 Features
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
+
- Clean answer extraction aligned with GAIA scoring rules
|
| 50 |
+
- Handles numbers without commas/units as required
|
| 51 |
+
- Properly formats lists and yes/no answers
|
| 52 |
+
- RAG integration for persona queries
|
| 53 |
+
- Real weather data when API key is available
|
| 54 |
+
- Fallback mechanisms for robustness
|
| 55 |
|
| 56 |
+
## 📋 Requirements
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
All dependencies are in `requirements.txt`. The key ones are:
|
| 59 |
+
- LlamaIndex (core framework)
|
| 60 |
+
- Gradio (web interface)
|
| 61 |
+
- ChromaDB (vector storage)
|
| 62 |
+
- DuckDuckGo Search (web tool)
|
| 63 |
|
| 64 |
+
## 🔑 API Keys Needed
|
| 65 |
|
| 66 |
+
Add these to your HuggingFace Space secrets:
|
| 67 |
+
- `ANTHROPIC_API_KEY` or `CLAUDE_API_KEY` (recommended for best performance)
|
| 68 |
+
- `GROQ_API_KEY` (good free alternative)
|
| 69 |
+
- `TOGETHER_API_KEY` (another good option)
|
| 70 |
+
- `HF_TOKEN` (free fallback)
|
| 71 |
+
- `OPENAI_API_KEY` (if you have credits)
|
| 72 |
+
- `OPENWEATHER_API_KEY` (for real weather data)
|
| 73 |
|
| 74 |
+
## 📊 Expected Performance
|
| 75 |
|
| 76 |
+
Based on my testing and understanding of GAIA:
|
| 77 |
+
- Math questions: Should score well with the calculator tool
|
| 78 |
+
- Factual questions: Web search helps find current information
|
| 79 |
+
- Data questions: File analyzer handles CSV analysis
|
| 80 |
+
- Simple logic: GAIA prompt guides proper reasoning
|
| 81 |
|
| 82 |
+
Target: 30%+ to pass the course!
|
| 83 |
|
| 84 |
+
## 🛠️ How It Works
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
+
1. **Question Processing**: Agent receives a GAIA question
|
| 87 |
+
2. **Tool Selection**: Uses the right tools based on the question
|
| 88 |
+
3. **Reasoning**: Follows GAIA prompt to think through the problem
|
| 89 |
+
4. **Answer Extraction**: Extracts clean answer for exact match
|
| 90 |
+
5. **Submission**: Sends properly formatted answer to evaluation
|
| 91 |
|
| 92 |
+
## 📝 Course Learnings Applied
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
+
- **Agent Architecture**: Using AgentWorkflow as taught in the course
|
| 95 |
+
- **Tool Integration**: Each tool has a clear purpose and description
|
| 96 |
+
- **RAG System**: Persona database shows RAG implementation
|
| 97 |
+
- **Prompt Engineering**: GAIA prompt for structured reasoning
|
| 98 |
+
- **Error Handling**: Graceful fallbacks instead of crashes
|
| 99 |
|
| 100 |
+
## 🎯 Goal
|
| 101 |
|
| 102 |
+
Pass the GAIA evaluation with 30%+ score by applying everything learned in the AI Agents course!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
---
|
| 105 |
|
| 106 |
+
*This project demonstrates practical application of agent concepts, tool integration, RAG systems, and prompt engineering as taught in the course.*
|
|
|
|
|
|
app.py
CHANGED
|
@@ -411,7 +411,8 @@ if __name__ == "__main__":
|
|
| 411 |
("Groq", os.getenv("GROQ_API_KEY")),
|
| 412 |
("Together", os.getenv("TOGETHER_API_KEY")),
|
| 413 |
("HuggingFace", os.getenv("HF_TOKEN")),
|
| 414 |
-
("OpenAI", os.getenv("OPENAI_API_KEY"))
|
|
|
|
| 415 |
]
|
| 416 |
|
| 417 |
available = [name for name, key in api_keys if key]
|
|
|
|
| 411 |
("Groq", os.getenv("GROQ_API_KEY")),
|
| 412 |
("Together", os.getenv("TOGETHER_API_KEY")),
|
| 413 |
("HuggingFace", os.getenv("HF_TOKEN")),
|
| 414 |
+
("OpenAI", os.getenv("OPENAI_API_KEY")),
|
| 415 |
+
("OpenWeather", os.getenv("OPENWEATHER_API_KEY"))
|
| 416 |
]
|
| 417 |
|
| 418 |
available = [name for name, key in api_keys if key]
|
requirements.txt
CHANGED
|
@@ -1,41 +1,31 @@
|
|
| 1 |
-
#
|
| 2 |
-
#
|
| 3 |
|
| 4 |
-
#
|
|
|
|
| 5 |
gradio>=4.0.0
|
| 6 |
-
requests>=2.28.0
|
| 7 |
pandas>=1.5.0
|
| 8 |
|
| 9 |
-
#
|
| 10 |
-
llama-index-
|
| 11 |
-
|
| 12 |
-
#
|
| 13 |
-
llama-index-llms-
|
| 14 |
-
llama-index-llms-openai
|
| 15 |
-
llama-index-llms-huggingface-api # HuggingFace (free option)
|
| 16 |
-
llama-index-llms-groq # Groq (fast and often free)
|
| 17 |
-
llama-index-llms-together # Together AI (good models)
|
| 18 |
-
|
| 19 |
-
# For the RAG part with embeddings and vector search
|
| 20 |
-
llama-index-retrievers-bm25
|
| 21 |
-
llama-index-embeddings-huggingface
|
| 22 |
-
llama-index-vector-stores-chroma
|
| 23 |
-
|
| 24 |
-
# Vector database - using ChromaDB like in the course
|
| 25 |
-
chromadb>=0.4.0
|
| 26 |
-
|
| 27 |
-
# To load the persona dataset from HuggingFace
|
| 28 |
-
datasets>=2.0.0
|
| 29 |
|
| 30 |
-
#
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
#
|
| 34 |
-
|
| 35 |
|
| 36 |
-
#
|
| 37 |
-
|
| 38 |
-
nest-asyncio
|
| 39 |
|
| 40 |
-
#
|
| 41 |
-
|
|
|
|
|
|
| 1 |
+
# GAIA RAG Agent Requirements
|
| 2 |
+
# Updated for the final course project
|
| 3 |
|
| 4 |
+
# Core framework
|
| 5 |
+
llama-index-core>=0.10.0
|
| 6 |
gradio>=4.0.0
|
| 7 |
+
requests>=2.28.0
|
| 8 |
pandas>=1.5.0
|
| 9 |
|
| 10 |
+
# LLM integrations - multiple options for flexibility
|
| 11 |
+
llama-index-llms-anthropic # Claude 3.5 Sonnet (best for GAIA)
|
| 12 |
+
llama-index-llms-groq # Groq (fast, free tier)
|
| 13 |
+
llama-index-llms-together # Together AI
|
| 14 |
+
llama-index-llms-huggingface-api # HuggingFace (free option)
|
| 15 |
+
llama-index-llms-openai # OpenAI GPT models
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
+
# RAG components
|
| 18 |
+
llama-index-embeddings-huggingface # For persona embeddings
|
| 19 |
+
llama-index-vector-stores-chroma # Vector storage
|
| 20 |
+
llama-index-retrievers-bm25 # Additional retriever
|
| 21 |
+
chromadb>=0.4.0 # Vector database
|
| 22 |
|
| 23 |
+
# Tools
|
| 24 |
+
duckduckgo-search>=6.0.0 # Web search tool
|
| 25 |
|
| 26 |
+
# Data handling
|
| 27 |
+
datasets>=2.0.0 # For loading persona dataset from HuggingFace
|
|
|
|
| 28 |
|
| 29 |
+
# Utilities
|
| 30 |
+
python-dotenv # For local testing with .env files
|
| 31 |
+
nest-asyncio # For async operations
|