Isateles commited on
Commit
b0a3ba0
·
1 Parent(s): e828c8e

Update GAIA agent

Browse files
Files changed (3) hide show
  1. README.md +68 -154
  2. app.py +2 -1
  3. requirements.txt +23 -33
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
- title: My FIXED GAIA Agent - Final Project
3
  emoji: 🤖
4
- colorFrom: blue
5
  colorTo: green
6
  sdk: gradio
7
  sdk_version: 5.25.2
@@ -11,182 +11,96 @@ hf_oauth: true
11
  hf_oauth_expiration_minutes: 480
12
  ---
13
 
14
- # My Course-Optimized GAIA Agent - Final Project
15
 
16
- This is my **CORRECTED** submission for the AI Agents course. My original agent scored 0% due to misunderstanding the evaluation format, but I've now implemented the critical fixes for the **course's specific GAIA system**!
17
 
18
- ## 🔧 Critical Discovery & Fixes
19
 
20
- ### The Problem: Wrong Evaluation System Understanding
21
- The course uses a **DIFFERENT** evaluation system than official GAIA:
 
 
 
 
22
 
23
- - **Course System:** EXACT MATCH on clean answers (no "FINAL ANSWER:" prefix)
24
- - **Official GAIA:** Quasi-exact match with "FINAL ANSWER:" required
25
 
26
- My original agent was giving:
27
- ```
28
- "Based on the search results, I found the following studio albums..."
29
- ```
 
30
 
31
- But the course needs:
32
- ```
33
- "2"
34
- ```
35
 
36
- **Key insight:** Course evaluation does EXACT MATCH on raw answers only!
 
 
 
 
37
 
38
- ### The Fixes That Actually Work for Course
39
 
40
- 1. **✅ Course-Specific Answer Extraction**
41
- - Use GAIA system prompt internally for good reasoning
42
- - Extract ONLY the final answer for submission (no "FINAL ANSWER:" prefix)
43
- - Optimized for course's EXACT MATCH evaluation
44
 
45
- 2. **✅ Claude LLM Integration**
46
- - Added Claude 3.5 Sonnet support (excellent at following instructions)
47
- - Better reasoning capabilities for complex questions
48
- - Falls back to Groq/Together/HuggingFace if Claude unavailable
49
 
50
- 3. **✅ Clean Answer Processing**
51
- - Removes verbose explanations automatically
52
- - Extracts core answers that match course expectations
53
- - Handles numbers, strings, and lists correctly
 
 
54
 
55
- 4. **✅ Course Format Compliance**
56
- - No commas in numbers (1000 not 1,000)
57
- - No units unless requested (50 not $50)
58
- - No articles in strings (Paris not The Paris)
59
- - No abbreviations (New York City not NYC)
60
 
61
- ## What My Course-Optimized Agent Does
 
 
 
 
62
 
63
- My agent uses the GAIA reasoning approach internally but outputs clean answers for course evaluation:
64
 
65
- - **🧠 Claude LLM**: Excellent reasoning with precise instruction following
66
- - **🔍 Web Search**: DuckDuckGo integration for current information
67
- - **🧮 Calculator**: Returns clean numbers (critical for math questions!)
68
- - **📊 File Analysis**: CSV/data analysis optimized for course questions
69
- - **👥 Persona Database**: RAG system with vector search
70
- - **🤖 Agent Workflow**: LlamaIndex with GAIA prompt internally
71
- - **✅ Clean Extraction**: Removes verbose text, returns exact answers for course matching
72
 
73
- ## How to Use
74
 
75
- 1. **Login** with your HuggingFace account
76
- 2. **Click "Run Course GAIA Evaluation"** and wait (5-10 minutes)
77
- 3. **See much better results** - should score 30%+ now with clean answer extraction!
 
 
78
 
79
- ## Technical Details
80
 
81
- ### LLM Configuration (Priority Order)
82
- 1. **Claude 3.5 Sonnet** (best for course - excellent instruction following)
83
- 2. **Groq Llama 3 70B** (fast, generous free tier)
84
- 3. **Together AI Llama 3.1 70B** (good open model performance)
85
- 4. **HuggingFace Llama 3.1 70B** (free fallback)
86
- 5. **OpenAI GPT-4o-mini** (if credits available)
87
 
88
- ### Course Evaluation Strategy
89
- - **Internal Processing**: Uses GAIA system prompt for structured reasoning
90
- - **Answer Extraction**: Extracts clean answers from "FINAL ANSWER:" pattern
91
- - **Format Cleaning**: Removes commas, units, articles, abbreviations
92
- - **Exact Matching**: Optimized for course's exact match evaluation
93
 
94
- ### Infrastructure
95
- - **Vector DB**: ChromaDB with in-memory storage for HF Spaces
96
- - **Embeddings**: BAAI/bge-small-en-v1.5
97
- - **Agent**: LlamaIndex AgentWorkflow with GAIA reasoning
98
- - **Interface**: Gradio web app with clean answer extraction
99
- - **Evaluation**: Course-specific exact match optimization
100
 
101
- ## Setup Requirements
 
 
 
 
102
 
103
- The Space needs **at least one** of these API keys in Repository secrets:
104
 
105
- ### Recommended (Best Performance)
106
- - `ANTHROPIC_API_KEY` or `CLAUDE_API_KEY` - Claude 3.5 Sonnet (excellent for GAIA)
107
- - `GROQ_API_KEY` - Fast inference, generous free tier
108
-
109
- ### Alternative Options
110
- - `TOGETHER_API_KEY` - Good open models, reasonable pricing
111
- - `HF_TOKEN` - Free HuggingFace inference (slower but works)
112
- - `OPENAI_API_KEY` - If you have credits
113
-
114
- ## Course Format Requirements (Critical!)
115
-
116
- The course evaluation system does **EXACT MATCH** on clean answers:
117
-
118
- ### ✅ Correct for Course
119
- ```
120
- 2 # Clean number
121
- Paris # Clean string
122
- apple, banana, cherry # Clean list
123
- ```
124
-
125
- ### ❌ Wrong for Course (Causes 0% scores)
126
- ```
127
- FINAL ANSWER: 2 # Course doesn't want this prefix
128
- 1,000 # No commas in numbers
129
- $50 # No units unless requested
130
- The Paris # No articles in strings
131
- NYC # No abbreviations
132
- ```
133
-
134
- ### Key Difference from Official GAIA
135
- - **Official GAIA**: Requires "FINAL ANSWER:" prefix, uses quasi-exact match
136
- - **Course System**: Wants clean answers only, uses exact match
137
-
138
- ## Key Learnings
139
-
140
- 1. **Course vs Official GAIA**: Different evaluation systems require different approaches
141
- 2. **Answer Extraction**: Must extract clean answers from agent reasoning
142
- 3. **Exact Match Sensitivity**: Even perfect reasoning fails with format issues
143
- 4. **LLM Choice Matters**: Claude much better at following complex instructions
144
- 5. **Internal Structure**: Use GAIA prompt internally, clean answers for submission
145
-
146
- ## Performance Improvements
147
-
148
- | Change | Impact |
149
- |--------|--------|
150
- | Understood course evaluation system | 0% → 25%+ (correct submission format) |
151
- | Added Claude LLM | +10-15% (better reasoning + instruction following) |
152
- | Clean answer extraction | +5-10% (removes verbose text that causes failures) |
153
- | Course format optimization | +5% (handles exact match requirements) |
154
-
155
- **Expected Score: 35-50%** (vs 0% original) - well above 30% passing threshold!
156
-
157
- ## Course vs Official GAIA Comparison
158
-
159
- | Aspect | Course System | Official GAIA |
160
- |--------|---------------|---------------|
161
- | Evaluation | EXACT MATCH | Quasi-exact match |
162
- | Submission Format | Clean answers only | "FINAL ANSWER: [answer]" |
163
- | System Prompt | Use internally for reasoning | Required for evaluation |
164
- | Answer Processing | Extract and clean | Submit full response |
165
-
166
- ## Testing
167
-
168
- Run the validation script to test everything:
169
- ```bash
170
- python test_hf_space.py
171
- ```
172
-
173
- This checks:
174
- - ✅ All dependencies installed correctly
175
- - ✅ LLM providers working
176
- - ✅ Tools functioning properly
177
- - ✅ Course answer extraction working
178
- - ✅ End-to-end agent creation and testing
179
-
180
- ## Research Sources
181
-
182
- My fixes are based on:
183
- - Course materials and instructions about exact match evaluation
184
- - [GAIA Official Paper](https://arxiv.org/abs/2311.12983) - Reasoning approach (used internally)
185
- - [LlamaIndex Claude Integration](https://docs.llamaindex.ai/en/stable/examples/llm/anthropic/) - Technical setup
186
- - Course forum discussions about evaluation format differences
187
 
188
  ---
189
 
190
- 🎯 **Goal**: Score 30%+ on course GAIA evaluation
191
- 🔧 **Status**: Fixed evaluation format misunderstanding - ready for much higher scores!
192
- 🤞 **Hope**: Clean answer extraction works and I pass the course!
 
1
  ---
2
+ title: Isadora Teles - GAIA Agent - Final HF Agents Project
3
  emoji: 🤖
4
+ colorFrom: orange
5
  colorTo: green
6
  sdk: gradio
7
  sdk_version: 5.25.2
 
11
  hf_oauth_expiration_minutes: 480
12
  ---
13
 
14
+ # My GAIA RAG Agent - Final Course Project 🎯
15
 
16
+ This is my submission for the AI Agents course final project. I've built a RAG agent to tackle the GAIA benchmark using everything we learned in the course!
17
 
18
+ ## 🎓 What I Learned & Applied
19
 
20
+ Throughout this course, I learned about:
21
+ - Building agents with LlamaIndex AgentWorkflow
22
+ - Creating and integrating tools (web search, calculator, file analysis)
23
+ - Implementing RAG systems with vector databases
24
+ - Proper prompting techniques for agent systems
25
+ - Working with multiple LLM providers
26
 
27
+ ## 🏗️ Architecture
 
28
 
29
+ My agent uses:
30
+ - **LlamaIndex AgentWorkflow**: For orchestrating the agent's reasoning
31
+ - **Multiple LLMs**: Supports Claude, Groq, Together AI, HuggingFace, and OpenAI
32
+ - **ChromaDB**: For the persona RAG database
33
+ - **GAIA System Prompt**: To ensure proper reasoning and answer formatting
34
 
35
+ ## 🔧 Tools Implemented
 
 
 
36
 
37
+ 1. **Web Search** (`web_search`): Uses DuckDuckGo to find current information
38
+ 2. **Calculator** (`calculator`): Handles math, percentages, and word problems
39
+ 3. **File Analyzer** (`file_analyzer`): Analyzes CSV and text files
40
+ 4. **Weather** (`weather`): Real weather data using OpenWeather API
41
+ 5. **Persona Database** (`persona_database`): RAG system for finding personas
42
 
43
+ ## 💡 Key Insights
44
 
45
+ The biggest challenge was understanding that the course evaluation uses **exact match** on clean answers. The GAIA prompt helps the agent reason well, but I needed to extract just the answer part (without "FINAL ANSWER:") for submission.
 
 
 
46
 
47
+ ## 🚀 Features
 
 
 
48
 
49
+ - Clean answer extraction aligned with GAIA scoring rules
50
+ - Handles numbers without commas/units as required
51
+ - Properly formats lists and yes/no answers
52
+ - RAG integration for persona queries
53
+ - Real weather data when API key is available
54
+ - Fallback mechanisms for robustness
55
 
56
+ ## 📋 Requirements
 
 
 
 
57
 
58
+ All dependencies are in `requirements.txt`. The key ones are:
59
+ - LlamaIndex (core framework)
60
+ - Gradio (web interface)
61
+ - ChromaDB (vector storage)
62
+ - DuckDuckGo Search (web tool)
63
 
64
+ ## 🔑 API Keys Needed
65
 
66
+ Add these to your HuggingFace Space secrets:
67
+ - `ANTHROPIC_API_KEY` or `CLAUDE_API_KEY` (recommended for best performance)
68
+ - `GROQ_API_KEY` (good free alternative)
69
+ - `TOGETHER_API_KEY` (another good option)
70
+ - `HF_TOKEN` (free fallback)
71
+ - `OPENAI_API_KEY` (if you have credits)
72
+ - `OPENWEATHER_API_KEY` (for real weather data)
73
 
74
+ ## 📊 Expected Performance
75
 
76
+ Based on my testing and understanding of GAIA:
77
+ - Math questions: Should score well with the calculator tool
78
+ - Factual questions: Web search helps find current information
79
+ - Data questions: File analyzer handles CSV analysis
80
+ - Simple logic: GAIA prompt guides proper reasoning
81
 
82
+ Target: 30%+ to pass the course!
83
 
84
+ ## 🛠️ How It Works
 
 
 
 
 
85
 
86
+ 1. **Question Processing**: Agent receives a GAIA question
87
+ 2. **Tool Selection**: Uses the right tools based on the question
88
+ 3. **Reasoning**: Follows GAIA prompt to think through the problem
89
+ 4. **Answer Extraction**: Extracts clean answer for exact match
90
+ 5. **Submission**: Sends properly formatted answer to evaluation
91
 
92
+ ## 📝 Course Learnings Applied
 
 
 
 
 
93
 
94
+ - **Agent Architecture**: Using AgentWorkflow as taught in the course
95
+ - **Tool Integration**: Each tool has a clear purpose and description
96
+ - **RAG System**: Persona database shows RAG implementation
97
+ - **Prompt Engineering**: GAIA prompt for structured reasoning
98
+ - **Error Handling**: Graceful fallbacks instead of crashes
99
 
100
+ ## 🎯 Goal
101
 
102
+ Pass the GAIA evaluation with 30%+ score by applying everything learned in the AI Agents course!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
  ---
105
 
106
+ *This project demonstrates practical application of agent concepts, tool integration, RAG systems, and prompt engineering as taught in the course.*
 
 
app.py CHANGED
@@ -411,7 +411,8 @@ if __name__ == "__main__":
411
  ("Groq", os.getenv("GROQ_API_KEY")),
412
  ("Together", os.getenv("TOGETHER_API_KEY")),
413
  ("HuggingFace", os.getenv("HF_TOKEN")),
414
- ("OpenAI", os.getenv("OPENAI_API_KEY"))
 
415
  ]
416
 
417
  available = [name for name, key in api_keys if key]
 
411
  ("Groq", os.getenv("GROQ_API_KEY")),
412
  ("Together", os.getenv("TOGETHER_API_KEY")),
413
  ("HuggingFace", os.getenv("HF_TOKEN")),
414
+ ("OpenAI", os.getenv("OPENAI_API_KEY")),
415
+ ("OpenWeather", os.getenv("OPENWEATHER_API_KEY"))
416
  ]
417
 
418
  available = [name for name, key in api_keys if key]
requirements.txt CHANGED
@@ -1,41 +1,31 @@
1
- # My FIXED GAIA Agent Requirements
2
- # These are all the packages I need for my final project with CRITICAL FIXES
3
 
4
- # Basic stuff for the web interface
 
5
  gradio>=4.0.0
6
- requests>=2.28.0
7
  pandas>=1.5.0
8
 
9
- # Main LlamaIndex stuff - this is the core framework we learned about
10
- llama-index-core>=0.10.0
11
-
12
- # Multiple LLM options - UPDATED with Claude support for GAIA
13
- llama-index-llms-anthropic # CLAUDE - NEW! Best for GAIA formatting
14
- llama-index-llms-openai # OpenAI (if I have credits)
15
- llama-index-llms-huggingface-api # HuggingFace (free option)
16
- llama-index-llms-groq # Groq (fast and often free)
17
- llama-index-llms-together # Together AI (good models)
18
-
19
- # For the RAG part with embeddings and vector search
20
- llama-index-retrievers-bm25
21
- llama-index-embeddings-huggingface
22
- llama-index-vector-stores-chroma
23
-
24
- # Vector database - using ChromaDB like in the course
25
- chromadb>=0.4.0
26
-
27
- # To load the persona dataset from HuggingFace
28
- datasets>=2.0.0
29
 
30
- # Web search tool
31
- duckduckgo-search>=6.0.0
 
 
 
32
 
33
- # CRITICAL: Pydantic for structured responses (GAIA format validation)
34
- pydantic>=2.0.0
35
 
36
- # Helper packages
37
- python-dotenv
38
- nest-asyncio
39
 
40
- # Additional packages for better GAIA performance
41
- typing-extensions # For better type hints in validation
 
 
1
+ # GAIA RAG Agent Requirements
2
+ # Updated for the final course project
3
 
4
+ # Core framework
5
+ llama-index-core>=0.10.0
6
  gradio>=4.0.0
7
+ requests>=2.28.0
8
  pandas>=1.5.0
9
 
10
+ # LLM integrations - multiple options for flexibility
11
+ llama-index-llms-anthropic # Claude 3.5 Sonnet (best for GAIA)
12
+ llama-index-llms-groq # Groq (fast, free tier)
13
+ llama-index-llms-together # Together AI
14
+ llama-index-llms-huggingface-api # HuggingFace (free option)
15
+ llama-index-llms-openai # OpenAI GPT models
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
+ # RAG components
18
+ llama-index-embeddings-huggingface # For persona embeddings
19
+ llama-index-vector-stores-chroma # Vector storage
20
+ llama-index-retrievers-bm25 # Additional retriever
21
+ chromadb>=0.4.0 # Vector database
22
 
23
+ # Tools
24
+ duckduckgo-search>=6.0.0 # Web search tool
25
 
26
+ # Data handling
27
+ datasets>=2.0.0 # For loading persona dataset from HuggingFace
 
28
 
29
+ # Utilities
30
+ python-dotenv # For local testing with .env files
31
+ nest-asyncio # For async operations