AheedTahir commited on
Commit
223e45d
·
1 Parent(s): 81917a3

Final Working Implementation

Browse files
Files changed (7) hide show
  1. .env.example +24 -0
  2. .gitignore +49 -0
  3. README.md +89 -6
  4. agent.py +356 -0
  5. evaluation_app.py +217 -0
  6. requirements.txt +16 -1
  7. test_agent.py +84 -0
.env.example ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # API Keys Configuration
2
+ # Copy this file to .env and fill in your actual API keys
3
+
4
+ # Groq API Key (for LLM)
5
+ # Get from: https://console.groq.com
6
+ GROQ_API_KEY=gsk_your_groq_api_key_here
7
+
8
+ # Tavily API Key (for web search)
9
+ # Get from: https://tavily.com
10
+ TAVILY_API_KEY=tvly-your_tavily_api_key_here
11
+
12
+ # Optional: Supabase (if using vector database)
13
+ SUPABASE_URL=your_supabase_url_here
14
+ SUPABASE_SERVICE_ROLE_KEY=your_supabase_key_here
15
+
16
+ # Optional: HuggingFace (if using HF models)
17
+ HUGGINGFACEHUB_API_TOKEN=hf_your_token_here
18
+
19
+ # Optional: LangSmith (for debugging/tracing)
20
+ LANGSMITH_API_KEY=lsv2_your_key_here
21
+ LANGSMITH_TRACING=true
22
+ LANGSMITH_PROJECT=ai_agent_course
23
+ LANGSMITH_ENDPOINT=https://api.smith.langchain.com
24
+
.gitignore ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Environment variables
2
+ .env
3
+ .env.local
4
+
5
+ # Python
6
+ __pycache__/
7
+ *.py[cod]
8
+ *$py.class
9
+ *.so
10
+ .Python
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ *.egg-info/
24
+ .installed.cfg
25
+ *.egg
26
+
27
+ # Virtual environments
28
+ venv/
29
+ ENV/
30
+ env/
31
+ .venv
32
+
33
+ # IDE
34
+ .vscode/
35
+ .idea/
36
+ *.swp
37
+ *.swo
38
+ *~
39
+
40
+ # OS
41
+ .DS_Store
42
+ Thumbs.db
43
+
44
+ # Logs
45
+ *.log
46
+
47
+ # Test outputs
48
+ test_results/
49
+
README.md CHANGED
@@ -1,15 +1,98 @@
1
  ---
2
- title: Template Final Assignment
3
- emoji: 🕵🏻‍♂️
4
  colorFrom: indigo
5
- colorTo: indigo
6
  sdk: gradio
7
  sdk_version: 5.25.2
8
- app_file: app.py
9
  pinned: false
10
  hf_oauth: true
11
- # optional, default duration is 8 hours/480 minutes. Max duration is 30 days/43200 minutes.
12
  hf_oauth_expiration_minutes: 480
13
  ---
14
 
15
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: GAIA Agent - Certification
3
+ emoji: 🤖
4
  colorFrom: indigo
5
+ colorTo: purple
6
  sdk: gradio
7
  sdk_version: 5.25.2
8
+ app_file: evaluation_app.py
9
  pinned: false
10
  hf_oauth: true
 
11
  hf_oauth_expiration_minutes: 480
12
  ---
13
 
14
+ # GAIA Agent - Hugging Face Agents Course Certification
15
+
16
+ This is a LangGraph-based AI agent built to answer questions from the GAIA benchmark for the Hugging Face Agents Course Unit 4 certification.
17
+
18
+ ## Goal
19
+
20
+ Achieve **30%+ accuracy** on the GAIA benchmark to earn the certification.
21
+
22
+ ## Agent Architecture
23
+
24
+ The agent is built using:
25
+ - **LLM**: Groq's Llama 3.3 70B (fast and free)
26
+ - **Framework**: LangGraph for agent orchestration
27
+ - **Tools**: 5 essential tools for maximum coverage
28
+
29
+ ### Tools Implemented
30
+
31
+ 1. **Web Search** (Tavily) - Search the internet for current information
32
+ 2. **Wikipedia Search** - Access encyclopedic knowledge (Wikipedia API)
33
+ 3. **Calculator** - Perform mathematical calculations
34
+ 4. **Python Executor** - Execute Python code for complex computations
35
+ 5. **File Reader** - Read CSV, JSON, and text files
36
+
37
+ ## Answer Format Rules
38
+
39
+ The agent follows GAIA's strict formatting requirements:
40
+ - **Numbers**: No commas, no units (unless requested)
41
+ - **Text**: No articles (a, an, the), no abbreviations
42
+ - **Lists**: Comma-separated with one space after commas
43
+ - **Dates**: ISO format (YYYY-MM-DD) unless specified
44
+
45
+ ## Usage
46
+
47
+ ### Local Testing
48
+
49
+ ```bash
50
+ # Install dependencies
51
+ pip install -r requirements.txt
52
+
53
+ # Set up environment variables in .env
54
+ GROQ_API_KEY=your_key_here
55
+ TAVILY_API_KEY=your_key_here
56
+
57
+ # Test the agent
58
+ python test_agent.py
59
+ ```
60
+
61
+ ### Running Evaluation
62
+
63
+ 1. Open the Space URL
64
+ 2. Log in with your HuggingFace account
65
+ 3. Click "Run Evaluation & Submit All Answers"
66
+ 4. Wait for results (takes ~1-2 hours due to rate limiting)
67
+
68
+ ## Project Structure
69
+
70
+ ```
71
+ .
72
+ ├── agent.py # Main agent implementation
73
+ ├── evaluation_app.py # Gradio app for evaluation
74
+ ├── test_agent.py # Local testing script
75
+ ├── requirements.txt # Python dependencies
76
+ ├── .env # API keys (not committed)
77
+ └── README.md # This file
78
+ ```
79
+
80
+ ## Required API Keys
81
+
82
+ - **GROQ_API_KEY**: Get from [console.groq.com](https://console.groq.com)
83
+ - **TAVILY_API_KEY**: Get from [tavily.com](https://tavily.com)
84
+
85
+ ## Expected Performance
86
+
87
+ With the current tool set:
88
+ - **Web Search + Wikipedia + Calculator**: ~25-30%
89
+ - **+ File Processing**: ~35-40%
90
+ - **+ Python Execution**: ~40-45%
91
+
92
+ ## Course Information
93
+
94
+ This project is part of the [Hugging Face Agents Course](https://huggingface.co/learn/agents-course) Unit 4 certification.
95
+
96
+ ## License
97
+
98
+ MY License - Feel free to use and modify for your own certification!
agent.py ADDED
@@ -0,0 +1,356 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ GAIA Agent with Essential Tools for 30%+ Accuracy
3
+ Built with LangGraph and Groq LLM
4
+ """
5
+ import os
6
+ import re
7
+ import json
8
+ from typing import Annotated
9
+ from langchain_core.tools import tool
10
+ from langchain_core.messages import SystemMessage
11
+ from langchain_community.tools.tavily_search import TavilySearchResults
12
+ from langchain_community.document_loaders import WikipediaLoader
13
+ from langchain_groq import ChatGroq
14
+ from langgraph.graph import StateGraph, MessagesState, START, END
15
+ from langgraph.prebuilt import ToolNode, tools_condition
16
+ from langgraph.checkpoint.memory import MemorySaver
17
+
18
+ # Initialize LLM
19
+ def get_llm():
20
+ """Get Groq LLM instance"""
21
+ return ChatGroq(
22
+ model="llama-3.3-70b-versatile",
23
+ temperature=0,
24
+ max_tokens=8000,
25
+ timeout=60,
26
+ max_retries=2
27
+ )
28
+
29
+ # ============================================================================
30
+ # TOOL DEFINITIONS
31
+ # ============================================================================
32
+
33
+ @tool
34
+ def web_search(query: str) -> str:
35
+ """
36
+ Search the web for current information using Tavily.
37
+ Use this for finding recent information, facts, statistics, or any data not in your training.
38
+
39
+ Args:
40
+ query: The search query string
41
+
42
+ Returns:
43
+ Search results as formatted text
44
+ """
45
+ try:
46
+ tavily = TavilySearchResults(
47
+ max_results=5,
48
+ search_depth="advanced",
49
+ include_answer=True,
50
+ include_raw_content=False
51
+ )
52
+ results = tavily.invoke(query)
53
+
54
+ if not results:
55
+ return "No results found."
56
+
57
+ # Format results nicely
58
+ formatted = []
59
+ for i, result in enumerate(results, 1):
60
+ title = result.get('title', 'No title')
61
+ content = result.get('content', 'No content')
62
+ url = result.get('url', '')
63
+ formatted.append(f"Result {i}:\nTitle: {title}\nContent: {content}\nURL: {url}\n")
64
+
65
+ return "\n".join(formatted)
66
+ except Exception as e:
67
+ return f"Error searching web: {str(e)}"
68
+
69
+
70
+ @tool
71
+ def wikipedia_search(query: str) -> str:
72
+ """
73
+ Search Wikipedia for encyclopedic information.
74
+ Use this for historical facts, biographies, scientific concepts, etc.
75
+
76
+ Args:
77
+ query: The Wikipedia search query
78
+
79
+ Returns:
80
+ Wikipedia article content
81
+ """
82
+ try:
83
+ loader = WikipediaLoader(query=query, load_max_docs=2, doc_content_chars_max=4000)
84
+ docs = loader.load()
85
+
86
+ if not docs:
87
+ return f"No Wikipedia article found for '{query}'"
88
+
89
+ # Combine the documents
90
+ content = "\n\n---\n\n".join([doc.page_content for doc in docs])
91
+ return f"Wikipedia results for '{query}':\n\n{content}"
92
+ except Exception as e:
93
+ return f"Error searching Wikipedia: {str(e)}"
94
+
95
+
96
+ @tool
97
+ def calculate(expression: str) -> str:
98
+ """
99
+ Evaluate a mathematical expression safely.
100
+ Supports basic arithmetic: +, -, *, /, //, %, **, parentheses.
101
+ Also supports common math functions: abs, round, min, max, sum.
102
+
103
+ Args:
104
+ expression: Mathematical expression as a string (e.g., "2 + 2", "sqrt(16)", "10 ** 2")
105
+
106
+ Returns:
107
+ The calculated result
108
+ """
109
+ try:
110
+ # Import math for advanced functions
111
+ import math
112
+
113
+ # Create a safe namespace with math functions
114
+ safe_dict = {
115
+ 'abs': abs, 'round': round, 'min': min, 'max': max, 'sum': sum,
116
+ 'sqrt': math.sqrt, 'pow': pow, 'log': math.log, 'log10': math.log10,
117
+ 'sin': math.sin, 'cos': math.cos, 'tan': math.tan,
118
+ 'pi': math.pi, 'e': math.e, 'ceil': math.ceil, 'floor': math.floor
119
+ }
120
+
121
+ # Clean the expression
122
+ expression = expression.strip()
123
+
124
+ # Evaluate safely
125
+ result = eval(expression, {"__builtins__": {}}, safe_dict)
126
+ return str(result)
127
+ except Exception as e:
128
+ return f"Error calculating '{expression}': {str(e)}"
129
+
130
+
131
+ @tool
132
+ def python_executor(code: str) -> str:
133
+ """
134
+ Execute Python code safely for data processing and calculations.
135
+ Use this for complex calculations, data manipulation, or multi-step computations.
136
+ The code should print its output.
137
+
138
+ Args:
139
+ code: Python code to execute
140
+
141
+ Returns:
142
+ The output of the code execution
143
+ """
144
+ try:
145
+ import io
146
+ import sys
147
+ import math
148
+ import json
149
+ from datetime import datetime, timedelta
150
+
151
+ # Capture stdout
152
+ old_stdout = sys.stdout
153
+ sys.stdout = buffer = io.StringIO()
154
+
155
+ # Create safe execution environment
156
+ safe_globals = {
157
+ '__builtins__': {
158
+ 'print': print, 'len': len, 'range': range, 'str': str,
159
+ 'int': int, 'float': float, 'list': list, 'dict': dict,
160
+ 'set': set, 'tuple': tuple, 'sorted': sorted, 'sum': sum,
161
+ 'min': min, 'max': max, 'abs': abs, 'round': round,
162
+ 'enumerate': enumerate, 'zip': zip, 'map': map, 'filter': filter,
163
+ },
164
+ 'math': math,
165
+ 'json': json,
166
+ 'datetime': datetime,
167
+ 'timedelta': timedelta,
168
+ }
169
+
170
+ # Execute code
171
+ exec(code, safe_globals)
172
+
173
+ # Get output
174
+ sys.stdout = old_stdout
175
+ output = buffer.getvalue()
176
+
177
+ return output if output else "Code executed successfully (no output)"
178
+ except Exception as e:
179
+ sys.stdout = old_stdout
180
+ return f"Error executing code: {str(e)}"
181
+
182
+
183
+ @tool
184
+ def read_file(filepath: str) -> str:
185
+ """
186
+ Read and return the contents of a file.
187
+ Supports text files, CSV, JSON, and basic file formats.
188
+
189
+ Args:
190
+ filepath: Path to the file to read
191
+
192
+ Returns:
193
+ File contents as string
194
+ """
195
+ try:
196
+ # Check if file exists
197
+ if not os.path.exists(filepath):
198
+ return f"File not found: {filepath}"
199
+
200
+ # Read based on file extension
201
+ if filepath.endswith('.json'):
202
+ with open(filepath, 'r', encoding='utf-8') as f:
203
+ data = json.load(f)
204
+ return json.dumps(data, indent=2)
205
+
206
+ elif filepath.endswith('.csv'):
207
+ try:
208
+ import pandas as pd
209
+ df = pd.read_csv(filepath)
210
+ return f"CSV file with {len(df)} rows and {len(df.columns)} columns:\n\n{df.to_string()}"
211
+ except ImportError:
212
+ # Fallback if pandas not available
213
+ with open(filepath, 'r', encoding='utf-8') as f:
214
+ return f.read()
215
+
216
+ else:
217
+ # Read as text
218
+ with open(filepath, 'r', encoding='utf-8') as f:
219
+ content = f.read()
220
+ return content
221
+ except Exception as e:
222
+ return f"Error reading file '{filepath}': {str(e)}"
223
+
224
+
225
+ # ============================================================================
226
+ # SYSTEM PROMPT - GAIA Specific Instructions
227
+ # ============================================================================
228
+
229
+ GAIA_SYSTEM_PROMPT = """You are a helpful AI assistant designed to answer questions from the GAIA benchmark.
230
+
231
+ CRITICAL ANSWER FORMAT RULES:
232
+ 1. For numbers: NO commas, NO units (unless explicitly requested)
233
+ - CORRECT: "1000" or "1000 meters" (if units requested)
234
+ - WRONG: "1,000" or "1000 meters" (if units not requested)
235
+
236
+ 2. For text answers: No articles (a, an, the), no abbreviations
237
+ - CORRECT: "United States"
238
+ - WRONG: "The United States" or "USA"
239
+
240
+ 3. For lists: Comma-separated with one space after each comma
241
+ - CORRECT: "apple, banana, orange"
242
+ - WRONG: "apple,banana,orange" or "apple, banana, orange."
243
+
244
+ 4. For dates: Use the format specified in the question
245
+ - If not specified, use ISO format: YYYY-MM-DD
246
+
247
+ 5. Be precise and concise - answer ONLY what is asked
248
+
249
+ APPROACH:
250
+ 1. Read the question carefully and identify what information is needed
251
+ 2. Use tools to gather information (web search, Wikipedia, calculations)
252
+ 3. For multi-step questions, break down the problem and solve step by step
253
+ 4. Verify your answer matches the format requirements above
254
+ 5. Return ONLY the final answer in the correct format
255
+
256
+ AVAILABLE TOOLS:
257
+ - web_search: Search the internet for current information
258
+ - wikipedia_search: Search Wikipedia for encyclopedic knowledge
259
+ - calculate: Perform mathematical calculations
260
+ - python_executor: Execute Python code for complex computations
261
+ - read_file: Read files (CSV, JSON, text)
262
+
263
+ Remember: Your final response should be ONLY the answer in the correct format, nothing else.
264
+ """
265
+
266
+ # ============================================================================
267
+ # AGENT GRAPH CONSTRUCTION
268
+ # ============================================================================
269
+
270
+ def build_graph():
271
+ """Build the LangGraph agent with tools"""
272
+
273
+ # Initialize LLM
274
+ llm = get_llm()
275
+
276
+ # Define tools
277
+ tools = [
278
+ web_search,
279
+ wikipedia_search,
280
+ calculate,
281
+ python_executor,
282
+ read_file
283
+ ]
284
+
285
+ # Bind tools to LLM
286
+ llm_with_tools = llm.bind_tools(tools)
287
+
288
+ # Define the assistant node
289
+ def assistant(state: MessagesState):
290
+ """Assistant node that calls the LLM"""
291
+ messages = state["messages"]
292
+
293
+ # Add system message if not present
294
+ if not any(isinstance(msg, SystemMessage) for msg in messages):
295
+ messages = [SystemMessage(content=GAIA_SYSTEM_PROMPT)] + messages
296
+
297
+ response = llm_with_tools.invoke(messages)
298
+ return {"messages": [response]}
299
+
300
+ # Build the graph
301
+ builder = StateGraph(MessagesState)
302
+
303
+ # Add nodes
304
+ builder.add_node("assistant", assistant)
305
+ builder.add_node("tools", ToolNode(tools))
306
+
307
+ # Add edges
308
+ builder.add_edge(START, "assistant")
309
+ builder.add_conditional_edges(
310
+ "assistant",
311
+ tools_condition,
312
+ )
313
+ builder.add_edge("tools", "assistant")
314
+
315
+ # Compile with memory
316
+ memory = MemorySaver()
317
+ graph = builder.compile(checkpointer=memory)
318
+
319
+ return graph
320
+
321
+
322
+ # ============================================================================
323
+ # TESTING
324
+ # ============================================================================
325
+
326
+ if __name__ == "__main__":
327
+ """Test the agent with sample questions"""
328
+ from langchain_core.messages import HumanMessage
329
+
330
+ # Build agent
331
+ print("Building agent...")
332
+ agent = build_graph()
333
+
334
+ # Test questions
335
+ test_questions = [
336
+ "What is 25 * 4 + 100?",
337
+ "Who was the first president of the United States?",
338
+ "Search for the population of Tokyo in 2024"
339
+ ]
340
+
341
+ for i, question in enumerate(test_questions, 1):
342
+ print(f"\n{'='*60}")
343
+ print(f"Test {i}: {question}")
344
+ print('='*60)
345
+
346
+ try:
347
+ config = {"configurable": {"thread_id": f"test_{i}"}}
348
+ result = agent.invoke(
349
+ {"messages": [HumanMessage(content=question)]},
350
+ config=config
351
+ )
352
+ answer = result['messages'][-1].content
353
+ print(f"Answer: {answer}")
354
+ except Exception as e:
355
+ print(f"Error: {e}")
356
+
evaluation_app.py ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ Basic Agent Evaluation Runner"""
2
+ import os
3
+ import inspect
4
+ import gradio as gr
5
+ import requests
6
+ import pandas as pd
7
+ import time
8
+ from langchain_core.messages import HumanMessage
9
+ from agent import build_graph
10
+
11
+
12
+
13
+ # (Keep Constants as is)
14
+ # --- Constants ---
15
+ DEFAULT_API_URL = "https://agents-course-unit4-scoring.hf.space"
16
+
17
+ # --- Basic Agent Definition ---
18
+ # ----- THIS IS WERE YOU CAN BUILD WHAT YOU WANT ------
19
+
20
+
21
+ class BasicAgent:
22
+ """A langgraph agent."""
23
+ def __init__(self):
24
+ print("BasicAgent initialized.")
25
+ self.graph = build_graph()
26
+
27
+ def __call__(self, question: str) -> str:
28
+ print(f"Agent received question (first 50 chars): {question[:50]}...")
29
+ # Wrap the question in a HumanMessage from langchain_core
30
+ messages = [HumanMessage(content=question)]
31
+ config = {"configurable": {"thread_id": "evaluation"}}
32
+ result = self.graph.invoke({"messages": messages}, config=config)
33
+ answer = result['messages'][-1].content
34
+
35
+ # Extract final answer if it has "Final Answer:" prefix
36
+ if "Final Answer:" in answer:
37
+ answer = answer.split("Final Answer:")[-1].strip()
38
+
39
+ return answer
40
+
41
+
42
+ def run_and_submit_all( profile: gr.OAuthProfile | None):
43
+ """
44
+ Fetches all questions, runs the BasicAgent on them, submits all answers,
45
+ and displays the results.
46
+ """
47
+ # --- Determine HF Space Runtime URL and Repo URL ---
48
+ space_id = os.getenv("SPACE_ID") # Get the SPACE_ID for sending link to the code
49
+
50
+ if profile:
51
+ username= f"{profile.username}"
52
+ print(f"User logged in: {username}")
53
+ else:
54
+ print("User not logged in.")
55
+ return "Please Login to Hugging Face with the button.", None
56
+
57
+ api_url = DEFAULT_API_URL
58
+ questions_url = f"{api_url}/questions"
59
+ submit_url = f"{api_url}/submit"
60
+
61
+ # 1. Instantiate Agent ( modify this part to create your agent)
62
+ try:
63
+ agent = BasicAgent()
64
+ except Exception as e:
65
+ print(f"Error instantiating agent: {e}")
66
+ return f"Error initializing agent: {e}", None
67
+ # In the case of an app running as a hugging Face space, this link points toward your codebase ( usefull for others so please keep it public)
68
+ agent_code = f"https://huggingface.co/spaces/{space_id}/tree/main"
69
+ print(agent_code)
70
+
71
+ # 2. Fetch Questions
72
+ print(f"Fetching questions from: {questions_url}")
73
+ try:
74
+ response = requests.get(questions_url, timeout=15)
75
+ response.raise_for_status()
76
+ questions_data = response.json()
77
+ if not questions_data:
78
+ print("Fetched questions list is empty.")
79
+ return "Fetched questions list is empty or invalid format.", None
80
+ print(f"Fetched {len(questions_data)} questions.")
81
+ except requests.exceptions.RequestException as e:
82
+ print(f"Error fetching questions: {e}")
83
+ return f"Error fetching questions: {e}", None
84
+ except requests.exceptions.JSONDecodeError as e:
85
+ print(f"Error decoding JSON response from questions endpoint: {e}")
86
+ print(f"Response text: {response.text[:500]}")
87
+ return f"Error decoding server response for questions: {e}", None
88
+ except Exception as e:
89
+ print(f"An unexpected error occurred fetching questions: {e}")
90
+ return f"An unexpected error occurred fetching questions: {e}", None
91
+
92
+ # 3. Run your Agent
93
+ results_log = []
94
+ answers_payload = []
95
+ print(f"Running agent on {len(questions_data)} questions...")
96
+ for item in questions_data:
97
+ task_id = item.get("task_id")
98
+ question_text = item.get("question")
99
+ if not task_id or question_text is None:
100
+ print(f"Skipping item with missing task_id or question: {item}")
101
+ continue
102
+
103
+ time.sleep(30)
104
+
105
+ try:
106
+ submitted_answer = agent(question_text)
107
+ answers_payload.append({"task_id": task_id, "submitted_answer": submitted_answer})
108
+ results_log.append({"Task ID": task_id, "Question": question_text, "Submitted Answer": submitted_answer})
109
+ except Exception as e:
110
+ print(f"Error running agent on task {task_id}: {e}")
111
+ results_log.append({"Task ID": task_id, "Question": question_text, "Submitted Answer": f"AGENT ERROR: {e}"})
112
+
113
+ if not answers_payload:
114
+ print("Agent did not produce any answers to submit.")
115
+ return "Agent did not produce any answers to submit.", pd.DataFrame(results_log)
116
+
117
+ # 4. Prepare Submission
118
+ submission_data = {"username": username.strip(), "agent_code": agent_code, "answers": answers_payload}
119
+ status_update = f"Agent finished. Submitting {len(answers_payload)} answers for user '{username}'..."
120
+ print(status_update)
121
+
122
+ # 5. Submit
123
+ print(f"Submitting {len(answers_payload)} answers to: {submit_url}")
124
+ try:
125
+ response = requests.post(submit_url, json=submission_data, timeout=60)
126
+ response.raise_for_status()
127
+ result_data = response.json()
128
+ final_status = (
129
+ f"Submission Successful!\n"
130
+ f"User: {result_data.get('username')}\n"
131
+ f"Overall Score: {result_data.get('score', 'N/A')}% "
132
+ f"({result_data.get('correct_count', '?')}/{result_data.get('total_attempted', '?')} correct)\n"
133
+ f"Message: {result_data.get('message', 'No message received.')}"
134
+ )
135
+ print("Submission successful.")
136
+ results_df = pd.DataFrame(results_log)
137
+ return final_status, results_df
138
+ except requests.exceptions.HTTPError as e:
139
+ error_detail = f"Server responded with status {e.response.status_code}."
140
+ try:
141
+ error_json = e.response.json()
142
+ error_detail += f" Detail: {error_json.get('detail', e.response.text)}"
143
+ except requests.exceptions.JSONDecodeError:
144
+ error_detail += f" Response: {e.response.text[:500]}"
145
+ status_message = f"Submission Failed: {error_detail}"
146
+ print(status_message)
147
+ results_df = pd.DataFrame(results_log)
148
+ return status_message, results_df
149
+ except requests.exceptions.Timeout:
150
+ status_message = "Submission Failed: The request timed out."
151
+ print(status_message)
152
+ results_df = pd.DataFrame(results_log)
153
+ return status_message, results_df
154
+ except requests.exceptions.RequestException as e:
155
+ status_message = f"Submission Failed: Network error - {e}"
156
+ print(status_message)
157
+ results_df = pd.DataFrame(results_log)
158
+ return status_message, results_df
159
+ except Exception as e:
160
+ status_message = f"An unexpected error occurred during submission: {e}"
161
+ print(status_message)
162
+ results_df = pd.DataFrame(results_log)
163
+ return status_message, results_df
164
+
165
+
166
+ # --- Build Gradio Interface using Blocks ---
167
+ with gr.Blocks() as demo:
168
+ gr.Markdown("# Basic Agent Evaluation Runner")
169
+ gr.Markdown(
170
+ """
171
+ **Instructions:**
172
+ 1. Please clone this space, then modify the code to define your agent's logic, the tools, the necessary packages, etc ...
173
+ 2. Log in to your Hugging Face account using the button below. This uses your HF username for submission.
174
+ 3. Click 'Run Evaluation & Submit All Answers' to fetch questions, run your agent, submit answers, and see the score.
175
+ ---
176
+ **Disclaimers:**
177
+ Once clicking on the "submit button, it can take quite some time ( this is the time for the agent to go through all the questions).
178
+ This space provides a basic setup and is intentionally sub-optimal to encourage you to develop your own, more robust solution. For instance for the delay process of the submit button, a solution could be to cache the answers and submit in a seperate action or even to answer the questions in async.
179
+ """
180
+ )
181
+
182
+ gr.LoginButton()
183
+
184
+ run_button = gr.Button("Run Evaluation & Submit All Answers")
185
+
186
+ status_output = gr.Textbox(label="Run Status / Submission Result", lines=5, interactive=False)
187
+ # Removed max_rows=10 from DataFrame constructor
188
+ results_table = gr.DataFrame(label="Questions and Agent Answers", wrap=True)
189
+
190
+ run_button.click(
191
+ fn=run_and_submit_all,
192
+ outputs=[status_output, results_table]
193
+ )
194
+
195
+ if __name__ == "__main__":
196
+ print("\n" + "-"*30 + " App Starting " + "-"*30)
197
+ # Check for SPACE_HOST and SPACE_ID at startup for information
198
+ space_host_startup = os.getenv("SPACE_HOST")
199
+ space_id_startup = os.getenv("SPACE_ID") # Get SPACE_ID at startup
200
+
201
+ if space_host_startup:
202
+ print(f"✅ SPACE_HOST found: {space_host_startup}")
203
+ print(f" Runtime URL should be: https://{space_host_startup}.hf.space")
204
+ else:
205
+ print("ℹ️ SPACE_HOST environment variable not found (running locally?).")
206
+
207
+ if space_id_startup: # Print repo URLs if SPACE_ID is found
208
+ print(f"✅ SPACE_ID found: {space_id_startup}")
209
+ print(f" Repo URL: https://huggingface.co/spaces/{space_id_startup}")
210
+ print(f" Repo Tree URL: https://huggingface.co/spaces/{space_id_startup}/tree/main")
211
+ else:
212
+ print("ℹ️ SPACE_ID environment variable not found (running locally?). Repo URL cannot be determined.")
213
+
214
+ print("-"*(60 + len(" App Starting ")) + "\n")
215
+
216
+ print("Launching Gradio Interface for Basic Agent Evaluation...")
217
+ demo.launch(debug=True, share=False)
requirements.txt CHANGED
@@ -1,2 +1,17 @@
 
1
  gradio
2
- requests
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core dependencies
2
  gradio
3
+ requests
4
+ pandas
5
+
6
+ # LangChain and LangGraph
7
+ langchain-core
8
+ langchain-community
9
+ langchain-groq
10
+ langgraph
11
+
12
+ # Tools and APIs
13
+ tavily-python
14
+ wikipedia
15
+
16
+ # Utilities
17
+ python-dotenv
test_agent.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Simple test script for the GAIA agent
3
+ """
4
+ import os
5
+ from dotenv import load_dotenv
6
+ from langchain_core.messages import HumanMessage
7
+ from agent import build_graph
8
+
9
+ # Load environment variables
10
+ load_dotenv()
11
+
12
+ # Verify API keys are set
13
+ print("Checking API keys...")
14
+ groq_key = os.getenv("GROQ_API_KEY")
15
+ tavily_key = os.getenv("TAVILY_API_KEY")
16
+
17
+ if not groq_key:
18
+ print("❌ GROQ_API_KEY not found in environment")
19
+ else:
20
+ print(f"✅ GROQ_API_KEY found: {groq_key[:10]}...")
21
+
22
+ if not tavily_key:
23
+ print("❌ TAVILY_API_KEY not found in environment")
24
+ else:
25
+ print(f"✅ TAVILY_API_KEY found: {tavily_key[:10]}...")
26
+
27
+ print("\n" + "="*60)
28
+ print("Building agent...")
29
+ print("="*60)
30
+
31
+ try:
32
+ agent = build_graph()
33
+ print("✅ Agent built successfully!")
34
+ except Exception as e:
35
+ print(f"❌ Error building agent: {e}")
36
+ exit(1)
37
+
38
+ # Test questions (simple ones to verify functionality)
39
+ test_questions = [
40
+ {
41
+ "question": "What is 25 * 4?",
42
+ "expected_type": "number",
43
+ "description": "Simple calculation test"
44
+ },
45
+ {
46
+ "question": "Who was the first president of the United States? Answer with just the name.",
47
+ "expected_type": "text",
48
+ "description": "Simple knowledge test"
49
+ }
50
+ ]
51
+
52
+ print("\n" + "="*60)
53
+ print("Running tests...")
54
+ print("="*60)
55
+
56
+ for i, test in enumerate(test_questions, 1):
57
+ print(f"\n{'='*60}")
58
+ print(f"Test {i}: {test['description']}")
59
+ print(f"Question: {test['question']}")
60
+ print('='*60)
61
+
62
+ try:
63
+ config = {"configurable": {"thread_id": f"test_{i}"}}
64
+ result = agent.invoke(
65
+ {"messages": [HumanMessage(content=test['question'])]},
66
+ config=config
67
+ )
68
+ answer = result['messages'][-1].content
69
+
70
+ # Extract final answer if it has "Final Answer:" prefix
71
+ if "Final Answer:" in answer:
72
+ answer = answer.split("Final Answer:")[-1].strip()
73
+
74
+ print(f"✅ Answer: {answer}")
75
+
76
+ except Exception as e:
77
+ print(f"❌ Error: {e}")
78
+ import traceback
79
+ traceback.print_exc()
80
+
81
+ print("\n" + "="*60)
82
+ print("Tests completed!")
83
+ print("="*60)
84
+