Final_Assignment_Template

Sleeping

App Files Files Community

carolinacon commited on Sep 13, 2025

Commit

a4b0424

1 Parent(s): fc1b83d

updated chess tool and prompting

Browse files

Files changed (8) hide show

README.md +2 -2
config/prompts.yaml +7 -7
core/agent.py +6 -7
core/state.py +3 -4
nodes/chunking_node.py +14 -9
nodes/nodes.py +7 -7
tools/chess_tool.py +15 -17
tools/tavily_tools.py +4 -8

README.md CHANGED Viewed

@@ -16,7 +16,7 @@ hf_oauth_expiration_minutes: 480
 ## Background
 Created as a final project for the HuggingFace Agents course ( https://huggingface.co/learn/agents-course).
-Aims to answer Level 1 questions from the **GAIA** validation set. It was tested on 20 such questions with a success rate of 75%.
 ### What is GAIA
 GAIA is a benchmark for AI assistants evaluation on real-world tasks that involve:
@@ -81,7 +81,7 @@ the game is computed programmatically. Both `gpt-4.1` and `gemini-2.5-flash` mod
 **Chess Board Picture Analysis - Challenges and Limitations** 🆘
     I tried both `gpt-4.1` and `gemini-2.5-flash` models for chess pieces coordinates extraction, but I obtained inconsistent results (there
 are times when they get it right, but also instances when they don't).
-At least for openai I see there is a limitation listed on their website (see [here](https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations)):
 >Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.
 The tool questions both models and does an arbitrage on their results. It queries both models further but only on pieces with conflicting positions.

 ## Background
 Created as a final project for the HuggingFace Agents course ( https://huggingface.co/learn/agents-course).
+Aims to answer Level 1 questions from the **GAIA** validation set. It was tested on 20 such questions with a success rate of 90%.
 ### What is GAIA
 GAIA is a benchmark for AI assistants evaluation on real-world tasks that involve:
 **Chess Board Picture Analysis - Challenges and Limitations** 🆘
     I tried both `gpt-4.1` and `gemini-2.5-flash` models for chess pieces coordinates extraction, but I obtained inconsistent results (there
 are times when they get it right, but also instances when they don't).
+At least for openai I see there is a limitation listed on their website (see [https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations](https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations)):
 >Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.
 The tool questions both models and does an arbitrage on their results. It queries both models further but only on pieces with conflicting positions.

config/prompts.yaml CHANGED Viewed

@@ -2,7 +2,7 @@ prompts:
   base_system:
     content: |
       You are a general AI assistant tasked with answering complex questions.
-      Make sure you think step by step in order to answer the given question.
       Here is the the question you received:
       <question>
@@ -15,13 +15,15 @@ prompts:
       <summary>
       {{summary}}
       </summary>
       For mathematical questions or problems delegate them to the math_tool.
       For chess related questions use the chess_analysis_tool.
       Include citations for all the information you retrieve, ensuring you know exactly where the data comes from.
-      If you have the information inside your knowledge, still call a tool in order to confirm it.
        **Guidelines for Answering Questions:**
        * **Citations:** Always support findings with source URLs, clearly provided as in-text citations.
@@ -33,8 +35,6 @@ prompts:
        * **Observation:** Analyze obtained results.
        * Repeat Thought/Action/Observation cycles as needed.
        * **Final Answer:** Synthesize and present findings with citations in markdown format.
-       Break down a problem into sub-problems and solve it step by step.
        If the value of chunked_last_tool_call is true, this means that the last tool execution returns a result formed from the concatenation
        of multiple chunks.
@@ -50,9 +50,9 @@ prompts:
       Process the answer and extract YOUR FINAL ANSWER to be provided to the user. Make sure it respects the following guidelines.
       YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings.
       If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise.
-      If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise.
       If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
-      The rule for a comma-separated list is to always add a space after the comma, but never before it.
     type: answer_refinement
     variables: []
     version: 1.0

   base_system:
     content: |
       You are a general AI assistant tasked with answering complex questions.
+      Break down the problem into smaller, manageable sub-problems. Then, solve each sub-problem step by step to reach the final answer.
       Here is the the question you received:
       <question>
       <summary>
       {{summary}}
       </summary>
+      Important: note that the last two messages were not included yet in this summary.
       For mathematical questions or problems delegate them to the math_tool.
       For chess related questions use the chess_analysis_tool.
       Include citations for all the information you retrieve, ensuring you know exactly where the data comes from.
+      If you have the information inside your knowledge, still call a tool in order to confirm it.
        **Guidelines for Answering Questions:**
        * **Citations:** Always support findings with source URLs, clearly provided as in-text citations.
        * **Observation:** Analyze obtained results.
        * Repeat Thought/Action/Observation cycles as needed.
        * **Final Answer:** Synthesize and present findings with citations in markdown format.
        If the value of chunked_last_tool_call is true, this means that the last tool execution returns a result formed from the concatenation
        of multiple chunks.
       Process the answer and extract YOUR FINAL ANSWER to be provided to the user. Make sure it respects the following guidelines.
       YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings.
       If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise.
+      If you are asked for a string, don't use articles and don't use abbreviations (e.g. for cities names) and write the digits in plain text unless specified otherwise.
       If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
+      Very Important: The rule for a comma-separated list is to always add a space after the comma, but never before it.
     type: answer_refinement
     variables: []
     version: 1.0

core/agent.py CHANGED Viewed

@@ -1,16 +1,15 @@
 from typing import Optional
 from langchain_core.messages import HumanMessage
 from langgraph.graph.state import CompiledStateGraph
 from core.messages import Attachment
 from core.state import State
 from nodes.nodes import assistant, optimize_memory, response_processing, pre_processor
-from tools.tavily_tools import llm_tools
-from langgraph.graph import START, StateGraph, END
-from langgraph.prebuilt import tools_condition
-from langgraph.prebuilt import ToolNode
 class GaiaAgent:
@@ -23,7 +22,7 @@ class GaiaAgent:
         # Define nodes: these do the work
         builder.add_node("pre_processor", pre_processor)
         builder.add_node("assistant", assistant)
-        builder.add_node("tools", ToolNode(llm_tools))
         builder.add_node("optimize_memory", optimize_memory)
         builder.add_node("response_processing", response_processing)
@@ -49,7 +48,7 @@ class GaiaAgent:
         if attachment:
             initial_state["file_reference"] = attachment.file_path
-        messages = self.react_graph.invoke(initial_state)
         #        for m in messages['messages']:
         #            m.pretty_print()

 from typing import Optional
 from langchain_core.messages import HumanMessage
+from langgraph.graph import START, StateGraph, END
 from langgraph.graph.state import CompiledStateGraph
+from langgraph.prebuilt import ToolNode
+from langgraph.prebuilt import tools_condition
 from core.messages import Attachment
 from core.state import State
 from nodes.nodes import assistant, optimize_memory, response_processing, pre_processor
+from tools.tavily_tools import web_search_tools
 class GaiaAgent:
         # Define nodes: these do the work
         builder.add_node("pre_processor", pre_processor)
         builder.add_node("assistant", assistant)
+        builder.add_node("tools", ToolNode(web_search_tools))
         builder.add_node("optimize_memory", optimize_memory)
         builder.add_node("response_processing", response_processing)
         if attachment:
             initial_state["file_reference"] = attachment.file_path
+        messages = self.react_graph.invoke(initial_state, {"recursion_limit": 30})
         #        for m in messages['messages']:
         #            m.pretty_print()

core/state.py CHANGED Viewed

@@ -1,8 +1,7 @@
-from langgraph.graph import MessagesState
 import operator
-from typing_extensions import TypedDict, Annotated, List, Sequence
-from langchain_core.messages import BaseMessage
-from langgraph.graph.message import add_messages
 class State(MessagesState):

 import operator
+from langgraph.graph import MessagesState
+from typing_extensions import Annotated, List
 class State(MessagesState):

nodes/chunking_node.py CHANGED Viewed

@@ -58,15 +58,20 @@ class OversizedContentHandler:
     def process_oversized_message(self, message: BaseMessage, query: str) -> bool:
         chunked = False
         # At this point we are chunking only tavily_extract results messages
         if isinstance(message, ToolMessage) and message.name == "tavily_extract":
-            json_content = json.loads(message.content)
-            result = json_content['results'][0]
-            raw_content = result['raw_content']
-            content_size = self.count_tokens(raw_content)
-            if content_size > config.MAX_CONTEXT_TOKENS:
-                print(f"Proceed with chunking, evaluated no of tokens {content_size} for message {message.id}")
-                chunked = True
-                result['raw_content'] = self.extract_relevant_chunks(raw_content, query=query)
-                message.content = json.dumps(json_content)
         return chunked

     def process_oversized_message(self, message: BaseMessage, query: str) -> bool:
         chunked = False
         # At this point we are chunking only tavily_extract results messages
+        json_content = None
         if isinstance(message, ToolMessage) and message.name == "tavily_extract":
+            try:
+                json_content = json.loads(message.content)
+            except Exception as e:
+                print("cannot parse message")
+            if json_content:
+                result = json_content['results'][0]
+                raw_content = result['raw_content']
+                content_size = self.count_tokens(raw_content)
+                if content_size > config.MAX_CONTEXT_TOKENS:
+                    print(f"Proceed with chunking, evaluated no of tokens {content_size} for message {message.id}")
+                    chunked = True
+                    result['raw_content'] = self.extract_relevant_chunks(raw_content, query=query)
+                    message.content = json.dumps(json_content)
         return chunked

nodes/nodes.py CHANGED Viewed

@@ -13,17 +13,17 @@ from tools.chess_tool import chess_analysis_tool
 from tools.excel_tool import query_excel_file
 from tools.math_agent import math_tool
 from tools.python_executor import execute_python_code
-from tools.tavily_tools import llm_tools
 from utils.prompt_manager import prompt_mgmt
 model = ChatOpenAI(model="gpt-4.1")
 response_processing_model = ChatOpenAI(model="gpt-4.1-mini")
-llm_tools.append(query_audio)
-llm_tools.append(query_excel_file)
-llm_tools.append(execute_python_code)
-llm_tools.append(math_tool)
-llm_tools.append(chess_analysis_tool)
-model = model.bind_tools(llm_tools, parallel_tool_calls=False)
 # Node

 from tools.excel_tool import query_excel_file
 from tools.math_agent import math_tool
 from tools.python_executor import execute_python_code
+from tools.tavily_tools import web_search_tools
 from utils.prompt_manager import prompt_mgmt
 model = ChatOpenAI(model="gpt-4.1")
 response_processing_model = ChatOpenAI(model="gpt-4.1-mini")
+web_search_tools.append(query_audio)
+web_search_tools.append(query_excel_file)
+web_search_tools.append(execute_python_code)
+web_search_tools.append(math_tool)
+web_search_tools.append(chess_analysis_tool)
+model = model.bind_tools(web_search_tools, parallel_tool_calls=False)
 # Node

tools/chess_tool.py CHANGED Viewed

@@ -118,7 +118,8 @@ class ChessVisionAnalyzer:
             HumanMessage(content=[
                 {
                     "type": "text",
-                    "text": "Analyze this chess board image and return the chess board orientation. "
                 },
                 {
@@ -134,7 +135,7 @@ class ChessVisionAnalyzer:
         response = self.llm1.invoke(messages)
         return response.content
-    def analyze_board_from_image(self, active_color: str, image_path: str, llm_no: int,
                                  squares: Optional[list] = None) -> Optional[ChessBoardAnalysis]:
         """Analyze chess board image and return FEN notation"""
         base64_image = encode_image_to_base64(image_path)
@@ -153,10 +154,8 @@ class ChessVisionAnalyzer:
                 {
                     "type": "text",
                     "text": f"""Analyze this chess board image and return the  pieces positions.
-                    The chess board orientation is from **Black's perspective**.
-                    - The files are labeled from **h to a** (left to right).
-                    - The ranks are labeled from **8 to 1** (bottom to top).
-                    This matches the standard orientation for Black's perspective.
                     {squares_text}
                     Return the positions of the pieces in JSON format.
                     Use the following schema for each piece:
@@ -190,14 +189,14 @@ class ChessVisionAnalyzer:
         return self._parse_llm_response(response.content)
     def analyze_board(self, active_color: str, file_reference: str) -> str:
-        first_analysis_res = self.analyze_board_from_image(active_color, file_reference, 1)
-        second_analysis_res = self.analyze_board_from_image(active_color, file_reference, 2)
         result = self.compare_analyses(first_analysis_res, second_analysis_res)
         if result['conflicts'] is not None and len(result['conflicts']) > 0:
-            arbitrage_result = self.arbitrate_conflicts(result, active_color, file_reference, 3)
-            # todo: if there are still conflicts let one of the llms win
             return arbitrage_result.get("consensus").to_fen(active_color)
         else:
             result.get("consensus").to_fen(active_color)
@@ -226,8 +225,7 @@ class ChessVisionAnalyzer:
             return None
     def compare_analyses(self, analysis_1: ChessBoardAnalysis, analysis_2: ChessBoardAnalysis) -> dict:
-        """Compare two analyses and identify conflicts"""
-        print("Comparing analyses")
         if not analysis_1 or not analysis_2:
             return {"conflicts": [], "consensus": None, "need_arbitration": False}
@@ -264,7 +262,7 @@ class ChessVisionAnalyzer:
             "need_arbitration": need_arbitration
         }
-    def arbitrate_conflicts(self, state: dict, active_color: str, image_path: str, depth: int = 1) -> dict:
         """Arbitrate conflicting piece positions"""
         print(f"Arbitrating conflicts with depth {depth}")
@@ -278,14 +276,14 @@ class ChessVisionAnalyzer:
         print("Pieces with conflicts:", conflicts_sqares)
-        first_analysis_res = self.analyze_board_from_image(active_color, image_path, 1, conflicts_sqares)
-        second_analysis_res = self.analyze_board_from_image(active_color, image_path, 2, conflicts_sqares)
         result = self.compare_analyses(first_analysis_res, second_analysis_res)
         result.get("consensus").merge_with(state.get("consensus"))
         if result['conflicts'] is not None and len(result['conflicts']) > 0:
             if depth > 0:
                 depth -= 1
-                result = self.arbitrate_conflicts(result, active_color, image_path, depth)
             else:
                 print("Arbitrage completed with conflicts. took llm2 as ground truth")
                 result.get("consensus").merge_with(second_analysis_res)
@@ -343,7 +341,7 @@ class ChessMoveExplainer:
         5. Keep it concise but informative for an intermediate player
         """
-        response = self.llm([HumanMessage(content=prompt)])
         return response.content

             HumanMessage(content=[
                 {
                     "type": "text",
+                    "text": f"Analyze this chess board image and return the chess board orientation. I know that the "
+                            f"active color is {active_color}"
                 },
                 {
         response = self.llm1.invoke(messages)
         return response.content
+    def analyze_board_from_image(self, board_orientation: str, image_path: str, llm_no: int,
                                  squares: Optional[list] = None) -> Optional[ChessBoardAnalysis]:
         """Analyze chess board image and return FEN notation"""
         base64_image = encode_image_to_base64(image_path)
                 {
                     "type": "text",
                     "text": f"""Analyze this chess board image and return the  pieces positions.
+                    {board_orientation}
                     {squares_text}
                     Return the positions of the pieces in JSON format.
                     Use the following schema for each piece:
         return self._parse_llm_response(response.content)
     def analyze_board(self, active_color: str, file_reference: str) -> str:
+        board_orientation = self.analyze_board_orientation(active_color, file_reference)
+        first_analysis_res = self.analyze_board_from_image(board_orientation, file_reference, 1)
+        second_analysis_res = self.analyze_board_from_image(board_orientation, file_reference, 2)
         result = self.compare_analyses(first_analysis_res, second_analysis_res)
         if result['conflicts'] is not None and len(result['conflicts']) > 0:
+            arbitrage_result = self.arbitrate_conflicts(result, board_orientation, file_reference, 3)
             return arbitrage_result.get("consensus").to_fen(active_color)
         else:
             result.get("consensus").to_fen(active_color)
             return None
     def compare_analyses(self, analysis_1: ChessBoardAnalysis, analysis_2: ChessBoardAnalysis) -> dict:
+        """Compare the given analyses and identify conflicts"""
         if not analysis_1 or not analysis_2:
             return {"conflicts": [], "consensus": None, "need_arbitration": False}
             "need_arbitration": need_arbitration
         }
+    def arbitrate_conflicts(self, state: dict, board_orientation: str, image_path: str, depth: int = 1) -> dict:
         """Arbitrate conflicting piece positions"""
         print(f"Arbitrating conflicts with depth {depth}")
         print("Pieces with conflicts:", conflicts_sqares)
+        first_analysis_res = self.analyze_board_from_image(board_orientation, image_path, 1, conflicts_sqares)
+        second_analysis_res = self.analyze_board_from_image(board_orientation, image_path, 2, conflicts_sqares)
         result = self.compare_analyses(first_analysis_res, second_analysis_res)
         result.get("consensus").merge_with(state.get("consensus"))
         if result['conflicts'] is not None and len(result['conflicts']) > 0:
             if depth > 0:
                 depth -= 1
+                result = self.arbitrate_conflicts(result, board_orientation, image_path, depth)
             else:
                 print("Arbitrage completed with conflicts. took llm2 as ground truth")
                 result.get("consensus").merge_with(second_analysis_res)
         5. Keep it concise but informative for an intermediate player
         """
+        response = self.llm.invoke([HumanMessage(content=prompt)])
         return response.content

tools/tavily_tools.py CHANGED Viewed

@@ -1,21 +1,17 @@
 from langchain_tavily import TavilySearch
 from langchain_tavily import TavilyExtract
-from langchain_tavily import TavilyCrawl
 # Initialize Tavily Search Tool
 tavily_search_tool = TavilySearch(
     max_results=10,
     topic="general",
     # Make sure to avoid retrieving the response from a dataset or a space
-    exclude_domains =["https://huggingface.co/datasets", "https://huggingface.co/spaces"]
 )
 # Define the LangChain extract tool
 tavily_extract_tool = TavilyExtract(extract_depth="basic")
-# Define the LangChain crawl tool
-tavily_crawl_tool = TavilyCrawl()
-llm_tools = [
-    tavily_search_tool, tavily_extract_tool, tavily_crawl_tool
-]

 from langchain_tavily import TavilySearch
 from langchain_tavily import TavilyExtract
 # Initialize Tavily Search Tool
 tavily_search_tool = TavilySearch(
     max_results=10,
     topic="general",
     # Make sure to avoid retrieving the response from a dataset or a space
+    exclude_domains=["https://huggingface.co/datasets", "https://huggingface.co/spaces"]
 )
 # Define the LangChain extract tool
 tavily_extract_tool = TavilyExtract(extract_depth="basic")
+web_search_tools = [
+    tavily_search_tool, tavily_extract_tool
+]