Final_Assignment_Template

Sleeping

App Files Files Community

carolinacon commited on Sep 9, 2025

Commit

d1bfef5

1 Parent(s): 0555e07

updated README file

Browse files

Files changed (1) hide show

README.md +29 -38

README.md CHANGED Viewed

@@ -19,7 +19,7 @@ Created as a final project for the HuggingFace Agents course ( https://huggingfa
 Aims to answer Level 1 questions from the **GAIA** validation set. It was tested on 20 such questions with a success rate of 65%.
 ### What is GAIA
-GAIA is a benchmark for AI assistants evaluation on real-world tasks that involves:
 - multimodal reasoning (e.g., analyzing images, audio, documents)
 - multi-hop retrieval of interdependent facts
 - python code execution
@@ -40,20 +40,20 @@ GAIA introductory paper [”GAIA: A Benchmark for General AI Assistants”](http
 **Nodes**
 - **Pre-processor**: Pre-processing of the question.
-- **Assistant**: The brain of the agent. Decides which tool to call.
-- **Tools**: The invocation of a tool
 - **Optimize memory**: This step summarizes and the removes all the messages except the last 2. If the last message is the response of a web
-extract that exceed a threshold, it chunks the response and keeps only the most relevant chunks.
 - **Response Processing**: Brings the answer into the concise format required by GAIA.
-**Tools**
-**Web Search** 🔎 uses `tavily` search and extract tools.
-- **Chunking**: The content returned by the `extract` tool might be too large to be further analyzed at once by a model (depending on the chosen model context window size or on the rate limitation),
-so if its size exceeds a pre-configured threshold, it is chunked and only the most relevant chunks further analyzed.
-    - **Text Splitting**: First by markdown (used Langchain's `MarkdownHeaderTextSplitter`) and then further by size with a sliding window (used LangChain's `RecursiveCharacterTextSplitter`).
-    - **Embeddings**:  `langchain_community.embeddings.OpenAIEmbeddings`
     - **Vector DB**: `FAISS` vector db.
     - **Retrieval**: `FAISS` similarity search
@@ -66,26 +66,23 @@ so if its size exceeds a pre-configured threshold, it is chunked and only the mo
 - **Pyhton code executor**: executes a snippet of python code provided as input
 - **Think tool**: used for strategic reflection on the progress of the solving process
-  I chose to implement it as a code agent
-  It has the following states:
-**Python code Executor**⚙️ can run either a snippet of python code or a python file. The python file is executed in a sub-process.
 **Spreadsheets**📊: analyzes `excel` files using the pandas dataframe agent `langchain_experimental.agents.create_pandas_dataframe_agent`
 and the`gpt-4.1` model.
-**Chess**♟️ Given a chess board and the active color, this tool is able to suggest the best move to be performed by the active color.
-- **Picture analysis**: the tool must detect the location of each piece on the chess board. Once the coordinates are retrieved, the FEN of
 the game is computed programmatically. Both `gpt-4.1` and `gemini-2.5-flash` models are used to extract the coordinates and an arbitrage is performed on their outcomes.
 - **Move suggestion**: the best move is suggested by a `stockfish` chess engine
-- **Move interpretation**: the move is then interpreted and transcribed into the algebraic notation. Used `gpt-4` for this.
-**Chess Board picture analysis Challenges and Limitations** 🆘
-    - I tried both `gpt-4.1` and `gemini-2.5-flash` models for chess pieces coordinates extraction, but I obtained inconsistent results (there
 are times when they get it right, but also instances when they don't).
-At least for openai I see there is a limitation on their website:
->Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions (see [here](https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations)).
 I questioned both models and chose to do an arbitrage on their results. I invoked both further but only on the conflicting positions.
 This process continues for a limited number of times. My aim was to reduce the number of objects that the model focuses on.
@@ -94,30 +91,24 @@ But still, the inconsistencies remain ⚠️.
 Things that I want to try further: improving the prompt, if conflicts occur, retry with conflicting pieces instead of positions (e.g. if a white pawn got conflicting positions, ask for the positions of all white pawns).
-🎥 **Videos** 🚧
-## Challenges 🆘
-1. Chess Board picture anaysis.
-2. Video analysis.
 ## Future work and improvements 🔜
-#### 1. Evaluation
-Implement an automated evaluation for the reference set of questions.
-Evaluate the agent againts other questions from the validation test.
-Draw a comparison between different models and this agent.
-#### 2. Chunking
-Try other chunking strategies as well.
-#### 3. Audio Analysis
-Use a lesser expensive model to get the transcripts (like whisper) and if this is not enough to answer the question and more sophiticated processing is needed
 for other sounds like music, barks or other type of sounds then indeed use a better model.
-#### 4  Python File execution
-Make sure ...safety
-#### 5. Video Analysis
-#### 6. Chessboard Images analysis
 ## References 📚
 The math tool implementation was inspired from this repo https://github.com/langchain-ai/open_deep_research

 Aims to answer Level 1 questions from the **GAIA** validation set. It was tested on 20 such questions with a success rate of 65%.
 ### What is GAIA
+GAIA is a benchmark for AI assistants evaluation on real-world tasks that involve:
 - multimodal reasoning (e.g., analyzing images, audio, documents)
 - multi-hop retrieval of interdependent facts
 - python code execution
 **Nodes**
 - **Pre-processor**: Pre-processing of the question.
+- **Assistant**: The brain of the agent. Decides which tool to call. Uses `gpt-4.1`.
+- **Tools**: The invocation of a tool.
 - **Optimize memory**: This step summarizes and the removes all the messages except the last 2. If the last message is the response of a web
+extract who's size exceeds a threshold, it is chunked  and replaced with the most relevant chunks only.
 - **Response Processing**: Brings the answer into the concise format required by GAIA.
+**Tasks and corresponding Tools**
+**Web Searches** 🔎 are undertaken by `tavily` `search` and `extract` tools.
+- **Chunking**: The content returned by the `extract` tool might be too large to be further analyzed at once by a model (depending on the chosen model context window size or on the rate limits),
+so if its size exceeds a pre-configured threshold, it is chunked and only the most relevant chunks are further analyzed.
+    - **Text Splitting**: First a hierarchical split (used Langchain's `MarkdownHeaderTextSplitter`) and then further splitting by size with a sliding window (used LangChain's `RecursiveCharacterTextSplitter`).
+    - **Embeddings**:  `langchain_community.embeddings.OpenAIEmbeddings`.
     - **Vector DB**: `FAISS` vector db.
     - **Retrieval**: `FAISS` similarity search
 - **Pyhton code executor**: executes a snippet of python code provided as input
 - **Think tool**: used for strategic reflection on the progress of the solving process
+**Python code Executor**⚙️ can run either a snippet of python code or a given python file. The python code snippet is executed using `langchain_experimental.tools.PythonREPLTool`. The python file is executed in a sub-process.
 **Spreadsheets**📊: analyzes `excel` files using the pandas dataframe agent `langchain_experimental.agents.create_pandas_dataframe_agent`
 and the`gpt-4.1` model.
+**Chess Move Recommendation**♟️ Given a chess board and an active color, this tool is able to suggest the best move to be performed by the active color.
+- **Picture analysis**: identifies the location of each chess piece on the board. Once the coordinates are identified, the FEN of
 the game is computed programmatically. Both `gpt-4.1` and `gemini-2.5-flash` models are used to extract the coordinates and an arbitrage is performed on their outcomes.
 - **Move suggestion**: the best move is suggested by a `stockfish` chess engine
+- **Move interpretation**: the move is then interpreted and transcribed into the algebraic notation with the help of `gpt-4`.
+**Chess Board Picture Analysis - Challenges and Limitations** ��
+    I tried both `gpt-4.1` and `gemini-2.5-flash` models for chess pieces coordinates extraction, but I obtained inconsistent results (there
 are times when they get it right, but also instances when they don't).
+At least for openai I see there is a limitation listed on their website (see [here](https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations)):
+>Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.
 I questioned both models and chose to do an arbitrage on their results. I invoked both further but only on the conflicting positions.
 This process continues for a limited number of times. My aim was to reduce the number of objects that the model focuses on.
 Things that I want to try further: improving the prompt, if conflicts occur, retry with conflicting pieces instead of positions (e.g. if a white pawn got conflicting positions, ask for the positions of all white pawns).
+**YouTube Videos Analysis**🎥 This is work in progress 🚧
+So far, the agent is able to respond to the questions on the conversation inside a YouTube video. There is no dedicated tool for this.
+The assistant searches for the transcripts by using the `tavily extract` tool.
+TODO: analyze YouTube videos and answer questions about objects in the video.
 ## Future work and improvements 🔜
+- **Evaluation**:  Implement an automated evaluation for the reference set of questions.
+Evaluate the agent against other questions from the GAIA validation set.
+- **Large Web Extracts**: Try other chunking strategies.
+- **Audio Analysis**:Use a lesser expensive model to get the transcripts (like whisper) and if this is not enough to answer the question and more sophiticated processing is needed
 for other sounds like music, barks or other type of sounds then indeed use a better model.
+- **Python File execution**:Improved safety when executing python code or python files.
+- **Video Analysis**: Answer questions about objects in the video.
+- **Chessboard Images Analysis**: Detect correctly all pieces on a chess board image.
 ## References 📚
 The math tool implementation was inspired from this repo https://github.com/langchain-ai/open_deep_research