carolinacon commited on
Commit
d1bfef5
ยท
1 Parent(s): 0555e07

updated README file

Browse files
Files changed (1) hide show
  1. README.md +29 -38
README.md CHANGED
@@ -19,7 +19,7 @@ Created as a final project for the HuggingFace Agents course ( https://huggingfa
19
  Aims to answer Level 1 questions from the **GAIA** validation set. It was tested on 20 such questions with a success rate of 65%.
20
  ### What is GAIA
21
 
22
- GAIA is a benchmark for AI assistants evaluation on real-world tasks that involves:
23
  - multimodal reasoning (e.g., analyzing images, audio, documents)
24
  - multi-hop retrieval of interdependent facts
25
  - python code execution
@@ -40,20 +40,20 @@ GAIA introductory paper [โ€GAIA: A Benchmark for General AI Assistantsโ€](http
40
  **Nodes**
41
 
42
  - **Pre-processor**: Pre-processing of the question.
43
- - **Assistant**: The brain of the agent. Decides which tool to call.
44
- - **Tools**: The invocation of a tool
45
  - **Optimize memory**: This step summarizes and the removes all the messages except the last 2. If the last message is the response of a web
46
- extract that exceed a threshold, it chunks the response and keeps only the most relevant chunks.
47
  - **Response Processing**: Brings the answer into the concise format required by GAIA.
48
 
49
- **Tools**
50
 
51
- **Web Search** ๐Ÿ”Ž uses `tavily` search and extract tools.
52
 
53
- - **Chunking**: The content returned by the `extract` tool might be too large to be further analyzed at once by a model (depending on the chosen model context window size or on the rate limitation),
54
- so if its size exceeds a pre-configured threshold, it is chunked and only the most relevant chunks further analyzed.
55
- - **Text Splitting**: First by markdown (used Langchain's `MarkdownHeaderTextSplitter`) and then further by size with a sliding window (used LangChain's `RecursiveCharacterTextSplitter`).
56
- - **Embeddings**: `langchain_community.embeddings.OpenAIEmbeddings`
57
  - **Vector DB**: `FAISS` vector db.
58
  - **Retrieval**: `FAISS` similarity search
59
 
@@ -66,26 +66,23 @@ so if its size exceeds a pre-configured threshold, it is chunked and only the mo
66
  - **Pyhton code executor**: executes a snippet of python code provided as input
67
  - **Think tool**: used for strategic reflection on the progress of the solving process
68
 
69
- I chose to implement it as a code agent
70
- It has the following states:
71
-
72
- **Python code Executor**โš™๏ธ can run either a snippet of python code or a python file. The python file is executed in a sub-process.
73
 
74
  **Spreadsheets**๐Ÿ“Š: analyzes `excel` files using the pandas dataframe agent `langchain_experimental.agents.create_pandas_dataframe_agent`
75
  and the`gpt-4.1` model.
76
 
77
- **Chess**โ™Ÿ๏ธ Given a chess board and the active color, this tool is able to suggest the best move to be performed by the active color.
78
 
79
- - **Picture analysis**: the tool must detect the location of each piece on the chess board. Once the coordinates are retrieved, the FEN of
80
  the game is computed programmatically. Both `gpt-4.1` and `gemini-2.5-flash` models are used to extract the coordinates and an arbitrage is performed on their outcomes.
81
  - **Move suggestion**: the best move is suggested by a `stockfish` chess engine
82
- - **Move interpretation**: the move is then interpreted and transcribed into the algebraic notation. Used `gpt-4` for this.
83
 
84
- **Chess Board picture analysis Challenges and Limitations** ๐Ÿ†˜
85
- - I tried both `gpt-4.1` and `gemini-2.5-flash` models for chess pieces coordinates extraction, but I obtained inconsistent results (there
86
  are times when they get it right, but also instances when they don't).
87
- At least for openai I see there is a limitation on their website:
88
- >Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions (see [here](https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations)).
89
 
90
  I questioned both models and chose to do an arbitrage on their results. I invoked both further but only on the conflicting positions.
91
  This process continues for a limited number of times. My aim was to reduce the number of objects that the model focuses on.
@@ -94,30 +91,24 @@ But still, the inconsistencies remain โš ๏ธ.
94
  Things that I want to try further: improving the prompt, if conflicts occur, retry with conflicting pieces instead of positions (e.g. if a white pawn got conflicting positions, ask for the positions of all white pawns).
95
 
96
 
97
- ๐ŸŽฅ **Videos** ๐Ÿšง
98
-
99
 
 
 
 
100
 
101
- ## Challenges ๐Ÿ†˜
102
- 1. Chess Board picture anaysis.
103
- 2. Video analysis.
104
 
105
 
106
 
107
  ## Future work and improvements ๐Ÿ”œ
108
- #### 1. Evaluation
109
- Implement an automated evaluation for the reference set of questions.
110
- Evaluate the agent againts other questions from the validation test.
111
- Draw a comparison between different models and this agent.
112
- #### 2. Chunking
113
- Try other chunking strategies as well.
114
- #### 3. Audio Analysis
115
- Use a lesser expensive model to get the transcripts (like whisper) and if this is not enough to answer the question and more sophiticated processing is needed
116
  for other sounds like music, barks or other type of sounds then indeed use a better model.
117
- #### 4 Python File execution
118
- Make sure ...safety
119
- #### 5. Video Analysis
120
- #### 6. Chessboard Images analysis
121
 
122
  ## References ๐Ÿ“š
123
  The math tool implementation was inspired from this repo https://github.com/langchain-ai/open_deep_research
 
19
  Aims to answer Level 1 questions from the **GAIA** validation set. It was tested on 20 such questions with a success rate of 65%.
20
  ### What is GAIA
21
 
22
+ GAIA is a benchmark for AI assistants evaluation on real-world tasks that involve:
23
  - multimodal reasoning (e.g., analyzing images, audio, documents)
24
  - multi-hop retrieval of interdependent facts
25
  - python code execution
 
40
  **Nodes**
41
 
42
  - **Pre-processor**: Pre-processing of the question.
43
+ - **Assistant**: The brain of the agent. Decides which tool to call. Uses `gpt-4.1`.
44
+ - **Tools**: The invocation of a tool.
45
  - **Optimize memory**: This step summarizes and the removes all the messages except the last 2. If the last message is the response of a web
46
+ extract who's size exceeds a threshold, it is chunked and replaced with the most relevant chunks only.
47
  - **Response Processing**: Brings the answer into the concise format required by GAIA.
48
 
49
+ **Tasks and corresponding Tools**
50
 
51
+ **Web Searches** ๐Ÿ”Ž are undertaken by `tavily` `search` and `extract` tools.
52
 
53
+ - **Chunking**: The content returned by the `extract` tool might be too large to be further analyzed at once by a model (depending on the chosen model context window size or on the rate limits),
54
+ so if its size exceeds a pre-configured threshold, it is chunked and only the most relevant chunks are further analyzed.
55
+ - **Text Splitting**: First a hierarchical split (used Langchain's `MarkdownHeaderTextSplitter`) and then further splitting by size with a sliding window (used LangChain's `RecursiveCharacterTextSplitter`).
56
+ - **Embeddings**: `langchain_community.embeddings.OpenAIEmbeddings`.
57
  - **Vector DB**: `FAISS` vector db.
58
  - **Retrieval**: `FAISS` similarity search
59
 
 
66
  - **Pyhton code executor**: executes a snippet of python code provided as input
67
  - **Think tool**: used for strategic reflection on the progress of the solving process
68
 
69
+ **Python code Executor**โš™๏ธ can run either a snippet of python code or a given python file. The python code snippet is executed using `langchain_experimental.tools.PythonREPLTool`. The python file is executed in a sub-process.
 
 
 
70
 
71
  **Spreadsheets**๐Ÿ“Š: analyzes `excel` files using the pandas dataframe agent `langchain_experimental.agents.create_pandas_dataframe_agent`
72
  and the`gpt-4.1` model.
73
 
74
+ **Chess Move Recommendation**โ™Ÿ๏ธ Given a chess board and an active color, this tool is able to suggest the best move to be performed by the active color.
75
 
76
+ - **Picture analysis**: identifies the location of each chess piece on the board. Once the coordinates are identified, the FEN of
77
  the game is computed programmatically. Both `gpt-4.1` and `gemini-2.5-flash` models are used to extract the coordinates and an arbitrage is performed on their outcomes.
78
  - **Move suggestion**: the best move is suggested by a `stockfish` chess engine
79
+ - **Move interpretation**: the move is then interpreted and transcribed into the algebraic notation with the help of `gpt-4`.
80
 
81
+ **Chess Board Picture Analysis - Challenges and Limitations** ๏ฟฝ๏ฟฝ
82
+ I tried both `gpt-4.1` and `gemini-2.5-flash` models for chess pieces coordinates extraction, but I obtained inconsistent results (there
83
  are times when they get it right, but also instances when they don't).
84
+ At least for openai I see there is a limitation listed on their website (see [here](https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations)):
85
+ >Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.
86
 
87
  I questioned both models and chose to do an arbitrage on their results. I invoked both further but only on the conflicting positions.
88
  This process continues for a limited number of times. My aim was to reduce the number of objects that the model focuses on.
 
91
  Things that I want to try further: improving the prompt, if conflicts occur, retry with conflicting pieces instead of positions (e.g. if a white pawn got conflicting positions, ask for the positions of all white pawns).
92
 
93
 
94
+ **YouTube Videos Analysis**๐ŸŽฅ This is work in progress ๐Ÿšง
 
95
 
96
+ So far, the agent is able to respond to the questions on the conversation inside a YouTube video. There is no dedicated tool for this.
97
+ The assistant searches for the transcripts by using the `tavily extract` tool.
98
+ TODO: analyze YouTube videos and answer questions about objects in the video.
99
 
 
 
 
100
 
101
 
102
 
103
  ## Future work and improvements ๐Ÿ”œ
104
+ - **Evaluation**: Implement an automated evaluation for the reference set of questions.
105
+ Evaluate the agent against other questions from the GAIA validation set.
106
+ - **Large Web Extracts**: Try other chunking strategies.
107
+ - **Audio Analysis**:Use a lesser expensive model to get the transcripts (like whisper) and if this is not enough to answer the question and more sophiticated processing is needed
 
 
 
 
108
  for other sounds like music, barks or other type of sounds then indeed use a better model.
109
+ - **Python File execution**:Improved safety when executing python code or python files.
110
+ - **Video Analysis**: Answer questions about objects in the video.
111
+ - **Chessboard Images Analysis**: Detect correctly all pieces on a chess board image.
 
112
 
113
  ## References ๐Ÿ“š
114
  The math tool implementation was inspired from this repo https://github.com/langchain-ai/open_deep_research