Commit
ยท
d1bfef5
1
Parent(s):
0555e07
updated README file
Browse files
README.md
CHANGED
|
@@ -19,7 +19,7 @@ Created as a final project for the HuggingFace Agents course ( https://huggingfa
|
|
| 19 |
Aims to answer Level 1 questions from the **GAIA** validation set. It was tested on 20 such questions with a success rate of 65%.
|
| 20 |
### What is GAIA
|
| 21 |
|
| 22 |
-
GAIA is a benchmark for AI assistants evaluation on real-world tasks that
|
| 23 |
- multimodal reasoning (e.g., analyzing images, audio, documents)
|
| 24 |
- multi-hop retrieval of interdependent facts
|
| 25 |
- python code execution
|
|
@@ -40,20 +40,20 @@ GAIA introductory paper [โGAIA: A Benchmark for General AI Assistantsโ](http
|
|
| 40 |
**Nodes**
|
| 41 |
|
| 42 |
- **Pre-processor**: Pre-processing of the question.
|
| 43 |
-
- **Assistant**: The brain of the agent. Decides which tool to call.
|
| 44 |
-
- **Tools**: The invocation of a tool
|
| 45 |
- **Optimize memory**: This step summarizes and the removes all the messages except the last 2. If the last message is the response of a web
|
| 46 |
-
extract
|
| 47 |
- **Response Processing**: Brings the answer into the concise format required by GAIA.
|
| 48 |
|
| 49 |
-
**Tools**
|
| 50 |
|
| 51 |
-
**Web
|
| 52 |
|
| 53 |
-
- **Chunking**: The content returned by the `extract` tool might be too large to be further analyzed at once by a model (depending on the chosen model context window size or on the rate
|
| 54 |
-
so if its size exceeds a pre-configured threshold, it is chunked and only the most relevant chunks further analyzed.
|
| 55 |
-
- **Text Splitting**: First
|
| 56 |
-
- **Embeddings**: `langchain_community.embeddings.OpenAIEmbeddings
|
| 57 |
- **Vector DB**: `FAISS` vector db.
|
| 58 |
- **Retrieval**: `FAISS` similarity search
|
| 59 |
|
|
@@ -66,26 +66,23 @@ so if its size exceeds a pre-configured threshold, it is chunked and only the mo
|
|
| 66 |
- **Pyhton code executor**: executes a snippet of python code provided as input
|
| 67 |
- **Think tool**: used for strategic reflection on the progress of the solving process
|
| 68 |
|
| 69 |
-
|
| 70 |
-
It has the following states:
|
| 71 |
-
|
| 72 |
-
**Python code Executor**โ๏ธ can run either a snippet of python code or a python file. The python file is executed in a sub-process.
|
| 73 |
|
| 74 |
**Spreadsheets**๐: analyzes `excel` files using the pandas dataframe agent `langchain_experimental.agents.create_pandas_dataframe_agent`
|
| 75 |
and the`gpt-4.1` model.
|
| 76 |
|
| 77 |
-
**Chess**โ๏ธ Given a chess board and
|
| 78 |
|
| 79 |
-
- **Picture analysis**:
|
| 80 |
the game is computed programmatically. Both `gpt-4.1` and `gemini-2.5-flash` models are used to extract the coordinates and an arbitrage is performed on their outcomes.
|
| 81 |
- **Move suggestion**: the best move is suggested by a `stockfish` chess engine
|
| 82 |
-
- **Move interpretation**: the move is then interpreted and transcribed into the algebraic notation
|
| 83 |
|
| 84 |
-
**Chess Board
|
| 85 |
-
|
| 86 |
are times when they get it right, but also instances when they don't).
|
| 87 |
-
At least for openai I see there is a limitation on their website:
|
| 88 |
-
>Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions
|
| 89 |
|
| 90 |
I questioned both models and chose to do an arbitrage on their results. I invoked both further but only on the conflicting positions.
|
| 91 |
This process continues for a limited number of times. My aim was to reduce the number of objects that the model focuses on.
|
|
@@ -94,30 +91,24 @@ But still, the inconsistencies remain โ ๏ธ.
|
|
| 94 |
Things that I want to try further: improving the prompt, if conflicts occur, retry with conflicting pieces instead of positions (e.g. if a white pawn got conflicting positions, ask for the positions of all white pawns).
|
| 95 |
|
| 96 |
|
| 97 |
-
|
| 98 |
-
|
| 99 |
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
-
## Challenges ๐
|
| 102 |
-
1. Chess Board picture anaysis.
|
| 103 |
-
2. Video analysis.
|
| 104 |
|
| 105 |
|
| 106 |
|
| 107 |
## Future work and improvements ๐
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
#### 2. Chunking
|
| 113 |
-
Try other chunking strategies as well.
|
| 114 |
-
#### 3. Audio Analysis
|
| 115 |
-
Use a lesser expensive model to get the transcripts (like whisper) and if this is not enough to answer the question and more sophiticated processing is needed
|
| 116 |
for other sounds like music, barks or other type of sounds then indeed use a better model.
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
#### 6. Chessboard Images analysis
|
| 121 |
|
| 122 |
## References ๐
|
| 123 |
The math tool implementation was inspired from this repo https://github.com/langchain-ai/open_deep_research
|
|
|
|
| 19 |
Aims to answer Level 1 questions from the **GAIA** validation set. It was tested on 20 such questions with a success rate of 65%.
|
| 20 |
### What is GAIA
|
| 21 |
|
| 22 |
+
GAIA is a benchmark for AI assistants evaluation on real-world tasks that involve:
|
| 23 |
- multimodal reasoning (e.g., analyzing images, audio, documents)
|
| 24 |
- multi-hop retrieval of interdependent facts
|
| 25 |
- python code execution
|
|
|
|
| 40 |
**Nodes**
|
| 41 |
|
| 42 |
- **Pre-processor**: Pre-processing of the question.
|
| 43 |
+
- **Assistant**: The brain of the agent. Decides which tool to call. Uses `gpt-4.1`.
|
| 44 |
+
- **Tools**: The invocation of a tool.
|
| 45 |
- **Optimize memory**: This step summarizes and the removes all the messages except the last 2. If the last message is the response of a web
|
| 46 |
+
extract who's size exceeds a threshold, it is chunked and replaced with the most relevant chunks only.
|
| 47 |
- **Response Processing**: Brings the answer into the concise format required by GAIA.
|
| 48 |
|
| 49 |
+
**Tasks and corresponding Tools**
|
| 50 |
|
| 51 |
+
**Web Searches** ๐ are undertaken by `tavily` `search` and `extract` tools.
|
| 52 |
|
| 53 |
+
- **Chunking**: The content returned by the `extract` tool might be too large to be further analyzed at once by a model (depending on the chosen model context window size or on the rate limits),
|
| 54 |
+
so if its size exceeds a pre-configured threshold, it is chunked and only the most relevant chunks are further analyzed.
|
| 55 |
+
- **Text Splitting**: First a hierarchical split (used Langchain's `MarkdownHeaderTextSplitter`) and then further splitting by size with a sliding window (used LangChain's `RecursiveCharacterTextSplitter`).
|
| 56 |
+
- **Embeddings**: `langchain_community.embeddings.OpenAIEmbeddings`.
|
| 57 |
- **Vector DB**: `FAISS` vector db.
|
| 58 |
- **Retrieval**: `FAISS` similarity search
|
| 59 |
|
|
|
|
| 66 |
- **Pyhton code executor**: executes a snippet of python code provided as input
|
| 67 |
- **Think tool**: used for strategic reflection on the progress of the solving process
|
| 68 |
|
| 69 |
+
**Python code Executor**โ๏ธ can run either a snippet of python code or a given python file. The python code snippet is executed using `langchain_experimental.tools.PythonREPLTool`. The python file is executed in a sub-process.
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
**Spreadsheets**๐: analyzes `excel` files using the pandas dataframe agent `langchain_experimental.agents.create_pandas_dataframe_agent`
|
| 72 |
and the`gpt-4.1` model.
|
| 73 |
|
| 74 |
+
**Chess Move Recommendation**โ๏ธ Given a chess board and an active color, this tool is able to suggest the best move to be performed by the active color.
|
| 75 |
|
| 76 |
+
- **Picture analysis**: identifies the location of each chess piece on the board. Once the coordinates are identified, the FEN of
|
| 77 |
the game is computed programmatically. Both `gpt-4.1` and `gemini-2.5-flash` models are used to extract the coordinates and an arbitrage is performed on their outcomes.
|
| 78 |
- **Move suggestion**: the best move is suggested by a `stockfish` chess engine
|
| 79 |
+
- **Move interpretation**: the move is then interpreted and transcribed into the algebraic notation with the help of `gpt-4`.
|
| 80 |
|
| 81 |
+
**Chess Board Picture Analysis - Challenges and Limitations** ๏ฟฝ๏ฟฝ
|
| 82 |
+
I tried both `gpt-4.1` and `gemini-2.5-flash` models for chess pieces coordinates extraction, but I obtained inconsistent results (there
|
| 83 |
are times when they get it right, but also instances when they don't).
|
| 84 |
+
At least for openai I see there is a limitation listed on their website (see [here](https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations)):
|
| 85 |
+
>Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.
|
| 86 |
|
| 87 |
I questioned both models and chose to do an arbitrage on their results. I invoked both further but only on the conflicting positions.
|
| 88 |
This process continues for a limited number of times. My aim was to reduce the number of objects that the model focuses on.
|
|
|
|
| 91 |
Things that I want to try further: improving the prompt, if conflicts occur, retry with conflicting pieces instead of positions (e.g. if a white pawn got conflicting positions, ask for the positions of all white pawns).
|
| 92 |
|
| 93 |
|
| 94 |
+
**YouTube Videos Analysis**๐ฅ This is work in progress ๐ง
|
|
|
|
| 95 |
|
| 96 |
+
So far, the agent is able to respond to the questions on the conversation inside a YouTube video. There is no dedicated tool for this.
|
| 97 |
+
The assistant searches for the transcripts by using the `tavily extract` tool.
|
| 98 |
+
TODO: analyze YouTube videos and answer questions about objects in the video.
|
| 99 |
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
|
| 102 |
|
| 103 |
## Future work and improvements ๐
|
| 104 |
+
- **Evaluation**: Implement an automated evaluation for the reference set of questions.
|
| 105 |
+
Evaluate the agent against other questions from the GAIA validation set.
|
| 106 |
+
- **Large Web Extracts**: Try other chunking strategies.
|
| 107 |
+
- **Audio Analysis**:Use a lesser expensive model to get the transcripts (like whisper) and if this is not enough to answer the question and more sophiticated processing is needed
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
for other sounds like music, barks or other type of sounds then indeed use a better model.
|
| 109 |
+
- **Python File execution**:Improved safety when executing python code or python files.
|
| 110 |
+
- **Video Analysis**: Answer questions about objects in the video.
|
| 111 |
+
- **Chessboard Images Analysis**: Detect correctly all pieces on a chess board image.
|
|
|
|
| 112 |
|
| 113 |
## References ๐
|
| 114 |
The math tool implementation was inspired from this repo https://github.com/langchain-ai/open_deep_research
|