|
|
--- |
|
|
title: Template Final Assignment |
|
|
emoji: ๐ต๐ปโโ๏ธ |
|
|
colorFrom: indigo |
|
|
colorTo: indigo |
|
|
sdk: gradio |
|
|
sdk_version: 5.25.2 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
hf_oauth: true |
|
|
|
|
|
hf_oauth_expiration_minutes: 480 |
|
|
--- |
|
|
|
|
|
# General AI Assistant ๐ฎ |
|
|
|
|
|
## Background |
|
|
Created as a final project for the HuggingFace Agents course ( https://huggingface.co/learn/agents-course). |
|
|
Aims to answer Level 1 questions from the **GAIA** validation set. It was tested on 20 such questions with a success rate of 90%. |
|
|
### What is GAIA |
|
|
|
|
|
GAIA is a benchmark for AI assistants evaluation on real-world tasks that involve: |
|
|
- multimodal reasoning (e.g., analyzing images, audio, documents) |
|
|
- multi-hop retrieval of interdependent facts |
|
|
- python code execution |
|
|
- a structured response format |
|
|
|
|
|
(see https://huggingface.co/learn/agents-course/unit4/what-is-gaia). |
|
|
|
|
|
GAIA introductory paper [โGAIA: A Benchmark for General AI Assistantsโ](https://arxiv.org/abs/2311.12983). |
|
|
|
|
|
|
|
|
## Implementation Highlights ๐ ๏ธ |
|
|
|
|
|
|
|
|
**The agent** is implemented using the LangGraph framework. |
|
|
|
|
|
 |
|
|
|
|
|
**Nodes** |
|
|
|
|
|
- **Pre-processor**: Initialization and preparation of the state. Input handling: attached pictures are directly sent to the assistant node and to the model. |
|
|
Other type of attachments are loaded only in the tools. |
|
|
- **Assistant**: The brain of the agent. Decides which tool to call. Uses `gpt-4.1`. |
|
|
- **Tools**: The invocation of a tool. |
|
|
- **Optimize memory**: This step summarizes and the removes all the messages except the last 2. If the last message is the response of a web |
|
|
extract who's size exceeds a threshold, it is chunked and replaced with the most relevant chunks only. |
|
|
- **Response Processing**: Brings the answer into the concise format required by GAIA. |
|
|
|
|
|
**Tasks and corresponding Tools** |
|
|
|
|
|
**Web Searches** ๐ are undertaken by `tavily` `search` and `extract` tools. |
|
|
|
|
|
- **Chunking**: The content returned by the `extract` tool might be too large to be further analyzed at once by a model (depending on the chosen model context window size or on the rate limits), |
|
|
so if its size exceeds a pre-configured threshold, it is chunked and only the most relevant chunks are further analyzed. |
|
|
- **Text Splitting**: First a hierarchical split (used Langchain's `MarkdownHeaderTextSplitter`) and then further splitting by size with a sliding window (used LangChain's `RecursiveCharacterTextSplitter`). |
|
|
- **Embeddings**: `langchain_community.embeddings.OpenAIEmbeddings`. |
|
|
- **Vector DB**: `FAISS` vector db. |
|
|
- **Retrieval**: `FAISS` similarity search |
|
|
|
|
|
Updated the original `extract` tool response message content to contain only the relevant chunks. |
|
|
|
|
|
**Audio Analyzer**๐ uses `gpt-4o-audio-preview` to analyze the input. |
|
|
|
|
|
**Math Problems Solver**๐งฎ is a subagent that uses `gpt-5` equipped with the following tools: |
|
|
|
|
|
- **Pyhton code executor**: executes a snippet of python code provided as input |
|
|
- **Think tool**: used for strategic reflection on the progress of the solving process |
|
|
|
|
|
At this point it looks like the agent prefers to answer the mathematical question from the test set by invoking the python code executor instead. |
|
|
The question is answered correctly. I decided to not remove yet this tool, until I test the agent on other mathematical questions from the GAIA validation set. |
|
|
|
|
|
**Python code Executor**โ๏ธ can run either a snippet of python code or a given python file. The python code snippet is executed using `langchain_experimental.tools.PythonREPLTool`. The python file is executed in a sub-process. |
|
|
|
|
|
**Spreadsheets**๐: analyzes `excel` files using the pandas dataframe agent `langchain_experimental.agents.create_pandas_dataframe_agent` |
|
|
and the`gpt-4.1` model. |
|
|
|
|
|
**Chess Move Recommendation**โ๏ธ Given a chess board and an active color, this tool is able to suggest the best move to be performed by the active color. |
|
|
|
|
|
- **Picture analysis**: identifies the location of each chess piece on the board. Once the coordinates are identified, the FEN of |
|
|
the game is computed programmatically. Both `gpt-4.1` and `gemini-2.5-flash` models are used to extract the coordinates and an arbitrage is performed on their outcomes. |
|
|
- **Move suggestion**: the best move is suggested by a `stockfish` chess engine. |
|
|
- **Move interpretation**: the move is then interpreted and transcribed into the algebraic notation with the help of `gpt-4`. |
|
|
|
|
|
**Chess Board Picture Analysis - Challenges and Limitations** ๐ |
|
|
I tried both `gpt-4.1` and `gemini-2.5-flash` models for chess pieces coordinates extraction, but I obtained inconsistent results (there |
|
|
are times when they get it right, but also instances when they don't). |
|
|
At least for openai I see there is a limitation listed on their website (see [https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations](https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations)): |
|
|
>Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions. |
|
|
|
|
|
The tool questions both models and does an arbitrage on their results. It queries both models further but only on pieces with conflicting positions. |
|
|
This process continues for a limited number of times. If at the end of it, if there are still conflicts, the answers provided by one of |
|
|
the models (in this case `gemini-2.5-flash`) is considered the ground truth. The aim of this approach was to reduce the number of objects that the model focuses on. |
|
|
From what I observed, this approach improved the chances of having a correct identification of pieces on the board โ ๏ธ. |
|
|
|
|
|
|
|
|
|
|
|
**YouTube Videos Analysis**๐ฅ This is work in progress ๐ง |
|
|
|
|
|
So far, the agent is able to answer questions on the conversation inside a YouTube video. There is no dedicated tool for this. |
|
|
The assistant searches for the transcripts by using the `tavily extract` tool. |
|
|
TODO: analyze YouTube videos and answer questions about objects in the video. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Future work and improvements ๐ |
|
|
- **Evaluation**: Evaluate the agent against other questions from the GAIA validation set. |
|
|
- **Large Web Extracts**: Try other chunking strategies. |
|
|
- **Audio Analysis**: Use a lesser expensive model to get the transcripts (like `whisper`) and if this is not enough to answer the question and more sophiticated processing is needed |
|
|
for other sounds like music, barks or other type of sounds then indeed use a better model. |
|
|
- **Python File execution**:Improve safety when executing python code or python files. |
|
|
- **Video Analysis**: Answer questions about objects in the video. |
|
|
- **Chessboard Images Analysis**: Detect correctly all pieces on a chess board image. |
|
|
|
|
|
|
|
|
## How to use |
|
|
Please check the `.env.example` file for the environment variables that need to be configured. If running in a Huggingface Space, you can |
|
|
set them as secrets, otherwise rename the `.env.example` to `.env`. |
|
|
|
|
|
The `CHESS_ENGINE_PATH` needs to be configured only if running on a Windows machine and needs to point |
|
|
to the `stockfish` executable, otherwise the `stockfish`installation is automatically detected. |
|
|
|
|
|
The `LANGSMITH_*` properties need to be configured only if you want to enable observability with LangSmith. |
|
|
|
|
|
The `SUBMISSION_MODE_ON` flag indicates whether the application will run in submission mode (when the 20 questions are fetched and the answers are |
|
|
submitted for agent evaluation) or not (the agent accepts a question and an attachment). |
|
|
|
|
|
## References ๐ |
|
|
The math tool implementation was inspired from this repo https://github.com/langchain-ai/open_deep_research |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference |