carolinacon's picture
Modified the Gradio interface
0c44617
---
title: Template Final Assignment
emoji: ๐Ÿ•ต๐Ÿปโ€โ™‚๏ธ
colorFrom: indigo
colorTo: indigo
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
pinned: false
hf_oauth: true
# optional, default duration is 8 hours/480 minutes. Max duration is 30 days/43200 minutes.
hf_oauth_expiration_minutes: 480
---
# General AI Assistant ๐Ÿ”ฎ
## Background
Created as a final project for the HuggingFace Agents course ( https://huggingface.co/learn/agents-course).
Aims to answer Level 1 questions from the **GAIA** validation set. It was tested on 20 such questions with a success rate of 90%.
### What is GAIA
GAIA is a benchmark for AI assistants evaluation on real-world tasks that involve:
- multimodal reasoning (e.g., analyzing images, audio, documents)
- multi-hop retrieval of interdependent facts
- python code execution
- a structured response format
(see https://huggingface.co/learn/agents-course/unit4/what-is-gaia).
GAIA introductory paper [โ€GAIA: A Benchmark for General AI Assistantsโ€](https://arxiv.org/abs/2311.12983).
## Implementation Highlights ๐Ÿ› ๏ธ
**The agent** is implemented using the LangGraph framework.
![img.png](img.png)
**Nodes**
- **Pre-processor**: Initialization and preparation of the state. Input handling: attached pictures are directly sent to the assistant node and to the model.
Other type of attachments are loaded only in the tools.
- **Assistant**: The brain of the agent. Decides which tool to call. Uses `gpt-4.1`.
- **Tools**: The invocation of a tool.
- **Optimize memory**: This step summarizes and the removes all the messages except the last 2. If the last message is the response of a web
extract who's size exceeds a threshold, it is chunked and replaced with the most relevant chunks only.
- **Response Processing**: Brings the answer into the concise format required by GAIA.
**Tasks and corresponding Tools**
**Web Searches** ๐Ÿ”Ž are undertaken by `tavily` `search` and `extract` tools.
- **Chunking**: The content returned by the `extract` tool might be too large to be further analyzed at once by a model (depending on the chosen model context window size or on the rate limits),
so if its size exceeds a pre-configured threshold, it is chunked and only the most relevant chunks are further analyzed.
- **Text Splitting**: First a hierarchical split (used Langchain's `MarkdownHeaderTextSplitter`) and then further splitting by size with a sliding window (used LangChain's `RecursiveCharacterTextSplitter`).
- **Embeddings**: `langchain_community.embeddings.OpenAIEmbeddings`.
- **Vector DB**: `FAISS` vector db.
- **Retrieval**: `FAISS` similarity search
Updated the original `extract` tool response message content to contain only the relevant chunks.
**Audio Analyzer**๐Ÿ”‰ uses `gpt-4o-audio-preview` to analyze the input.
**Math Problems Solver**๐Ÿงฎ is a subagent that uses `gpt-5` equipped with the following tools:
- **Pyhton code executor**: executes a snippet of python code provided as input
- **Think tool**: used for strategic reflection on the progress of the solving process
At this point it looks like the agent prefers to answer the mathematical question from the test set by invoking the python code executor instead.
The question is answered correctly. I decided to not remove yet this tool, until I test the agent on other mathematical questions from the GAIA validation set.
**Python code Executor**โš™๏ธ can run either a snippet of python code or a given python file. The python code snippet is executed using `langchain_experimental.tools.PythonREPLTool`. The python file is executed in a sub-process.
**Spreadsheets**๐Ÿ“Š: analyzes `excel` files using the pandas dataframe agent `langchain_experimental.agents.create_pandas_dataframe_agent`
and the`gpt-4.1` model.
**Chess Move Recommendation**โ™Ÿ๏ธ Given a chess board and an active color, this tool is able to suggest the best move to be performed by the active color.
- **Picture analysis**: identifies the location of each chess piece on the board. Once the coordinates are identified, the FEN of
the game is computed programmatically. Both `gpt-4.1` and `gemini-2.5-flash` models are used to extract the coordinates and an arbitrage is performed on their outcomes.
- **Move suggestion**: the best move is suggested by a `stockfish` chess engine.
- **Move interpretation**: the move is then interpreted and transcribed into the algebraic notation with the help of `gpt-4`.
**Chess Board Picture Analysis - Challenges and Limitations** ๐Ÿ†˜
I tried both `gpt-4.1` and `gemini-2.5-flash` models for chess pieces coordinates extraction, but I obtained inconsistent results (there
are times when they get it right, but also instances when they don't).
At least for openai I see there is a limitation listed on their website (see [https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations](https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations)):
>Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.
The tool questions both models and does an arbitrage on their results. It queries both models further but only on pieces with conflicting positions.
This process continues for a limited number of times. If at the end of it, if there are still conflicts, the answers provided by one of
the models (in this case `gemini-2.5-flash`) is considered the ground truth. The aim of this approach was to reduce the number of objects that the model focuses on.
From what I observed, this approach improved the chances of having a correct identification of pieces on the board โš ๏ธ.
**YouTube Videos Analysis**๐ŸŽฅ This is work in progress ๐Ÿšง
So far, the agent is able to answer questions on the conversation inside a YouTube video. There is no dedicated tool for this.
The assistant searches for the transcripts by using the `tavily extract` tool.
TODO: analyze YouTube videos and answer questions about objects in the video.
## Future work and improvements ๐Ÿ”œ
- **Evaluation**: Evaluate the agent against other questions from the GAIA validation set.
- **Large Web Extracts**: Try other chunking strategies.
- **Audio Analysis**: Use a lesser expensive model to get the transcripts (like `whisper`) and if this is not enough to answer the question and more sophiticated processing is needed
for other sounds like music, barks or other type of sounds then indeed use a better model.
- **Python File execution**:Improve safety when executing python code or python files.
- **Video Analysis**: Answer questions about objects in the video.
- **Chessboard Images Analysis**: Detect correctly all pieces on a chess board image.
## How to use
Please check the `.env.example` file for the environment variables that need to be configured. If running in a Huggingface Space, you can
set them as secrets, otherwise rename the `.env.example` to `.env`.
The `CHESS_ENGINE_PATH` needs to be configured only if running on a Windows machine and needs to point
to the `stockfish` executable, otherwise the `stockfish`installation is automatically detected.
The `LANGSMITH_*` properties need to be configured only if you want to enable observability with LangSmith.
The `SUBMISSION_MODE_ON` flag indicates whether the application will run in submission mode (when the 20 questions are fetched and the answers are
submitted for agent evaluation) or not (the agent accepts a question and an attachment).
## References ๐Ÿ“š
The math tool implementation was inspired from this repo https://github.com/langchain-ai/open_deep_research
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference