Final_Assignment_Template

Sleeping

File size: 7,756 Bytes

2705160
 
62ad9da
 
 
2705160
 
 
 
d123508
 
 
2705160
 
11db2a0
b4f9800
 
 
a4b0424
11db2a0
b4f9800
d1bfef5
a63c00c
 
 
 
b4f9800
11db2a0
b4f9800
11db2a0
 
 
 
b4f9800
 
 
 
a63c00c
b4f9800
a63c00c
 
a605490
 
d1bfef5
 
a63c00c
d1bfef5
a63c00c
b4f9800
d1bfef5
b4f9800
d1bfef5
b4f9800
d1bfef5
 
 
 
11db2a0
 
b4f9800
a63c00c
b4f9800
fc1b83d
b4f9800
a63c00c
b4f9800
11db2a0
b4f9800
 
eb5efe8
 
 
d1bfef5
b4f9800
11db2a0
 
b4f9800
d1bfef5
b4f9800
d1bfef5
a63c00c
eb5efe8
d1bfef5
11db2a0
d1bfef5
 
11db2a0
a4b0424
d1bfef5
a63c00c
fc1b83d
 
 
 
b4f9800
 
 
d1bfef5
b4f9800
eb5efe8
d1bfef5
 
b4f9800
 
 
 
11db2a0
eb5efe8
d1bfef5
a605490
a63c00c
fc1b83d
d1bfef5
 
b4f9800
a605490
 
 
 
 
 
 
 
 
 
0c44617
 
a605490
11db2a0
 
b4f9800
 
 
 
 
 
 
62ad9da

---
title: Template Final Assignment
emoji: 🕵🏻‍♂️
colorFrom: indigo
colorTo: indigo
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
pinned: false
hf_oauth: true
# optional, default duration is 8 hours/480 minutes. Max duration is 30 days/43200 minutes.
hf_oauth_expiration_minutes: 480
---

# General AI Assistant 🔮

## Background
Created as a final project for the HuggingFace Agents course ( https://huggingface.co/learn/agents-course).
Aims to answer Level 1 questions from the **GAIA** validation set. It was tested on 20 such questions with a success rate of 90%.
### What is GAIA

GAIA is a benchmark for AI assistants evaluation on real-world tasks that involve:
- multimodal reasoning (e.g., analyzing images, audio, documents)
- multi-hop retrieval of interdependent facts
- python code execution
- a structured response format

(see https://huggingface.co/learn/agents-course/unit4/what-is-gaia).

GAIA introductory paper [”GAIA: A Benchmark for General AI Assistants”](https://arxiv.org/abs/2311.12983).


## Implementation Highlights 🛠️


**The agent** is implemented using the LangGraph framework.

![img.png](img.png)

**Nodes**

- **Pre-processor**: Initialization and preparation of the state. Input handling: attached pictures are directly sent to the assistant node and to the model.
Other type of attachments are loaded only in the tools.
- **Assistant**: The brain of the agent. Decides which tool to call. Uses `gpt-4.1`.
- **Tools**: The invocation of a tool.
- **Optimize memory**: This step summarizes and the removes all the messages except the last 2. If the last message is the response of a web
extract who's size exceeds a threshold, it is chunked  and replaced with the most relevant chunks only.  
- **Response Processing**: Brings the answer into the concise format required by GAIA.

**Tasks and corresponding Tools**

**Web Searches** 🔎 are undertaken by `tavily` `search` and `extract` tools. 

- **Chunking**: The content returned by the `extract` tool might be too large to be further analyzed at once by a model (depending on the chosen model context window size or on the rate limits),
so if its size exceeds a pre-configured threshold, it is chunked and only the most relevant chunks are further analyzed.
    - **Text Splitting**: First a hierarchical split (used Langchain's `MarkdownHeaderTextSplitter`) and then further splitting by size with a sliding window (used LangChain's `RecursiveCharacterTextSplitter`). 
    - **Embeddings**:  `langchain_community.embeddings.OpenAIEmbeddings`.
    - **Vector DB**: `FAISS` vector db. 
    - **Retrieval**: `FAISS` similarity search

  Updated the original `extract` tool response message content to contain only the relevant chunks.

**Audio Analyzer**🔉 uses `gpt-4o-audio-preview` to analyze the input.

**Math Problems Solver**🧮 is a subagent that uses `gpt-5` equipped with the following tools:

- **Pyhton code executor**: executes a snippet of python code provided as input
- **Think tool**: used for strategic reflection on the progress of the solving process

At this point it looks like the agent prefers to answer the mathematical question from the test set by invoking the python code executor instead. 
The question is answered correctly. I decided to not remove yet this tool, until I test the agent on other mathematical questions from the GAIA validation set. 

**Python code Executor**⚙️ can run either a snippet of python code or a given python file. The python code snippet is executed using `langchain_experimental.tools.PythonREPLTool`. The python file is executed in a sub-process.

**Spreadsheets**📊: analyzes `excel` files using the pandas dataframe agent `langchain_experimental.agents.create_pandas_dataframe_agent` 
and the`gpt-4.1` model.

**Chess Move Recommendation**♟️ Given a chess board and an active color, this tool is able to suggest the best move to be performed by the active color.

- **Picture analysis**: identifies the location of each chess piece on the board. Once the coordinates are identified, the FEN of
the game is computed programmatically. Both `gpt-4.1` and `gemini-2.5-flash` models are used to extract the coordinates and an arbitrage is performed on their outcomes.
- **Move suggestion**: the best move is suggested by a `stockfish` chess engine.
- **Move interpretation**: the move is then interpreted and transcribed into the algebraic notation with the help of `gpt-4`. 
  
**Chess Board Picture Analysis - Challenges and Limitations** 🆘
    I tried both `gpt-4.1` and `gemini-2.5-flash` models for chess pieces coordinates extraction, but I obtained inconsistent results (there
are times when they get it right, but also instances when they don't).
At least for openai I see there is a limitation listed on their website (see [https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations](https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations)):
>Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.

The tool questions both models and does an arbitrage on their results. It queries both models further but only on pieces with conflicting positions.
This process continues for a limited number of times. If at the end of it, if there are still conflicts, the answers provided by one of 
the models (in this case `gemini-2.5-flash`) is considered the ground truth. The aim of this approach was to reduce the number of objects that the model focuses on. 
From what I observed, this approach improved the chances of having a correct identification of pieces on the board ⚠️. 



**YouTube Videos Analysis**🎥 This is work in progress 🚧

So far, the agent is able to answer questions on the conversation inside a YouTube video. There is no dedicated tool for this.
The assistant searches for the transcripts by using the `tavily extract` tool. 
TODO: analyze YouTube videos and answer questions about objects in the video.




## Future work and improvements 🔜
- **Evaluation**:  Evaluate the agent against other questions from the GAIA validation set.
- **Large Web Extracts**: Try other chunking strategies.
- **Audio Analysis**: Use a lesser expensive model to get the transcripts (like `whisper`) and if this is not enough to answer the question and more sophiticated processing is needed
for other sounds like music, barks or other type of sounds then indeed use a better model.
- **Python File execution**:Improve safety when executing python code or python files.
- **Video Analysis**: Answer questions about objects in the video.
- **Chessboard Images Analysis**: Detect correctly all pieces on a chess board image. 


## How to use
Please check the `.env.example` file for the environment variables that need to be configured. If running in a Huggingface Space, you can
set them as secrets, otherwise rename the `.env.example` to `.env`. 

The `CHESS_ENGINE_PATH` needs to be configured only if running on a Windows machine and needs to point
to the `stockfish` executable, otherwise the `stockfish`installation is automatically detected. 

The `LANGSMITH_*` properties need to be configured only if you want to enable observability with LangSmith.

The `SUBMISSION_MODE_ON` flag indicates whether the application will run in submission mode (when the 20 questions are fetched and the answers are
submitted for agent evaluation) or not (the agent accepts a question and an attachment).

## References 📚
The math tool implementation was inspired from this repo https://github.com/langchain-ai/open_deep_research







Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference