A newer version of the Gradio SDK is available:
6.5.1
title: Template Final Assignment
emoji: ๐ต๐ปโโ๏ธ
colorFrom: indigo
colorTo: indigo
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 480
General AI Assistant ๐ฎ
Background
Created as a final project for the HuggingFace Agents course ( https://huggingface.co/learn/agents-course). Aims to answer Level 1 questions from the GAIA validation set. It was tested on 20 such questions with a success rate of 90%.
What is GAIA
GAIA is a benchmark for AI assistants evaluation on real-world tasks that involve:
- multimodal reasoning (e.g., analyzing images, audio, documents)
- multi-hop retrieval of interdependent facts
- python code execution
- a structured response format
(see https://huggingface.co/learn/agents-course/unit4/what-is-gaia).
GAIA introductory paper โGAIA: A Benchmark for General AI Assistantsโ.
Implementation Highlights ๐ ๏ธ
The agent is implemented using the LangGraph framework.
Nodes
- Pre-processor: Initialization and preparation of the state. Input handling: attached pictures are directly sent to the assistant node and to the model. Other type of attachments are loaded only in the tools.
- Assistant: The brain of the agent. Decides which tool to call. Uses
gpt-4.1. - Tools: The invocation of a tool.
- Optimize memory: This step summarizes and the removes all the messages except the last 2. If the last message is the response of a web extract who's size exceeds a threshold, it is chunked and replaced with the most relevant chunks only.
- Response Processing: Brings the answer into the concise format required by GAIA.
Tasks and corresponding Tools
Web Searches ๐ are undertaken by tavily search and extract tools.
Chunking: The content returned by the
extracttool might be too large to be further analyzed at once by a model (depending on the chosen model context window size or on the rate limits), so if its size exceeds a pre-configured threshold, it is chunked and only the most relevant chunks are further analyzed.- Text Splitting: First a hierarchical split (used Langchain's
MarkdownHeaderTextSplitter) and then further splitting by size with a sliding window (used LangChain'sRecursiveCharacterTextSplitter). - Embeddings:
langchain_community.embeddings.OpenAIEmbeddings. - Vector DB:
FAISSvector db. - Retrieval:
FAISSsimilarity search
Updated the original
extracttool response message content to contain only the relevant chunks.- Text Splitting: First a hierarchical split (used Langchain's
Audio Analyzer๐ uses gpt-4o-audio-preview to analyze the input.
Math Problems Solver๐งฎ is a subagent that uses gpt-5 equipped with the following tools:
- Pyhton code executor: executes a snippet of python code provided as input
- Think tool: used for strategic reflection on the progress of the solving process
At this point it looks like the agent prefers to answer the mathematical question from the test set by invoking the python code executor instead. The question is answered correctly. I decided to not remove yet this tool, until I test the agent on other mathematical questions from the GAIA validation set.
Python code Executorโ๏ธ can run either a snippet of python code or a given python file. The python code snippet is executed using langchain_experimental.tools.PythonREPLTool. The python file is executed in a sub-process.
Spreadsheets๐: analyzes excel files using the pandas dataframe agent langchain_experimental.agents.create_pandas_dataframe_agent
and thegpt-4.1 model.
Chess Move Recommendationโ๏ธ Given a chess board and an active color, this tool is able to suggest the best move to be performed by the active color.
- Picture analysis: identifies the location of each chess piece on the board. Once the coordinates are identified, the FEN of
the game is computed programmatically. Both
gpt-4.1andgemini-2.5-flashmodels are used to extract the coordinates and an arbitrage is performed on their outcomes. - Move suggestion: the best move is suggested by a
stockfishchess engine. - Move interpretation: the move is then interpreted and transcribed into the algebraic notation with the help of
gpt-4.
Chess Board Picture Analysis - Challenges and Limitations ๐
I tried both gpt-4.1 and gemini-2.5-flash models for chess pieces coordinates extraction, but I obtained inconsistent results (there
are times when they get it right, but also instances when they don't).
At least for openai I see there is a limitation listed on their website (see https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations):
Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.
The tool questions both models and does an arbitrage on their results. It queries both models further but only on pieces with conflicting positions.
This process continues for a limited number of times. If at the end of it, if there are still conflicts, the answers provided by one of
the models (in this case gemini-2.5-flash) is considered the ground truth. The aim of this approach was to reduce the number of objects that the model focuses on.
From what I observed, this approach improved the chances of having a correct identification of pieces on the board โ ๏ธ.
YouTube Videos Analysis๐ฅ This is work in progress ๐ง
So far, the agent is able to answer questions on the conversation inside a YouTube video. There is no dedicated tool for this.
The assistant searches for the transcripts by using the tavily extract tool.
TODO: analyze YouTube videos and answer questions about objects in the video.
Future work and improvements ๐
- Evaluation: Evaluate the agent against other questions from the GAIA validation set.
- Large Web Extracts: Try other chunking strategies.
- Audio Analysis: Use a lesser expensive model to get the transcripts (like
whisper) and if this is not enough to answer the question and more sophiticated processing is needed for other sounds like music, barks or other type of sounds then indeed use a better model. - Python File execution:Improve safety when executing python code or python files.
- Video Analysis: Answer questions about objects in the video.
- Chessboard Images Analysis: Detect correctly all pieces on a chess board image.
How to use
Please check the .env.example file for the environment variables that need to be configured. If running in a Huggingface Space, you can
set them as secrets, otherwise rename the .env.example to .env.
The CHESS_ENGINE_PATH needs to be configured only if running on a Windows machine and needs to point
to the stockfish executable, otherwise the stockfishinstallation is automatically detected.
The LANGSMITH_* properties need to be configured only if you want to enable observability with LangSmith.
The SUBMISSION_MODE_ON flag indicates whether the application will run in submission mode (when the 20 questions are fetched and the answers are
submitted for agent evaluation) or not (the agent accepts a question and an attachment).
References ๐
The math tool implementation was inspired from this repo https://github.com/langchain-ai/open_deep_research
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
