carolinacon's picture
Modified the Gradio interface
0c44617

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: Template Final Assignment
emoji: ๐Ÿ•ต๐Ÿปโ€โ™‚๏ธ
colorFrom: indigo
colorTo: indigo
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 480

General AI Assistant ๐Ÿ”ฎ

Background

Created as a final project for the HuggingFace Agents course ( https://huggingface.co/learn/agents-course). Aims to answer Level 1 questions from the GAIA validation set. It was tested on 20 such questions with a success rate of 90%.

What is GAIA

GAIA is a benchmark for AI assistants evaluation on real-world tasks that involve:

  • multimodal reasoning (e.g., analyzing images, audio, documents)
  • multi-hop retrieval of interdependent facts
  • python code execution
  • a structured response format

(see https://huggingface.co/learn/agents-course/unit4/what-is-gaia).

GAIA introductory paper โ€GAIA: A Benchmark for General AI Assistantsโ€.

Implementation Highlights ๐Ÿ› ๏ธ

The agent is implemented using the LangGraph framework.

img.png

Nodes

  • Pre-processor: Initialization and preparation of the state. Input handling: attached pictures are directly sent to the assistant node and to the model. Other type of attachments are loaded only in the tools.
  • Assistant: The brain of the agent. Decides which tool to call. Uses gpt-4.1.
  • Tools: The invocation of a tool.
  • Optimize memory: This step summarizes and the removes all the messages except the last 2. If the last message is the response of a web extract who's size exceeds a threshold, it is chunked and replaced with the most relevant chunks only.
  • Response Processing: Brings the answer into the concise format required by GAIA.

Tasks and corresponding Tools

Web Searches ๐Ÿ”Ž are undertaken by tavily search and extract tools.

  • Chunking: The content returned by the extract tool might be too large to be further analyzed at once by a model (depending on the chosen model context window size or on the rate limits), so if its size exceeds a pre-configured threshold, it is chunked and only the most relevant chunks are further analyzed.

    • Text Splitting: First a hierarchical split (used Langchain's MarkdownHeaderTextSplitter) and then further splitting by size with a sliding window (used LangChain's RecursiveCharacterTextSplitter).
    • Embeddings: langchain_community.embeddings.OpenAIEmbeddings.
    • Vector DB: FAISS vector db.
    • Retrieval: FAISS similarity search

    Updated the original extract tool response message content to contain only the relevant chunks.

Audio Analyzer๐Ÿ”‰ uses gpt-4o-audio-preview to analyze the input.

Math Problems Solver๐Ÿงฎ is a subagent that uses gpt-5 equipped with the following tools:

  • Pyhton code executor: executes a snippet of python code provided as input
  • Think tool: used for strategic reflection on the progress of the solving process

At this point it looks like the agent prefers to answer the mathematical question from the test set by invoking the python code executor instead. The question is answered correctly. I decided to not remove yet this tool, until I test the agent on other mathematical questions from the GAIA validation set.

Python code Executorโš™๏ธ can run either a snippet of python code or a given python file. The python code snippet is executed using langchain_experimental.tools.PythonREPLTool. The python file is executed in a sub-process.

Spreadsheets๐Ÿ“Š: analyzes excel files using the pandas dataframe agent langchain_experimental.agents.create_pandas_dataframe_agent and thegpt-4.1 model.

Chess Move Recommendationโ™Ÿ๏ธ Given a chess board and an active color, this tool is able to suggest the best move to be performed by the active color.

  • Picture analysis: identifies the location of each chess piece on the board. Once the coordinates are identified, the FEN of the game is computed programmatically. Both gpt-4.1 and gemini-2.5-flash models are used to extract the coordinates and an arbitrage is performed on their outcomes.
  • Move suggestion: the best move is suggested by a stockfish chess engine.
  • Move interpretation: the move is then interpreted and transcribed into the algebraic notation with the help of gpt-4.

Chess Board Picture Analysis - Challenges and Limitations ๐Ÿ†˜ I tried both gpt-4.1 and gemini-2.5-flash models for chess pieces coordinates extraction, but I obtained inconsistent results (there are times when they get it right, but also instances when they don't). At least for openai I see there is a limitation listed on their website (see https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations):

Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.

The tool questions both models and does an arbitrage on their results. It queries both models further but only on pieces with conflicting positions. This process continues for a limited number of times. If at the end of it, if there are still conflicts, the answers provided by one of the models (in this case gemini-2.5-flash) is considered the ground truth. The aim of this approach was to reduce the number of objects that the model focuses on. From what I observed, this approach improved the chances of having a correct identification of pieces on the board โš ๏ธ.

YouTube Videos Analysis๐ŸŽฅ This is work in progress ๐Ÿšง

So far, the agent is able to answer questions on the conversation inside a YouTube video. There is no dedicated tool for this. The assistant searches for the transcripts by using the tavily extract tool. TODO: analyze YouTube videos and answer questions about objects in the video.

Future work and improvements ๐Ÿ”œ

  • Evaluation: Evaluate the agent against other questions from the GAIA validation set.
  • Large Web Extracts: Try other chunking strategies.
  • Audio Analysis: Use a lesser expensive model to get the transcripts (like whisper) and if this is not enough to answer the question and more sophiticated processing is needed for other sounds like music, barks or other type of sounds then indeed use a better model.
  • Python File execution:Improve safety when executing python code or python files.
  • Video Analysis: Answer questions about objects in the video.
  • Chessboard Images Analysis: Detect correctly all pieces on a chess board image.

How to use

Please check the .env.example file for the environment variables that need to be configured. If running in a Huggingface Space, you can set them as secrets, otherwise rename the .env.example to .env.

The CHESS_ENGINE_PATH needs to be configured only if running on a Windows machine and needs to point to the stockfish executable, otherwise the stockfishinstallation is automatically detected.

The LANGSMITH_* properties need to be configured only if you want to enable observability with LangSmith.

The SUBMISSION_MODE_ON flag indicates whether the application will run in submission mode (when the 20 questions are fetched and the answers are submitted for agent evaluation) or not (the agent accepts a question and an attachment).

References ๐Ÿ“š

The math tool implementation was inspired from this repo https://github.com/langchain-ai/open_deep_research

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference