Final_Assignment_Template

Sleeping

App Files Files Community

Final_Assignment_Template / README.md

carolinacon

Modified the Gradio interface

0c44617 5 months ago

preview code

raw

history blame contribute delete

7.76 kB

	---
	title: Template Final Assignment
	emoji: 🕵🏻‍♂️
	colorFrom: indigo
	colorTo: indigo
	sdk: gradio
	sdk_version: 5.25.2
	app_file: app.py
	pinned: false
	hf_oauth: true
	# optional, default duration is 8 hours/480 minutes. Max duration is 30 days/43200 minutes.
	hf_oauth_expiration_minutes: 480
	---

	# General AI Assistant 🔮

	## Background
	Created as a final project for the HuggingFace Agents course ( https://huggingface.co/learn/agents-course).
	Aims to answer Level 1 questions from the GAIA validation set. It was tested on 20 such questions with a success rate of 90%.
	### What is GAIA

	GAIA is a benchmark for AI assistants evaluation on real-world tasks that involve:
	- multimodal reasoning (e.g., analyzing images, audio, documents)
	- multi-hop retrieval of interdependent facts
	- python code execution
	- a structured response format

	(see https://huggingface.co/learn/agents-course/unit4/what-is-gaia).

	GAIA introductory paper [”GAIA: A Benchmark for General AI Assistants”](https://arxiv.org/abs/2311.12983).


	## Implementation Highlights 🛠️


	The agent is implemented using the LangGraph framework.

	![img.png](img.png)

	Nodes

	- Pre-processor: Initialization and preparation of the state. Input handling: attached pictures are directly sent to the assistant node and to the model.
	Other type of attachments are loaded only in the tools.
	- Assistant: The brain of the agent. Decides which tool to call. Uses `gpt-4.1`.
	- Tools: The invocation of a tool.
	- Optimize memory: This step summarizes and the removes all the messages except the last 2. If the last message is the response of a web
	extract who's size exceeds a threshold, it is chunked and replaced with the most relevant chunks only.
	- Response Processing: Brings the answer into the concise format required by GAIA.

	Tasks and corresponding Tools

	Web Searches 🔎 are undertaken by `tavily` `search` and `extract` tools.

	- Chunking: The content returned by the `extract` tool might be too large to be further analyzed at once by a model (depending on the chosen model context window size or on the rate limits),
	so if its size exceeds a pre-configured threshold, it is chunked and only the most relevant chunks are further analyzed.
	- Text Splitting: First a hierarchical split (used Langchain's `MarkdownHeaderTextSplitter`) and then further splitting by size with a sliding window (used LangChain's `RecursiveCharacterTextSplitter`).
	- Embeddings: `langchain_community.embeddings.OpenAIEmbeddings`.
	- Vector DB: `FAISS` vector db.
	- Retrieval: `FAISS` similarity search

	Updated the original `extract` tool response message content to contain only the relevant chunks.

	Audio Analyzer🔉 uses `gpt-4o-audio-preview` to analyze the input.

	Math Problems Solver🧮 is a subagent that uses `gpt-5` equipped with the following tools:

	- Pyhton code executor: executes a snippet of python code provided as input
	- Think tool: used for strategic reflection on the progress of the solving process

	At this point it looks like the agent prefers to answer the mathematical question from the test set by invoking the python code executor instead.
	The question is answered correctly. I decided to not remove yet this tool, until I test the agent on other mathematical questions from the GAIA validation set.

	Python code Executor⚙️ can run either a snippet of python code or a given python file. The python code snippet is executed using `langchain_experimental.tools.PythonREPLTool`. The python file is executed in a sub-process.

	Spreadsheets📊: analyzes `excel` files using the pandas dataframe agent `langchain_experimental.agents.create_pandas_dataframe_agent`
	and the`gpt-4.1` model.

	Chess Move Recommendation♟️ Given a chess board and an active color, this tool is able to suggest the best move to be performed by the active color.

	- Picture analysis: identifies the location of each chess piece on the board. Once the coordinates are identified, the FEN of
	the game is computed programmatically. Both `gpt-4.1` and `gemini-2.5-flash` models are used to extract the coordinates and an arbitrage is performed on their outcomes.
	- Move suggestion: the best move is suggested by a `stockfish` chess engine.
	- Move interpretation: the move is then interpreted and transcribed into the algebraic notation with the help of `gpt-4`.

	Chess Board Picture Analysis - Challenges and Limitations 🆘
	I tried both `gpt-4.1` and `gemini-2.5-flash` models for chess pieces coordinates extraction, but I obtained inconsistent results (there
	are times when they get it right, but also instances when they don't).
	At least for openai I see there is a limitation listed on their website (see [https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations](https://platform.openai.com/docs/guides/images-vision?api-mode=responses#limitations)):
	>Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.

	The tool questions both models and does an arbitrage on their results. It queries both models further but only on pieces with conflicting positions.
	This process continues for a limited number of times. If at the end of it, if there are still conflicts, the answers provided by one of
	the models (in this case `gemini-2.5-flash`) is considered the ground truth. The aim of this approach was to reduce the number of objects that the model focuses on.
	From what I observed, this approach improved the chances of having a correct identification of pieces on the board ⚠️.



	YouTube Videos Analysis🎥 This is work in progress 🚧

	So far, the agent is able to answer questions on the conversation inside a YouTube video. There is no dedicated tool for this.
	The assistant searches for the transcripts by using the `tavily extract` tool.
	TODO: analyze YouTube videos and answer questions about objects in the video.




	## Future work and improvements 🔜
	- Evaluation: Evaluate the agent against other questions from the GAIA validation set.
	- Large Web Extracts: Try other chunking strategies.
	- Audio Analysis: Use a lesser expensive model to get the transcripts (like `whisper`) and if this is not enough to answer the question and more sophiticated processing is needed
	for other sounds like music, barks or other type of sounds then indeed use a better model.
	- Python File execution:Improve safety when executing python code or python files.
	- Video Analysis: Answer questions about objects in the video.
	- Chessboard Images Analysis: Detect correctly all pieces on a chess board image.


	## How to use
	Please check the `.env.example` file for the environment variables that need to be configured. If running in a Huggingface Space, you can
	set them as secrets, otherwise rename the `.env.example` to `.env`.

	The `CHESS_ENGINE_PATH` needs to be configured only if running on a Windows machine and needs to point
	to the `stockfish` executable, otherwise the `stockfish`installation is automatically detected.

	The `LANGSMITH_*` properties need to be configured only if you want to enable observability with LangSmith.

	The `SUBMISSION_MODE_ON` flag indicates whether the application will run in submission mode (when the 20 questions are fetched and the answers are
	submitted for agent evaluation) or not (the agent accepts a question and an attachment).

	## References 📚
	The math tool implementation was inspired from this repo https://github.com/langchain-ai/open_deep_research







	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference