James2236's picture
Uploading human_in_the_loop_multimodal_agent
00f84d6 verified

A newer version of the Gradio SDK is available: 6.12.0

Upgrade
metadata
title: Human In The Loop Multimodal Agent
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: true

📝 Technical Model Card: Human-in-the-Loop (HITL) Multimodal Agent 🚀 Overview

This project demonstrates a production-ready Vision-Language pipeline designed to handle model uncertainty through a Human-in-the-Loop (HITL) architecture. Instead of treating the AI as a standalone oracle, this system implements an active feedback loop where users audit model outputs to improve future performance.

🛠️ The Tech Stack

  • Model: PaliGemma-3B-pt-224 (A state-of-the-art vision-language model by Google).

  • Interface: Gradio for real-time multimodal interaction.

  • Backend: Transformers (PyTorch) for inference.

  • Data Flywheel: Hugging Face Datasets integration for persistent storage of "flagged" edge cases.

🧠 System Architecture

The application follows a four-stage lifecycle:

  1. Multimodal Inference: The user provides an image and a natural language query. The model processes the visual features and text tokens simultaneously.

  2. Uncertainty Quantification: The system evaluates the model's confidence. (In this demo, we simulate a threshold check).

  3. Human Audit: Users can "Flag" responses that are incorrect, hallucinated, or low-confidence.

  4. Data Persistence: Flagged samples—including the original image, the prompt, and the AI's failed response—are automatically streamed to a connected Hugging Face Dataset for future supervised fine-tuning (SFT).

🎯 Key Features for Recruitment

  • Active Learning Foundation: Shows an understanding of how to reduce model hallucinations in a business context.

  • Data Governance: Demonstrates secure handling of user data and automated dataset curation.

  • Scalability: The architecture is designed to be model-agnostic; the PaliGemma backbone can be swapped for larger models (like Idefics2) or smaller ones (like Moondream2) based on latency requirements.

📈 Future Roadmap

  • Implement LoRA (Low-Rank Adaptation) fine-tuning scripts to retrain the model monthly on the collected "flagged" dataset.

  • Integrate LLM-as-a-judge to automatically pre-screen flagged data before human review.

  • Add support for multiple image uploads to allow for comparative visual reasoning.

👨‍💻 How to Run

  1. Upload an image (e.g., a photo of a room).

  2. Ask: "What color is the sofa?" or "Are there any safety hazards?"

  3. If the AI misses a detail, click "Flag for Human Review" to help improve the model!