A newer version of the Gradio SDK is available: 6.12.0
title: Human In The Loop Multimodal Agent
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: true
📝 Technical Model Card: Human-in-the-Loop (HITL) Multimodal Agent 🚀 Overview
This project demonstrates a production-ready Vision-Language pipeline designed to handle model uncertainty through a Human-in-the-Loop (HITL) architecture. Instead of treating the AI as a standalone oracle, this system implements an active feedback loop where users audit model outputs to improve future performance.
🛠️ The Tech Stack
Model: PaliGemma-3B-pt-224 (A state-of-the-art vision-language model by Google).
Interface: Gradio for real-time multimodal interaction.
Backend: Transformers (PyTorch) for inference.
Data Flywheel: Hugging Face Datasets integration for persistent storage of "flagged" edge cases.
🧠 System Architecture
The application follows a four-stage lifecycle:
Multimodal Inference: The user provides an image and a natural language query. The model processes the visual features and text tokens simultaneously.
Uncertainty Quantification: The system evaluates the model's confidence. (In this demo, we simulate a threshold check).
Human Audit: Users can "Flag" responses that are incorrect, hallucinated, or low-confidence.
Data Persistence: Flagged samples—including the original image, the prompt, and the AI's failed response—are automatically streamed to a connected Hugging Face Dataset for future supervised fine-tuning (SFT).
🎯 Key Features for Recruitment
Active Learning Foundation: Shows an understanding of how to reduce model hallucinations in a business context.
Data Governance: Demonstrates secure handling of user data and automated dataset curation.
Scalability: The architecture is designed to be model-agnostic; the PaliGemma backbone can be swapped for larger models (like Idefics2) or smaller ones (like Moondream2) based on latency requirements.
📈 Future Roadmap
Implement LoRA (Low-Rank Adaptation) fine-tuning scripts to retrain the model monthly on the collected "flagged" dataset.
Integrate LLM-as-a-judge to automatically pre-screen flagged data before human review.
Add support for multiple image uploads to allow for comparative visual reasoning.
👨💻 How to Run
Upload an image (e.g., a photo of a room).
Ask: "What color is the sofa?" or "Are there any safety hazards?"
If the AI misses a detail, click "Flag for Human Review" to help improve the model!