ACIDE
/

User-VLM-3B-base

Image-Text-to-Text

text-generation-inference

Model card Files Files and versions

hamedrahimi commited on Feb 15, 2025

Commit

705e21d

·

verified ·

1 Parent(s): 1c6ecac

Update README.md

Files changed (1) hide show

README.md +3 -0

README.md CHANGED Viewed

@@ -12,6 +12,7 @@ base_model:
 pipeline_tag: image-text-to-text
 ---
 # User-VLM 360°
 ## Overview
 **User-VLM 360°** is a series of personalized Vision-Language Models (VLMs) designed for social human-robot interactions. The model introduces **User-aware tuning**, addressing the **semantic gap** that arises from the misalignment between user queries and the observed scene as captured by a robot's camera. Unlike traditional instruction tuning, which introduces latency and reduces performance, **User-VLM 360°** enables **real-time, robust adaptation** in dynamic robotic environments by inherently aligning cross-modal user representations.
@@ -21,6 +22,8 @@ This model allows for **customization of open-weight VLMs** to produce **persona
 ## Training Details
 **Base Model:** User-VLM 360° is built on **PaliGemma 2**, which consists of a **SigLIP vision encoder** and **Gemma 2 as the language model**.
 ### Fine-tuning Process:
 1. **Base Model Tuning:**
    - Tuned the **MLP layer** to provide **user and scene descriptions** over **1 epoch**.

 pipeline_tag: image-text-to-text
 ---
 # User-VLM 360°
+![Architecture](result-final.pdf)
 ## Overview
 **User-VLM 360°** is a series of personalized Vision-Language Models (VLMs) designed for social human-robot interactions. The model introduces **User-aware tuning**, addressing the **semantic gap** that arises from the misalignment between user queries and the observed scene as captured by a robot's camera. Unlike traditional instruction tuning, which introduces latency and reduces performance, **User-VLM 360°** enables **real-time, robust adaptation** in dynamic robotic environments by inherently aligning cross-modal user representations.
 ## Training Details
 **Base Model:** User-VLM 360° is built on **PaliGemma 2**, which consists of a **SigLIP vision encoder** and **Gemma 2 as the language model**.
+![Deployment on Pepper](pepper2.pdf)
 ### Fine-tuning Process:
 1. **Base Model Tuning:**
    - Tuned the **MLP layer** to provide **user and scene descriptions** over **1 epoch**.