iFlyBot
/

iFlyBotVLM

Safetensors

internvl_chat

custom_code

Model card Files Files and versions

xet

Community

iFlyBot commited on Oct 28, 2025

Commit

d2a6c1e

1 Parent(s): 5099798

fix README.md

Browse files

Files changed (1) hide show

README.md +11 -17

README.md CHANGED Viewed

@@ -9,32 +9,26 @@ license: mit
 We introduce IflyBotVLM, a general-purpose Vision-Language-Model (VLM) specifically engineered for the domain of Embodied Intelligence. The primary objective of this model is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robot motion control. It achieves this by abstracting complex scene information into an "Operational Language" that is body-agnostic and transferable, thus enabling seamless perception-to-action closed-loop coordination.
-The architecture of IflyBotVLM is designed to realize four critical functional capabilities in the embodied domain:
-**🧭Spatial Understanding and Metric**: Provides the model with the capacity to understand spatial relationships and perform relative position estimation among objects in the environment.
-**🎯Interactive Target Grounding**: Supports diverse grounding mechanisms, including 2D/3D object detection in the visual modality, language-based object and spatial referring, and the prediction of critical object affordance regions.
-**🤖Action Abstraction and Control Parameter Generation**: Generates outputs directly relevant to the manipulation domain, providing grasp poses and manipulation trajectories.
-**📋Task Planning**: Leveraging the current scene comprehension, this module performs multi-step prediction to decompose complex tasks into a sequence of atomic skills, fundamentally supporting the robust execution of long-horizon tasks.
 We anticipate that iFlyBotVLM will serve as an efficient and scalable foundation model, driving the advancement of embodied AI from single-task capabilities toward generalist intelligent agents.
 <div style="display: flex; gap: 1em; max-width: 100%;">
-  <!-- 第一张图：自动缩放，保持比例，不裁剪 -->
-  <img
-    src="https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/radar_performance.png"
-    style="flex: 1; max-width: 50%; height: auto; object-fit: contain;"
-    alt="iFlyBotVLM Performance"
-  >
-  <!-- 第二张图：同上 -->
   <img
     src="https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/smart_donut_chart.png"
-    style="flex: 1; max-width: 50%; height: auto; object-fit: contain;"
     alt="iFlyBotVLM Traning Data"
   >
 </div>

 We introduce IflyBotVLM, a general-purpose Vision-Language-Model (VLM) specifically engineered for the domain of Embodied Intelligence. The primary objective of this model is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robot motion control. It achieves this by abstracting complex scene information into an "Operational Language" that is body-agnostic and transferable, thus enabling seamless perception-to-action closed-loop coordination.
+The architecture of IflyBotVLM is designed to realize four critical functional capabilities in the embodied domain:
+**🧭Spatial Understanding and Metric**: Provides the model with the capacity to understand spatial relationships and perform relative position estimation among objects in the environment.
+**🎯Interactive Target Grounding**: Supports diverse grounding mechanisms, including 2D/3D object detection in the visual modality, language-based object and spatial referring, and the prediction of critical object affordance regions.
+**🤖Action Abstraction and Control Parameter Generation**: Generates outputs directly relevant to the manipulation domain, providing grasp poses and manipulation trajectories.
+**📋Task Planning**: Leveraging the current scene comprehension, this module performs multi-step prediction to decompose complex tasks into a sequence of atomic skills, fundamentally supporting the robust execution of long-horizon tasks.
 We anticipate that iFlyBotVLM will serve as an efficient and scalable foundation model, driving the advancement of embodied AI from single-task capabilities toward generalist intelligent agents.
 <div style="display: flex; gap: 1em; max-width: 100%;">
   <img
     src="https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/smart_donut_chart.png"
+    style="flex: 1; max-width: 60%; height: auto; object-fit: contain;"
     alt="iFlyBotVLM Traning Data"
   >
+  <img
+    src="https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/radar_performance.png"
+    style="flex: 1; max-width: 40%; height: auto; object-fit: contain;"
+    alt="iFlyBotVLM Performance"
+  >
 </div>