iFlyBot commited on
Commit
edf7793
·
1 Parent(s): d2a6c1e

fix README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -10
README.md CHANGED
@@ -3,13 +3,13 @@ license: mit
3
  ---
4
 
5
 
6
- # IflyBotVLM
7
 
8
  ## 🔥Introduction
9
 
10
- We introduce IflyBotVLM, a general-purpose Vision-Language-Model (VLM) specifically engineered for the domain of Embodied Intelligence. The primary objective of this model is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robot motion control. It achieves this by abstracting complex scene information into an "Operational Language" that is body-agnostic and transferable, thus enabling seamless perception-to-action closed-loop coordination.
11
 
12
- The architecture of IflyBotVLM is designed to realize four critical functional capabilities in the embodied domain:
13
  **🧭Spatial Understanding and Metric**: Provides the model with the capacity to understand spatial relationships and perform relative position estimation among objects in the environment.
14
  **🎯Interactive Target Grounding**: Supports diverse grounding mechanisms, including 2D/3D object detection in the visual modality, language-based object and spatial referring, and the prediction of critical object affordance regions.
15
  **🤖Action Abstraction and Control Parameter Generation**: Generates outputs directly relevant to the manipulation domain, providing grasp poses and manipulation trajectories.
@@ -20,12 +20,12 @@ We anticipate that iFlyBotVLM will serve as an efficient and scalable foundation
20
 
21
  <div style="display: flex; gap: 1em; max-width: 100%;">
22
  <img
23
- src="https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/smart_donut_chart.png"
24
  style="flex: 1; max-width: 60%; height: auto; object-fit: contain;"
25
  alt="iFlyBotVLM Traning Data"
26
  >
27
  <img
28
- src="https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/radar_performance.png"
29
  style="flex: 1; max-width: 40%; height: auto; object-fit: contain;"
30
  alt="iFlyBotVLM Performance"
31
  >
@@ -38,15 +38,15 @@ iFlyBotVLM inherits the robust, three-stage "ViT-Projector-LLM" paradigm from es
38
 
39
  The core enhancement lies in the ViT's Positional Encoding (PE) layer. Instead of relying solely on the original 448 dimension PE, we employ Bicubic Interpolation to intelligently upsample the learned positional embeddings from 448 to an enriched dimension of 896. This novel approach, termed Dimension-Expanded Position Embedding (DEPE), provides a significantly more nuanced spatial context vector for each visual token. This dimensional enrichment allows the model to capture more complex positional and relative spatial information without increasing the sequence length, thereby enhancing the model's ability to perform fine-grained visual reasoning and detailed localization tasks.
40
 
41
- ![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/architecture.png)
42
 
43
  ## 📊Model Performance
44
 
45
  iFlyBotVLM demonstrates superior performance across various challenging benchmarks.
46
 
47
- ![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/benchmark_performance.png)
48
 
49
- ![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/table-performances.png)
50
 
51
  iFlyBotVLM-8B achieves state-of-the-art (SOTA) or near-SOTA performance on ten spatial comprehension, spatial perception, and temporal task planning benchmarks: Where2Place, Refspatial-bench, ShareRobot-affordance, ShareRobot-trajectory, BLINK(spatial), EmbSpatial, ERQA, CVBench, SAT, EgoPlan2.
52
 
@@ -177,7 +177,7 @@ class IflyRoboInference:
177
 
178
 
179
  def test_spatial_from_blink():
180
- hf_path = "IflyBot/IflyBotVLM"
181
  ifly_robo_infer = IflyRoboInference(hf_path)
182
  question = {
183
  "idx": "val_Spatial_Relation_143",
@@ -191,7 +191,7 @@ def test_spatial_from_blink():
191
 
192
 
193
  def test_visual_correspondence_from_blink():
194
- hf_path = "IflyBot/IflyBotVLM"
195
  ifly_robo_infer = IflyRoboInference(hf_path)
196
  question = {
197
  "idx": "val_Visual_Correspondence_1",
 
3
  ---
4
 
5
 
6
+ # iFlyBotVLM
7
 
8
  ## 🔥Introduction
9
 
10
+ We introduce iFlyBotVLM, a general-purpose Vision-Language-Model (VLM) specifically engineered for the domain of Embodied Intelligence. The primary objective of this model is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robot motion control. It achieves this by abstracting complex scene information into an "Operational Language" that is body-agnostic and transferable, thus enabling seamless perception-to-action closed-loop coordination.
11
 
12
+ The architecture of iFlyBotVLM is designed to realize four critical functional capabilities in the embodied domain:
13
  **🧭Spatial Understanding and Metric**: Provides the model with the capacity to understand spatial relationships and perform relative position estimation among objects in the environment.
14
  **🎯Interactive Target Grounding**: Supports diverse grounding mechanisms, including 2D/3D object detection in the visual modality, language-based object and spatial referring, and the prediction of critical object affordance regions.
15
  **🤖Action Abstraction and Control Parameter Generation**: Generates outputs directly relevant to the manipulation domain, providing grasp poses and manipulation trajectories.
 
20
 
21
  <div style="display: flex; gap: 1em; max-width: 100%;">
22
  <img
23
+ src="https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/smart_donut_chart.png"
24
  style="flex: 1; max-width: 60%; height: auto; object-fit: contain;"
25
  alt="iFlyBotVLM Traning Data"
26
  >
27
  <img
28
+ src="https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/radar_performance.png"
29
  style="flex: 1; max-width: 40%; height: auto; object-fit: contain;"
30
  alt="iFlyBotVLM Performance"
31
  >
 
38
 
39
  The core enhancement lies in the ViT's Positional Encoding (PE) layer. Instead of relying solely on the original 448 dimension PE, we employ Bicubic Interpolation to intelligently upsample the learned positional embeddings from 448 to an enriched dimension of 896. This novel approach, termed Dimension-Expanded Position Embedding (DEPE), provides a significantly more nuanced spatial context vector for each visual token. This dimensional enrichment allows the model to capture more complex positional and relative spatial information without increasing the sequence length, thereby enhancing the model's ability to perform fine-grained visual reasoning and detailed localization tasks.
40
 
41
+ ![image/png](https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/architecture.png)
42
 
43
  ## 📊Model Performance
44
 
45
  iFlyBotVLM demonstrates superior performance across various challenging benchmarks.
46
 
47
+ ![image/png](https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/benchmark_performance.png)
48
 
49
+ ![image/png](https://huggingface.co/datasets/iFlyBot/iFlyBotVLM-Repo/resolve/main/images/table-performances.png)
50
 
51
  iFlyBotVLM-8B achieves state-of-the-art (SOTA) or near-SOTA performance on ten spatial comprehension, spatial perception, and temporal task planning benchmarks: Where2Place, Refspatial-bench, ShareRobot-affordance, ShareRobot-trajectory, BLINK(spatial), EmbSpatial, ERQA, CVBench, SAT, EgoPlan2.
52
 
 
177
 
178
 
179
  def test_spatial_from_blink():
180
+ hf_path = "iFlyBot/iFlyBotVLM"
181
  ifly_robo_infer = IflyRoboInference(hf_path)
182
  question = {
183
  "idx": "val_Spatial_Relation_143",
 
191
 
192
 
193
  def test_visual_correspondence_from_blink():
194
+ hf_path = "iFlyBot/iFlyBotVLM"
195
  ifly_robo_infer = IflyRoboInference(hf_path)
196
  question = {
197
  "idx": "val_Visual_Correspondence_1",