update README.md
Browse files
README.md
CHANGED
|
@@ -5,25 +5,25 @@ license: mit
|
|
| 5 |
|
| 6 |
# IflyBotVLM
|
| 7 |
|
| 8 |
-
## Introduction
|
| 9 |
|
| 10 |
We introduce IflyBotVLM, a general-purpose Vision-Language-Model (VLM) specifically engineered for the domain of Embodied Intelligence. The primary objective of this model is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robot motion control. It achieves this by abstracting complex scene information into an "Operational Language" that is body-agnostic and transferable, thus enabling seamless perception-to-action closed-loop coordination.
|
| 11 |
|
| 12 |
The architecture of IflyBotVLM is designed to realize four critical functional capabilities in the embodied domain:
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
We anticipate that IflyBotVLM will serve as an efficient and scalable foundation model, driving the advancement of embodied AI from single-task capabilities toward generalist intelligent agents.
|
| 23 |
|
| 24 |

|
| 25 |
|
| 26 |
-
## Model Architecture
|
| 27 |
|
| 28 |
IflyBotVLM inherits the robust, three-stage "ViT-Projector-LLM" paradigm from established Vision-Language Models. It integrates a dedicated, incrementally pre-trained Visual Encoder with an advanced Language Model via a simple, randomly initialized MLP projector for efficient feature alignment.
|
| 29 |
|
|
@@ -31,7 +31,7 @@ The core enhancement lies in the ViT's Positional Encoding (PE) layer. Instead o
|
|
| 31 |
|
| 32 |

|
| 33 |
|
| 34 |
-
## Model Performance
|
| 35 |
|
| 36 |
IflyBotVLM demonstrates superior performance across various challenging benchmarks.
|
| 37 |
|
|
@@ -41,7 +41,7 @@ IflyBotVLM demonstrates superior performance across various challenging benchmar
|
|
| 41 |
|
| 42 |
IflyBotVLM-8B achieves state-of-the-art (SOTA) or near-SOTA performance on ten spatial comprehension, spatial perception, and temporal task planning benchmarks: Where2Place, Refspatial-bench, ShareRobot-affordance, ShareRobot-trajectory, BLINK(spatial), EmbSpatial, ERQA, CVBench, SAT, EgoPlan2.
|
| 43 |
|
| 44 |
-
## Quick Start
|
| 45 |
|
| 46 |
### Using π€ Transformers to Chat
|
| 47 |
|
|
|
|
| 5 |
|
| 6 |
# IflyBotVLM
|
| 7 |
|
| 8 |
+
## π₯Introduction
|
| 9 |
|
| 10 |
We introduce IflyBotVLM, a general-purpose Vision-Language-Model (VLM) specifically engineered for the domain of Embodied Intelligence. The primary objective of this model is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robot motion control. It achieves this by abstracting complex scene information into an "Operational Language" that is body-agnostic and transferable, thus enabling seamless perception-to-action closed-loop coordination.
|
| 11 |
|
| 12 |
The architecture of IflyBotVLM is designed to realize four critical functional capabilities in the embodied domain:
|
| 13 |
|
| 14 |
+
**π§Spatial Understanding and Metric**: Provides the model with the capacity to understand spatial relationships and perform relative position estimation among objects in the environment.
|
| 15 |
|
| 16 |
+
**π―Interactive Target Grounding**: Supports diverse grounding mechanisms, including 2D/3D object detection in the visual modality, language-based object and spatial referring, and the prediction of critical object affordance regions.
|
| 17 |
|
| 18 |
+
**π€Action Abstraction and Control Parameter Generation**: Generates outputs directly relevant to the manipulation domain, providing grasp poses and manipulation trajectories.
|
| 19 |
|
| 20 |
+
**πTask Planning**: Leveraging the current scene comprehension, this module performs multi-step prediction to decompose complex tasks into a sequence of atomic skills, fundamentally supporting the robust execution of long-horizon tasks.
|
| 21 |
|
| 22 |
We anticipate that IflyBotVLM will serve as an efficient and scalable foundation model, driving the advancement of embodied AI from single-task capabilities toward generalist intelligent agents.
|
| 23 |
|
| 24 |

|
| 25 |
|
| 26 |
+
## ποΈModel Architecture
|
| 27 |
|
| 28 |
IflyBotVLM inherits the robust, three-stage "ViT-Projector-LLM" paradigm from established Vision-Language Models. It integrates a dedicated, incrementally pre-trained Visual Encoder with an advanced Language Model via a simple, randomly initialized MLP projector for efficient feature alignment.
|
| 29 |
|
|
|
|
| 31 |
|
| 32 |

|
| 33 |
|
| 34 |
+
## πModel Performance
|
| 35 |
|
| 36 |
IflyBotVLM demonstrates superior performance across various challenging benchmarks.
|
| 37 |
|
|
|
|
| 41 |
|
| 42 |
IflyBotVLM-8B achieves state-of-the-art (SOTA) or near-SOTA performance on ten spatial comprehension, spatial perception, and temporal task planning benchmarks: Where2Place, Refspatial-bench, ShareRobot-affordance, ShareRobot-trajectory, BLINK(spatial), EmbSpatial, ERQA, CVBench, SAT, EgoPlan2.
|
| 43 |
|
| 44 |
+
## πQuick Start
|
| 45 |
|
| 46 |
### Using π€ Transformers to Chat
|
| 47 |
|