iFlyBot commited on
Commit
b62a609
·
1 Parent(s): cf676a5

update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -5
README.md CHANGED
@@ -19,13 +19,25 @@ The architecture of IflyBotVLM is designed to realize four critical functional c
19
 
20
  **📋Task Planning**: Leveraging the current scene comprehension, this module performs multi-step prediction to decompose complex tasks into a sequence of atomic skills, fundamentally supporting the robust execution of long-horizon tasks.
21
 
22
- We anticipate that IflyBotVLM will serve as an efficient and scalable foundation model, driving the advancement of embodied AI from single-task capabilities toward generalist intelligent agents.
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
- ![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/radar_performance.png)
25
 
26
  ## 🏗️Model Architecture
27
 
28
- IflyBotVLM inherits the robust, three-stage "ViT-Projector-LLM" paradigm from established Vision-Language Models. It integrates a dedicated, incrementally pre-trained Visual Encoder with an advanced Language Model via a simple, randomly initialized MLP projector for efficient feature alignment.
29
 
30
  The core enhancement lies in the ViT's Positional Encoding (PE) layer. Instead of relying solely on the original 448 dimension PE, we employ Bicubic Interpolation to intelligently upsample the learned positional embeddings from 448 to an enriched dimension of 896. This novel approach, termed Dimension-Expanded Position Embedding (DEPE), provides a significantly more nuanced spatial context vector for each visual token. This dimensional enrichment allows the model to capture more complex positional and relative spatial information without increasing the sequence length, thereby enhancing the model's ability to perform fine-grained visual reasoning and detailed localization tasks.
31
 
@@ -33,13 +45,13 @@ The core enhancement lies in the ViT's Positional Encoding (PE) layer. Instead o
33
 
34
  ## 📊Model Performance
35
 
36
- IflyBotVLM demonstrates superior performance across various challenging benchmarks.
37
 
38
  ![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/benchmark_performance.png)
39
 
40
  ![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/table-performances.png)
41
 
42
- IflyBotVLM-8B achieves state-of-the-art (SOTA) or near-SOTA performance on ten spatial comprehension, spatial perception, and temporal task planning benchmarks: Where2Place, Refspatial-bench, ShareRobot-affordance, ShareRobot-trajectory, BLINK(spatial), EmbSpatial, ERQA, CVBench, SAT, EgoPlan2.
43
 
44
  ## 🚀Quick Start
45
 
 
19
 
20
  **📋Task Planning**: Leveraging the current scene comprehension, this module performs multi-step prediction to decompose complex tasks into a sequence of atomic skills, fundamentally supporting the robust execution of long-horizon tasks.
21
 
22
+ We anticipate that iFlyBotVLM will serve as an efficient and scalable foundation model, driving the advancement of embodied AI from single-task capabilities toward generalist intelligent agents.
23
+
24
+ <div style="display: flex; gap: 20px; height: 300px;"> <!-- 统一容器高度,可选 -->
25
+ <img
26
+ src="https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/radar_performance.png"
27
+ style="flex: 1; max-width: 50%; height: 100%; object-fit: cover;"
28
+ alt="iFlyBotVLM Performance"
29
+ >
30
+ <img
31
+ src="https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/smart_donut_chart.png"
32
+ style="flex: 1; max-width: 50%; height: 100%; object-fit: cover;"
33
+ alt="iFlyBotVLM Traning Data"
34
+ >
35
+ </div>
36
 
 
37
 
38
  ## 🏗️Model Architecture
39
 
40
+ iFlyBotVLM inherits the robust, three-stage "ViT-Projector-LLM" paradigm from established Vision-Language Models. It integrates a dedicated, incrementally pre-trained Visual Encoder with an advanced Language Model via a simple, randomly initialized MLP projector for efficient feature alignment.
41
 
42
  The core enhancement lies in the ViT's Positional Encoding (PE) layer. Instead of relying solely on the original 448 dimension PE, we employ Bicubic Interpolation to intelligently upsample the learned positional embeddings from 448 to an enriched dimension of 896. This novel approach, termed Dimension-Expanded Position Embedding (DEPE), provides a significantly more nuanced spatial context vector for each visual token. This dimensional enrichment allows the model to capture more complex positional and relative spatial information without increasing the sequence length, thereby enhancing the model's ability to perform fine-grained visual reasoning and detailed localization tasks.
43
 
 
45
 
46
  ## 📊Model Performance
47
 
48
+ iFlyBotVLM demonstrates superior performance across various challenging benchmarks.
49
 
50
  ![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/benchmark_performance.png)
51
 
52
  ![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/table-performances.png)
53
 
54
+ iFlyBotVLM-8B achieves state-of-the-art (SOTA) or near-SOTA performance on ten spatial comprehension, spatial perception, and temporal task planning benchmarks: Where2Place, Refspatial-bench, ShareRobot-affordance, ShareRobot-trajectory, BLINK(spatial), EmbSpatial, ERQA, CVBench, SAT, EgoPlan2.
55
 
56
  ## 🚀Quick Start
57