iFlyBot commited on
Commit
9e022f5
·
1 Parent(s): b01eeba

update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -1
README.md CHANGED
@@ -5,14 +5,19 @@ license: mit
5
 
6
  # IflyBotVLM
7
 
8
-
9
  ## Introduction
10
 
11
  IflyBotVLM is a 8B open-source vision-language model(VLM) designed for embodied brain.
12
 
 
13
 
14
  ## Model Architecture
15
 
 
 
 
 
 
16
 
17
  ## Model Performance
18
 
 
5
 
6
  # IflyBotVLM
7
 
 
8
  ## Introduction
9
 
10
  IflyBotVLM is a 8B open-source vision-language model(VLM) designed for embodied brain.
11
 
12
+ ![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/radar_performance.png)
13
 
14
  ## Model Architecture
15
 
16
+ IflyBotVLM inherits the robust, three-stage "ViT-Projector-LLM" paradigm from established Vision-Language Models. It integrates a dedicated, incrementally pre-trained Visual Encoder with an advanced Language Model via a simple, randomly initialized MLP projector for efficient feature alignment.
17
+
18
+ The core enhancement lies in the ViT's Positional Encoding (PE) layer. Instead of relying solely on the original $448$ dimension PE, we employ Bicubic Interpolation to intelligently upsample the learned positional embeddings from $448$ to an enriched dimension of $896$. This novel approach, termed Dimension-Expanded Position Embedding (DEPE), provides a significantly more nuanced spatial context vector for each visual token. This dimensional enrichment allows the model to capture more complex positional and relative spatial information without increasing the sequence length, thereby enhancing the model's ability to perform fine-grained visual reasoning and detailed localization tasks.
19
+
20
+ ![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/architecture.png)
21
 
22
  ## Model Performance
23