update README.md
Browse files
README.md
CHANGED
|
@@ -5,14 +5,19 @@ license: mit
|
|
| 5 |
|
| 6 |
# IflyBotVLM
|
| 7 |
|
| 8 |
-
|
| 9 |
## Introduction
|
| 10 |
|
| 11 |
IflyBotVLM is a 8B open-source vision-language model(VLM) designed for embodied brain.
|
| 12 |
|
|
|
|
| 13 |
|
| 14 |
## Model Architecture
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
## Model Performance
|
| 18 |
|
|
|
|
| 5 |
|
| 6 |
# IflyBotVLM
|
| 7 |
|
|
|
|
| 8 |
## Introduction
|
| 9 |
|
| 10 |
IflyBotVLM is a 8B open-source vision-language model(VLM) designed for embodied brain.
|
| 11 |
|
| 12 |
+

|
| 13 |
|
| 14 |
## Model Architecture
|
| 15 |
|
| 16 |
+
IflyBotVLM inherits the robust, three-stage "ViT-Projector-LLM" paradigm from established Vision-Language Models. It integrates a dedicated, incrementally pre-trained Visual Encoder with an advanced Language Model via a simple, randomly initialized MLP projector for efficient feature alignment.
|
| 17 |
+
|
| 18 |
+
The core enhancement lies in the ViT's Positional Encoding (PE) layer. Instead of relying solely on the original $448$ dimension PE, we employ Bicubic Interpolation to intelligently upsample the learned positional embeddings from $448$ to an enriched dimension of $896$. This novel approach, termed Dimension-Expanded Position Embedding (DEPE), provides a significantly more nuanced spatial context vector for each visual token. This dimensional enrichment allows the model to capture more complex positional and relative spatial information without increasing the sequence length, thereby enhancing the model's ability to perform fine-grained visual reasoning and detailed localization tasks.
|
| 19 |
+
|
| 20 |
+

|
| 21 |
|
| 22 |
## Model Performance
|
| 23 |
|