iFlyBot
/

iFlyBotVLM

Model card Files Files and versions

iFlyBot commited on Oct 27, 2025

Commit

9e022f5

·

1 Parent(s): b01eeba

update README.md

Files changed (1) hide show

README.md +6 -1

README.md CHANGED Viewed

@@ -5,14 +5,19 @@ license: mit
 # IflyBotVLM
 ## Introduction
 IflyBotVLM is a 8B open-source vision-language model(VLM) designed for embodied brain.
 ## Model Architecture
 ## Model Performance

 # IflyBotVLM
 ## Introduction
 IflyBotVLM is a 8B open-source vision-language model(VLM) designed for embodied brain.
+![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/radar_performance.png)
 ## Model Architecture
+IflyBotVLM inherits the robust, three-stage "ViT-Projector-LLM" paradigm from established Vision-Language Models. It integrates a dedicated, incrementally pre-trained Visual Encoder with an advanced Language Model via a simple, randomly initialized MLP projector for efficient feature alignment.
+The core enhancement lies in the ViT's Positional Encoding (PE) layer. Instead of relying solely on the original $448$ dimension PE, we employ Bicubic Interpolation to intelligently upsample the learned positional embeddings from $448$ to an enriched dimension of $896$. This novel approach, termed Dimension-Expanded Position Embedding (DEPE), provides a significantly more nuanced spatial context vector for each visual token. This dimensional enrichment allows the model to capture more complex positional and relative spatial information without increasing the sequence length, thereby enhancing the model's ability to perform fine-grained visual reasoning and detailed localization tasks.
+![image/png](https://huggingface.co/datasets/IflyBot/IflyBotVLM-Repo/resolve/main/images/architecture.png)
 ## Model Performance