Improve model card for Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
Browse filesThis PR significantly enhances the model card for the `Don't Blind Your VLA` LoRA adapter.
Key improvements include:
* Adding `robotics` as the `pipeline_tag` to correctly categorize the model for its domain.
* Specifying the `apache-2.0` license.
* Populating the model card content with detailed descriptions from the paper abstract and GitHub README.
* Adding direct links to the Hugging Face paper page, the project page, the GitHub repository, and the Hugging Face collection.
* Including the installation instructions and relevant code snippets from the GitHub README for Visual Representation Alignment integration and LoRA fine-tuning.
* Adding the academic citation and acknowledgements.
* Adding the primary image from the GitHub README for visual context.
These updates make the model card more comprehensive, discoverable, and user-friendly for the community.
|
@@ -1,202 +1,193 @@
|
|
| 1 |
---
|
| 2 |
base_model: openvla/openvla-7b
|
| 3 |
library_name: peft
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
-
#
|
| 7 |
|
| 8 |
-
|
| 9 |
|
|
|
|
| 10 |
|
|
|
|
| 11 |
|
| 12 |
## Model Details
|
| 13 |
|
| 14 |
### Model Description
|
| 15 |
|
| 16 |
-
|
| 17 |
|
|
|
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
|
| 21 |
-
- **Funded by [optional]:** [More Information Needed]
|
| 22 |
-
- **Shared by [optional]:** [More Information Needed]
|
| 23 |
-
- **Model type:** [More Information Needed]
|
| 24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
| 25 |
-
- **License:** [More Information Needed]
|
| 26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
- **Repository:** [More Information Needed]
|
| 33 |
-
- **Paper [optional]:** [More Information Needed]
|
| 34 |
-
- **Demo [optional]:** [More Information Needed]
|
| 35 |
|
| 36 |
## Uses
|
| 37 |
|
| 38 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 39 |
-
|
| 40 |
### Direct Use
|
| 41 |
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
[More Information Needed]
|
| 45 |
|
| 46 |
-
###
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
###
|
| 53 |
|
| 54 |
-
|
| 55 |
|
| 56 |
-
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
-
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
-
|
|
|
|
| 67 |
|
| 68 |
-
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
-
|
|
|
|
| 75 |
|
| 76 |
-
|
|
|
|
|
|
|
| 77 |
|
| 78 |
-
|
|
|
|
|
|
|
| 79 |
|
| 80 |
-
|
|
|
|
|
|
|
| 81 |
|
| 82 |
-
|
|
|
|
| 83 |
|
| 84 |
-
|
|
|
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
-
|
|
|
|
| 89 |
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
|
|
|
| 92 |
|
| 93 |
-
|
| 94 |
|
| 95 |
-
|
| 96 |
|
| 97 |
-
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
## Evaluation
|
| 104 |
|
| 105 |
-
<!-- This section describes the evaluation protocols and provides the results. -->
|
| 106 |
-
|
| 107 |
### Testing Data, Factors & Metrics
|
| 108 |
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
|
|
|
| 118 |
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
#### Metrics
|
| 122 |
-
|
| 123 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
| 124 |
-
|
| 125 |
-
[More Information Needed]
|
| 126 |
|
| 127 |
### Results
|
| 128 |
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
#### Summary
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
## Model Examination [optional]
|
| 136 |
-
|
| 137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
| 138 |
-
|
| 139 |
-
[More Information Needed]
|
| 140 |
-
|
| 141 |
-
## Environmental Impact
|
| 142 |
-
|
| 143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
| 144 |
-
|
| 145 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 146 |
-
|
| 147 |
-
- **Hardware Type:** [More Information Needed]
|
| 148 |
-
- **Hours used:** [More Information Needed]
|
| 149 |
-
- **Cloud Provider:** [More Information Needed]
|
| 150 |
-
- **Compute Region:** [More Information Needed]
|
| 151 |
-
- **Carbon Emitted:** [More Information Needed]
|
| 152 |
-
|
| 153 |
-
## Technical Specifications [optional]
|
| 154 |
-
|
| 155 |
-
### Model Architecture and Objective
|
| 156 |
-
|
| 157 |
-
[More Information Needed]
|
| 158 |
-
|
| 159 |
-
### Compute Infrastructure
|
| 160 |
-
|
| 161 |
-
[More Information Needed]
|
| 162 |
-
|
| 163 |
-
#### Hardware
|
| 164 |
-
|
| 165 |
-
[More Information Needed]
|
| 166 |
-
|
| 167 |
-
#### Software
|
| 168 |
-
|
| 169 |
-
[More Information Needed]
|
| 170 |
-
|
| 171 |
-
## Citation [optional]
|
| 172 |
-
|
| 173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 174 |
-
|
| 175 |
-
**BibTeX:**
|
| 176 |
-
|
| 177 |
-
[More Information Needed]
|
| 178 |
-
|
| 179 |
-
**APA:**
|
| 180 |
-
|
| 181 |
-
[More Information Needed]
|
| 182 |
-
|
| 183 |
-
## Glossary [optional]
|
| 184 |
-
|
| 185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
| 186 |
-
|
| 187 |
-
[More Information Needed]
|
| 188 |
-
|
| 189 |
-
## More Information [optional]
|
| 190 |
-
|
| 191 |
-
[More Information Needed]
|
| 192 |
-
|
| 193 |
-
## Model Card Authors [optional]
|
| 194 |
|
| 195 |
-
|
| 196 |
|
| 197 |
-
|
| 198 |
|
| 199 |
-
|
| 200 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
|
| 202 |
-
|
|
|
|
|
|
| 1 |
---
|
| 2 |
base_model: openvla/openvla-7b
|
| 3 |
library_name: peft
|
| 4 |
+
pipeline_tag: robotics
|
| 5 |
+
license: apache-2.0
|
| 6 |
---
|
| 7 |
|
| 8 |
+
# Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
|
| 9 |
|
| 10 |
+
This model is a LoRA adapter presented in the paper [Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization](https://huggingface.co/papers/2510.25616).
|
| 11 |
|
| 12 |
+
To address the degradation of visual-language (VL) representations during VLA supervised fine-tuning (SFT), we introduce **Visual Representation Alignment**. During SFT, we pull a VLA’s visual tokens toward a frozen teacher’s patch features using cosine similarity through a lightweight frozen projector. This keeps perception anchored while the model learns to act — improving OOD generalization with almost no added cost.
|
| 13 |
|
| 14 |
+
<img width="5567" height="4133" alt="method_1" src="https://huggingface.co/CognitiveAISystems/BlindVLA/resolve/main/figs/method_1.png" />
|
| 15 |
|
| 16 |
## Model Details
|
| 17 |
|
| 18 |
### Model Description
|
| 19 |
|
| 20 |
+
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. This work systematically studies representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, the authors probe VLA's hidden representations and analyze attention maps, and design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. They introduce a simple yet effective method, **Visual Representation Alignment**, that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios.
|
| 21 |
|
| 22 |
+
The paper also introduces the **VL-Think Task Suite**, a diagnostic suite assessing the transfer of VL understanding and knowledge from VLMs to VLAs independently of low-level control. This suite focuses on whether models retain the ability to interpret visual symbols, compositional cues, and categorical relations rather than pure manipulation skills.
|
| 23 |
|
| 24 |
+
- **Developed by:** Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov
|
| 25 |
+
- **Model type:** Vision-Language-Action (VLA) model (LoRA adapter)
|
| 26 |
+
- **Language(s):** English
|
| 27 |
+
- **License:** Apache-2.0
|
| 28 |
+
- **Finetuned from model:** [openvla/openvla-7b](https://huggingface.co/openvla/openvla-7b)
|
| 29 |
|
| 30 |
+
### Model Sources
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
- **Repository:** [https://github.com/CognitiveAISystems/BlindVLA](https://github.com/CognitiveAISystems/BlindVLA)
|
| 33 |
+
- **Paper:** [https://huggingface.co/papers/2510.25616](https://huggingface.co/papers/2510.25616)
|
| 34 |
+
- **Project Page:** [https://blind-vla-paper.github.io/](https://blind-vla-paper.github.io/)
|
| 35 |
+
- **Hugging Face Collection:** [https://huggingface.co/collections/tttonyalpha/dont-blind-your-vla](https://huggingface.co/collections/tttonyalpha/dont-blind-your-vla)
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
## Uses
|
| 38 |
|
|
|
|
|
|
|
| 39 |
### Direct Use
|
| 40 |
|
| 41 |
+
This model is intended for research in Vision-Language-Action (VLA) models, particularly for understanding and improving out-of-distribution (OOD) generalization in robotic and agent control tasks through visual representation alignment. Researchers can use this adapter and methodology to fine-tune base VLA models and explore the impact of representation degradation.
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
### Out-of-Scope Use
|
| 44 |
|
| 45 |
+
As a research artifact, this model is not intended for deployment in real-world, safety-critical applications without further rigorous testing, validation, and adaptation. It is focused on studying and mitigating specific representation issues in VLAs, rather than serving as a production-ready agent.
|
| 46 |
|
| 47 |
+
## How to Get Started with the Model
|
| 48 |
|
| 49 |
+
### Installation
|
| 50 |
|
| 51 |
+
Use the environment setup commands below to get started:
|
| 52 |
|
| 53 |
+
```bash
|
| 54 |
+
# Create and activate conda environment
|
| 55 |
+
conda create -n blindvla python=3.10 -y
|
| 56 |
+
conda activate blindvla
|
| 57 |
|
| 58 |
+
# Install PyTorch. Below is a sample command to do this, but you should check the following link
|
| 59 |
+
# to find installation instructions that are specific to your compute platform:
|
| 60 |
+
# https://pytorch.org/get-started/locally/
|
| 61 |
+
pip install torch torchvision torchaudio
|
| 62 |
|
| 63 |
+
# Clone and install the BlindVLA repo
|
| 64 |
+
git clone https://github.com/CognitiveAISystems/BlindVLA.git
|
| 65 |
+
cd BlindVLA
|
| 66 |
+
pip install -e ./openvla
|
| 67 |
|
| 68 |
+
# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
|
| 69 |
+
# =>> If you run into difficulty, try `pip cache remove flash_attn` first
|
| 70 |
+
pip3 install packaging ninja
|
| 71 |
+
ninja --version; echo $? # Verify Ninja --> should return exit code "0"
|
| 72 |
+
pip install "flash-attn==2.5.5" --no-build-isolation
|
| 73 |
+
pip install diffusers==0.33.0
|
| 74 |
|
| 75 |
+
pip install -e ./ManiSkill
|
| 76 |
+
pip install -e ./SimplerEnv
|
| 77 |
+
pip install -U "typeguard>=3"
|
| 78 |
+
```
|
| 79 |
|
| 80 |
+
The pretrained OpenVLA model is warmed up using 140 episodes collected with Octo-Small and a motion planner for 2k steps.
|
| 81 |
+
You can download the training dataset (1.4k episodes) [here](https://huggingface.co/datasets/tttonyalpha/openvla_1k-dataset) and the warm-up checkpoint [here](https://huggingface.co/tttonyalpha/openvla-7b-warmup-checkpoint_lora_002000).
|
| 82 |
|
| 83 |
+
### Sample Usage (LoRA Fine-tuning with Visual Representation Alignment)
|
| 84 |
|
| 85 |
+
Below is a minimal example from the GitHub README of how you can integrate Visual Representation Alignment into your VLA’s training pipeline. Just plug in these few lines right after your forward pass — no architecture changes are needed.
|
| 86 |
|
| 87 |
+
```python
|
| 88 |
+
# ....
|
| 89 |
+
# out = vla.forward(..., output_hidden_states=True)
|
| 90 |
+
# pixel_values = preprocessor(image, ...)
|
| 91 |
+
# ....
|
| 92 |
+
#
|
| 93 |
|
| 94 |
+
n_vis = out.projector_features.shape[1]
|
| 95 |
+
pos, pos_end = 1,
|
| 96 |
|
| 97 |
+
# 1. Extract VLA's visual features from specific layer and project to visual teacher dimention
|
| 98 |
+
vla_features = out.hidden_states[align_layer][:, pos:pos_end]
|
| 99 |
+
vla_features = alignment_projector(vla_feats)
|
| 100 |
|
| 101 |
+
# 2. Get teacher patch features
|
| 102 |
+
with torch.no_grad():
|
| 103 |
+
teacher_features = teacher_vision_backbone(pixel_values)
|
| 104 |
|
| 105 |
+
# 3. Compute cosine alignment loss
|
| 106 |
+
emb_t = F.normalize(teacher_features, dim=-1)
|
| 107 |
+
emb_s = F.normalize(vla_features, dim=-1)
|
| 108 |
|
| 109 |
+
cossim = (emb_t * emb_s).sum(dim=-1)
|
| 110 |
+
align_loss = (-cossim).mean()
|
| 111 |
|
| 112 |
+
loss += cfg.align_coeff * align_loss
|
| 113 |
+
```
|
| 114 |
|
| 115 |
+
You can run LoRA fine-tuning with Visual Representation Alignment using this script:
|
| 116 |
|
| 117 |
+
```bash
|
| 118 |
+
openvla_path="tttonyalpha/openvla-7b-warmup-checkpoint_merged_002000_lora_002000"
|
| 119 |
|
| 120 |
+
torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \
|
| 121 |
+
--vla_path "$openvla_path" \
|
| 122 |
+
--data_root_dir "datasets" \
|
| 123 |
+
--dataset_name "sft" \
|
| 124 |
+
--run_root_dir "runs" \
|
| 125 |
+
--lora_rank 32 \
|
| 126 |
+
--batch_size 8 \
|
| 127 |
+
--max_steps 60000 \
|
| 128 |
+
--eval_steps 200 \
|
| 129 |
+
--save_steps "0,5000,10000,20000,30000,40000,50000,60000" \
|
| 130 |
+
--grad_accumulation_steps 1 \
|
| 131 |
+
--learning_rate 5e-4 \
|
| 132 |
+
--image_aug True
|
| 133 |
+
```
|
| 134 |
|
| 135 |
+
## Training Details
|
| 136 |
|
| 137 |
+
### Training Data
|
| 138 |
|
| 139 |
+
The model is warmed up using 140 episodes collected with Octo-Small and a motion planner for 2k steps. A larger training dataset (1.4k episodes) is available [here](https://huggingface.co/datasets/tttonyalpha/openvla_1k-dataset).
|
| 140 |
|
| 141 |
+
### Training Procedure
|
| 142 |
|
| 143 |
+
#### Training Hyperparameters
|
| 144 |
|
| 145 |
+
The model is fine-tuned using LoRA with the following hyperparameters, as described in the provided `finetune.py` script:
|
| 146 |
+
- **LoRA rank:** 32
|
| 147 |
+
- **Batch size:** 8
|
| 148 |
+
- **Max steps:** 60000
|
| 149 |
+
- **Evaluation steps:** 200
|
| 150 |
+
- **Save steps:** 0, 5000, 10000, 20000, 30000, 40000, 50000, 60000
|
| 151 |
+
- **Gradient accumulation steps:** 1
|
| 152 |
+
- **Learning rate:** 5e-4
|
| 153 |
+
- **Image augmentation:** True
|
| 154 |
|
| 155 |
## Evaluation
|
| 156 |
|
|
|
|
|
|
|
| 157 |
### Testing Data, Factors & Metrics
|
| 158 |
|
| 159 |
+
The model is evaluated using the **VL-Think Task Suite**, a diagnostic suite assessing the transfer of VL understanding and knowledge from VLMs to VLAs independently of low-level control. The suite includes various tasks, focusing on the ability to interpret visual symbols, compositional cues, and categorical relations.
|
| 160 |
+
Examples of tasks include:
|
| 161 |
+
* `PutOnShapeInSceneMultiColor-v1` (13 shapes)
|
| 162 |
+
* `PutOnColorInSceneMulti-v1` (8 colors)
|
| 163 |
+
* `PutOnLaundryIconInSceneMulti-v1` (17 laundry icons)
|
| 164 |
+
* `PutOnNumberInSceneParity-v1` (8 numbers)
|
| 165 |
+
* `PutOnPublicInfoSignInSceneMulti-v1` (14 public info signs)
|
| 166 |
+
* `PutOnSignTrafficInSceneMulti-v1` (24 traffic signs)
|
| 167 |
+
* `PutOnWeatherIconInSceneMulti-v1` (9 weather icons)
|
| 168 |
+
* `PutOnArrowSignInSceneMulti-v1` (4 directions)
|
| 169 |
|
| 170 |
+
Evaluation is performed using batched environments for efficient parallel processing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
|
| 172 |
### Results
|
| 173 |
|
| 174 |
+
The paper demonstrates that Visual Representation Alignment mitigates degradation of visual representations and yields improved generalization to out-of-distribution (OOD) scenarios. For detailed results, refer to the [paper](https://huggingface.co/papers/2510.25616).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
|
| 176 |
+
## Citation
|
| 177 |
|
| 178 |
+
If you find our code useful, please cite [our paper](https://arxiv.org/abs/2510.25616):
|
| 179 |
|
| 180 |
+
```BibTeX
|
| 181 |
+
@misc{kachaev2025dontblindvlaaligning,
|
| 182 |
+
title={Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization},
|
| 183 |
+
author={Nikita Kachaev and Mikhail Kolosov and Daniil Zelezetsky and Alexey K. Kovalev and Aleksandr I. Panov},
|
| 184 |
+
year={2025},
|
| 185 |
+
eprint={2510.25616},
|
| 186 |
+
archivePrefix={arXiv},
|
| 187 |
+
primaryClass={cs.LG},
|
| 188 |
+
url={https://arxiv.org/abs/2510.25616},
|
| 189 |
+
}
|
| 190 |
+
```
|
| 191 |
|
| 192 |
+
## Acknowledgement
|
| 193 |
+
BlindVLA is built with reference on: [RL4VLA](https://github.com/gen-robot/RL4VLA), [Simpler](https://github.com/simpler-env/SimplerEnv), [REPA](https://github.com/sihyun-yu/REPA), [OpenVLA](https://github.com/openvla/openvla). Many thanks for their awesome work!
|