Update README.md
Browse files
README.md
CHANGED
|
@@ -18,7 +18,6 @@ library_name: transformers
|
|
| 18 |
---
|
| 19 |
|
| 20 |
# BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
|
| 21 |
-
|
| 22 |
[[paper]](https://arxiv.org/abs/2506.07530) [[model]](https://huggingface.co/collections/hongyuw/bitvla-68468fb1e3aae15dd8a4e36e) [[code]](https://github.com/ustcwhy/BitVLA)
|
| 23 |
|
| 24 |
- June 2025: [BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation](https://arxiv.org/abs/2506.07530)
|
|
@@ -27,11 +26,44 @@ library_name: transformers
|
|
| 27 |
## Open Source Plan
|
| 28 |
|
| 29 |
- ✅ Paper, Pre-trained VLM and evaluation code.
|
| 30 |
-
-
|
| 31 |
-
- 🧭 Pre-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
We use the [LMM-Eval](https://github.com/ustcwhy/BitVLA/tree/main/lmms-eval) toolkit to conduct evaluations on VQA tasks. We provide the [transformers repo](https://github.com/ustcwhy/BitVLA/tree/main/transformers) in which we modify the [modeling_llava.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/llava/modeling_llava.py) and [modeling_siglip.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/siglip/modeling_siglip.py) to support the W1.58-A8 quantization.
|
| 37 |
|
|
@@ -62,12 +94,89 @@ bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-siglipL-224px-bf16
|
|
| 62 |
|
| 63 |
Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost.
|
| 64 |
|
| 65 |
-
##
|
| 66 |
|
| 67 |
-
|
| 68 |
|
| 69 |
-
|
| 70 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
## Citation
|
| 73 |
|
|
@@ -83,6 +192,9 @@ If you find this repository useful, please consider citing our work:
|
|
| 83 |
}
|
| 84 |
```
|
| 85 |
|
|
|
|
|
|
|
|
|
|
| 86 |
### Contact Information
|
| 87 |
|
| 88 |
-
For help or issues using models, please submit a GitHub issue.
|
|
|
|
| 18 |
---
|
| 19 |
|
| 20 |
# BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
|
|
|
|
| 21 |
[[paper]](https://arxiv.org/abs/2506.07530) [[model]](https://huggingface.co/collections/hongyuw/bitvla-68468fb1e3aae15dd8a4e36e) [[code]](https://github.com/ustcwhy/BitVLA)
|
| 22 |
|
| 23 |
- June 2025: [BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation](https://arxiv.org/abs/2506.07530)
|
|
|
|
| 26 |
## Open Source Plan
|
| 27 |
|
| 28 |
- ✅ Paper, Pre-trained VLM and evaluation code.
|
| 29 |
+
- ✅ Fine-tuned VLA code and models
|
| 30 |
+
- 🧭 Pre-training code and VLA.
|
| 31 |
+
|
| 32 |
+
## Contents
|
| 33 |
+
|
| 34 |
+
- [BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation](#bitvla-1-bit-vision-language-action-models-for-robotics-manipulation)
|
| 35 |
+
- [Contents](#contents)
|
| 36 |
+
- [Checkpoints](#checkpoints)
|
| 37 |
+
- [Vision-Language](#vision-language)
|
| 38 |
+
- [Evaluation on VQA](#evaluation-on-vqa)
|
| 39 |
+
- [Vision-Language-Action](#vision-language-action)
|
| 40 |
+
- [OFT Training](#oft-training)
|
| 41 |
+
- [1. Preparing OFT](#1-preparing-oft)
|
| 42 |
+
- [2. OFT fine-tuning](#2-oft-fine-tuning)
|
| 43 |
+
- [Evaluation on LIBERO](#evaluation-on-libero)
|
| 44 |
+
- [Acknowledgement](#acknowledgement)
|
| 45 |
+
- [Citation](#citation)
|
| 46 |
+
- [License](#license)
|
| 47 |
+
- [Contact Information](#contact-information)
|
| 48 |
+
|
| 49 |
+
## Checkpoints
|
| 50 |
+
|
| 51 |
+
| Model | Path |
|
| 52 |
+
| -------------- | ----- |
|
| 53 |
+
| BitVLA | [hongyuw/bitvla-bitsiglipL-224px-bf16](https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16) |
|
| 54 |
+
| BitVLA finetuned on LIBERO-Spatial | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16) |
|
| 55 |
+
| BitVLA finetuned on LIBERO-Object | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_object-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_object-bf16) |
|
| 56 |
+
| BitVLA finetuned on LIBERO-Goal | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16) |
|
| 57 |
+
| BitVLA finetuned on LIBERO-Long | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16) |
|
| 58 |
+
| BitVLA w/ BF16 SigLIP | [hongyuw/bitvla-siglipL-224px-bf16](https://huggingface.co/hongyuw/bitvla-siglipL-224px-bf16) |
|
| 59 |
+
|
| 60 |
+
*Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost.*
|
| 61 |
+
|
| 62 |
+
*Due to limited resources, we have not yet pre-trained BitVLA on a large-scale robotics dataset. We are actively working to secure additional compute resources to conduct this pre-training.*
|
| 63 |
+
|
| 64 |
+
## Vision-Language
|
| 65 |
+
|
| 66 |
+
### Evaluation on VQA
|
| 67 |
|
| 68 |
We use the [LMM-Eval](https://github.com/ustcwhy/BitVLA/tree/main/lmms-eval) toolkit to conduct evaluations on VQA tasks. We provide the [transformers repo](https://github.com/ustcwhy/BitVLA/tree/main/transformers) in which we modify the [modeling_llava.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/llava/modeling_llava.py) and [modeling_siglip.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/siglip/modeling_siglip.py) to support the W1.58-A8 quantization.
|
| 69 |
|
|
|
|
| 94 |
|
| 95 |
Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost.
|
| 96 |
|
| 97 |
+
## Vision-Language-Action
|
| 98 |
|
| 99 |
+
### OFT Training
|
| 100 |
|
| 101 |
+
#### 1. Preparing OFT
|
| 102 |
+
We fine-tune BitVLA using OFT training shown in [OpenVLA-OFT](https://github.com/moojink/openvla-oft/tree/main). First setup the environment as required by that project. You can refer to [SETUP.md](https://github.com/moojink/openvla-oft/blob/main/SETUP.md) and [LIBERO.md](https://github.com/moojink/openvla-oft/blob/main/LIBERO.md) for detailed instructions.
|
| 103 |
+
|
| 104 |
+
```
|
| 105 |
+
conda create -n bitvla python=3.10 -y
|
| 106 |
+
conda activate bitvla
|
| 107 |
+
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124
|
| 108 |
+
|
| 109 |
+
# or use the provided docker
|
| 110 |
+
# docker run --name nvidia_24_07 --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity
|
| 111 |
+
|
| 112 |
+
cd BitVLA
|
| 113 |
+
pip install -e openvla-oft/
|
| 114 |
+
pip install -e transformers
|
| 115 |
+
|
| 116 |
+
cd openvla-oft/
|
| 117 |
+
|
| 118 |
+
# install LIBERO
|
| 119 |
+
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
|
| 120 |
+
pip install -e LIBERO/
|
| 121 |
+
# in BitVLA
|
| 122 |
+
pip install -r experiments/robot/libero/libero_requirements.txt
|
| 123 |
+
|
| 124 |
+
# install bitvla
|
| 125 |
+
pip install -e bitvla/
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
We adopt the same dataset as OpenVLA-OFT for the fine-tuning on LIBERO. You can download the dataset from [HuggingFace](https://huggingface.co/datasets/openvla/modified_libero_rlds).
|
| 129 |
+
|
| 130 |
+
```
|
| 131 |
+
git clone git@hf.co:datasets/openvla/modified_libero_rlds
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
#### 2. OFT fine-tuning
|
| 135 |
+
|
| 136 |
+
First convert the [BitVLA](https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16) to a format compatible with the VLA codebase.
|
| 137 |
+
|
| 138 |
+
```
|
| 139 |
+
python convert_ckpt.py /path/to/bitvla-bitsiglipL-224px-bf16
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
After that, you can finetune the BitVLA using the following command. Here we take LIBERO spatial as an example:
|
| 143 |
+
|
| 144 |
+
```
|
| 145 |
+
torchrun --standalone --nnodes 1 --nproc-per-node 4 vla-scripts/finetune_bitnet.py \
|
| 146 |
+
--vla_path /path/to/bitvla-bitsiglipL-224px-bf16 \
|
| 147 |
+
--data_root_dir /path/to/modified_libero_rlds/ \
|
| 148 |
+
--dataset_name libero_spatial_no_noops \
|
| 149 |
+
--run_root_dir /path/to/save/your/ckpt \
|
| 150 |
+
--use_l1_regression True \
|
| 151 |
+
--warmup_steps 375 \
|
| 152 |
+
--use_lora False \
|
| 153 |
+
--num_images_in_input 2 \
|
| 154 |
+
--use_proprio True \
|
| 155 |
+
--batch_size 2 \
|
| 156 |
+
--grad_accumulation_steps 8 \
|
| 157 |
+
--learning_rate 1e-4 \
|
| 158 |
+
--max_steps 10001 \
|
| 159 |
+
--save_freq 10000 \
|
| 160 |
+
--save_latest_checkpoint_only False \
|
| 161 |
+
--image_aug True \
|
| 162 |
+
--run_id_note your_id
|
| 163 |
+
```
|
| 164 |
+
|
| 165 |
+
### Evaluation on LIBERO
|
| 166 |
+
|
| 167 |
+
You can download our fine-tuned BitVLA models from [HuggingFace](https://huggingface.co/collections/hongyuw/bitvla-68468fb1e3aae15dd8a4e36e). As an example for spatial set in LIBERO, run the following script for evaluation:
|
| 168 |
+
|
| 169 |
+
```
|
| 170 |
+
python experiments/robot/libero/run_libero_eval_bitnet.py \
|
| 171 |
+
--pretrained_checkpoint /path/to/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16 \
|
| 172 |
+
--task_suite_name libero_spatial \
|
| 173 |
+
--info_in_path "information you want to show in path" \
|
| 174 |
+
--model_family "bitnet"
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
## Acknowledgement
|
| 178 |
+
|
| 179 |
+
This repository is built using [LMM-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [the HuggingFace's transformers](https://github.com/huggingface/transformers) and [OpenVLA-OFT](https://github.com/moojink/openvla-oft).
|
| 180 |
|
| 181 |
## Citation
|
| 182 |
|
|
|
|
| 192 |
}
|
| 193 |
```
|
| 194 |
|
| 195 |
+
## License
|
| 196 |
+
This project is licensed under the MIT License.
|
| 197 |
+
|
| 198 |
### Contact Information
|
| 199 |
|
| 200 |
+
For help or issues using models, please submit a GitHub issue.
|