Image-Text-to-Text
Transformers
Safetensors
English
openvla
feature-extraction
robotics
vla
multimodal
pretraining
custom_code
Instructions to use openvla/openvla-v01-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openvla/openvla-v01-7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="openvla/openvla-v01-7b", trust_remote_code=True)# Load model directly from transformers import AutoModelForVision2Seq model = AutoModelForVision2Seq.from_pretrained("openvla/openvla-v01-7b", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use openvla/openvla-v01-7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "openvla/openvla-v01-7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openvla/openvla-v01-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/openvla/openvla-v01-7b
- SGLang
How to use openvla/openvla-v01-7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "openvla/openvla-v01-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openvla/openvla-v01-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "openvla/openvla-v01-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openvla/openvla-v01-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use openvla/openvla-v01-7b with Docker Model Runner:
docker model run hf.co/openvla/openvla-v01-7b
Cannot run code snipped with an actual image
#1
by kkatodus - opened
Hi, thank you for releasing the model! Very excited to use it. I have a slightly modified version of the code snipped on the model card where I just feed in a (320, 256) size image with some text prompt. However when I run this, I get an error about the channels not matching(see below). Is there a possibility that you can point me to an image that can be fed into the model to get some inference out?
RuntimeError: Given groups=1, weight of size [1152, 3, 14, 14], expected input[1, 6, 224, 224] to have 3 channels, but got 6 channels instead
# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
from numpy import asarray
INSTRUCTION = 'pick up the red block and place it on the green block'
# Load Processor & VLA
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
"openvla/openvla-v01-7b",
attn_implementation="flash_attention_2", # [Optional] Requires `flash_attn`
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).to("cuda:0")
# Grab image input & format prompt (note inclusion of system prompt due to Vicuña base model)
image_path = "./your_file.jpeg"
image = Image.open(image_path)
system_prompt = (
"A chat between a curious user and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the user's questions."
)
prompt = f"{system_prompt} USER: What action should the robot take to {INSTRUCTION}? ASSISTANT:"
# Predict Action (7-DoF; un-normalize for BridgeV2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
print(action)