|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- IPEC-COMMUNITY/OpenFly |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- Success rate |
|
|
base_model: |
|
|
- openvla/openvla-7b-prismatic |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
tags: |
|
|
- UAV |
|
|
- Navigation |
|
|
- VLN |
|
|
- visual-language-navigation |
|
|
--- |
|
|
|
|
|
# OpenFly |
|
|
|
|
|
OpenFly, a platform comprising a versatile toolchain and large-scale benchmark for aerial VLN. The code is purely huggingFace-based and concise, with efficient performance. |
|
|
|
|
|
For full details, please read [our paper](https://arxiv.org/abs/2502.18041) and see [our project page](https://shailab-ipec.github.io/openfly/). |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** The OpenFly team consisting of researchers from Shanghai AI Laboratory. |
|
|
- **Model type:** vision-language-navigation (language, image => uav actions) |
|
|
- **Language(s) (NLP):** en |
|
|
- **License:** MIT |
|
|
- **Pretraining Dataset:** [OpenFly](https://huggingface.co/datasets/IPEC-COMMUNITY/OpenFly) |
|
|
- **Repository:** [https://github.com/SHAILAB-IPEC/OpenFly-Platform](https://github.com/SHAILAB-IPEC/OpenFly-Platform) |
|
|
- **Paper:** [OpenFly: A Versatile Toolchain and Large-scale Benchmark for Aerial Vision-Language Navigation](https://arxiv.org/abs/2502.18041) |
|
|
- **Project Page & Videos:** [https://shailab-ipec.github.io/openfly/](https://shailab-ipec.github.io/openfly/) |
|
|
|
|
|
|
|
|
## Uses |
|
|
|
|
|
OpenFly relies solely on HuggingFace Transformers 🤗, making deployment extremely easy. If your environment supports `transformers >= 4.47.0`, you can directly use the following code to load the model and perform inference. |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
```python |
|
|
|
|
|
from typing import Dict, List, Optional, Union |
|
|
from pathlib import Path |
|
|
import numpy as np |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import LlamaTokenizerFast |
|
|
from transformers import AutoConfig, AutoImageProcessor, AutoModelForVision2Seq, AutoProcessor |
|
|
import os, json |
|
|
from model.prismatic import PrismaticVLM |
|
|
from model.overwatch import initialize_overwatch |
|
|
from model.action_tokenizer import ActionTokenizer |
|
|
from model.vision_backbone import DinoSigLIPViTBackbone, DinoSigLIPImageTransform |
|
|
from model.llm_backbone import LLaMa2LLMBackbone |
|
|
from extern.hf.configuration_prismatic import OpenFlyConfig |
|
|
from extern.hf.modeling_prismatic import OpenVLAForActionPrediction |
|
|
from extern.hf.processing_prismatic import PrismaticImageProcessor, PrismaticProcessor |
|
|
|
|
|
AutoConfig.register("openvla", OpenFlyConfig) |
|
|
AutoImageProcessor.register(OpenFlyConfig, PrismaticImageProcessor) |
|
|
AutoProcessor.register(OpenFlyConfig, PrismaticProcessor) |
|
|
AutoModelForVision2Seq.register(OpenFlyConfig, OpenVLAForActionPrediction) |
|
|
|
|
|
model_name_or_path="IPEC-COMMUNITY/openfly-agent-7b" |
|
|
processor = AutoProcessor.from_pretrained(model_name_or_path) |
|
|
model = AutoModelForVision2Seq.from_pretrained( |
|
|
model_name_or_path, |
|
|
attn_implementation="flash_attention_2", # [Optional] Requires `flash_attn` |
|
|
torch_dtype=torch.bfloat16, |
|
|
low_cpu_mem_usage=True, |
|
|
trust_remote_code=True, |
|
|
).to("cuda:0") |
|
|
|
|
|
image = Image.fromarray(cv2.imread("example.png")) |
|
|
prompt = "Take off, go straight pass the river" |
|
|
inputs = processor(prompt, [image, image, image]).to("cuda:0", dtype=torch.bfloat16) |
|
|
action = model.predict_action(**inputs, unnorm_key="vln_norm", do_sample=False) |
|
|
print(action) |
|
|
``` |
|
|
|
|
|
|