IPEC-COMMUNITY
/

openfly-agent-7b

Image-Text-to-Text

visual-language-navigation

Model card Files Files and versions

openfly-agent-7b / README.md

Eziotic's picture

Update README.md

21dcce2 verified 5 months ago

|

history blame contribute delete

3.26 kB

	---
	license: mit
	datasets:
	- IPEC-COMMUNITY/OpenFly
	language:
	- en
	metrics:
	- Success rate
	base_model:
	- openvla/openvla-7b-prismatic
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- UAV
	- Navigation
	- VLN
	- visual-language-navigation
	---

	# OpenFly

	OpenFly, a platform comprising a versatile toolchain and large-scale benchmark for aerial VLN. The code is purely huggingFace-based and concise, with efficient performance.

	For full details, please read [our paper](https://arxiv.org/abs/2502.18041) and see [our project page](https://shailab-ipec.github.io/openfly/).

	## Model Details

	### Model Description

	- Developed by: The OpenFly team consisting of researchers from Shanghai AI Laboratory.
	- Model type: vision-language-navigation (language, image => uav actions)
	- Language(s) (NLP): en
	- License: MIT
	- Pretraining Dataset: [OpenFly](https://huggingface.co/datasets/IPEC-COMMUNITY/OpenFly)
	- Repository: [https://github.com/SHAILAB-IPEC/OpenFly-Platform](https://github.com/SHAILAB-IPEC/OpenFly-Platform)
	- Paper: [OpenFly: A Versatile Toolchain and Large-scale Benchmark for Aerial Vision-Language Navigation](https://arxiv.org/abs/2502.18041)
	- Project Page & Videos: [https://shailab-ipec.github.io/openfly/](https://shailab-ipec.github.io/openfly/)


	## Uses

	OpenFly relies solely on HuggingFace Transformers 🤗, making deployment extremely easy. If your environment supports `transformers >= 4.47.0`, you can directly use the following code to load the model and perform inference.

	### Direct Use

	```python

	from typing import Dict, List, Optional, Union
	from pathlib import Path
	import numpy as np
	import torch
	from PIL import Image
	from transformers import LlamaTokenizerFast
	from transformers import AutoConfig, AutoImageProcessor, AutoModelForVision2Seq, AutoProcessor
	import os, json
	from model.prismatic import PrismaticVLM
	from model.overwatch import initialize_overwatch
	from model.action_tokenizer import ActionTokenizer
	from model.vision_backbone import DinoSigLIPViTBackbone, DinoSigLIPImageTransform
	from model.llm_backbone import LLaMa2LLMBackbone
	from extern.hf.configuration_prismatic import OpenFlyConfig
	from extern.hf.modeling_prismatic import OpenVLAForActionPrediction
	from extern.hf.processing_prismatic import PrismaticImageProcessor, PrismaticProcessor

	AutoConfig.register("openvla", OpenFlyConfig)
	AutoImageProcessor.register(OpenFlyConfig, PrismaticImageProcessor)
	AutoProcessor.register(OpenFlyConfig, PrismaticProcessor)
	AutoModelForVision2Seq.register(OpenFlyConfig, OpenVLAForActionPrediction)

	model_name_or_path="IPEC-COMMUNITY/openfly-agent-7b"
	processor = AutoProcessor.from_pretrained(model_name_or_path)
	model = AutoModelForVision2Seq.from_pretrained(
	model_name_or_path,
	attn_implementation="flash_attention_2", # [Optional] Requires `flash_attn`
	torch_dtype=torch.bfloat16,
	low_cpu_mem_usage=True,
	trust_remote_code=True,
	).to("cuda:0")

	image = Image.fromarray(cv2.imread("example.png"))
	prompt = "Take off, go straight pass the river"
	inputs = processor(prompt, [image, image, image]).to("cuda:0", dtype=torch.bfloat16)
	action = model.predict_action(**inputs, unnorm_key="vln_norm", do_sample=False)
	print(action)
	```