--- license: apache-2.0 library_name: transformers pipeline_tag: image-text-to-text --- # Sky-VLM: Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation [![License](https://img.shields.io/badge/License-Apache%202.0-9BDFDF)](https://github.com/linglingxiansen/SpatialSky/blob/main/LICENSE) [![hf_checkpoint](https://img.shields.io/badge/🤗-Checkpoint-FBD49F.svg)](https://huggingface.co/llxs/Sky-VLM) [![arXiv](https://img.shields.io/badge/Arxiv-2511.13269-E69191.svg?logo=arXiv)](https://arxiv.org/abs/2511.13269) This repository hosts the **Sky-VLM** model, a specialized Vision-Language Model designed for UAV spatial reasoning across multiple granularities and contexts. It was introduced in the paper [Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation](https://huggingface.co/papers/2511.13269). The project's code is available on GitHub: [https://github.com/linglingxiansen/SpatialSKy](https://github.com/linglingxiansen/SpatialSKy). ## 🚀 Sample Usage First, install the `transformers` library and other dependencies as described in the [GitHub repository](https://github.com/linglingxiansen/SpatialSky#installation): ```bash pip install git+https://github.com/huggingface/transformers accelerate torch torchvision openai pillow tqdm nltk scipy ``` Then, you can use the following Python code for inference with the `Sky-VLM` model: ```python from transformers import Qwen2VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info # Note: qwen_vl_utils might need to be installed separately or adapted # Default: Load the model on the available device(s) model = Qwen2VLForConditionalGeneration.from_pretrained( "llxs/Sky-VLM", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("llxs/Sky-VLM") messages = [ { "role": "user", "content": [ { "type": "image", "image": "./examples/images/web_6f93090a-81f6-489e-bb35-1a2838b18c01.png", # Placeholder image path }, {"type": "text", "text": "In this UI screenshot, what is the position of the element corresponding to the command \\\"switch language of current page\\\" (with bbox)?"}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # Assuming process_vision_info is available from qwen_vl_utils or a similar helper # For a minimal example, image_inputs can be directly a list of PIL Images or similar # If qwen_vl_utils is not installed, manual processing might be needed. # For simplicity, if this exact helper isn't critical for basic HF inference, we might skip/adapt. # Here, we assume its presence for direct copy. image_inputs, video_inputs = process_vision_info(messages) # Requires qwen_vl_utils for this exact function inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False ) print(output_text) # Expected output example: <|object_ref_start|>language switch<|object_ref_end|><|box_start|>(576,12),(592,42)<|box_end|><|im_end|> ```