Add model card with metadata and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +81 -0
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ ---
6
+
7
+ # Sky-VLM: Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation
8
+
9
+ [![License](https://img.shields.io/badge/License-Apache%202.0-9BDFDF)](https://github.com/linglingxiansen/SpatialSky/blob/main/LICENSE)
10
+ [![hf_checkpoint](https://img.shields.io/badge/🤗-Checkpoint-FBD49F.svg)](https://huggingface.co/llxs/Sky-VLM)
11
+ [![arXiv](https://img.shields.io/badge/Arxiv-2511.13269-E69191.svg?logo=arXiv)](https://arxiv.org/abs/2511.13269)
12
+
13
+ This repository hosts the **Sky-VLM** model, a specialized Vision-Language Model designed for UAV spatial reasoning across multiple granularities and contexts. It was introduced in the paper [Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation](https://huggingface.co/papers/2511.13269).
14
+
15
+ The project's code is available on GitHub: [https://github.com/linglingxiansen/SpatialSKy](https://github.com/linglingxiansen/SpatialSKy).
16
+
17
+ ## 🚀 Sample Usage
18
+
19
+ First, install the `transformers` library and other dependencies as described in the [GitHub repository](https://github.com/linglingxiansen/SpatialSky#installation):
20
+ ```bash
21
+ pip install git+https://github.com/huggingface/transformers accelerate torch torchvision openai pillow tqdm nltk scipy
22
+ ```
23
+
24
+ Then, you can use the following Python code for inference with the `Sky-VLM` model:
25
+
26
+ ```python
27
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
28
+ from qwen_vl_utils import process_vision_info # Note: qwen_vl_utils might need to be installed separately or adapted
29
+
30
+ # Default: Load the model on the available device(s)
31
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
32
+ "llxs/Sky-VLM", torch_dtype="auto", device_map="auto"
33
+ )
34
+ processor = AutoProcessor.from_pretrained("llxs/Sky-VLM")
35
+
36
+ messages = [
37
+ {
38
+ "role": "user",
39
+ "content": [
40
+ {
41
+ "type": "image",
42
+ "image": "./examples/images/web_6f93090a-81f6-489e-bb35-1a2838b18c01.png", # Placeholder image path
43
+ },
44
+ {"type": "text", "text": "In this UI screenshot, what is the position of the element corresponding to the command \\\"switch language of current page\\\" (with bbox)?"},
45
+ ],
46
+ }
47
+ ]
48
+
49
+
50
+ # Preparation for inference
51
+ text = processor.apply_chat_template(
52
+ messages, tokenize=False, add_generation_prompt=True
53
+ )
54
+ # Assuming process_vision_info is available from qwen_vl_utils or a similar helper
55
+ # For a minimal example, image_inputs can be directly a list of PIL Images or similar
56
+ # If qwen_vl_utils is not installed, manual processing might be needed.
57
+ # For simplicity, if this exact helper isn't critical for basic HF inference, we might skip/adapt.
58
+ # Here, we assume its presence for direct copy.
59
+ image_inputs, video_inputs = process_vision_info(messages) # Requires qwen_vl_utils for this exact function
60
+ inputs = processor(
61
+ text=[text],
62
+ images=image_inputs,
63
+ videos=video_inputs,
64
+ padding=True,
65
+ return_tensors="pt",
66
+ )
67
+ inputs = inputs.to("cuda")
68
+
69
+ # Inference: Generation of the output
70
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
71
+
72
+ generated_ids_trimmed = [
73
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
74
+ ]
75
+
76
+ output_text = processor.batch_decode(
77
+ generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
78
+ )
79
+ print(output_text)
80
+ # Expected output example: <|object_ref_start|>language switch<|object_ref_end|><|box_start|>(576,12),(592,42)<|box_end|><|im_end|>
81
+ ```