Update README.md
Browse files
README.md
CHANGED
|
@@ -15,18 +15,23 @@ tags:
|
|
| 15 |
|
| 16 |
# **Spatial-VU**
|
| 17 |
|
| 18 |
-
> The **Spatial-VU** model is a fine-tuned
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
* **Spatial Reasoning
|
| 23 |
-
* **High-Precision Descriptions**: Generates detailed, context-rich captions for general, technical, and abstract imagery.
|
| 24 |
-
* **Adaptive Across Aspect Ratios**: Performs effectively with images of varying formats—wide, tall, square, and irregular.
|
| 25 |
-
* **Multi-Level Detail Control**: Supports both concise summaries and fine-grained analytical outputs, depending on the prompt.
|
| 26 |
-
* **Built on Qwen2.5-VL Architecture**: Utilizes the visual-linguistic reasoning power of Qwen2.5-VL-7B for structured and accurate comprehension tasks.
|
| 27 |
-
* **Multilingual Support**: Outputs in English by default, with the ability to generate multilingual responses through prompt conditioning.
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
```python
|
| 32 |
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
|
|
@@ -46,7 +51,7 @@ messages = [
|
|
| 46 |
"type": "image",
|
| 47 |
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
|
| 48 |
},
|
| 49 |
-
{"type": "text", "text": "
|
| 50 |
],
|
| 51 |
}
|
| 52 |
]
|
|
@@ -74,17 +79,18 @@ output_text = processor.batch_decode(
|
|
| 74 |
print(output_text)
|
| 75 |
```
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
-
*
|
| 80 |
-
*
|
| 81 |
-
*
|
| 82 |
-
*
|
| 83 |
-
*
|
|
|
|
| 84 |
|
| 85 |
-
|
| 86 |
|
| 87 |
-
* May
|
| 88 |
-
* Not
|
| 89 |
-
*
|
| 90 |
-
*
|
|
|
|
| 15 |
|
| 16 |
# **Spatial-VU**
|
| 17 |
|
| 18 |
+
> The **Spatial-VU** model is a fine-tuned version of **Qwen2.5-VL-7B-Instruct**, tailored for **Spatial Reasoning and Vision Understanding**. This variant is designed to generate highly detailed and descriptive captions across a broad range of visual categories, including images with complex, sensitive, or nuanced content—across varying aspect ratios and resolutions.
|
| 19 |
|
| 20 |
+
# Key Highlights
|
| 21 |
|
| 22 |
+
* **Spatial Reasoning & Vision Understanding**: Fine-tuned to provide accurate and descriptive visual interpretations, enabling deeper understanding of spatial relationships, structures, and context.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+
* **High-Fidelity Descriptions**: Generates comprehensive captions for general, artistic, technical, abstract, and low-context images.
|
| 25 |
+
|
| 26 |
+
* **Robust Across Aspect Ratios**: Capable of accurately captioning images with wide, tall, square, and irregular dimensions.
|
| 27 |
+
|
| 28 |
+
* **Variational Detail Control**: Produces outputs with both high-level summaries and fine-grained descriptions as needed.
|
| 29 |
+
|
| 30 |
+
* **Foundation on Qwen2.5-VL Architecture**: Leverages the strengths of the Qwen2.5-VL-7B multimodal model for visual reasoning, comprehension, and instruction-following.
|
| 31 |
+
|
| 32 |
+
* **Multilingual Output Capability**: Can support multilingual descriptions (English as default), adaptable via prompt engineering.
|
| 33 |
+
|
| 34 |
+
# Quick Start with Transformers
|
| 35 |
|
| 36 |
```python
|
| 37 |
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
|
|
|
|
| 51 |
"type": "image",
|
| 52 |
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
|
| 53 |
},
|
| 54 |
+
{"type": "text", "text": "Describe this image in detail."},
|
| 55 |
],
|
| 56 |
}
|
| 57 |
]
|
|
|
|
| 79 |
print(output_text)
|
| 80 |
```
|
| 81 |
|
| 82 |
+
# Intended Use
|
| 83 |
|
| 84 |
+
* Generating detailed and unfiltered image captions for general-purpose or artistic datasets.
|
| 85 |
+
* Spatial reasoning and vision understanding research.
|
| 86 |
+
* Content moderation research, red-teaming, and generative safety evaluations.
|
| 87 |
+
* Enabling descriptive captioning for visual datasets typically excluded from mainstream models.
|
| 88 |
+
* Use in creative applications (e.g., storytelling, art generation) that benefit from rich descriptive captions.
|
| 89 |
+
* Captioning for non-standard aspect ratios and stylized visual content.
|
| 90 |
|
| 91 |
+
# Limitations
|
| 92 |
|
| 93 |
+
* May produce explicit, sensitive, or offensive descriptions depending on image content and prompts.
|
| 94 |
+
* Not suitable for deployment in production systems requiring content filtering or moderation.
|
| 95 |
+
* Can exhibit variability in caption tone or style depending on input prompt phrasing.
|
| 96 |
+
* Accuracy for unfamiliar or synthetic visual styles may vary.
|