Upload folder using huggingface_hub

2f97259 verified 4 days ago

4.32 kB

	---
	license: apache-2.0
	library_name: transformers
	---
	# VisionMaster-Pro
	<!-- markdownlint-disable first-line-h1 -->
	<!-- markdownlint-disable html -->
	<!-- markdownlint-disable no-duplicate-header -->

	<div align="center">
	<img src="figures/fig1.png" width="60%" alt="VisionMaster-Pro" />
	</div>
	<hr>

	<div align="center" style="line-height: 1;">
	<a href="LICENSE" style="margin: 2px;">
	<img alt="License" src="figures/fig2.png" style="display: inline-block; vertical-align: middle;"/>
	</a>
	</div>

	## 1. Introduction

	VisionMaster-Pro represents a breakthrough in computer vision technology. This latest release incorporates advanced transformer-based architectures with enhanced attention mechanisms specifically designed for visual understanding tasks. The model excels at perceiving fine-grained visual details while maintaining robust performance across diverse imaging conditions.

	<p align="center">
	<img width="80%" src="figures/fig3.png">
	</p>

	Compared to our previous VisionMaster release, this Pro version demonstrates substantial improvements in handling complex visual scenarios. For instance, on the ImageNet-1K benchmark, accuracy has increased from 82.3% to 89.7%. This advancement stems from our novel multi-scale attention fusion mechanism and improved training methodology using progressive resolution scaling.

	Beyond core recognition tasks, VisionMaster-Pro also features enhanced robustness to domain shifts and improved zero-shot transfer capabilities.

	## 2. Evaluation Results

	### Comprehensive Benchmark Results

	<div align="center">

	\| \| Benchmark \| ModelA \| ModelB \| ModelC \| VisionMaster-Pro \|
	\|---\|---\|---\|---\|---\|---\|
	\| Detection Tasks \| Object Detection \| 0.721 \| 0.745 \| 0.751 \| 0.557 \|
	\| \| Instance Segmentation \| 0.683 \| 0.701 \| 0.712 \| 0.639 \|
	\| \| Semantic Segmentation \| 0.756 \| 0.771 \| 0.780 \| 0.750 \|
	\| Recognition Tasks \| Image Classification \| 0.823 \| 0.847 \| 0.858 \| 0.693 \|
	\| \| Face Recognition \| 0.912 \| 0.925 \| 0.931 \| 0.864 \|
	\| \| Action Recognition \| 0.678 \| 0.695 \| 0.708 \| 0.683 \|
	\| \| Scene Understanding \| 0.701 \| 0.718 \| 0.729 \| 0.625 \|
	\| Perception Tasks \| Depth Estimation \| 0.645 \| 0.667 \| 0.678 \| 0.493 \|
	\| \| Pose Estimation \| 0.712 \| 0.728 \| 0.741 \| 0.683 \|
	\| \| Edge Detection \| 0.823 \| 0.835 \| 0.846 \| 0.844 \|
	\| \| OCR Accuracy \| 0.867 \| 0.882 \| 0.891 \| 0.820 \|
	\| Advanced Capabilities\| Visual QA \| 0.589 \| 0.612 \| 0.628 \| 0.451 \|
	\| \| Image Captioning \| 0.634 \| 0.651 \| 0.668 \| 0.590 \|
	\| \| Anomaly Detection \| 0.756 \| 0.773 \| 0.785 \| 0.806 \|
	\| \| Zero-Shot Transfer \| 0.523 \| 0.548 \| 0.567 \| 0.484 \|

	</div>

	### Overall Performance Summary
	VisionMaster-Pro demonstrates exceptional performance across all evaluated vision benchmark categories, with particularly notable results in recognition and perception tasks.

	## 3. Demo & API Platform
	We offer a demo interface and API for you to interact with VisionMaster-Pro. Please check our official website for more details.

	## 4. How to Run Locally

	Please refer to our code repository for more information about running VisionMaster-Pro locally.

	Compared to previous versions, the usage recommendations for VisionMaster-Pro have the following changes:

	1. Multi-scale input is supported natively.
	2. Automatic image preprocessing is enabled by default.

	The model architecture of VisionMaster-Pro-Lite is optimized for edge deployment, but it shares the same feature extraction configuration as the main VisionMaster-Pro.

	### Input Configuration
	We recommend using the following preprocessing settings.
	```python
	transform = transforms.Compose([
	transforms.Resize(384),
	transforms.CenterCrop(384),
	transforms.ToTensor(),
	transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
	])
	```

	### Inference Settings
	We recommend the following inference settings for optimal performance:
	- Batch size: 32 (adjust based on GPU memory)
	- Mixed precision: FP16 for inference
	- Image resolution: 384x384 for best accuracy

	## 5. License
	This code repository is licensed under the [Apache License 2.0](LICENSE). The use of VisionMaster-Pro models is also subject to the [Apache License 2.0](LICENSE). Commercial use is permitted.

	## 6. Contact
	If you have any questions, please raise an issue on our GitHub repository or contact us at vision@visionmaster.ai.