Add pipeline tag and sample usage
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,7 +1,6 @@
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-sa-4.0
|
| 3 |
-
|
| 4 |
-
- arxiv:2606.14024
|
| 5 |
---
|
| 6 |
|
| 7 |
# ViT-Up
|
|
@@ -10,18 +9,45 @@ tags:
|
|
| 10 |
|
| 11 |
This repository provides pretrained ViT-Up weights for DINOv3-S+ and DINOv3-B.
|
| 12 |
|
| 13 |
-
- Paper: https://
|
| 14 |
-
-
|
| 15 |
-
-
|
| 16 |
-
- Code: https://github.com/krispinwandel/vit-up
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
## Citation
|
| 20 |
|
| 21 |
```bibtex
|
| 22 |
@misc{wandel2026vitupfaithfulfeatureupsampling,
|
| 23 |
title={ViT-Up: Faithful Feature Upsampling for Vision Transformers},
|
| 24 |
-
author={Krispin Wandel and Jingchuan Wang and Hesheng Wang},
|
| 25 |
year={2026},
|
| 26 |
eprint={2606.14024},
|
| 27 |
archivePrefix={arXiv},
|
|
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-sa-4.0
|
| 3 |
+
pipeline_tag: image-feature-extraction
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
# ViT-Up
|
|
|
|
| 9 |
|
| 10 |
This repository provides pretrained ViT-Up weights for DINOv3-S+ and DINOv3-B.
|
| 11 |
|
| 12 |
+
- **Paper**: [ViT-Up: Faithful Feature Upsampling for Vision Transformers](https://huggingface.co/papers/2606.14024)
|
| 13 |
+
- **Project page**: https://vitup.papers.discuna.com/
|
| 14 |
+
- **Code**: https://github.com/krispinwandel/vit-up
|
|
|
|
| 15 |
|
| 16 |
+
## Sample Usage
|
| 17 |
+
|
| 18 |
+
ViT-Up models can be loaded directly with `torch.hub.load`. The Hub entry points download ViT-Up weights from Hugging Face and load the matching DINOv3 backbone.
|
| 19 |
+
|
| 20 |
+
```python
|
| 21 |
+
import torch
|
| 22 |
+
|
| 23 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 24 |
+
|
| 25 |
+
# Available entry points:
|
| 26 |
+
# - vit_up_dinov3_splus
|
| 27 |
+
# - vit_up_dinov3_base
|
| 28 |
+
model = torch.hub.load(
|
| 29 |
+
"krispinwandel/vit-up",
|
| 30 |
+
"vit_up_dinov3_splus",
|
| 31 |
+
pretrained=True,
|
| 32 |
+
trust_repo=True,
|
| 33 |
+
device=device,
|
| 34 |
+
).eval()
|
| 35 |
+
|
| 36 |
+
images = torch.randn(1, 3, 448, 448, device=device)
|
| 37 |
+
query_coords = torch.rand(1, 100, 2, device=device) # normalized (x, y) in [0, 1]
|
| 38 |
+
|
| 39 |
+
with torch.no_grad():
|
| 40 |
+
features = model(images, query_coords)
|
| 41 |
+
|
| 42 |
+
print(features.shape) # (B, N_queries, D)
|
| 43 |
+
```
|
| 44 |
|
| 45 |
## Citation
|
| 46 |
|
| 47 |
```bibtex
|
| 48 |
@misc{wandel2026vitupfaithfulfeatureupsampling,
|
| 49 |
title={ViT-Up: Faithful Feature Upsampling for Vision Transformers},
|
| 50 |
+
author={Krispin Wandel evangelista and Jingchuan Wang and Hesheng Wang},
|
| 51 |
year={2026},
|
| 52 |
eprint={2606.14024},
|
| 53 |
archivePrefix={arXiv},
|