Cloned from facebook/dinov2-with-registers-base

Browse files

Files changed (4) hide show

README.md +75 -0
config.json +50 -0
model.safetensors +3 -0
preprocessor_config.json +27 -0

README.md ADDED Viewed

	@@ -0,0 +1,75 @@

+---
+library_name: transformers
+pipeline_tag: image-feature-extraction
+license: apache-2.0
+tags:
+- dino
+- vision
+inference: false
+---
+# Vision Transformer (base-sized model) trained using DINOv2, with registers
+Vision Transformer (ViT) model introduced in the paper [Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588) by Darcet et al. and first released in [this repository](https://github.com/facebookresearch/dinov2).
+Disclaimer: The team releasing DINOv2 with registers did not write a model card for this model so this model card has been written by the Hugging Face team.
+## Model description
+The Vision Transformer (ViT) is a transformer encoder model (BERT-like) [originally introduced](https://arxiv.org/abs/2010.11929) to do supervised image classification on ImageNet.
+Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on
+images without requiring any labels. Some example papers here include [DINOv2](https://huggingface.co/papers/2304.07193) and [MAE](https://arxiv.org/abs/2111.06377).
+The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called "register" tokens), which you only use during pre-training (and throw away afterwards). This results in:
+- no artifacts
+- interpretable attention maps
+- and improved performances.
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/dinov2_with_registers_visualization.png"
+alt="drawing" width="600"/>
+<small> Visualization of attention maps of various models trained with vs. without registers. Taken from the <a href="https://arxiv.org/abs/2309.16588">original paper</a>. </small>
+Note that this model does not include any fine-tuned heads.
+By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
+## Intended uses & limitations
+You can use the raw model for feature extraction. See the [model hub](https://huggingface.co/models?other=dinov2_with_registers) to look for
+fine-tuned versions on a task that interests you.
+### How to use
+Here is how to use this model:
+```python
+from transformers import AutoImageProcessor, AutoModel
+from PIL import Image
+import requests
+url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
+image = Image.open(requests.get(url, stream=True).raw)
+processor = AutoImageProcessor.from_pretrained('facebook/dinov2-with-registers-base')
+model = AutoModel.from_pretrained('facebook/dinov2-with-registers-base')
+inputs = processor(images=image, return_tensors="pt")
+outputs = model(**inputs)
+last_hidden_states = outputs.last_hidden_state
+```
+### BibTeX entry and citation info
+```bibtex
+@misc{darcet2024visiontransformersneedregisters,
+      title={Vision Transformers Need Registers},
+      author={Timothée Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
+      year={2024},
+      eprint={2309.16588},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2309.16588},
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,50 @@

+{
+  "apply_layernorm": true,
+  "architectures": [
+    "Dinov2WithRegistersModel"
+  ],
+  "attention_probs_dropout_prob": 0.0,
+  "drop_path_rate": 0.0,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.0,
+  "hidden_size": 768,
+  "image_size": 518,
+  "initializer_range": 0.02,
+  "interpolate_antialias": true,
+  "interpolate_offset": 0.0,
+  "layer_norm_eps": 1e-06,
+  "layerscale_value": 1.0,
+  "mlp_ratio": 4,
+  "model_type": "dinov2_with_registers",
+  "num_attention_heads": 12,
+  "num_channels": 3,
+  "num_hidden_layers": 12,
+  "num_register_tokens": 4,
+  "out_features": [
+    "stage12"
+  ],
+  "out_indices": [
+    12
+  ],
+  "patch_size": 14,
+  "qkv_bias": true,
+  "reshape_hidden_states": true,
+  "stage_names": [
+    "stem",
+    "stage1",
+    "stage2",
+    "stage3",
+    "stage4",
+    "stage5",
+    "stage6",
+    "stage7",
+    "stage8",
+    "stage9",
+    "stage10",
+    "stage11",
+    "stage12"
+  ],
+  "torch_dtype": "float32",
+  "transformers_version": "4.48.0.dev0",
+  "use_swiglu_ffn": false
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7a6f7b3b9fa4b8732e707476a03cd6cdce210048582f21aafb7991c17d98e362
+size 346358296

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "crop_size": {
+    "height": 224,
+    "width": 224
+  },
+  "do_center_crop": true,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.485,
+    0.456,
+    0.406
+  ],
+  "image_processor_type": "BitImageProcessor",
+  "image_std": [
+    0.229,
+    0.224,
+    0.225
+  ],
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "shortest_edge": 256
+  }
+}