Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

models--Intel--dpt-hybrid-midas/refs/main +1 -0
models--Intel--dpt-hybrid-midas/snapshots/11eaf7a1cf4bd70740697dbc216f98980c0aeb03/README.md +166 -0
models--Intel--dpt-hybrid-midas/snapshots/11eaf7a1cf4bd70740697dbc216f98980c0aeb03/config.json +459 -0
models--Intel--dpt-hybrid-midas/snapshots/11eaf7a1cf4bd70740697dbc216f98980c0aeb03/pytorch_model.bin +3 -0

models--Intel--dpt-hybrid-midas/refs/main ADDED Viewed

	@@ -0,0 +1 @@


1	+ 11eaf7a1cf4bd70740697dbc216f98980c0aeb03

models--Intel--dpt-hybrid-midas/snapshots/11eaf7a1cf4bd70740697dbc216f98980c0aeb03/README.md ADDED Viewed

	@@ -0,0 +1,166 @@

+---
+license: apache-2.0
+tags:
+- vision
+- depth-estimation
+widget:
+- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
+  example_title: Tiger
+- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg
+  example_title: Teapot
+- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg
+  example_title: Palace
+model-index:
+- name: dpt-hybrid-midas
+  results:
+  - task:
+      type: monocular-depth-estimation
+      name: Monocular Depth Estimation
+    dataset:
+      type: MIX-6
+      name: MIX-6
+    metrics:
+    - type: Zero-shot transfer
+      value: 11.06
+      name: Zero-shot transfer
+      config: Zero-shot transfer
+      verified: false
+---
+## Model Details: DPT-Hybrid (also known as MiDaS 3.0)
+Dense Prediction Transformer (DPT) model trained on 1.4 million images for monocular depth estimation.
+It was introduced in the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by Ranftl et al. (2021) and first released in [this repository](https://github.com/isl-org/DPT).
+DPT uses the Vision Transformer (ViT) as backbone and adds a neck + head on top for monocular depth estimation.
+![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dpt_architecture.jpg)
+This repository hosts the "hybrid" version of the model as stated in the paper. DPT-Hybrid diverges from DPT by using [ViT-hybrid](https://huggingface.co/google/vit-hybrid-base-bit-384) as a backbone and taking some activations from the backbone.
+The model card has been written in combination by the Hugging Face team and Intel.
+| Model Detail | Description |
+| ----------- | ----------- |
+| Model Authors - Company | Intel |
+| Date | December 22, 2022 |
+| Version | 1 |
+| Type | Computer Vision - Monocular Depth Estimation |
+| Paper or Other Resources | [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) and [GitHub Repo](https://github.com/isl-org/DPT) |
+| License | Apache 2.0 |
+| Questions or Comments | [Community Tab](https://huggingface.co/Intel/dpt-hybrid-midas/discussions) and [Intel Developers Discord](https://discord.gg/rv2Gp55UJQ)|
+| Intended Use | Description |
+| ----------- | ----------- |
+| Primary intended uses | You can use the raw model for zero-shot monocular depth estimation. See the [model hub](https://huggingface.co/models?search=dpt) to look for fine-tuned versions on a task that interests you. |
+| Primary intended users | Anyone doing monocular depth estimation |
+| Out-of-scope uses | This model in most cases will need to be fine-tuned for your particular task.  The model should not be used to intentionally create hostile or alienating environments for people.|
+### How to use
+Here is how to use this model for zero-shot depth estimation on an image:
+```python
+from PIL import Image
+import numpy as np
+import requests
+import torch
+from transformers import DPTImageProcessor, DPTForDepthEstimation
+image_processor = DPTImageProcessor.from_pretrained("Intel/dpt-hybrid-midas")
+model = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas", low_cpu_mem_usage=True)
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+# prepare image for the model
+inputs = image_processor(images=image, return_tensors="pt")
+with torch.no_grad():
+    outputs = model(**inputs)
+    predicted_depth = outputs.predicted_depth
+# interpolate to original size
+prediction = torch.nn.functional.interpolate(
+    predicted_depth.unsqueeze(1),
+    size=image.size[::-1],
+    mode="bicubic",
+    align_corners=False,
+)
+# visualize the prediction
+output = prediction.squeeze().cpu().numpy()
+formatted = (output * 255 / np.max(output)).astype("uint8")
+depth = Image.fromarray(formatted)
+depth.show()
+```
+For more code examples, we refer to the [documentation](https://huggingface.co/docs/transformers/master/en/model_doc/dpt).
+| Factors | Description |
+| ----------- | ----------- |
+| Groups | Multiple datasets compiled together |
+| Instrumentation | - |
+| Environment | Inference completed on Intel Xeon Platinum 8280 CPU @ 2.70GHz with 8 physical cores and an NVIDIA RTX 2080 GPU. |
+| Card Prompts | Model deployment on alternate hardware and software will change model performance |
+| Metrics | Description |
+| ----------- | ----------- |
+| Model performance measures | Zero-shot Transfer |
+| Decision thresholds | - |
+| Approaches to uncertainty and variability | - |
+| Training and Evaluation Data | Description |
+| ----------- | ----------- |
+| Datasets | The dataset is called MIX 6, and contains around 1.4M images. The model was initialized with ImageNet-pretrained weights.|
+| Motivation | To build a robust monocular depth prediction network |
+| Preprocessing | "We resize the image such that the longer side is 384 pixels and train on random square crops of size 384. ... We perform random horizontal flips for data augmentation." See [Ranftl et al. (2021)](https://arxiv.org/abs/2103.13413) for more details. |
+## Quantitative Analyses
+| Model | Training set | DIW WHDR | ETH3D AbsRel | Sintel AbsRel | KITTI δ>1.25 | NYU δ>1.25 | TUM δ>1.25 |
+| --- | --- | --- | --- | --- | --- | --- | --- |
+| DPT - Large | MIX 6 | 10.82 (-13.2%) | 0.089 (-31.2%) | 0.270 (-17.5%) | 8.46 (-64.6%) | 8.32 (-12.9%) | 9.97 (-30.3%) |
+| DPT - Hybrid | MIX 6 | 11.06 (-11.2%) | 0.093 (-27.6%) | 0.274 (-16.2%) | 11.56 (-51.6%) | 8.69 (-9.0%) | 10.89 (-23.2%) |
+| MiDaS  | MIX 6  | 12.95 (+3.9%)  | 0.116 (-10.5%)  | 0.329 (+0.5%)  | 16.08 (-32.7%)  | 8.71 (-8.8%)  | 12.51 (-12.5%)
+| MiDaS [30]  | MIX 5  | 12.46  | 0.129  | 0.327  | 23.90  | 9.55  | 14.29 |
+ | Li [22]  | MD [22]  | 23.15  | 0.181  | 0.385  | 36.29  | 27.52  | 29.54 |
+ | Li [21]  | MC [21]  | 26.52  | 0.183  | 0.405  | 47.94  | 18.57  | 17.71 |
+ | Wang [40]  | WS [40]  | 19.09  | 0.205  | 0.390  | 31.92  | 29.57  | 20.18 |
+ | Xian [45]  | RW [45]  | 14.59  | 0.186 |  0.422  | 34.08 |  27.00 |  25.02 |
+ | Casser [5]  | CS [8]  | 32.80  | 0.235  | 0.422  | 21.15  | 39.58  | 37.18 |
+Table 1. Comparison to the state of the art on monocular depth estimation. We evaluate zero-shot cross-dataset transfer according to the
+protocol defined in [30]. Relative performance is computed with respect to the original MiDaS model [30]. Lower is better for all metrics. ([Ranftl et al., 2021](https://arxiv.org/abs/2103.13413))
+| Ethical Considerations | Description |
+| ----------- | ----------- |
+| Data | The training data come from multiple image datasets compiled together. |
+| Human life | The model is not intended to inform decisions central to human life or flourishing. It is an aggregated set of monocular depth image datasets. |
+| Mitigations | No additional risk mitigation strategies were considered during model development. |
+| Risks and harms | The extent of the risks involved by using the model remain unknown. |
+| Use cases | - |
+| Caveats and Recommendations |
+| ----------- |
+| Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. There are no additional caveats or recommendations for this model. |
+### BibTeX entry and citation info
+```bibtex
+@article{DBLP:journals/corr/abs-2103-13413,
+  author    = {Ren{\'{e}} Ranftl and
+               Alexey Bochkovskiy and
+               Vladlen Koltun},
+  title     = {Vision Transformers for Dense Prediction},
+  journal   = {CoRR},
+  volume    = {abs/2103.13413},
+  year      = {2021},
+  url       = {https://arxiv.org/abs/2103.13413},
+  eprinttype = {arXiv},
+  eprint    = {2103.13413},
+  timestamp = {Wed, 07 Apr 2021 15:31:46 +0200},
+  biburl    = {https://dblp.org/rec/journals/corr/abs-2103-13413.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+```

models--Intel--dpt-hybrid-midas/snapshots/11eaf7a1cf4bd70740697dbc216f98980c0aeb03/config.json ADDED Viewed

	@@ -0,0 +1,459 @@

+{
+  "_commit_hash": null,
+  "architectures": [
+    "DPTForDepthEstimation"
+  ],
+  "attention_probs_dropout_prob": 0.0,
+  "auxiliary_loss_weight": 0.4,
+  "backbone_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "architectures": null,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": null,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "depths": [
+      3,
+      4,
+      9
+    ],
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "drop_path_rate": 0.0,
+    "early_stopping": false,
+    "embedding_dynamic_padding": true,
+    "embedding_size": 64,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": null,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "global_padding": "SAME",
+    "hidden_act": "relu",
+    "hidden_sizes": [
+      256,
+      512,
+      1024,
+      2048
+    ],
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_type": "bottleneck",
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "min_length": 0,
+    "model_type": "bit",
+    "no_repeat_ngram_size": 0,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_channels": 3,
+    "num_groups": 32,
+    "num_return_sequences": 1,
+    "out_features": [
+      "stage1",
+      "stage2",
+      "stage3"
+    ],
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "output_stride": 32,
+    "pad_token_id": null,
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "stage_names": [
+      "stem",
+      "stage1",
+      "stage2",
+      "stage3"
+    ],
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": null,
+    "torchscript": false,
+    "transformers_version": "4.26.0.dev0",
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "width_factor": 1
+  },
+  "backbone_featmap_shape": [
+    1,
+    1024,
+    24,
+    24
+  ],
+  "backbone_out_indices": [
+    2,
+    5,
+    8,
+    11
+  ],
+  "fusion_hidden_size": 256,
+  "head_in_index": -1,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.0,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2",
+    "3": "LABEL_3",
+    "4": "LABEL_4",
+    "5": "LABEL_5",
+    "6": "LABEL_6",
+    "7": "LABEL_7",
+    "8": "LABEL_8",
+    "9": "LABEL_9",
+    "10": "LABEL_10",
+    "11": "LABEL_11",
+    "12": "LABEL_12",
+    "13": "LABEL_13",
+    "14": "LABEL_14",
+    "15": "LABEL_15",
+    "16": "LABEL_16",
+    "17": "LABEL_17",
+    "18": "LABEL_18",
+    "19": "LABEL_19",
+    "20": "LABEL_20",
+    "21": "LABEL_21",
+    "22": "LABEL_22",
+    "23": "LABEL_23",
+    "24": "LABEL_24",
+    "25": "LABEL_25",
+    "26": "LABEL_26",
+    "27": "LABEL_27",
+    "28": "LABEL_28",
+    "29": "LABEL_29",
+    "30": "LABEL_30",
+    "31": "LABEL_31",
+    "32": "LABEL_32",
+    "33": "LABEL_33",
+    "34": "LABEL_34",
+    "35": "LABEL_35",
+    "36": "LABEL_36",
+    "37": "LABEL_37",
+    "38": "LABEL_38",
+    "39": "LABEL_39",
+    "40": "LABEL_40",
+    "41": "LABEL_41",
+    "42": "LABEL_42",
+    "43": "LABEL_43",
+    "44": "LABEL_44",
+    "45": "LABEL_45",
+    "46": "LABEL_46",
+    "47": "LABEL_47",
+    "48": "LABEL_48",
+    "49": "LABEL_49",
+    "50": "LABEL_50",
+    "51": "LABEL_51",
+    "52": "LABEL_52",
+    "53": "LABEL_53",
+    "54": "LABEL_54",
+    "55": "LABEL_55",
+    "56": "LABEL_56",
+    "57": "LABEL_57",
+    "58": "LABEL_58",
+    "59": "LABEL_59",
+    "60": "LABEL_60",
+    "61": "LABEL_61",
+    "62": "LABEL_62",
+    "63": "LABEL_63",
+    "64": "LABEL_64",
+    "65": "LABEL_65",
+    "66": "LABEL_66",
+    "67": "LABEL_67",
+    "68": "LABEL_68",
+    "69": "LABEL_69",
+    "70": "LABEL_70",
+    "71": "LABEL_71",
+    "72": "LABEL_72",
+    "73": "LABEL_73",
+    "74": "LABEL_74",
+    "75": "LABEL_75",
+    "76": "LABEL_76",
+    "77": "LABEL_77",
+    "78": "LABEL_78",
+    "79": "LABEL_79",
+    "80": "LABEL_80",
+    "81": "LABEL_81",
+    "82": "LABEL_82",
+    "83": "LABEL_83",
+    "84": "LABEL_84",
+    "85": "LABEL_85",
+    "86": "LABEL_86",
+    "87": "LABEL_87",
+    "88": "LABEL_88",
+    "89": "LABEL_89",
+    "90": "LABEL_90",
+    "91": "LABEL_91",
+    "92": "LABEL_92",
+    "93": "LABEL_93",
+    "94": "LABEL_94",
+    "95": "LABEL_95",
+    "96": "LABEL_96",
+    "97": "LABEL_97",
+    "98": "LABEL_98",
+    "99": "LABEL_99",
+    "100": "LABEL_100",
+    "101": "LABEL_101",
+    "102": "LABEL_102",
+    "103": "LABEL_103",
+    "104": "LABEL_104",
+    "105": "LABEL_105",
+    "106": "LABEL_106",
+    "107": "LABEL_107",
+    "108": "LABEL_108",
+    "109": "LABEL_109",
+    "110": "LABEL_110",
+    "111": "LABEL_111",
+    "112": "LABEL_112",
+    "113": "LABEL_113",
+    "114": "LABEL_114",
+    "115": "LABEL_115",
+    "116": "LABEL_116",
+    "117": "LABEL_117",
+    "118": "LABEL_118",
+    "119": "LABEL_119",
+    "120": "LABEL_120",
+    "121": "LABEL_121",
+    "122": "LABEL_122",
+    "123": "LABEL_123",
+    "124": "LABEL_124",
+    "125": "LABEL_125",
+    "126": "LABEL_126",
+    "127": "LABEL_127",
+    "128": "LABEL_128",
+    "129": "LABEL_129",
+    "130": "LABEL_130",
+    "131": "LABEL_131",
+    "132": "LABEL_132",
+    "133": "LABEL_133",
+    "134": "LABEL_134",
+    "135": "LABEL_135",
+    "136": "LABEL_136",
+    "137": "LABEL_137",
+    "138": "LABEL_138",
+    "139": "LABEL_139",
+    "140": "LABEL_140",
+    "141": "LABEL_141",
+    "142": "LABEL_142",
+    "143": "LABEL_143",
+    "144": "LABEL_144",
+    "145": "LABEL_145",
+    "146": "LABEL_146",
+    "147": "LABEL_147",
+    "148": "LABEL_148",
+    "149": "LABEL_149"
+  },
+  "image_size": 384,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "is_hybrid": true,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_10": 10,
+    "LABEL_100": 100,
+    "LABEL_101": 101,
+    "LABEL_102": 102,
+    "LABEL_103": 103,
+    "LABEL_104": 104,
+    "LABEL_105": 105,
+    "LABEL_106": 106,
+    "LABEL_107": 107,
+    "LABEL_108": 108,
+    "LABEL_109": 109,
+    "LABEL_11": 11,
+    "LABEL_110": 110,
+    "LABEL_111": 111,
+    "LABEL_112": 112,
+    "LABEL_113": 113,
+    "LABEL_114": 114,
+    "LABEL_115": 115,
+    "LABEL_116": 116,
+    "LABEL_117": 117,
+    "LABEL_118": 118,
+    "LABEL_119": 119,
+    "LABEL_12": 12,
+    "LABEL_120": 120,
+    "LABEL_121": 121,
+    "LABEL_122": 122,
+    "LABEL_123": 123,
+    "LABEL_124": 124,
+    "LABEL_125": 125,
+    "LABEL_126": 126,
+    "LABEL_127": 127,
+    "LABEL_128": 128,
+    "LABEL_129": 129,
+    "LABEL_13": 13,
+    "LABEL_130": 130,
+    "LABEL_131": 131,
+    "LABEL_132": 132,
+    "LABEL_133": 133,
+    "LABEL_134": 134,
+    "LABEL_135": 135,
+    "LABEL_136": 136,
+    "LABEL_137": 137,
+    "LABEL_138": 138,
+    "LABEL_139": 139,
+    "LABEL_14": 14,
+    "LABEL_140": 140,
+    "LABEL_141": 141,
+    "LABEL_142": 142,
+    "LABEL_143": 143,
+    "LABEL_144": 144,
+    "LABEL_145": 145,
+    "LABEL_146": 146,
+    "LABEL_147": 147,
+    "LABEL_148": 148,
+    "LABEL_149": 149,
+    "LABEL_15": 15,
+    "LABEL_16": 16,
+    "LABEL_17": 17,
+    "LABEL_18": 18,
+    "LABEL_19": 19,
+    "LABEL_2": 2,
+    "LABEL_20": 20,
+    "LABEL_21": 21,
+    "LABEL_22": 22,
+    "LABEL_23": 23,
+    "LABEL_24": 24,
+    "LABEL_25": 25,
+    "LABEL_26": 26,
+    "LABEL_27": 27,
+    "LABEL_28": 28,
+    "LABEL_29": 29,
+    "LABEL_3": 3,
+    "LABEL_30": 30,
+    "LABEL_31": 31,
+    "LABEL_32": 32,
+    "LABEL_33": 33,
+    "LABEL_34": 34,
+    "LABEL_35": 35,
+    "LABEL_36": 36,
+    "LABEL_37": 37,
+    "LABEL_38": 38,
+    "LABEL_39": 39,
+    "LABEL_4": 4,
+    "LABEL_40": 40,
+    "LABEL_41": 41,
+    "LABEL_42": 42,
+    "LABEL_43": 43,
+    "LABEL_44": 44,
+    "LABEL_45": 45,
+    "LABEL_46": 46,
+    "LABEL_47": 47,
+    "LABEL_48": 48,
+    "LABEL_49": 49,
+    "LABEL_5": 5,
+    "LABEL_50": 50,
+    "LABEL_51": 51,
+    "LABEL_52": 52,
+    "LABEL_53": 53,
+    "LABEL_54": 54,
+    "LABEL_55": 55,
+    "LABEL_56": 56,
+    "LABEL_57": 57,
+    "LABEL_58": 58,
+    "LABEL_59": 59,
+    "LABEL_6": 6,
+    "LABEL_60": 60,
+    "LABEL_61": 61,
+    "LABEL_62": 62,
+    "LABEL_63": 63,
+    "LABEL_64": 64,
+    "LABEL_65": 65,
+    "LABEL_66": 66,
+    "LABEL_67": 67,
+    "LABEL_68": 68,
+    "LABEL_69": 69,
+    "LABEL_7": 7,
+    "LABEL_70": 70,
+    "LABEL_71": 71,
+    "LABEL_72": 72,
+    "LABEL_73": 73,
+    "LABEL_74": 74,
+    "LABEL_75": 75,
+    "LABEL_76": 76,
+    "LABEL_77": 77,
+    "LABEL_78": 78,
+    "LABEL_79": 79,
+    "LABEL_8": 8,
+    "LABEL_80": 80,
+    "LABEL_81": 81,
+    "LABEL_82": 82,
+    "LABEL_83": 83,
+    "LABEL_84": 84,
+    "LABEL_85": 85,
+    "LABEL_86": 86,
+    "LABEL_87": 87,
+    "LABEL_88": 88,
+    "LABEL_89": 89,
+    "LABEL_9": 9,
+    "LABEL_90": 90,
+    "LABEL_91": 91,
+    "LABEL_92": 92,
+    "LABEL_93": 93,
+    "LABEL_94": 94,
+    "LABEL_95": 95,
+    "LABEL_96": 96,
+    "LABEL_97": 97,
+    "LABEL_98": 98,
+    "LABEL_99": 99
+  },
+  "layer_norm_eps": 1e-12,
+  "model_type": "dpt",
+  "neck_hidden_sizes": [
+    256,
+    512,
+    768,
+    768
+  ],
+  "neck_ignore_stages": [
+    0,
+    1
+  ],
+  "num_attention_heads": 12,
+  "num_channels": 3,
+  "num_hidden_layers": 12,
+  "patch_size": 16,
+  "qkv_bias": true,
+  "readout_type": "project",
+  "reassemble_factors": [
+    1,
+    1,
+    1,
+    0.5
+  ],
+  "semantic_classifier_dropout": 0.1,
+  "semantic_loss_ignore_index": 255,
+  "torch_dtype": "float32",
+  "transformers_version": null,
+  "use_auxiliary_head": true,
+  "use_batch_norm_in_fusion_residual": false
+}

models--Intel--dpt-hybrid-midas/snapshots/11eaf7a1cf4bd70740697dbc216f98980c0aeb03/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b6c4d44f9d96ca3fa76dd3bbb153989a60b4ad5526559f3c598562a368d687ec
+size 489648389