update model card

Browse files

Files changed (8) hide show

README.md +131 -49
assets/architecture.jpg +0 -0
assets/depth-pro-datasets.jpeg +0 -0
assets/depth-pro-results-boundary.png +0 -0
assets/depth-pro-results-depth.png +0 -0
assets/depth-pro-results-fov.png +0 -0
assets/depth-pro-training-hyper-parameters.jpeg +0 -0
assets/tiger.jpg +0 -0

README.md CHANGED Viewed

@@ -1,77 +1,159 @@
 ---
 license: apple-ascl
 pipeline_tag: depth-estimation
 ---
 # DepthPro: Monocular Depth Estimation
-Install the required libraries:
-```bash
-pip install -q numpy pillow torch torchvision
-pip install -q git+https://github.com/geetu040/transformers.git@depth-pro#egg=transformers
-```
-Import the required libraries:
-```py
-import requests
-from PIL import Image
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from huggingface_hub import hf_hub_download
-import matplotlib.pyplot as plt
-# custom installation from this PR: https://github.com/huggingface/transformers/pull/34583
-# !pip install git+https://github.com/geetu040/transformers.git@depth-pro#egg=transformers
-from transformers import DepthProConfig, DepthProImageProcessorFast, DepthProForDepthEstimation
-```
-Load the model and image processor:
-```py
-checkpoint = "geetu040/DepthPro"
-revision = "main"
-image_processor = DepthProImageProcessorFast.from_pretrained(checkpoint, revision=revision)
-model = DepthProForDepthEstimation.from_pretrained(checkpoint, revision=revision)
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-model = model.to(device)
-```
-Inference:
-```py
-# inference
-url = "https://huggingface.co/geetu040/DepthPro/resolve/main/assets/tiger.jpg"
 image = Image.open(requests.get(url, stream=True).raw)
-image = image.convert("RGB")
-# prepare image for the model
 inputs = image_processor(images=image, return_tensors="pt")
-inputs = {k: v.to(device) for k, v in inputs.items()}
 with torch.no_grad():
-  outputs = model(**inputs)
-# interpolate to original size
 post_processed_output = image_processor.post_process_depth_estimation(
-  outputs, target_sizes=[(image.height, image.width)],
 )
-# visualize the prediction
 depth = post_processed_output[0]["predicted_depth"]
 depth = (depth - depth.min()) / depth.max()
 depth = depth * 255.
 depth = depth.detach().cpu().numpy()
 depth = Image.fromarray(depth.astype("uint8"))
-# visualize the prediction
-fig, axes = plt.subplots(1, 2, figsize=(20, 20))
-axes[0].imshow(image)
-axes[0].set_title(f'Image {image.size}')
-axes[0].axis('off')
-axes[1].imshow(depth)
-axes[1].set_title(f'Depth {depth.size}')
-axes[1].axis('off')
-plt.subplots_adjust(wspace=0, hspace=0)
-plt.show()
 ```

 ---
+library_name: transformers
 license: apple-ascl
+tags:
+- vision
+- depth-estimation
 pipeline_tag: depth-estimation
+widget:
+- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
+  example_title: Tiger
+- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg
+  example_title: Teapot
+- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg
+  example_title: Palace
 ---
 # DepthPro: Monocular Depth Estimation
+## Table of Contents
+- [DepthPro: Monocular Depth Estimation](#depthpro-monocular-depth-estimation)
+  - [Table of Contents](#table-of-contents)
+  - [Model Details](#model-details)
+    - [Model Sources](#model-sources)
+  - [How to Get Started with the Model](#how-to-get-started-with-the-model)
+  - [Training Details](#training-details)
+    - [Training Data](#training-data)
+    - [Preprocessing](#preprocessing)
+    - [Training Hyperparameters](#training-hyperparameters)
+  - [Evaluation](#evaluation)
+    - [Model Architecture and Objective](#model-architecture-and-objective)
+  - [Citation](#citation)
+  - [Model Card Authors](#model-card-authors)
+## Model Details
+![image/png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/depth_pro_teaser.png)
+DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.
+The abstract from the paper is the following:
+> We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions.
+This is the model card of a 🤗 [transformers](https://huggingface.co/docs/transformers/index) model that has been pushed on the Hub.
+- **Developed by:** Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, Vladlen Koltun.
+- **Model type:** [DepthPro](https://huggingface.co/docs/transformers/main/en/model_doc/depth_pro)
+- **License:** Apple-ASCL
+### Model Sources
+<!-- Provide the basic links for the model. -->
+- **HF Docs:** [DepthPro](https://huggingface.co/docs/transformers/main/en/model_doc/depth_pro)
+- **Repository:** https://github.com/apple/ml-depth-pro
+- **Paper:** https://arxiv.org/abs/2410.02073
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+import requests
+from PIL import Image
+import torch
+from transformers import DepthProImageProcessorFast, DepthProForDepthEstimation
+url = 'https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg'
 image = Image.open(requests.get(url, stream=True).raw)
+image_processor = DepthProImageProcessorFast.from_pretrained("geetu040/DepthPro")
+model = DepthProForDepthEstimation.from_pretrained("geetu040/DepthPro")
 inputs = image_processor(images=image, return_tensors="pt")
 with torch.no_grad():
+    outputs = model(**inputs)
 post_processed_output = image_processor.post_process_depth_estimation(
+    outputs, target_sizes=[(image.height, image.width)],
 )
+fov = post_processed_output[0]["fov"]
 depth = post_processed_output[0]["predicted_depth"]
 depth = (depth - depth.min()) / depth.max()
 depth = depth * 255.
 depth = depth.detach().cpu().numpy()
 depth = Image.fromarray(depth.astype("uint8"))
+```
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+The DepthPro model was trained on the following datasets:
+![image/jpeg](assets/depth-pro-datasets.jpeg)
+### Preprocessing
+Images go through the following preprocessing steps:
+- rescaled by `1/225.`
+- normalized with `mean=[0.5, 0.5, 0.5]` and `std=[0.5, 0.5, 0.5]`
+- resized to `1536x1536` pixels
+### Training Hyperparameters
+![image/jpeg](assets/depth-pro-training-hyper-parameters.jpeg)
+## Evaluation
+![image/png](assets/depth-pro-results-depth.png)
+![image/png](assets/depth-pro-results-boundary.png)
+![image/png](assets/depth-pro-results-fov.png)
+### Model Architecture and Objective
+![image/png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/depth_pro_architecture.png)
+The `DepthProForDepthEstimation` model uses a `DepthProEncoder`, for encoding the input image and a `FeatureFusionStage` for fusing the output features from encoder.
+The `DepthProEncoder` further uses two encoders:
+- `patch_encoder`
+   - Input image is scaled with multiple ratios, as specified in the `scaled_images_ratios` configuration.
+   - Each scaled image is split into smaller **patches** of size `patch_size` with overlapping areas determined by `scaled_images_overlap_ratios`.
+   - These patches are processed by the **`patch_encoder`**
+- `image_encoder`
+   - Input image is also rescaled to `patch_size` and processed by the **`image_encoder`**
+Both these encoders can be configured via `patch_model_config` and `image_model_config` respectively, both of which are seperate `Dinov2Model` by default.
+Outputs from both encoders (`last_hidden_state`) and selected intermediate states (`hidden_states`) from **`patch_encoder`** are fused by a `DPT`-based `FeatureFusionStage` for depth estimation.
+The network is supplemented with a focal length estimation head. A small convolutional head ingests frozen features from the depth estimation network and task-specific features from a separate ViT image encoder to predict the horizontal angular field-of-view.
+## Citation
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+```bibtex
+@misc{bochkovskii2024depthprosharpmonocular,
+      title={Depth Pro: Sharp Monocular Metric Depth in Less Than a Second},
+      author={Aleksei Bochkovskii and Amaël Delaunoy and Hugo Germain and Marcel Santos and Yichao Zhou and Stephan R. Richter and Vladlen Koltun},
+      year={2024},
+      eprint={2410.02073},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2410.02073},
+}
 ```
+## Model Card Authors
+[Armaghan Shakir](https://huggingface.co/geetu040)

assets/architecture.jpg DELETED Viewed

Binary file (85 kB)

assets/depth-pro-datasets.jpeg ADDED Viewed

assets/depth-pro-results-boundary.png ADDED Viewed

assets/depth-pro-results-depth.png ADDED Viewed

assets/depth-pro-results-fov.png ADDED Viewed

assets/depth-pro-training-hyper-parameters.jpeg ADDED Viewed

assets/tiger.jpg DELETED Viewed

Binary file (433 kB)