Buckets:
| # Marigold Computer Vision | |
|  | |
| Marigold was proposed in | |
| [Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation](https://huggingface.co/papers/2312.02145), | |
| a CVPR 2024 Oral paper by | |
| [Bingxin Ke](http://www.kebingxin.com/), | |
| [Anton Obukhov](https://www.obukhov.ai/), | |
| [Shengyu Huang](https://shengyuh.github.io/), | |
| [Nando Metzger](https://nandometzger.github.io/), | |
| [Rodrigo Caye Daudt](https://rcdaudt.github.io/), and | |
| [Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en). | |
| The core idea is to **repurpose the generative prior of Text-to-Image Latent Diffusion Models (LDMs) for traditional | |
| computer vision tasks**. | |
| This approach was explored by fine-tuning Stable Diffusion for **Monocular Depth Estimation**, as demonstrated in the | |
| teaser above. | |
| Marigold was later extended in the follow-up paper, | |
| [Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis](https://huggingface.co/papers/2312.02145), | |
| authored by | |
| [Bingxin Ke](http://www.kebingxin.com/), | |
| [Kevin Qu](https://www.linkedin.com/in/kevin-qu-b3417621b/?locale=en_US), | |
| [Tianfu Wang](https://tianfwang.github.io/), | |
| [Nando Metzger](https://nandometzger.github.io/), | |
| [Shengyu Huang](https://shengyuh.github.io/), | |
| [Bo Li](https://www.linkedin.com/in/bobboli0202/), | |
| [Anton Obukhov](https://www.obukhov.ai/), and | |
| [Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en). | |
| This work expanded Marigold to support new modalities such as **Surface Normals** and **Intrinsic Image Decomposition** | |
| (IID), introduced a training protocol for **Latent Consistency Models** (LCM), and demonstrated **High-Resolution** (HR) | |
| processing capability. | |
| > [!TIP] | |
| > The early Marigold models (`v1-0` and earlier) were optimized for best results with at least 10 inference steps. | |
| > LCM models were later developed to enable high-quality inference in just 1 to 4 steps. | |
| > Marigold models `v1-1` and later use the DDIM scheduler to achieve optimal | |
| > results in as few as 1 to 4 steps. | |
| ## Available Pipelines | |
| Each pipeline is tailored for a specific computer vision task, processing an input RGB image and generating a | |
| corresponding prediction. | |
| Currently, the following computer vision tasks are implemented: | |
| | Pipeline | Recommended Model Checkpoints | Spaces (Interactive Apps) | Predicted Modalities | | |
| |---------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | |
| | [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | [Depth Estimation](https://huggingface.co/spaces/prs-eth/marigold) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) | | |
| | [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [prs-eth/marigold-normals-v1-1](https://huggingface.co/prs-eth/marigold-normals-v1-1) | [Surface Normals Estimation](https://huggingface.co/spaces/prs-eth/marigold-normals) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) | | |
| | [MarigoldIntrinsicsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py) | [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1),[prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | [Intrinsic Image Decomposition](https://huggingface.co/spaces/prs-eth/marigold-iid) | [Albedo](https://en.wikipedia.org/wiki/Albedo), [Materials](https://www.n.aiq3d.com/wiki/roughnessmetalnessao-map), [Lighting](https://en.wikipedia.org/wiki/Diffuse_reflection) | | |
| ## Available Checkpoints | |
| All original checkpoints are available under the [PRS-ETH](https://huggingface.co/prs-eth/) organization on Hugging Face. | |
| They are designed for use with diffusers pipelines and the [original codebase](https://github.com/prs-eth/marigold), which can also be used to train | |
| new model checkpoints. | |
| The following is a summary of the recommended checkpoints, all of which produce reliable results with 1 to 4 steps. | |
| | Checkpoint | Modality | Comment | | |
| |-----------------------------------------------------------------------------------------------------|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | |
| | [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | Depth | Affine-invariant depth prediction assigns each pixel a value between 0 (near plane) and 1 (far plane), with both planes determined by the model during inference. | | |
| | [prs-eth/marigold-normals-v0-1](https://huggingface.co/prs-eth/marigold-normals-v0-1) | Normals | The surface normals predictions are unit-length 3D vectors in the screen space camera, with values in the range from -1 to 1. | | |
| | [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1) | Intrinsics | InteriorVerse decomposition is comprised of Albedo and two BRDF material properties: Roughness and Metallicity. | | |
| | [prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | Intrinsics | HyperSim decomposition of an image $I$ is comprised of Albedo $A$, Diffuse shading $S$, and Non-diffuse residual $R$: $I = A*S+R$. | | |
| > [!TIP] | |
| > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff | |
| > between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to | |
| > efficiently load the same components into multiple pipelines. | |
| > Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section | |
| > [here](./stable_diffusion/svd#reduce-memory-usage). | |
| > [!WARNING] | |
| > Marigold pipelines were designed and tested with the scheduler embedded in the model checkpoint. | |
| > The optimal number of inference steps varies by scheduler, with no universal value that works best across all cases. | |
| > To accommodate this, the `num_inference_steps` parameter in the pipeline's `__call__` method defaults to `None` (see the | |
| > API reference). | |
| > Unless set explicitly, it inherits the value from the `default_denoising_steps` field in the checkpoint configuration | |
| > file (`model_index.json`). | |
| > This ensures high-quality predictions when invoking the pipeline with only the `image` argument. | |
| The examples below are mostly given for depth prediction, but they can be universally applied to other supported | |
| modalities. | |
| We showcase the predictions using the same input image of Albert Einstein generated by Midjourney. | |
| This makes it easier to compare visualizations of the predictions across various modalities and checkpoints. | |
| Example input image for all Marigold pipelines | |
| ## Depth Prediction | |
| To get a depth prediction, load the `prs-eth/marigold-depth-v1-1` checkpoint into [MarigoldDepthPipeline](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.MarigoldDepthPipeline), | |
| put the image through the pipeline, and save the predictions: | |
| ```python | |
| import diffusers | |
| import torch | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| depth = pipe(image) | |
| vis = pipe.image_processor.visualize_depth(depth.prediction) | |
| vis[0].save("einstein_depth.png") | |
| depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction) | |
| depth_16bit[0].save("einstein_depth_16bit.png") | |
| ``` | |
| The [visualize_depth()](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_depth) function applies one of | |
| [matplotlib's colormaps](https://matplotlib.org/stable/users/explain/colors/colormaps.html) (`Spectral` by default) to map the predicted pixel values from a single-channel `[0, 1]` | |
| depth range into an RGB image. | |
| With the `Spectral` colormap, pixels with near depth are painted red, and far pixels are blue. | |
| The 16-bit PNG file stores the single channel values mapped linearly from the `[0, 1]` range into `[0, 65535]`. | |
| Below are the raw and the visualized predictions. The darker and closer areas (mustache) are easier to distinguish in | |
| the visualization. | |
| Predicted depth (16-bit PNG) | |
| Predicted depth visualization (Spectral) | |
| ## Surface Normals Estimation | |
| Load the `prs-eth/marigold-normals-v1-1` checkpoint into [MarigoldNormalsPipeline](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.MarigoldNormalsPipeline), put the image through the | |
| pipeline, and save the predictions: | |
| ```python | |
| import diffusers | |
| import torch | |
| pipe = diffusers.MarigoldNormalsPipeline.from_pretrained( | |
| "prs-eth/marigold-normals-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| normals = pipe(image) | |
| vis = pipe.image_processor.visualize_normals(normals.prediction) | |
| vis[0].save("einstein_normals.png") | |
| ``` | |
| The [visualize_normals()](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_normals) maps the three-dimensional | |
| prediction with pixel values in the range `[-1, 1]` into an RGB image. | |
| The visualization function supports flipping surface normals axes to make the visualization compatible with other | |
| choices of the frame of reference. | |
| Conceptually, each pixel is painted according to the surface normal vector in the frame of reference, where `X` axis | |
| points right, `Y` axis points up, and `Z` axis points at the viewer. | |
| Below is the visualized prediction: | |
| Predicted surface normals visualization | |
| In this example, the nose tip almost certainly has a point on the surface, in which the surface normal vector points | |
| straight at the viewer, meaning that its coordinates are `[0, 0, 1]`. | |
| This vector maps to the RGB `[128, 128, 255]`, which corresponds to the violet-blue color. | |
| Similarly, a surface normal on the cheek in the right part of the image has a large `X` component, which increases the | |
| red hue. | |
| Points on the shoulders pointing up with a large `Y` promote green color. | |
| ## Intrinsic Image Decomposition | |
| Marigold provides two models for Intrinsic Image Decomposition (IID): "Appearance" and "Lighting". | |
| Each model produces Albedo maps, derived from InteriorVerse and Hypersim annotations, respectively. | |
| - The "Appearance" model also estimates Material properties: Roughness and Metallicity. | |
| - The "Lighting" model generates Diffuse Shading and Non-diffuse Residual. | |
| Here is the sample code saving predictions made by the "Appearance" model: | |
| ```python | |
| import diffusers | |
| import torch | |
| pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained( | |
| "prs-eth/marigold-iid-appearance-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| intrinsics = pipe(image) | |
| vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties) | |
| vis[0]["albedo"].save("einstein_albedo.png") | |
| vis[0]["roughness"].save("einstein_roughness.png") | |
| vis[0]["metallicity"].save("einstein_metallicity.png") | |
| ``` | |
| Another example demonstrating the predictions made by the "Lighting" model: | |
| ```python | |
| import diffusers | |
| import torch | |
| pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained( | |
| "prs-eth/marigold-iid-lighting-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| intrinsics = pipe(image) | |
| vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties) | |
| vis[0]["albedo"].save("einstein_albedo.png") | |
| vis[0]["shading"].save("einstein_shading.png") | |
| vis[0]["residual"].save("einstein_residual.png") | |
| ``` | |
| Both models share the same pipeline while supporting different decomposition types. | |
| The exact decomposition parameterization (e.g., sRGB vs. linear space) is stored in the | |
| `pipe.target_properties` dictionary, which is passed into the | |
| [visualize_intrinsics()](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_intrinsics) function. | |
| Below are some examples showcasing the predicted decomposition outputs. | |
| All modalities can be inspected in the | |
| [Intrinsic Image Decomposition](https://huggingface.co/spaces/prs-eth/marigold-iid) Space. | |
| Predicted albedo ("Appearance" model) | |
| Predicted diffuse shading ("Lighting" model) | |
| ## Speeding up inference | |
| The above quick start snippets are already optimized for quality and speed, loading the checkpoint, utilizing the | |
| `fp16` variant of weights and computation, and performing the default number (4) of denoising diffusion steps. | |
| The first step to accelerate inference, at the expense of prediction quality, is to reduce the denoising diffusion | |
| steps to the minimum: | |
| ```diff | |
| import diffusers | |
| import torch | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| - depth = pipe(image) | |
| + depth = pipe(image, num_inference_steps=1) | |
| ``` | |
| With this change, the `pipe` call completes in 280ms on RTX 3090 GPU. | |
| Internally, the input image is first encoded using the Stable Diffusion VAE encoder, followed by a single denoising | |
| step performed by the U-Net. | |
| Finally, the prediction latent is decoded with the VAE decoder into pixel space. | |
| In this setup, two out of three module calls are dedicated to converting between the pixel and latent spaces of the LDM. | |
| Since Marigold's latent space is compatible with Stable Diffusion 2.0, inference can be accelerated by more than 3x, | |
| reducing the call time to 85ms on an RTX 3090, by using a [lightweight replacement of the SD VAE](../models/autoencoder_tiny). | |
| Note that using a lightweight VAE may slightly reduce the visual quality of the predictions. | |
| ```diff | |
| import diffusers | |
| import torch | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| + pipe.vae = diffusers.AutoencoderTiny.from_pretrained( | |
| + "madebyollin/taesd", torch_dtype=torch.float16 | |
| + ).cuda() | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| depth = pipe(image, num_inference_steps=1) | |
| ``` | |
| So far, we have optimized the number of diffusion steps and model components. Self-attention operations account for a | |
| significant portion of computations. | |
| Speeding them up can be achieved by using a more efficient attention processor: | |
| ```diff | |
| import diffusers | |
| import torch | |
| + from diffusers.models.attention_processor import AttnProcessor2_0 | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| + pipe.vae.set_attn_processor(AttnProcessor2_0()) | |
| + pipe.unet.set_attn_processor(AttnProcessor2_0()) | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| depth = pipe(image, num_inference_steps=1) | |
| ``` | |
| Finally, as suggested in [Optimizations](../../optimization/fp16#torchcompile), enabling `torch.compile` can further enhance performance depending on | |
| the target hardware. | |
| However, compilation incurs a significant overhead during the first pipeline invocation, making it beneficial only when | |
| the same pipeline instance is called repeatedly, such as within a loop. | |
| ```diff | |
| import diffusers | |
| import torch | |
| from diffusers.models.attention_processor import AttnProcessor2_0 | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| pipe.vae.set_attn_processor(AttnProcessor2_0()) | |
| pipe.unet.set_attn_processor(AttnProcessor2_0()) | |
| + pipe.vae = torch.compile(pipe.vae, mode="reduce-overhead", fullgraph=True) | |
| + pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| depth = pipe(image, num_inference_steps=1) | |
| ``` | |
| ## Maximizing Precision and Ensembling | |
| Marigold pipelines have a built-in ensembling mechanism combining multiple predictions from different random latents. | |
| This is a brute-force way of improving the precision of predictions, capitalizing on the generative nature of diffusion. | |
| The ensembling path is activated automatically when the `ensemble_size` argument is set greater or equal than `3`. | |
| When aiming for maximum precision, it makes sense to adjust `num_inference_steps` simultaneously with `ensemble_size`. | |
| The recommended values vary across checkpoints but primarily depend on the scheduler type. | |
| The effect of ensembling is particularly well-seen with surface normals: | |
| ```diff | |
| import diffusers | |
| pipe = diffusers.MarigoldNormalsPipeline.from_pretrained("prs-eth/marigold-normals-v1-1").to("cuda") | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| - depth = pipe(image) | |
| + depth = pipe(image, num_inference_steps=10, ensemble_size=5) | |
| vis = pipe.image_processor.visualize_normals(depth.prediction) | |
| vis[0].save("einstein_normals.png") | |
| ``` | |
| Surface normals, no ensembling | |
| Surface normals, with ensembling | |
| As can be seen, all areas with fine-grained structurers, such as hair, got more conservative and on average more | |
| correct predictions. | |
| Such a result is more suitable for precision-sensitive downstream tasks, such as 3D reconstruction. | |
| ## Frame-by-frame Video Processing with Temporal Consistency | |
| Due to Marigold's generative nature, each prediction is unique and defined by the random noise sampled for the latent | |
| initialization. | |
| This becomes an obvious drawback compared to traditional end-to-end dense regression networks, as exemplified in the | |
| following videos: | |
| Input video | |
| Marigold Depth applied to input video frames independently | |
| To address this issue, it is possible to pass `latents` argument to the pipelines, which defines the starting point of | |
| diffusion. | |
| Empirically, we found that a convex combination of the very same starting point noise latent and the latent | |
| corresponding to the previous frame prediction give sufficiently smooth results, as implemented in the snippet below: | |
| ```python | |
| import imageio | |
| import diffusers | |
| import torch | |
| from diffusers.models.attention_processor import AttnProcessor2_0 | |
| from PIL import Image | |
| from tqdm import tqdm | |
| device = "cuda" | |
| path_in = "https://huggingface.co/spaces/prs-eth/marigold-lcm/resolve/c7adb5427947d2680944f898cd91d386bf0d4924/files/video/obama.mp4" | |
| path_out = "obama_depth.gif" | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ).to(device) | |
| pipe.vae = diffusers.AutoencoderTiny.from_pretrained( | |
| "madebyollin/taesd", torch_dtype=torch.float16 | |
| ).to(device) | |
| pipe.unet.set_attn_processor(AttnProcessor2_0()) | |
| pipe.vae = torch.compile(pipe.vae, mode="reduce-overhead", fullgraph=True) | |
| pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) | |
| pipe.set_progress_bar_config(disable=True) | |
| with imageio.get_reader(path_in) as reader: | |
| size = reader.get_meta_data()['size'] | |
| last_frame_latent = None | |
| latent_common = torch.randn( | |
| (1, 4, 768 * size[1] // (8 * max(size)), 768 * size[0] // (8 * max(size))) | |
| ).to(device=device, dtype=torch.float16) | |
| out = [] | |
| for frame_id, frame in tqdm(enumerate(reader), desc="Processing Video"): | |
| frame = Image.fromarray(frame) | |
| latents = latent_common | |
| if last_frame_latent is not None: | |
| latents = 0.9 * latents + 0.1 * last_frame_latent | |
| depth = pipe( | |
| frame, | |
| num_inference_steps=1, | |
| match_input_resolution=False, | |
| latents=latents, | |
| output_latent=True, | |
| ) | |
| last_frame_latent = depth.latent | |
| out.append(pipe.image_processor.visualize_depth(depth.prediction)[0]) | |
| diffusers.utils.export_to_gif(out, path_out, fps=reader.get_meta_data()['fps']) | |
| ``` | |
| Here, the diffusion process starts from the given computed latent. | |
| The pipeline sets `output_latent=True` to access `out.latent` and computes its contribution to the next frame's latent | |
| initialization. | |
| The result is much more stable now: | |
| Marigold Depth applied to input video frames independently | |
| Marigold Depth with forced latents initialization | |
| ## Marigold for ControlNet | |
| A very common application for depth prediction with diffusion models comes in conjunction with ControlNet. | |
| Depth crispness plays a crucial role in obtaining high-quality results from ControlNet. | |
| As seen in comparisons with other methods above, Marigold excels at that task. | |
| The snippet below demonstrates how to load an image, compute depth, and pass it into ControlNet in a compatible format: | |
| ```python | |
| import torch | |
| import diffusers | |
| device = "cuda" | |
| generator = torch.Generator(device=device).manual_seed(2024) | |
| image = diffusers.utils.load_image( | |
| "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_depth_source.png" | |
| ) | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| "prs-eth/marigold-depth-v1-1", torch_dtype=torch.float16, variant="fp16" | |
| ).to(device) | |
| depth_image = pipe(image, generator=generator).prediction | |
| depth_image = pipe.image_processor.visualize_depth(depth_image, color_map="binary") | |
| depth_image[0].save("motorcycle_controlnet_depth.png") | |
| controlnet = diffusers.ControlNetModel.from_pretrained( | |
| "diffusers/controlnet-depth-sdxl-1.0", torch_dtype=torch.float16, variant="fp16" | |
| ).to(device) | |
| pipe = diffusers.StableDiffusionXLControlNetPipeline.from_pretrained( | |
| "SG161222/RealVisXL_V4.0", torch_dtype=torch.float16, variant="fp16", controlnet=controlnet | |
| ).to(device) | |
| pipe.scheduler = diffusers.DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True) | |
| controlnet_out = pipe( | |
| prompt="high quality photo of a sports bike, city", | |
| negative_prompt="", | |
| guidance_scale=6.5, | |
| num_inference_steps=25, | |
| image=depth_image, | |
| controlnet_conditioning_scale=0.7, | |
| control_guidance_end=0.7, | |
| generator=generator, | |
| ).images | |
| controlnet_out[0].save("motorcycle_controlnet_out.png") | |
| ``` | |
| Input image | |
| Depth in the format compatible with ControlNet | |
| ControlNet generation, conditioned on depth and prompt: "high quality photo of a sports bike, city" | |
| ## Quantitative Evaluation | |
| To evaluate Marigold quantitatively in standard leaderboards and benchmarks (such as NYU, KITTI, and other datasets), | |
| follow the evaluation protocol outlined in the paper: load the full precision fp32 model and use appropriate values | |
| for `num_inference_steps` and `ensemble_size`. | |
| Optionally seed randomness to ensure reproducibility. | |
| Maximizing `batch_size` will deliver maximum device utilization. | |
| ```python | |
| import diffusers | |
| import torch | |
| device = "cuda" | |
| seed = 2024 | |
| generator = torch.Generator(device=device).manual_seed(seed) | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained("prs-eth/marigold-depth-v1-1").to(device) | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| depth = pipe( | |
| image, | |
| num_inference_steps=4, # set according to the evaluation protocol from the paper | |
| ensemble_size=10, # set according to the evaluation protocol from the paper | |
| generator=generator, | |
| ) | |
| # evaluate metrics | |
| ``` | |
| ## Using Predictive Uncertainty | |
| The ensembling mechanism built into Marigold pipelines combines multiple predictions obtained from different random | |
| latents. | |
| As a side effect, it can be used to quantify epistemic (model) uncertainty; simply specify `ensemble_size` greater | |
| or equal than 3 and set `output_uncertainty=True`. | |
| The resulting uncertainty will be available in the `uncertainty` field of the output. | |
| It can be visualized as follows: | |
| ```python | |
| import diffusers | |
| import torch | |
| pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| depth = pipe( | |
| image, | |
| ensemble_size=10, # any number >= 3 | |
| output_uncertainty=True, | |
| ) | |
| uncertainty = pipe.image_processor.visualize_uncertainty(depth.uncertainty) | |
| uncertainty[0].save("einstein_depth_uncertainty.png") | |
| ``` | |
| Depth uncertainty | |
| Surface normals uncertainty | |
| Albedo uncertainty | |
| The interpretation of uncertainty is easy: higher values (white) correspond to pixels, where the model struggles to | |
| make consistent predictions. | |
| - The depth model exhibits the most uncertainty around discontinuities, where object depth changes abruptly. | |
| - The surface normals model is least confident in fine-grained structures like hair and in dark regions such as the | |
| collar area. | |
| - Albedo uncertainty is represented as an RGB image, as it captures uncertainty independently for each color channel, | |
| unlike depth and surface normals. It is also higher in shaded regions and at discontinuities. | |
| ## Marigold Depth Prediction API[[diffusers.MarigoldDepthPipeline]] | |
| #### diffusers.MarigoldDepthPipeline[[diffusers.MarigoldDepthPipeline]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py#L104) | |
| Pipeline for monocular depth estimation using the Marigold method: https://marigoldmonodepth.github.io. | |
| This model inherits from [DiffusionPipeline](/docs/diffusers/pr_13745/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the | |
| library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) | |
| __call__diffusers.MarigoldDepthPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py#L347[{"name": "image", "val": ": PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor]"}, {"name": "num_inference_steps", "val": ": int | None = None"}, {"name": "ensemble_size", "val": ": int = 1"}, {"name": "processing_resolution", "val": ": int | None = None"}, {"name": "match_input_resolution", "val": ": bool = True"}, {"name": "resample_method_input", "val": ": str = 'bilinear'"}, {"name": "resample_method_output", "val": ": str = 'bilinear'"}, {"name": "batch_size", "val": ": int = 1"}, {"name": "ensembling_kwargs", "val": ": dict[str, typing.Any] | None = None"}, {"name": "latents", "val": ": torch.Tensor | list[torch.Tensor] | None = None"}, {"name": "generator", "val": ": torch._C.Generator | list[torch._C.Generator] | None = None"}, {"name": "output_type", "val": ": str = 'np'"}, {"name": "output_uncertainty", "val": ": bool = False"}, {"name": "output_latent", "val": ": bool = False"}, {"name": "return_dict", "val": ": bool = True"}]- **image** (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `list[PIL.Image.Image]`, `list[np.ndarray]`), -- | |
| `list[torch.Tensor]`: An input image or images used as an input for the depth estimation task. For | |
| arrays and tensors, the expected value range is between `[0, 1]`. Passing a batch of images is possible | |
| by providing a four-dimensional array or a tensor. Additionally, a list of images of two- or | |
| three-dimensional arrays or tensors can be passed. In the latter case, all list elements must have the | |
| same width and height. | |
| - **num_inference_steps** (`int`, *optional*, defaults to `None`) -- | |
| Number of denoising diffusion steps during inference. The default value `None` results in automatic | |
| selection. | |
| - **ensemble_size** (`int`, defaults to `1`) -- | |
| Number of ensemble predictions. Higher values result in measurable improvements and visual degradation. | |
| - **processing_resolution** (`int`, *optional*, defaults to `None`) -- | |
| Effective processing resolution. When set to `0`, matches the larger input image dimension. This | |
| produces crisper predictions, but may also lead to the overall loss of global context. The default | |
| value `None` resolves to the optimal value from the model config. | |
| - **match_input_resolution** (`bool`, *optional*, defaults to `True`) -- | |
| When enabled, the output prediction is resized to match the input dimensions. When disabled, the longer | |
| side of the output will equal to `processing_resolution`. | |
| - **resample_method_input** (`str`, *optional*, defaults to `"bilinear"`) -- | |
| Resampling method used to resize input images to `processing_resolution`. The accepted values are: | |
| `"nearest"`, `"nearest-exact"`, `"bilinear"`, `"bicubic"`, or `"area"`. | |
| - **resample_method_output** (`str`, *optional*, defaults to `"bilinear"`) -- | |
| Resampling method used to resize output predictions to match the input resolution. The accepted values | |
| are `"nearest"`, `"nearest-exact"`, `"bilinear"`, `"bicubic"`, or `"area"`. | |
| - **batch_size** (`int`, *optional*, defaults to `1`) -- | |
| Batch size; only matters when setting `ensemble_size` or passing a tensor of images. | |
| - **ensembling_kwargs** (`dict`, *optional*, defaults to `None`) -- | |
| Extra dictionary with arguments for precise ensembling control. The following options are available: | |
| - reduction (`str`, *optional*, defaults to `"median"`): Defines the ensembling function applied in | |
| every pixel location, can be either `"median"` or `"mean"`. | |
| - regularizer_strength (`float`, *optional*, defaults to `0.02`): Strength of the regularizer that | |
| pulls the aligned predictions to the unit range from 0 to 1. | |
| - max_iter (`int`, *optional*, defaults to `2`): Maximum number of the alignment solver steps. Refer to | |
| `scipy.optimize.minimize` function, `options` argument. | |
| - tol (`float`, *optional*, defaults to `1e-3`): Alignment solver tolerance. The solver stops when the | |
| tolerance is reached. | |
| - max_res (`int`, *optional*, defaults to `None`): Resolution at which the alignment is performed; | |
| `None` matches the `processing_resolution`. | |
| - **latents** (`torch.Tensor`, or `list[torch.Tensor]`, *optional*, defaults to `None`) -- | |
| Latent noise tensors to replace the random initialization. These can be taken from the previous | |
| function call's output. | |
| - **generator** (`torch.Generator`, or `list[torch.Generator]`, *optional*, defaults to `None`) -- | |
| Random number generator object to ensure reproducibility. | |
| - **output_type** (`str`, *optional*, defaults to `"np"`) -- | |
| Preferred format of the output's `prediction` and the optional `uncertainty` fields. The accepted | |
| values are: `"np"` (numpy array) or `"pt"` (torch tensor). | |
| - **output_uncertainty** (`bool`, *optional*, defaults to `False`) -- | |
| When enabled, the output's `uncertainty` field contains the predictive uncertainty map, provided that | |
| the `ensemble_size` argument is set to a value above 2. | |
| - **output_latent** (`bool`, *optional*, defaults to `False`) -- | |
| When enabled, the output's `latent` field contains the latent codes corresponding to the predictions | |
| within the ensemble. These codes can be saved, modified, and used for subsequent calls with the | |
| `latents` argument. | |
| - **return_dict** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to return a [MarigoldDepthOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldDepthOutput) instead of a plain tuple.0[MarigoldDepthOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldDepthOutput) or `tuple`If `return_dict` is `True`, [MarigoldDepthOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldDepthOutput) is returned, otherwise a | |
| `tuple` is returned where the first element is the prediction, the second element is the uncertainty | |
| (or `None`), and the third is the latent (or `None`). | |
| Function invoked when calling the pipeline. | |
| Examples: | |
| ```py | |
| >>> import diffusers | |
| >>> import torch | |
| >>> pipe = diffusers.MarigoldDepthPipeline.from_pretrained( | |
| ... "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ... ).to("cuda") | |
| >>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| >>> depth = pipe(image) | |
| >>> vis = pipe.image_processor.visualize_depth(depth.prediction) | |
| >>> vis[0].save("einstein_depth.png") | |
| >>> depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction) | |
| >>> depth_16bit[0].save("einstein_depth_16bit.png") | |
| ``` | |
| **Parameters:** | |
| unet (`UNet2DConditionModel`) : Conditional U-Net to denoise the depth latent, conditioned on image latent. | |
| vae (`AutoencoderKL`) : Variational Auto-Encoder (VAE) Model to encode and decode images and predictions to and from latent representations. | |
| scheduler (`DDIMScheduler` or `LCMScheduler`) : A scheduler to be used in combination with `unet` to denoise the encoded image latents. | |
| text_encoder (`CLIPTextModel`) : Text-encoder, for empty text embedding. | |
| tokenizer (`CLIPTokenizer`) : CLIP tokenizer. | |
| prediction_type (`str`, *optional*) : Type of predictions made by the model. | |
| scale_invariant (`bool`, *optional*) : A model property specifying whether the predicted depth maps are scale-invariant. This value must be set in the model config. When used together with the `shift_invariant=True` flag, the model is also called "affine-invariant". NB: overriding this value is not supported. | |
| shift_invariant (`bool`, *optional*) : A model property specifying whether the predicted depth maps are shift-invariant. This value must be set in the model config. When used together with the `scale_invariant=True` flag, the model is also called "affine-invariant". NB: overriding this value is not supported. | |
| default_denoising_steps (`int`, *optional*) : The minimum number of denoising diffusion steps that are required to produce a prediction of reasonable quality with the given model. This value must be set in the model config. When the pipeline is called without explicitly setting `num_inference_steps`, the default value is used. This is required to ensure reasonable results with various model flavors compatible with the pipeline, such as those relying on very short denoising schedules (`LCMScheduler`) and those with full diffusion schedules (`DDIMScheduler`). | |
| default_processing_resolution (`int`, *optional*) : The recommended value of the `processing_resolution` parameter of the pipeline. This value must be set in the model config. When the pipeline is called without explicitly setting `processing_resolution`, the default value is used. This is required to ensure reasonable results with various model flavors trained with varying optimal processing resolution values. | |
| **Returns:** | |
| `[MarigoldDepthOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldDepthOutput) or `tuple`` | |
| If `return_dict` is `True`, [MarigoldDepthOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldDepthOutput) is returned, otherwise a | |
| `tuple` is returned where the first element is the prediction, the second element is the uncertainty | |
| (or `None`), and the third is the latent (or `None`). | |
| #### diffusers.pipelines.marigold.MarigoldDepthOutput[[diffusers.pipelines.marigold.MarigoldDepthOutput]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py#L83) | |
| Output class for Marigold monocular depth prediction pipeline. | |
| **Parameters:** | |
| prediction (`np.ndarray`, `torch.Tensor`) : Predicted depth maps with values in the range [0, 1]. The shape is `numimages × 1 × height × width` for `torch.Tensor` or `numimages × height × width × 1` for `np.ndarray`. | |
| uncertainty (`None`, `np.ndarray`, `torch.Tensor`) : Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is `numimages × 1 × height × width` for `torch.Tensor` or `numimages × height × width × 1` for `np.ndarray`. | |
| latent (`None`, `torch.Tensor`) : Latent features corresponding to the predictions, compatible with the `latents` argument of the pipeline. The shape is `numimages * numensemble × 4 × latentheight × latentwidth`. | |
| #### diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_depth[[diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_depth]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/marigold/marigold_image_processing.py#L387) | |
| Visualizes depth maps, such as predictions of the `MarigoldDepthPipeline`. | |
| Returns: `list[PIL.Image.Image]` with depth maps visualization. | |
| **Parameters:** | |
| depth (`PIL.Image.Image | np.ndarray | torch.Tensor | list[PIL.Image.Image, list[np.ndarray], : list[torch.Tensor]]`): Depth maps. | |
| val_min (`float`, *optional*, defaults to `0.0`) : Minimum value of the visualized depth range. | |
| val_max (`float`, *optional*, defaults to `1.0`) : Maximum value of the visualized depth range. | |
| color_map (`str`, *optional*, defaults to `"Spectral"`) : Color map used to convert a single-channel depth prediction into colored representation. | |
| ## Marigold Normals Estimation API[[diffusers.MarigoldNormalsPipeline]] | |
| #### diffusers.MarigoldNormalsPipeline[[diffusers.MarigoldNormalsPipeline]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py#L99) | |
| Pipeline for monocular normals estimation using the Marigold method: https://marigoldmonodepth.github.io. | |
| This model inherits from [DiffusionPipeline](/docs/diffusers/pr_13745/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the | |
| library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) | |
| __call__diffusers.MarigoldNormalsPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py#L332[{"name": "image", "val": ": PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor]"}, {"name": "num_inference_steps", "val": ": int | None = None"}, {"name": "ensemble_size", "val": ": int = 1"}, {"name": "processing_resolution", "val": ": int | None = None"}, {"name": "match_input_resolution", "val": ": bool = True"}, {"name": "resample_method_input", "val": ": str = 'bilinear'"}, {"name": "resample_method_output", "val": ": str = 'bilinear'"}, {"name": "batch_size", "val": ": int = 1"}, {"name": "ensembling_kwargs", "val": ": dict[str, typing.Any] | None = None"}, {"name": "latents", "val": ": torch.Tensor | list[torch.Tensor] | None = None"}, {"name": "generator", "val": ": torch._C.Generator | list[torch._C.Generator] | None = None"}, {"name": "output_type", "val": ": str = 'np'"}, {"name": "output_uncertainty", "val": ": bool = False"}, {"name": "output_latent", "val": ": bool = False"}, {"name": "return_dict", "val": ": bool = True"}]- **image** (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `list[PIL.Image.Image]`, `list[np.ndarray]`), -- | |
| `list[torch.Tensor]`: An input image or images used as an input for the normals estimation task. For | |
| arrays and tensors, the expected value range is between `[0, 1]`. Passing a batch of images is possible | |
| by providing a four-dimensional array or a tensor. Additionally, a list of images of two- or | |
| three-dimensional arrays or tensors can be passed. In the latter case, all list elements must have the | |
| same width and height. | |
| - **num_inference_steps** (`int`, *optional*, defaults to `None`) -- | |
| Number of denoising diffusion steps during inference. The default value `None` results in automatic | |
| selection. | |
| - **ensemble_size** (`int`, defaults to `1`) -- | |
| Number of ensemble predictions. Higher values result in measurable improvements and visual degradation. | |
| - **processing_resolution** (`int`, *optional*, defaults to `None`) -- | |
| Effective processing resolution. When set to `0`, matches the larger input image dimension. This | |
| produces crisper predictions, but may also lead to the overall loss of global context. The default | |
| value `None` resolves to the optimal value from the model config. | |
| - **match_input_resolution** (`bool`, *optional*, defaults to `True`) -- | |
| When enabled, the output prediction is resized to match the input dimensions. When disabled, the longer | |
| side of the output will equal to `processing_resolution`. | |
| - **resample_method_input** (`str`, *optional*, defaults to `"bilinear"`) -- | |
| Resampling method used to resize input images to `processing_resolution`. The accepted values are: | |
| `"nearest"`, `"nearest-exact"`, `"bilinear"`, `"bicubic"`, or `"area"`. | |
| - **resample_method_output** (`str`, *optional*, defaults to `"bilinear"`) -- | |
| Resampling method used to resize output predictions to match the input resolution. The accepted values | |
| are `"nearest"`, `"nearest-exact"`, `"bilinear"`, `"bicubic"`, or `"area"`. | |
| - **batch_size** (`int`, *optional*, defaults to `1`) -- | |
| Batch size; only matters when setting `ensemble_size` or passing a tensor of images. | |
| - **ensembling_kwargs** (`dict`, *optional*, defaults to `None`) -- | |
| Extra dictionary with arguments for precise ensembling control. The following options are available: | |
| - reduction (`str`, *optional*, defaults to `"closest"`): Defines the ensembling function applied in | |
| every pixel location, can be either `"closest"` or `"mean"`. | |
| - **latents** (`torch.Tensor`, *optional*, defaults to `None`) -- | |
| Latent noise tensors to replace the random initialization. These can be taken from the previous | |
| function call's output. | |
| - **generator** (`torch.Generator`, or `list[torch.Generator]`, *optional*, defaults to `None`) -- | |
| Random number generator object to ensure reproducibility. | |
| - **output_type** (`str`, *optional*, defaults to `"np"`) -- | |
| Preferred format of the output's `prediction` and the optional `uncertainty` fields. The accepted | |
| values are: `"np"` (numpy array) or `"pt"` (torch tensor). | |
| - **output_uncertainty** (`bool`, *optional*, defaults to `False`) -- | |
| When enabled, the output's `uncertainty` field contains the predictive uncertainty map, provided that | |
| the `ensemble_size` argument is set to a value above 2. | |
| - **output_latent** (`bool`, *optional*, defaults to `False`) -- | |
| When enabled, the output's `latent` field contains the latent codes corresponding to the predictions | |
| within the ensemble. These codes can be saved, modified, and used for subsequent calls with the | |
| `latents` argument. | |
| - **return_dict** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to return a [MarigoldNormalsOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldNormalsOutput) instead of a plain tuple.0[MarigoldNormalsOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldNormalsOutput) or `tuple`If `return_dict` is `True`, [MarigoldNormalsOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldNormalsOutput) is returned, otherwise a | |
| `tuple` is returned where the first element is the prediction, the second element is the uncertainty | |
| (or `None`), and the third is the latent (or `None`). | |
| Function invoked when calling the pipeline. | |
| Examples: | |
| ```py | |
| >>> import diffusers | |
| >>> import torch | |
| >>> pipe = diffusers.MarigoldNormalsPipeline.from_pretrained( | |
| ... "prs-eth/marigold-normals-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ... ).to("cuda") | |
| >>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| >>> normals = pipe(image) | |
| >>> vis = pipe.image_processor.visualize_normals(normals.prediction) | |
| >>> vis[0].save("einstein_normals.png") | |
| ``` | |
| **Parameters:** | |
| unet (`UNet2DConditionModel`) : Conditional U-Net to denoise the normals latent, conditioned on image latent. | |
| vae (`AutoencoderKL`) : Variational Auto-Encoder (VAE) Model to encode and decode images and predictions to and from latent representations. | |
| scheduler (`DDIMScheduler` or `LCMScheduler`) : A scheduler to be used in combination with `unet` to denoise the encoded image latents. | |
| text_encoder (`CLIPTextModel`) : Text-encoder, for empty text embedding. | |
| tokenizer (`CLIPTokenizer`) : CLIP tokenizer. | |
| prediction_type (`str`, *optional*) : Type of predictions made by the model. | |
| use_full_z_range (`bool`, *optional*) : Whether the normals predicted by this model utilize the full range of the Z dimension, or only its positive half. | |
| default_denoising_steps (`int`, *optional*) : The minimum number of denoising diffusion steps that are required to produce a prediction of reasonable quality with the given model. This value must be set in the model config. When the pipeline is called without explicitly setting `num_inference_steps`, the default value is used. This is required to ensure reasonable results with various model flavors compatible with the pipeline, such as those relying on very short denoising schedules (`LCMScheduler`) and those with full diffusion schedules (`DDIMScheduler`). | |
| default_processing_resolution (`int`, *optional*) : The recommended value of the `processing_resolution` parameter of the pipeline. This value must be set in the model config. When the pipeline is called without explicitly setting `processing_resolution`, the default value is used. This is required to ensure reasonable results with various model flavors trained with varying optimal processing resolution values. | |
| **Returns:** | |
| `[MarigoldNormalsOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldNormalsOutput) or `tuple`` | |
| If `return_dict` is `True`, [MarigoldNormalsOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldNormalsOutput) is returned, otherwise a | |
| `tuple` is returned where the first element is the prediction, the second element is the uncertainty | |
| (or `None`), and the third is the latent (or `None`). | |
| #### diffusers.pipelines.marigold.MarigoldNormalsOutput[[diffusers.pipelines.marigold.MarigoldNormalsOutput]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py#L78) | |
| Output class for Marigold monocular normals prediction pipeline. | |
| **Parameters:** | |
| prediction (`np.ndarray`, `torch.Tensor`) : Predicted normals with values in the range [-1, 1]. The shape is `numimages × 3 × height × width` for `torch.Tensor` or `numimages × height × width × 3` for `np.ndarray`. | |
| uncertainty (`None`, `np.ndarray`, `torch.Tensor`) : Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is `numimages × 1 × height × width` for `torch.Tensor` or `numimages × height × width × 1` for `np.ndarray`. | |
| latent (`None`, `torch.Tensor`) : Latent features corresponding to the predictions, compatible with the `latents` argument of the pipeline. The shape is `numimages * numensemble × 4 × latentheight × latentwidth`. | |
| #### diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_normals[[diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_normals]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/marigold/marigold_image_processing.py#L486) | |
| Visualizes surface normals, such as predictions of the `MarigoldNormalsPipeline`. | |
| Returns: `list[PIL.Image.Image]` with surface normals visualization. | |
| **Parameters:** | |
| normals (`np.ndarray | torch.Tensor | list[np.ndarray, list[torch.Tensor]]`) : Surface normals. | |
| flip_x (`bool`, *optional*, defaults to `False`) : Flips the X axis of the normals frame of reference. Default direction is right. | |
| flip_y (`bool`, *optional*, defaults to `False`) : Flips the Y axis of the normals frame of reference. Default direction is top. | |
| flip_z (`bool`, *optional*, defaults to `False`) : Flips the Z axis of the normals frame of reference. Default direction is facing the observer. | |
| ## Marigold Intrinsic Image Decomposition API[[diffusers.MarigoldIntrinsicsPipeline]] | |
| #### diffusers.MarigoldIntrinsicsPipeline[[diffusers.MarigoldIntrinsicsPipeline]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py#L120) | |
| Pipeline for Intrinsic Image Decomposition (IID) using the Marigold method: | |
| https://marigoldcomputervision.github.io. | |
| This model inherits from [DiffusionPipeline](/docs/diffusers/pr_13745/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the | |
| library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) | |
| __call__diffusers.MarigoldIntrinsicsPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py#L359[{"name": "image", "val": ": PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor]"}, {"name": "num_inference_steps", "val": ": int | None = None"}, {"name": "ensemble_size", "val": ": int = 1"}, {"name": "processing_resolution", "val": ": int | None = None"}, {"name": "match_input_resolution", "val": ": bool = True"}, {"name": "resample_method_input", "val": ": str = 'bilinear'"}, {"name": "resample_method_output", "val": ": str = 'bilinear'"}, {"name": "batch_size", "val": ": int = 1"}, {"name": "ensembling_kwargs", "val": ": dict[str, typing.Any] | None = None"}, {"name": "latents", "val": ": torch.Tensor | list[torch.Tensor] | None = None"}, {"name": "generator", "val": ": torch._C.Generator | list[torch._C.Generator] | None = None"}, {"name": "output_type", "val": ": str = 'np'"}, {"name": "output_uncertainty", "val": ": bool = False"}, {"name": "output_latent", "val": ": bool = False"}, {"name": "return_dict", "val": ": bool = True"}]- **image** (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `list[PIL.Image.Image]`, `list[np.ndarray]`), -- | |
| `list[torch.Tensor]`: An input image or images used as an input for the intrinsic decomposition task. | |
| For arrays and tensors, the expected value range is between `[0, 1]`. Passing a batch of images is | |
| possible by providing a four-dimensional array or a tensor. Additionally, a list of images of two- or | |
| three-dimensional arrays or tensors can be passed. In the latter case, all list elements must have the | |
| same width and height. | |
| - **num_inference_steps** (`int`, *optional*, defaults to `None`) -- | |
| Number of denoising diffusion steps during inference. The default value `None` results in automatic | |
| selection. | |
| - **ensemble_size** (`int`, defaults to `1`) -- | |
| Number of ensemble predictions. Higher values result in measurable improvements and visual degradation. | |
| - **processing_resolution** (`int`, *optional*, defaults to `None`) -- | |
| Effective processing resolution. When set to `0`, matches the larger input image dimension. This | |
| produces crisper predictions, but may also lead to the overall loss of global context. The default | |
| value `None` resolves to the optimal value from the model config. | |
| - **match_input_resolution** (`bool`, *optional*, defaults to `True`) -- | |
| When enabled, the output prediction is resized to match the input dimensions. When disabled, the longer | |
| side of the output will equal to `processing_resolution`. | |
| - **resample_method_input** (`str`, *optional*, defaults to `"bilinear"`) -- | |
| Resampling method used to resize input images to `processing_resolution`. The accepted values are: | |
| `"nearest"`, `"nearest-exact"`, `"bilinear"`, `"bicubic"`, or `"area"`. | |
| - **resample_method_output** (`str`, *optional*, defaults to `"bilinear"`) -- | |
| Resampling method used to resize output predictions to match the input resolution. The accepted values | |
| are `"nearest"`, `"nearest-exact"`, `"bilinear"`, `"bicubic"`, or `"area"`. | |
| - **batch_size** (`int`, *optional*, defaults to `1`) -- | |
| Batch size; only matters when setting `ensemble_size` or passing a tensor of images. | |
| - **ensembling_kwargs** (`dict`, *optional*, defaults to `None`) -- | |
| Extra dictionary with arguments for precise ensembling control. The following options are available: | |
| - reduction (`str`, *optional*, defaults to `"median"`): Defines the ensembling function applied in | |
| every pixel location, can be either `"median"` or `"mean"`. | |
| - **latents** (`torch.Tensor`, *optional*, defaults to `None`) -- | |
| Latent noise tensors to replace the random initialization. These can be taken from the previous | |
| function call's output. | |
| - **generator** (`torch.Generator`, or `list[torch.Generator]`, *optional*, defaults to `None`) -- | |
| Random number generator object to ensure reproducibility. | |
| - **output_type** (`str`, *optional*, defaults to `"np"`) -- | |
| Preferred format of the output's `prediction` and the optional `uncertainty` fields. The accepted | |
| values are: `"np"` (numpy array) or `"pt"` (torch tensor). | |
| - **output_uncertainty** (`bool`, *optional*, defaults to `False`) -- | |
| When enabled, the output's `uncertainty` field contains the predictive uncertainty map, provided that | |
| the `ensemble_size` argument is set to a value above 2. | |
| - **output_latent** (`bool`, *optional*, defaults to `False`) -- | |
| When enabled, the output's `latent` field contains the latent codes corresponding to the predictions | |
| within the ensemble. These codes can be saved, modified, and used for subsequent calls with the | |
| `latents` argument. | |
| - **return_dict** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to return a [MarigoldIntrinsicsOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldIntrinsicsOutput) instead of a plain tuple.0[MarigoldIntrinsicsOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldIntrinsicsOutput) or `tuple`If `return_dict` is `True`, [MarigoldIntrinsicsOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldIntrinsicsOutput) is returned, otherwise a | |
| `tuple` is returned where the first element is the prediction, the second element is the uncertainty | |
| (or `None`), and the third is the latent (or `None`). | |
| Function invoked when calling the pipeline. | |
| Examples: | |
| ```py | |
| >>> import diffusers | |
| >>> import torch | |
| >>> pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained( | |
| ... "prs-eth/marigold-iid-appearance-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ... ).to("cuda") | |
| >>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| >>> intrinsics = pipe(image) | |
| >>> vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties) | |
| >>> vis[0]["albedo"].save("einstein_albedo.png") | |
| >>> vis[0]["roughness"].save("einstein_roughness.png") | |
| >>> vis[0]["metallicity"].save("einstein_metallicity.png") | |
| ``` | |
| ```py | |
| >>> import diffusers | |
| >>> import torch | |
| >>> pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained( | |
| ... "prs-eth/marigold-iid-lighting-v1-1", variant="fp16", torch_dtype=torch.float16 | |
| ... ).to("cuda") | |
| >>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") | |
| >>> intrinsics = pipe(image) | |
| >>> vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties) | |
| >>> vis[0]["albedo"].save("einstein_albedo.png") | |
| >>> vis[0]["shading"].save("einstein_shading.png") | |
| >>> vis[0]["residual"].save("einstein_residual.png") | |
| ``` | |
| **Parameters:** | |
| unet (`UNet2DConditionModel`) : Conditional U-Net to denoise the targets latent, conditioned on image latent. | |
| vae (`AutoencoderKL`) : Variational Auto-Encoder (VAE) Model to encode and decode images and predictions to and from latent representations. | |
| scheduler (`DDIMScheduler` or `LCMScheduler`) : A scheduler to be used in combination with `unet` to denoise the encoded image latents. | |
| text_encoder (`CLIPTextModel`) : Text-encoder, for empty text embedding. | |
| tokenizer (`CLIPTokenizer`) : CLIP tokenizer. | |
| prediction_type (`str`, *optional*) : Type of predictions made by the model. | |
| target_properties (`dict[str, Any]`, *optional*) : Properties of the predicted modalities, such as `target_names`, a `list[str]` used to define the number, order and names of the predicted modalities, and any other metadata that may be required to interpret the predictions. | |
| default_denoising_steps (`int`, *optional*) : The minimum number of denoising diffusion steps that are required to produce a prediction of reasonable quality with the given model. This value must be set in the model config. When the pipeline is called without explicitly setting `num_inference_steps`, the default value is used. This is required to ensure reasonable results with various model flavors compatible with the pipeline, such as those relying on very short denoising schedules (`LCMScheduler`) and those with full diffusion schedules (`DDIMScheduler`). | |
| default_processing_resolution (`int`, *optional*) : The recommended value of the `processing_resolution` parameter of the pipeline. This value must be set in the model config. When the pipeline is called without explicitly setting `processing_resolution`, the default value is used. This is required to ensure reasonable results with various model flavors trained with varying optimal processing resolution values. | |
| **Returns:** | |
| `[MarigoldIntrinsicsOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldIntrinsicsOutput) or `tuple`` | |
| If `return_dict` is `True`, [MarigoldIntrinsicsOutput](/docs/diffusers/pr_13745/en/api/pipelines/marigold#diffusers.pipelines.marigold.MarigoldIntrinsicsOutput) is returned, otherwise a | |
| `tuple` is returned where the first element is the prediction, the second element is the uncertainty | |
| (or `None`), and the third is the latent (or `None`). | |
| #### diffusers.pipelines.marigold.MarigoldIntrinsicsOutput[[diffusers.pipelines.marigold.MarigoldIntrinsicsOutput]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py#L96) | |
| Output class for Marigold Intrinsic Image Decomposition pipeline. | |
| **Parameters:** | |
| prediction (`np.ndarray`, `torch.Tensor`) : Predicted image intrinsics with values in the range [0, 1]. The shape is `(numimages * numtargets) × 3 × height × width` for `torch.Tensor` or `(numimages * numtargets) × height × width × 3` for `np.ndarray`, where `numtargets` corresponds to the number of predicted target modalities of the intrinsic image decomposition. | |
| uncertainty (`None`, `np.ndarray`, `torch.Tensor`) : Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is `(numimages * numtargets) × 3 × height × width` for `torch.Tensor` or `(numimages * numtargets) × height × width × 3` for `np.ndarray`. | |
| latent (`None`, `torch.Tensor`) : Latent features corresponding to the predictions, compatible with the `latents` argument of the pipeline. The shape is `(numimages * numensemble) × (numtargets * 4) × latentheight × latentwidth`. | |
| #### diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_intrinsics[[diffusers.pipelines.marigold.MarigoldImageProcessor.visualize_intrinsics]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/marigold/marigold_image_processing.py#L542) | |
| Visualizes intrinsic image decomposition, such as predictions of the `MarigoldIntrinsicsPipeline`. | |
| Returns: `list[dict[str, PIL.Image.Image]]` with intrinsic image decomposition visualization. | |
| **Parameters:** | |
| prediction (`np.ndarray | torch.Tensor | list[np.ndarray, list[torch.Tensor]]`) : Intrinsic image decomposition. | |
| target_properties (`dict[str, Any]`) : Decomposition properties. Expected entries: `target_names: list[str]` and a dictionary with keys `prediction_space: str`, `sub_target_names: list[str | Null]` (must have 3 entries, null for missing modalities), `up_to_scale: bool`, one for each target and sub-target. | |
| color_map (`str | dict[str, str]`, *optional*, defaults to `"Spectral"`) : Color map used to convert a single-channel predictions into colored representations. When a dictionary is passed, each modality can be colored with its own color map. | |
Xet Storage Details
- Size:
- 63.2 kB
- Xet hash:
- f2f4a8c429732c3f33c39fc99aa816681b11ef7613a911b58ab49ff685bda765
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.