Intel
/

ldm3d-pano

@@ -16,9 +16,7 @@ The LDM3D model was proposed in ["LDM3D: Latent Diffusion Model for 3D"](https:/
 LDM3D got accepted to [CVPRW'23]([https://aaai.org/Conferences/AAAI-23/](https://cvpr2023.thecvf.com/)).
-This checkpoint finetunes the previous [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c) on 2 panoramic-images datasets:
-- [polyhaven](https://polyhaven.com/): 585 images for the training set, 66 images for the validation set
-- [ihdri](https://www.ihdri.com/hdri-skies-outdoor/): 57 outdoor images for the training set, 7 outdoor images for the validation set.
 These datasets were augmented using [Text2Light](https://frozenburning.github.io/projects/text2light/) to create a dataset containing 13852 training samples and 1606 validation samples.
@@ -47,14 +45,14 @@ Here is how to use this model to get the features of a given text in PyTorch:
 from diffusers import StableDiffusionLDM3DPipeline
-pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d-4c")
 pipe.to("cuda")
-prompt ="A picture of some lemons on a table"
-name = "lemons"
-output = pipe(prompt)
 rgb_image, depth_image = output.rgb, output.depth
 rgb_image[0].save(name+"_ldm3d_rgb.jpg")
 depth_image[0].save(name+"_ldm3d_depth.png")
@@ -62,7 +60,7 @@ depth_image[0].save(name+"_ldm3d_depth.png")
 This is the result:
-![ldm3d_results](ldm3d_4c_results.png)
 ### Limitations and bias
@@ -77,13 +75,15 @@ The LDM3D model was finetuned on a dataset constructed from a subset of the LAIO
 ### Finetuning
-The fine-tuning process comprises two stages. In the first stage, we train an autoencoder to generate a lower-dimensional, perceptually equivalent data representation. Subsequently, we fine-tune the diffusion model using the frozen autoencoder
-## Evaluation results
-Please refer to Table 1 and Table2 from the [paper](https://arxiv.org/abs/2305.10853) for quantitative results.
-The figure below shows some qualitative results comparing our method with (Stable diffusion v1.4)[https://arxiv.org/pdf/2112.10752.pdf] and with (DPT-Large)[https://arxiv.org/pdf/2103.13413.pdf] for the depth maps
-![qualitative results](qualitative_results.png)
 ### BibTeX entry and citation info
 ```bibtex

 LDM3D got accepted to [CVPRW'23]([https://aaai.org/Conferences/AAAI-23/](https://cvpr2023.thecvf.com/)).
 These datasets were augmented using [Text2Light](https://frozenburning.github.io/projects/text2light/) to create a dataset containing 13852 training samples and 1606 validation samples.
 from diffusers import StableDiffusionLDM3DPipeline
+pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d-pano")
 pipe.to("cuda")
+prompt ="360 view of a large bedroom"
+name = "bedroom_pano"
+output = pipe(prompt, width=1024, height=512,)
 rgb_image, depth_image = output.rgb, output.depth
 rgb_image[0].save(name+"_ldm3d_rgb.jpg")
 depth_image[0].save(name+"_ldm3d_depth.png")
 This is the result:
+![ldm3d_results](ldm3d_pano_results.png)
 ### Limitations and bias
 ### Finetuning
+This checkpoint finetunes the previous [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c) on 2 panoramic-images datasets:
+- [polyhaven](https://polyhaven.com/): 585 images for the training set, 66 images for the validation set
+- [ihdri](https://www.ihdri.com/hdri-skies-outdoor/): 57 outdoor images for the training set, 7 outdoor images for the validation set.
+These datasets were augmented using [Text2Light](https://frozenburning.github.io/projects/text2light/) to create a dataset containing 13852 training samples and 1606 validation samples.
+In order to generate the depth map of those samples, we used [DPT-large](https://github.com/isl-org/MiDaS) and to generate the caption we used [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)
 ### BibTeX entry and citation info
 ```bibtex