81.2 kB

Title: Exploring Plain Diffusion Transformers for 3D Shape Generation

URL Source: https://arxiv.org/html/2307.01831

Markdown Content: Shentong Mo 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Enze Xie 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Ruihang Chu 3 3{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Lewei Yao 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT

Lanqing Hong 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Matthias Nießner 4 4{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT Zhenguo Li 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT

1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT MBZUAI, 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Huawei Noah’s Ark Lab, 3 3{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT CUHK, 4 4{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT TUM

https://DiT-3D.github.io

Abstract

Recent Diffusion Transformers(e.g.DiTPeebles2022DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape generation, namely DiT-3D, which can directly operate the denoising process on voxelized point clouds using plain Transformers. Compared to existing U-Net approaches, our DiT-3D is more scalable in model size and produces much higher quality generations. Specifically, the DiT-3D adopts the design philosophy of DiTPeebles2022DiT but modifies it by incorporating 3D positional and patch embeddings to adaptively aggregate input from voxelized point clouds. To reduce the computational cost of self-attention in 3D shape generation, we incorporate 3D window attention into Transformer blocks, as the increased 3D token length resulting from the additional dimension of voxels can lead to high computation. Finally, linear and devoxelization layers are used to predict the denoised point clouds. In addition, our transformer architecture supports efficient fine-tuning from 2D to 3D, where the pre-trained DiT-2D checkpoint on ImageNet can significantly improve DiT-3D on ShapeNet. Experimental results on the ShapeNet dataset demonstrate that the proposed DiT-3D achieves state-of-the-art performance in high-fidelity and diverse 3D point cloud generation. In particular, our DiT-3D decreases the 1-Nearest Neighbor Accuracy of the state-of-the-art method by 4.59 and increases the Coverage metric by 3.51 when evaluated on Chamfer Distance.

Figure 1: Examples of high-fidelity and diverse 3D point clouds produced from DiT-3D.

1 Introduction

In recent times, there has been a growing interest in exploring the potential of diffusion transformers for high-fidelity image generation, as evinced by a series of scholarly worksPeebles2022DiT; bao2022all; bao2023transformer; xie2023difffit. Notably, a seminal work by Peebles et al.Peebles2022DiT proposed the replacement of the widely-used U-Net backbone with a scalable transformer. Specifically, the proposed method operates on latent patches by training latent 2D diffusion models. However, the efficacy of plain diffusion transformers for 3D shape generation has yet to be explored, as most existing 3D diffusion approaches continue to adopt the U-Net backbone.

Generating high-fidelity point clouds for 3D shape generation is a challenging and significant problem. Early generative methodsFan2017a; Groueix2018a; Kurenkov2018DeformNet addressed this problem by directly optimizing heuristic loss objectives, such as Chamfer Distance (CD) and Earth Mover’s Distance (EMD). More recent worksachlioptas2018learning; yang2019pointflow; Kim2020SoftFlowPF; Klokov2020dpfnet have explored the usage of the generative adversarial network (GAN)-based and flow-based models to generate 3D point clouds from a probabilistic perspective. Recently, researcherszhou2021pvd; zeng2022lion; gao2022get3d; liu2023meshdiffusion have turned to various denoising diffusion probabilistic models (DDPMs) to generate entire shapes from random noise. For instance, PVDzhou2021pvd employed the point-voxel representation of 3D shapes as input to DDPMs. They reversed the diffusion process from observed point clouds to Gaussian noise by optimizing a variational lower bound to the likelihood function. Recently, the Diffusion Transformer (DiT)Peebles2022DiT; bao2022all has been shown to surpass the U-Net architecture in 2D image generation, owing to its simple design and superior generative performance. Consequently, we investigate the potential of the Diffusion Transformer for 3D generation. However, extending the 2D DiT to 3D poses two significant challenges: (1) Point clouds are intrinsically unordered, unlike images where pixels are ordered; and (2) The tokens in 3D space have an additional dimension compared to 2D images, resulting in a substantial increase in computational cost.

This work introduces DiT-3D, a novel diffusion transformer architecture designed for 3D shape generation that leverages the denoising process of DDPM on 3D point clouds. The proposed model inherits the simple design of the modules in DiT-2D, with only minor adaptations to enable it to generalize to 3D generation tasks. To tackle the challenge posed by the unordered data structure of point clouds, we convert the point cloud into a voxel representation. DiT-3D employs 3D positional embedding and 3D patch embedding on the voxelized point clouds to extract point-voxel features and effectively process the unordered data. Furthermore, to address the computational cost associated with a large number of tokens in 3D space, we introduce a 3D window attention operator instead of the vanilla global attention in DiT-2D. This operator significantly reduces training time and memory usage, making DiT-3D feasible for large-scale 3D generation tasks. Finally, we utilize linear and devoxelization layers to predict the noised point clouds in the reversed process to generate final 3D shapes.

In order to address the computational cost associated with a large number of tokens in 3D space, we also introduce a parameter-efficient tuning method to utilize the pre-trained DiT-2D model on ImageNet as initialization for DiT-3D(window attention shares the same parameters with vanilla attention). Benefiting from the substantial similarity between the network structure and parameters of DiT-3D and DiT-2D, the representations learned on ImageNet significantly improve 3D generation, despite the significant domain disparity between 2D images and 3D point clouds. To our knowledge, we are the first to achieve parameter-efficient fine-tuning from 2D ImageNet pre-trained weights for high-fidelity and diverse 3D shape generation. In particular, we highly decrease the training parameters from 32.8MB to only 0.09MB.

We present a comprehensive evaluation of DiT-3D on a diverse set of object classes in the ShapeNet benchmark, where it achieves state-of-the-art performance compared to previous non-DDPM and DDPM-based 3D shape generation methods. Qualitative visualizations further emphasize the efficacy of DiT-3D in generating high-fidelity 3D shapes. Extensive ablation studies confirm the significance of 3D positional embeddings, window attention, and 2D pre-training in 3D shape generation. Moreover, we demonstrate that DiT-3D is easily scalable regarding patch sizes, voxel sizes, and model sizes. Our findings align with those of DiT-2D, where increasing the model size leads to continuous performance improvements. In addition, our parameter-efficient fine-tuning from DiT-2D ImageNet pre-trained weights highly decreases the training parameters while achieving competitive generation performance. By only training 0.09MB parameters of models from the source class to the target class, we also achieve comparable results of quality and diversity in terms of all metrics.

Our main contributions can be summarized as follows:

• We present DiT-3D, the first plain diffusion transformer architecture for point cloud shape generation that can effectively perform denoising operations on voxelized point clouds.
• We make several simple yet effective modifications on DiT-3D, including 3D positional and patch embeddings, 3D window attention, and 2D pre-training on ImageNet. These modifications significantly improve the performance of DiT-3D while maintaining efficiency.
• Extensive experiments on the ShapeNet dataset demonstrate the state-of-the-art superiority of DiT-3D over previous non-DDPM and DDPM baselines in generating high-fidelity shapes.

2 Related Work

3D Shape Generation. 3D shape generation aims to synthesize high-fidelity point clouds or meshes using generative models, such as variational autoencodersYang2018foldingnet; gadelha2018multiresolution; Kim2021SetVAE, generative adversarial net-worksvalsesia2019learning; achlioptas2018learning; Shu2019pointcloud, and normalized flowsyang2019pointflow; Kim2020SoftFlowPF; Klokov2020dpfnet. Typically, PointFlowyang2019pointflow utilized a probabilistic framework based on the continuous normalizing flow to generate 3D point clouds from two-level hierarchical distributions. ShapeGFcai2020learning trained a score-matching energy-based network to learn the distribution of points across gradient fields using Langevin dynamics. More recently, GET3Dgao2022get3d leveraged a signed distance field (SDF) and a texture field as two latent codes to learn a generative model that directly generates 3D meshes. In this work, we mainly focus on denoising diffusion probabilistic models for generating high-fidelity 3D point clouds starting from random noise, where point and shape distributions are not separated.

Diffusion Models. Diffusion modelsho2020denoising; song2021scorebased; song2021denoisingdi have been demonstrated to be effective in many generative tasks, such as image generationsaharia2022photorealistic, image restorationsaharia2021image, speech generationkong2021diffwave, and video generationho2022imagen. Denoising diffusion probabilistic models (DDPMs)ho2020denoising; song2021scorebased utilized a forward noising process that gradually adds Gaussian noise to images and trained a reverse process that inverts the forward process. In recent years, researchersluo2021dpm; zhou2021pvd; zeng2022lion; nam20223dldm; liu2023meshdiffusion; li2023diffusionsdf; chu2023diffcomplete have tried to explore diverse pipelines based on diffusion probabilistic models to achieve 3D shape generation. For example, PVDzhou2021pvd applied DDPM based on PVCNNsliu2019pvcnn on the point-voxel representation of 3D shapes with structured locality into point clouds. To improve the generation quality, LIONzeng2022lion used two DDPMs to learn a hierarchical latent space based on a global shape latent representation and a point-structured latent space separately. Different from them, we will solve the 3D shape generation problem in our approach by designing a plain transformer-based architecture backbone to replace the U-Net backbone for reversing the diffusion process from observed point clouds to Gaussian noise. Meanwhile, our 3D plain diffusion transformer supports multi-class training with learnable class embeddings as the condition and parameter-efficient fine-tuning with modality and domain transferability differ from DDPM-based 3D generation approaches discussed above.

Transformers in Diffusion Generation. Diffusion TransformersPeebles2022DiT; bao2022all; bao2023transformer; xie2023difffit have recently shown their impressive capacity to generate high-fidelity images. For instance, Diffusion Transformer (DiT)Peebles2022DiT proposed a plain diffusion Transformer architecture to learn the denoising diffusion process on latent patches from a pre-trained pre-trained variational autoencoder model in Stable DiffusionRombach2022highresolution. U-ViTbao2022all incorporated all the time, condition, and noisy image patches as tokens and utilized a Vision transformer(ViT)Dosovitskiy2021vit-based architecture with long skip connections between shallow and deep layers. More recently, UniDiffuserbao2023transformer designed a unified transformer for diffusion models to handle input types of different modalities by learning all distributions simultaneously. While those diffusion transformer approaches achieve promising performance in 2D image generation, how a plain diffusion transformer performs on 3D shape generation is still being determined. In contrast, we develop a novel plain diffusion transformer for 3D shape generation that can effectively perform denoising operations on voxelized point clouds. Furthermore, the proposed DiT-3D can support parameter-efficient fine-tuning with transferability across modality and domain.

Figure 2: Illustration of the proposed Diffusion Transformers(DiT-3D) for 3D shape generation. The plain diffusion transformer takes voxelized point clouds as input, and a patchification operator is used to generate token-level patch embeddings, where 3D positional embeddings are added together. Then, multiple transformer blocks based on 3D window attention extract point-voxel representations from all input tokens. Finally, the unpatchified voxel tensor output from a linear layer is devoxelized to predict the noise in the point cloud space.

3 Method

Given a set of 3D point clouds, we aim to learn a plain diffusion transformer for synthesizing new high-fidelity point clouds. We propose a novel diffusion transformer that operates the denoising process of DDPM on voxelized point clouds, namely DiT-3D, which consists of two main modules: Design DiT for 3D Point Cloud Generation in Section3.2 and Efficient Modality/Domain Transfer with Parameter-efficient Fine-tuning in Section3.3.

3.1 Preliminaries

In this section, we first describe the problem setup and notations and then revisit denoising diffusion probabilistic models (DDPMs) for 3D shape generation and diffusion transformers on 2D images.

Problem Setup and Notations. Given a set 𝒮={𝐩 i}i=1 S 𝒮 superscript subscript subscript 𝐩 𝑖 𝑖 1 𝑆\mathcal{S}={\mathbf{p}{i}}{i=1}^{S}caligraphic_S = { bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT of 3D shapes with M 𝑀 M italic_M classes, our goal is to train a plain diffusion transformer from these point clouds for generating high-fidelity point clouds. For each point cloud 𝐩 i subscript 𝐩 𝑖\mathbf{p}{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we have N 𝑁 N italic_N points for x,y,z 𝑥 𝑦 𝑧 x,y,z italic_x , italic_y , italic_z coordinates, that is 𝐩 i∈ℝ N×3 subscript 𝐩 𝑖 superscript ℝ 𝑁 3\mathbf{p}{i}\in\mathbb{R}^{N\times 3}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT. Note that we have a class label for the 3D shape 𝐩 i subscript 𝐩 𝑖\mathbf{p}{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which is denoted as {y i}i=1 M subscript superscript subscript 𝑦 𝑖 𝑀 𝑖 1{y{i}}^{M}{i=1}{ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT with y i subscript 𝑦 𝑖 y{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the ground-truth category entry i 𝑖 i italic_i as 1. During the training, we take the class label as input to achieve classifier-free guidance in conditional diffusion models, following the prior diffusion transformer (i.e., DiTPeebles2022DiT) on images.

Revisit DDPMs on 3D Shape Generation. To solve the 3D shape generation problem, previous workzhou2021pvd based on denoising diffusion probabilistic models (DDPMs) define a forward noising process that gradually applies noise to real data 𝐱 0 subscript 𝐱 0\mathbf{x}{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as q⁢(𝐱 t|𝐱 t−1)=𝒩⁢(𝐱 t;1−β t⁢𝐱 t−1,β t⁢𝐈)𝑞 conditional subscript 𝐱 𝑡 subscript 𝐱 𝑡 1 𝒩 subscript 𝐱 𝑡 1 subscript 𝛽 𝑡 subscript 𝐱 𝑡 1 subscript 𝛽 𝑡 𝐈 q(\mathbf{x}{t}|\mathbf{x}{t-1})=\mathcal{N}(\mathbf{x}{t};\sqrt{1-\beta_{t% }}\mathbf{x}{t-1},\beta{t}\mathbf{I})italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ), where β t subscript 𝛽 𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a Gaussian noise value between 0 0 and 1 1 1 1. In particular, the denoising process produces a series of shape variables with decreasing levels of noise, denoted as 𝐱 T,𝐱 T−1,…,𝐱 0 subscript 𝐱 𝑇 subscript 𝐱 𝑇 1…subscript 𝐱 0\mathbf{x}{T},\mathbf{x}{T-1},...,\mathbf{x}{0}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where 𝐱 T subscript 𝐱 𝑇\mathbf{x}{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is sampled from a Gaussian prior and 𝐱 0 subscript 𝐱 0\mathbf{x}{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the final output. With the reparameterization trick, we can have 𝐱 t=α¯t⁢𝐱 0+1−α¯t⁢ϵ subscript 𝐱 𝑡 subscript¯𝛼 𝑡 subscript 𝐱 0 1 subscript¯𝛼 𝑡 bold-italic-ϵ\mathbf{x}{t}=\sqrt{\bar{\alpha}{t}}\mathbf{x}{0}+\sqrt{1-\bar{\alpha}{t}}% \bm{\epsilon}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ, where ϵ∼𝒩⁢(𝟎,𝐈)similar-to bold-italic-ϵ 𝒩 0 𝐈\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ), α t=1−β t subscript 𝛼 𝑡 1 subscript 𝛽 𝑡\alpha{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and α¯t=∏i=1 t α i subscript¯𝛼 𝑡 superscript subscript product 𝑖 1 𝑡 subscript 𝛼 𝑖\bar{\alpha}{t}=\prod{i=1}^{t}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

For the reverse process, diffusion models are trained to learn a denoising network 𝜽 𝜽\bm{\theta}bold_italic_θ for inverting forward process corruption as p 𝜽⁢(𝐱 t−1|𝐱 t)=𝒩⁢(𝐱 t−1;𝝁 𝜽⁢(𝐱 t,t),σ t 2⁢𝐈)subscript 𝑝 𝜽 conditional subscript 𝐱 𝑡 1 subscript 𝐱 𝑡 𝒩 subscript 𝐱 𝑡 1 subscript 𝝁 𝜽 subscript 𝐱 𝑡 𝑡 superscript subscript 𝜎 𝑡 2 𝐈 p_{\bm{\theta}}(\mathbf{x}{t-1}|\mathbf{x}{t})=\mathcal{N}(\mathbf{x}{t-1};% \bm{\mu}{\bm{\theta}}(\mathbf{x}{t},t),\sigma{t}^{2}\mathbf{I})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ). The training objective is to maximize a variational lower bound of the negative log data likelihood that involves all of 𝐱 0,…,𝐱 T subscript 𝐱 0…subscript 𝐱 𝑇\mathbf{x}{0},...,\mathbf{x}{T}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT as

ℒ=∑t−p 𝜽(𝐱 0|𝐱 1)+𝒟 KL(q(𝐱 t−1|𝐱 t,𝐱 0)||p 𝜽(𝐱 t−1|𝐱 t)))\displaystyle\mathcal{L}=\sum_{t}-p_{\bm{\theta}}(\mathbf{x}{0}|\mathbf{x}{1% })+\mathcal{D}{\text{KL}}(q(\mathbf{x}{t-1}|\mathbf{x}{t},\mathbf{x}{0})||% p_{\bm{\theta}}(\mathbf{x}{t-1}|\mathbf{x}{t})))caligraphic_L = ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + caligraphic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | | italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) )(1)

where 𝒟 KL(⋅||⋅)\mathcal{D}{\text{KL}}(\cdot||\cdot)caligraphic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( ⋅ | | ⋅ ) denotes the KL divergence measuring the distance between two distributions. Since both p 𝜽(𝐱 t−1|𝐱 t))p{\bm{\theta}}(\mathbf{x}{t-1}|\mathbf{x}{t}))italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) and q⁢(𝐱 t−1|𝐱 t,𝐱 0)𝑞 conditional subscript 𝐱 𝑡 1 subscript 𝐱 𝑡 subscript 𝐱 0 q(\mathbf{x}{t-1}|\mathbf{x}{t},\mathbf{x}{0})italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) are Gaussians, we can reparameterize 𝝁 𝜽⁢(𝐱 t,t)subscript 𝝁 𝜽 subscript 𝐱 𝑡 𝑡\bm{\mu}{\bm{\theta}}(\mathbf{x}{t},t)bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) to predict the noise ϵ 𝜽⁢(𝐱 t,t)subscript bold-italic-ϵ 𝜽 subscript 𝐱 𝑡 𝑡\bm{\epsilon}{\bm{\theta}}(\mathbf{x}{t},t)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ). In the end, the training objective can be reduced to a simple mean-squared loss between the model output ϵ 𝜽⁢(𝐱 t,t)subscript bold-italic-ϵ 𝜽 subscript 𝐱 𝑡 𝑡\bm{\epsilon}{\bm{\theta}}(\mathbf{x}{t},t)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and the ground truth Gaussian noise ϵ bold-italic-ϵ\bm{\epsilon}bold_italic_ϵ as: ℒ simple=‖ϵ−ϵ 𝜽⁢(𝐱 t,t)‖2 subscript ℒ simple superscript norm bold-italic-ϵ subscript bold-italic-ϵ 𝜽 subscript 𝐱 𝑡 𝑡 2\mathcal{L}{\text{simple}}=|\bm{\epsilon}-\bm{\epsilon}{\bm{\theta}}(% \mathbf{x}{t},t)|^{2}caligraphic_L start_POSTSUBSCRIPT simple end_POSTSUBSCRIPT = ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. After p 𝜽(𝐱 t−1|𝐱 t))p_{\bm{\theta}}(\mathbf{x}{t-1}|\mathbf{x}{t}))italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) is trained, new point clouds can be generated by progressively sampling 𝐱 t−1∼p 𝜽(𝐱 t−1|𝐱 t))\mathbf{x}{t-1}\sim p{\bm{\theta}}(\mathbf{x}{t-1}|\mathbf{x}{t}))bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) by using the reparameterization trick with initialization of 𝐱 T∼𝒩⁢(𝟎,𝐈)similar-to subscript 𝐱 𝑇 𝒩 0 𝐈\mathbf{x}_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ).

Revisit Diffusion Transformer (DiT) on 2D Image Generation. To generate high-fidelity 2D images, DiT proposed to train latent diffusion models (LDMs) with Transformers as the backbone, consisting of two training models. They first extract the latent code 𝐳 𝐳\mathbf{z}bold_z from an image sample 𝐱 𝐱\mathbf{x}bold_x using an autoencoder with an encoder f enc⁢(⋅)subscript 𝑓 enc⋅f_{\text{enc}}(\cdot)italic_f start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT ( ⋅ ) and a decoder f dec⁢(⋅)subscript 𝑓 dec⋅f_{\text{dec}}(\cdot)italic_f start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT ( ⋅ ), that is, 𝐳=f enc⁢(𝐱)𝐳 subscript 𝑓 enc 𝐱\mathbf{z}=f_{\text{enc}}(\mathbf{x})bold_z = italic_f start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT ( bold_x ). The decoder is used to reconstruct the image sample 𝐱^^𝐱\hat{\mathbf{x}}over^ start_ARG bold_x end_ARG from the latent code 𝐳 𝐳\mathbf{z}bold_z, i.e., 𝐱^=f dec⁢(𝐳)^𝐱 subscript 𝑓 dec 𝐳\hat{\mathbf{x}}=f_{\text{dec}}(\mathbf{z})over^ start_ARG bold_x end_ARG = italic_f start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT ( bold_z ). Based on latent codes 𝐳 𝐳\mathbf{z}bold_z, a latent diffusion transformer with multiple designed blocks is trained with time embedding 𝐭 𝐭\mathbf{t}bold_t and class embedding 𝐜 𝐜\mathbf{c}bold_c, where a self-attention and a feed-forward module are involved in each block. Note that they apply patchification on latent code 𝐳 𝐳\mathbf{z}bold_z to extract a sequence of patch embeddings and depatchification operators are used to predict the denoised latent code 𝐳 𝐳\mathbf{z}bold_z.

Although DDPMs achieved promising performance on 3D shape generation, they can only handle single-class training based on PVCNNsliu2019pvcnn as the encoder to extract 3D representations, and they cannot learn explicit class-conditional embeddings. Furthermore, we are not able to directly transfer their single-class pre-trained model to new classes with parameter-efficient fine-tuning. Meanwhile, we empirically observe that the direct extension of DiTPeebles2022DiT on point clouds does not work. To address this problem, we propose a novel plain diffusion transformer for 3D shape generation that can effectively achieve the denoising processes on voxelized point clouds, as illustrated in Figure2.

3.2 Diffusion Transformer for 3D Point Cloud Generation

To enable denoising operations using a plain diffusion transformer, we propose several adaptations to 3D point cloud generation in Figure2 within the framework of DiTPeebles2022DiT. Specifically, our DiT-3D model accepts voxelized point clouds as input and employs a patchification operator to generate token-level patch embeddings. We add 3D positional embeddings to these embeddings and extract point-voxel representations from all input tokens using multiple transformer blocks based on 3D window attention. Finally, we apply a devoxelized linear layer to the unpatchified voxel output, allowing us to predict the noise in the point cloud space.

Denoising on Voxelized Point Clouds. Point clouds are inherently unordered, unlike images where pixels follow a specific order. We encountered difficulty in our attempt to train a diffusion transformer on point coordinates due to the sparse distribution of points in the 3D embedding space. To address this issue, we decided to voxelize the point clouds into dense representations, allowing the diffusion transformers to extract point-voxel features. Our approach differs from DiTPeebles2022DiT, which utilizes latent codes 𝐳 𝐳\mathbf{z}bold_z to train the latent diffusion transformer. Instead, we directly train the denoising process on voxelized point clouds using the diffusion transformer. For each point cloud 𝐩 i∈ℝ N×3 subscript 𝐩 𝑖 superscript ℝ 𝑁 3\mathbf{p}{i}\in\mathbb{R}^{N\times 3}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT with N 𝑁 N italic_N points for x,y,z 𝑥 𝑦 𝑧 x,y,z italic_x , italic_y , italic_z coordinates, we first voxelize it as input 𝐯 i∈ℝ V×V×V×3 subscript 𝐯 𝑖 superscript ℝ 𝑉 𝑉 𝑉 3\mathbf{v}{i}\in\mathbb{R}^{V\times V\times V\times 3}bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × italic_V × italic_V × 3 end_POSTSUPERSCRIPT.

3D Positional and Patch Embeddings. With the voxel input 𝐯 i∈ℝ V×V×V×3 subscript 𝐯 𝑖 superscript ℝ 𝑉 𝑉 𝑉 3\mathbf{v}_{i}\in\mathbb{R}^{V\times V\times V\times 3}bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × italic_V × italic_V × 3 end_POSTSUPERSCRIPT, we introduce patchification operator with a patch size p×p×p 𝑝 𝑝 𝑝 p\times p\times p italic_p × italic_p × italic_p to generate a sequence of patch tokens 𝐭∈ℝ L×3 𝐭 superscript ℝ 𝐿 3\mathbf{t}\in\mathbb{R}^{L\times 3}bold_t ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × 3 end_POSTSUPERSCRIPT. L=(V/p)3 𝐿 superscript 𝑉 𝑝 3 L=(V/p)^{3}italic_L = ( italic_V / italic_p ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT denotes the total number of patchified tokens. A 3D convolution layer is applied on patch tokens to extract patch embeddings 𝐞∈ℝ L×D 𝐞 superscript ℝ 𝐿 𝐷\mathbf{e}\in\mathbb{R}^{L\times D}bold_e ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_D end_POSTSUPERSCRIPT, where D 𝐷 D italic_D is the dimension of embeddings. To adapt to our voxelized point clouds, we add frequency-based sine-cosine 3D positional embeddings instead of the 2D version in DiTPeebles2022DiT to all input tokens. Based on these patch-level tokens, we introduce time embeddings 𝐭 𝐭\mathbf{t}bold_t and class embeddings 𝐜 𝐜\mathbf{c}bold_c as input to achieve multi-class training with learnable class embeddings as the condition, which differs from existing 3D generation approaches with U-Net as the backbone.

3D Window Attention. Due to the increased token length resulting from the additional dimension in 3D space, the computational cost of 3D Transformers can be significantly high. To address this issue, we introduce efficient 3D window attention into Transformer blocks blocks to propagate point-voxel features in efficient memory usage. For the original multi-head self-attention process with each of the heads Q,K,V 𝑄 𝐾 𝑉 Q,K,V italic_Q , italic_K , italic_V have the same dimensions L×D 𝐿 𝐷 L\times D italic_L × italic_D, where L=(V/p)3 𝐿 superscript 𝑉 𝑝 3 L=(V/p)^{3}italic_L = ( italic_V / italic_p ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is the length of input tokens, we can have the attention operator as:

Attention⁢(Q,K,V)=Softmax⁢(Q⁢K⊤D h⁢V)Attention 𝑄 𝐾 𝑉 Softmax 𝑄 superscript 𝐾 top subscript 𝐷 ℎ 𝑉\displaystyle\mbox{Attention}(Q,K,V)=\mbox{Softmax}(\dfrac{QK^{\top}}{\sqrt{D_% {h}}}V)Attention ( italic_Q , italic_K , italic_V ) = Softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG end_ARG italic_V )(2)

where D h subscript 𝐷 ℎ D_{h}italic_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is the dimension size of each head. The computational complexity of this process is 𝒪⁢(L 2)𝒪 superscript 𝐿 2\mathcal{O}(L^{2})caligraphic_O ( italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), which will be largely expensive for high voxel resolutions. Inspired byLi2022ExploringPV, we extend the 2D window attention operator to a 3D one for 3D input tokens instead of vanilla global attention. This process uses a window size of R 𝑅 R italic_R to reduce the length of total input tokens as

K^^𝐾\displaystyle\hat{K}over^ start_ARG italic_K end_ARG=Reshape⁢(L R 3,D⋅R 3)absent Reshape 𝐿 superscript 𝑅 3⋅𝐷 superscript 𝑅 3\displaystyle=\mbox{Reshape}(\frac{L}{R^{3}},D\cdot R^{3})= Reshape ( divide start_ARG italic_L end_ARG start_ARG italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG , italic_D ⋅ italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )(3) K 𝐾\displaystyle K italic_K=Linear⁢(D⋅R 3,D)⁢(K^)absent Linear⋅𝐷 superscript 𝑅 3 𝐷^𝐾\displaystyle=\mbox{Linear}(D\cdot R^{3},D)(\hat{K})= Linear ( italic_D ⋅ italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_D ) ( over^ start_ARG italic_K end_ARG )

where K 𝐾 K italic_K is the input tokens to be reduced. Reshape⁢(L R 3,D⋅R 3)Reshape 𝐿 superscript 𝑅 3⋅𝐷 superscript 𝑅 3\mbox{Reshape}\left(\frac{L}{R^{3}},D\cdot R^{3}\right)Reshape ( divide start_ARG italic_L end_ARG start_ARG italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG , italic_D ⋅ italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) denotes to reshape K 𝐾 K italic_K to the one with shape of L R 3×(D⋅R 3)𝐿 superscript 𝑅 3⋅𝐷 superscript 𝑅 3\frac{L}{R^{3}}\times(D\cdot R^{3})divide start_ARG italic_L end_ARG start_ARG italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG × ( italic_D ⋅ italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ), and Linear⁢(C i⁢n,C o⁢u⁢t)⁢(⋅)Linear subscript 𝐶 𝑖 𝑛 subscript 𝐶 𝑜 𝑢 𝑡⋅\mbox{Linear}(C_{in},C_{out})(\cdot)Linear ( italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ) ( ⋅ ) denotes to a linear layer with a C i⁢n subscript 𝐶 𝑖 𝑛 C_{in}italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT-dimensional tensor as input and a C o⁢u⁢t subscript 𝐶 𝑜 𝑢 𝑡 C_{out}italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT-dimensional tensor as output. Therefore, the new K 𝐾 K italic_K has the shape of L R 3×D 𝐿 superscript 𝑅 3 𝐷\frac{L}{R^{3}}\times D divide start_ARG italic_L end_ARG start_ARG italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG × italic_D. As a result, the complexity of the self-attention operator in Equation(2) is reduced from 𝒪⁢(L 2)𝒪 superscript 𝐿 2\mathcal{O}(L^{2})caligraphic_O ( italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) to 𝒪⁢(L 2 R 3)𝒪 superscript 𝐿 2 superscript 𝑅 3\mathcal{O}(\frac{L^{2}}{R^{3}})caligraphic_O ( divide start_ARG italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ). In our experiments, we set R 𝑅 R italic_R to 4 4 4 4 in the default setting.

Devoxelized Prediction. Since the transformers blocks are implemented on voxelized point clouds, we can not directly use a standard linear decoder to predict the output noise ϵ 𝜽⁢(𝐱 t,t)subscript bold-italic-ϵ 𝜽 subscript 𝐱 𝑡 𝑡\bm{\epsilon}{\bm{\theta}}(\mathbf{x}{t},t)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) from point clouds. In order to generate the output noise, we devoxelize output tokens from the linear decoder. We first apply the final layer norm and linearly decode each token into a p×p×p×L×3 𝑝 𝑝 𝑝 𝐿 3 p\times p\times p\times L\times 3 italic_p × italic_p × italic_p × italic_L × 3 tensor, where L 𝐿 L italic_L is the total number of input tokens. Then we unpatchify the decoded token into a voxel tensor with the shape of V×V×V×3 𝑉 𝑉 𝑉 3 V\times V\times V\times 3 italic_V × italic_V × italic_V × 3. Finally, the unpatchified voxel tensor is devoxelized into a N×3 𝑁 3 N\times 3 italic_N × 3 tensor as the output noise ϵ 𝜽⁢(𝐱 t,t)subscript bold-italic-ϵ 𝜽 subscript 𝐱 𝑡 𝑡\bm{\epsilon}{\bm{\theta}}(\mathbf{x}{t},t)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ), matching with the ground truth Gaussian noise ϵ bold-italic-ϵ\bm{\epsilon}bold_italic_ϵ in the point cloud space.

Model Scaling.

Our DiT-3D is designed to be scalable, adapting to varying voxel sizes, patch sizes, and model sizes. Specifically, it can flexibly accommodate voxel dimensions of 16, 32, 64, patch dimensions of 2, 4, 8, and model complexity ranging from Small, Base, Large and Extra Large, as demonstrated in DiTPeebles2022DiT. For instance, a model designated as DiT-3D-S/4 refers that it utilizes the Small configuration of the DiT modelPeebles2022DiT, with a patch size p 𝑝 p italic_p of 4.

3.3 Efficient Modality/Domain Transfer with Parameter-efficient Fine-tuning

Leveraging the scalability of the plain diffusion transformer, we investigate parameter-efficient fine-tuning for achieving modality and domain transferability. To facilitate modality transfer from 2D to 3D, we can leverage the knowledge pre-trained on large-scale 2D images using DiTPeebles2022DiT. For domain transfer from a source class to target classes, we train DiT-3D on a single class(e.g.chair) and transfer the model’s parameters to other classes(e.g.airplane, car).

Modality Transfer: 2D(ImageNet) →normal-→\rightarrow→ 3D(ShapeNet). As large-scale pre-trained DiT checkpoints 1 1 1 https://github.com/facebookresearch/DiT/tree/main/diffusion are readily available, we can skip training our diffusion transformer from scratch. Instead, we can load most of the weights from the DiTPeebles2022DiT pre-trained on ImageNetimagenet_cvpr09 into our DiT-3D and continue with fine-tuning. To further optimize training efficiency, we adopt the parameter-efficient fine-tuning approach described in recent work, DiffFitxie2023difffit, which involves freezing the majority of parameters and only training the newly-added scale factors, bias term, normalization, and class condition modules. It’s worth noting that we initialize γ 𝛾\gamma italic_γ to 1, which is then multiplied with the frozen layers.

Domain Transfer: Source Class →normal-→\rightarrow→ Target Class. Given a pre-trained DiT-3D model on chair data, we can use the parameter-efficient fine-tuning approach to extend its applicability to new categories. Specifically, following the same methodology as described above, we leverage the fine-tuning strategy of DiffFit and obtain satisfactory generation results.

3.4 Relationship to DiTPeebles2022DiT

Our DiT-3D contains multiple different and efficient designs for 3D shape generation compared with DiTPeebles2022DiT on 2D image generation:

• We effectively achieve the diffusion space on voxelized point clouds, while DiT needs the latent codes from a pre-trained variational autoencoder as the denoising target.
• Our plain diffusion transformer first incorporates frequency-based sine-cosine 3D positional embeddings with patch embeddings for voxel structure locality.
• We are the first to propose efficient 3D window attention in the transformer blocks for reducing the complexity of the self-attention operator in DiT.
• We add a devoxelized operator to the final output of the last linear layer from DiT for denoising the noise prediction in the point cloud space.

4 Experiments

4.1 Experimental Setup

Datasets. Following most previous workszhou2021pvd; zeng2022lion, we use ShapeNetchang2015shapenet Chair, Airplane, and Car as our primary datasets for 3D shape generation. For each 3D shape, we sample 2,048 points from 5,000 provided points inchang2015shapenet for training and testing. We also use the same dataset splits and pre-processing in PointFlowyang2019pointflow, which normalizes the data globally across the whole dataset.

Evaluation Metrics. For comprehensive comparisons, we follow prior workzhou2021pvd; zeng2022lion and use Chamfer Distance (CD) and Earth Mover’s Distance (EMD) as our distance metrics in computing 1-Nearest Neighbor Accuracy (1-NNA) and Coverage (COV) as main metrics to measure generative quality. 1-NNA calculates the leave-one-out accuracy of the 1-NN classifier to quantify point cloud generation performance, which is robust and correlates with generation quality and diversity. A lower 1-NNA score is better. COV measures the number of reference point clouds matched to at least one generated shape, correlating with generation diversity. Note that a higher COV score is better but does not measure the quality of the generated point clouds since low-quality but diverse generated point clouds can achieve high COV scores.

Implementation. Our implementation is based on the PyTorchpaszke2019PyTorch framework. The input voxel size is 32×32×32×3 32 32 32 3 32\times 32\times 32\times 3 32 × 32 × 32 × 3, i.e., V=32 𝑉 32 V=32 italic_V = 32. The final linear layer is initialized with zeros, and other weights initialization follows standard techniques in ViTDosovitskiy2021vit. The models were trained for 10,000 epochs using the Adam optimizerkingma2014adam with a learning rate of 1⁢e−4 1 𝑒 4 1e-4 1 italic_e - 4 and a batch size of 128 128 128 128. We set T=1000 𝑇 1000 T=1000 italic_T = 1000 for experiments. In the default setting, we use S/4 with patch size p=4 𝑝 4 p=4 italic_p = 4 as the backbone. Note that we utilize 3D window attention in partial blocks (i.e., 0,3,6,9) and global attention in other blocks.

Table 1: Comparison results (%) on shape metrics of our DiT-3D and baseline models.

Method Chair Airplane Car 1-NNA (↓↓\downarrow↓)COV (↑↑\uparrow↑)1-NNA (↓↓\downarrow↓)COV (↑↑\uparrow↑)1-NNA (↓↓\downarrow↓)COV (↑↑\uparrow↑) CD EMD CD EMD CD EMD CD EMD CD EMD CD EMD r-GANachlioptas2018learning83.69 99.70 24.27 15.13 98.40 96.79 30.12 14.32 94.46 99.01 19.03 6.539 l-GAN (CD)achlioptas2018learning68.58 83.84 41.99 29.31 87.30 93.95 38.52 21.23 66.49 88.78 38.92 23.58 l-GAN (EMD)achlioptas2018learning71.90 64.65 38.07 44.86 89.49 76.91 38.27 38.52 71.16 66.19 37.78 45.17 PointFlowyang2019pointflow62.84 60.57 42.90 50.00 75.68 70.74 47.90 46.41 58.10 56.25 46.88 50.00 SoftFlowKim2020SoftFlowPF59.21 60.05 41.39 47.43 76.05 65.80 46.91 47.90 64.77 60.09 42.90 44.60 SetVAEKim2021SetVAE58.84 60.57 46.83 44.26 76.54 67.65 43.70 48.40 59.94 59.94 49.15 46.59 DPF-NetKlokov2020dpfnet62.00 58.53 44.71 48.79 75.18 65.55 46.17 48.89 62.35 54.48 45.74 49.43 DPMluo2021dpm60.05 74.77 44.86 35.50 76.42 86.91 48.64 33.83 68.89 79.97 44.03 34.94 PVDzhou2021pvd57.09 60.87 36.68 49.24 73.82 64.81 48.88 52.09 54.55 53.83 41.19 50.56 LIONzeng2022lion53.70 52.34 48.94 52.11 67.41 61.23 47.16 49.63 53.41 51.14 50.00 56.53 GET3Dgao2022get3d75.26 72.49 43.36 42.77––––75.26 72.49 15.04 18.38 MeshDiffusionliu2023meshdiffusion53.69 57.63 46.00 46.71 66.44 76.26 47.34 42.15 81.43 87.84 34.07 25.85 DiT-3D (ours)49.11 50.73 52.45 54.32 62.35 58.67 53.16 54.39 48.24 49.35 50.00 56.38

4.2 Comparison to State-of-the-art Works

In this work, we propose a novel and effective diffusion transformer for 3D shape generation. In order to validate the effectiveness of the proposed DiT-3D, we comprehensively compare it to previous non-DDPM and DDPM baselines. 1) r-GAN, 1-GANachlioptas2018learning: (2018’ICML): generative models based on GANs trained on point clouds (l-GAN) and latent variables (l-GAN); 2) PointFlowyang2019pointflow (2019’ICCV): a probabilistic framework to generate 3D point clouds from a two-level hierarchy of distributions with the continuous normalizing flow; 3) SoftFlowKim2020SoftFlowPF (2020’NeurIPS): a probabilistic framework for training normalizing flows on manifolds to estimate the distribution of various shapes; 4) SetVAEKim2021SetVAE (2021’CVPR): a hierarchical variational autoencoder for sets to learn latent variables for coarse-to-fine dependency and permutation invariance; 5) DPF-NetKlokov2020dpfnet (2020’ECCV): a discrete latent variable network that builds on normalizing flows with affine coupling layers; 6) DPMluo2021dpm (2021’ICCV): the first DDPM approach to learn the reverse diffusion process for point clouds as a Markov chain conditioned on shape latent; 7) PVDzhou2021pvd (2021’ICCV): a strong DDPM baseline based on the point-voxel representation of 3D shapes; 8) LIONzeng2022lion (2022’NeurIPS): a recent method based on two hierarchical DDPMs in global latent and latent points spaces; 9) GET3Dgao2022get3d (2022’NeurIPS): a generative model that directly generates explicit textured 3D meshes based on two latent codes (a 3D SDF and a texture field); 10) MeshDiffusionliu2023meshdiffusion (2023’ICLR): a very recent DDPM method using graph structure of meshes and deformable tetrahedral grid parametrization of 3D mesh shapes.

For chair generation, we report the quantitative comparison results in Table1. As can be seen, we achieved the best performance in terms of all metrics compared to previous non-DDPM and DDPM baselines. In particular, the proposed DiT-3D significantly outperforms DPF-NetKlokov2020dpfnet, the current state-of-the-art normalizing flows baseline, decreasing by 12.89 1-NNA@CD & 7.80 1-NNA@EMD, and increasing by 7.74 COV@CD & 3.8 COV@EMD. Moreover, we achieve superior performance gains compared to MeshDiffusionliu2023meshdiffusion, the current state-of-the-art DDPM baseline on meshes, which implies the importance of replacing the U-Net with a plain diffusion transformer from observed point clouds for generating high-fidelity 3D shapes. Meanwhile, our DiT-3D outperforms LIONliu2023meshdiffusion by a large margin, where we achieve the performance gains of 4.59 1-NNA@CD & 1.61 1-NNA@EMD, and 3.51 COV@CD & 2.21 COV@EMD. These significant improvements demonstrate the superiority of our method in 3D shape generation. In addition, significant gains in airplane and car generations can be observed in Table1. These qualitative results also showcase the effectiveness of applying a plain diffusion transformer to operate the denoising process from point clouds for generating high-fidelity and diverse shapes, as shown in Figure3.

Figure 3: Qualitative visualizations of high-fidelity and diverse 3D point cloud generation.

4.3 Experimental Analysis

In this section, we performed ablation studies to demonstrate the benefit of introducing three main 3D design components (voxel diffusion, 3D positional embeddings, and 3D window attention) in 3D shape generation. We also conducted extensive experiments to explore the efficiency of 3D window attention, modality and domain transferability, and scalability.

Table 2: Ablation studies on 3D adaptation components of our DiT-3D.

Voxel 3D 3D Window Training 1-NNA (↓↓\downarrow↓)COV (↑↑\uparrow↑) Diffusion Pos Embed Attention Cost (hours)CD EMD CD EMD ✗✗✗86.53 99.86 99.93 7.768 4.653 ✓✗✗91.85 67.46 69.47 38.97 41.74 ✓✓✗91.85 51.99 49.94 54.76 57.37 ✓✓✓41.67 49.11 50.73 52.45 54.32

Ablation on 3D Design Components. In order to validate the effectiveness of the introduced 3D adaptation components (voxel diffusion, 3D positional embeddings, and 3D window attention), we ablate the necessity of each module and report the quantitative results in Table2. Note that no voxel diffusion means we directly perform the denoising process on point coordinates without voxelized point clouds and devoxelization prediction. We can observe that adding bearable voxel diffusion to the vanilla baseline highly decreases the results of 1-NNA (by 32.40 @CD and 30.46 @AUC) and increase the performance of COV (by 31.202 @CD and 37.087 @EMD), which demonstrates the benefit of voxelized point clouds and devoxelization prediction in denoising process for 3D shape generation. Meanwhile, introducing 3D positional embedding in the baseline with voxel diffusion also increases the shape generation performance in terms of all metrics. More importantly, incorporating 3D window attention and two previous modules together into the baseline significantly decreases the training cost by 44.86 hours and results of 1-NNA by 50.75 @CD and 49.2 @EMD, and raises the performance of COV by 44.682 @CD and 49.667 @EMD. These improving results validate the importance of the proposed 3D adaptation components in the plain diffusion transformer to operate the denoising process from observed point clouds for 3D shape generation.

Table 3: Transferability studies on modality and domain with parameter-efficient fine-tuning.

ImageNet Efficient Params 1-NNA (↓↓\downarrow↓)COV (↑↑\uparrow↑) Pre-train Fine-tuning(MB)CD EMD CD EMD ✗✗32.8 51.99 49.94 54.76 57.37 ✓✗32.8 49.07 49.76 53.26 55.75 ✓✓0.09 50.87 50.23 52.59 55.36

(a)Modality transfer.

Source Target Params 1-NNA (↓↓\downarrow↓)COV (↑↑\uparrow↑) Domain Domain(MB)CD EMD CD EMD Chair Chair 32.8 51.99 49.94 54.76 57.37 Airplane Chair 0.09 52.56 50.75 53.71 56.32 Airplane Airplane 32.8 62.81 58.31 55.04 54.58 Chair Airplane 0.09 63.58 59.17 53.25 53.68

(b)Domain transfer.

Table 4: Scalability studies on flexible patch, voxel, and model sizes.

Patch 1-NNA (↓↓\downarrow↓)COV (↑↑\uparrow↑) Size CD EMD CD EMD 8 53.84 51.20 50.01 52.49 4 51.99 49.94 54.76 57.37 2 51.78 49.69 54.54 55.94

(c)Patch size.

Voxel 1-NNA (↓↓\downarrow↓)COV (↑↑\uparrow↑) Size CD EMD CD EMD 16 54.00 50.60 50.73 52.26 32 51.99 49.94 54.76 57.37 64 50.32 49.73 55.45 57.32

(d)Voxel size.

Model Params 1-NNA (↓↓\downarrow↓)COV (↑↑\uparrow↑) Size(MB)CD EMD CD EMD S/4 32.8 56.31 55.82 47.21 50.75 B/4 130.2 55.59 54.91 50.09 52.80 L/4 579.0 52.96 53.57 51.88 54.41 XL/4 674.7 51.95 52.50 52.71 54.31

(e)Model size.

Influence of 2D Pretrain(ImageNet). In order to show the modality transferability of the proposed approach from 2D ImageNet pre-trained weights to 3D generation with parameter-efficient fine-tuning, we report the ablation results of ImageNet pre-train and efficient fine-tuning on chair generation in Table3(a). From comparisons, two main observations can be derived: 1) With the initialization with 2D ImageNet pre-trained weights, the proposed DiT-3D improves the quality of shape generation by decreasing 1-NNA by 2.92@CD and 0.18@EMD. 2) Incorporating parameter-efficient fine-tuning into 2D ImageNet pre-trained weights highly decreases the training parameters while achieving competitive generation performance.

Transferability in Domain. In addition, we explore the parameter-efficient fine-tuning for domain transferability in Table3(b). By only training 0.09MB parameters of models from the source class to the target class, we can achieve a comparable performance of quality and diversity in terms of all metrics. These results indicate that our DiT-3D can support flexible transferability on modality and domain, which differs from previous 3D generation methodszhou2021pvd; zeng2022lion based on U-Net as the backbone of DDPMs.

Scaling Patch size, Voxel size and Model Size. To explore the scalability of our plain diffusion transformer to flexible designs, we ablate the patch size from {2,4,8}2 4 8{2,4,8}{ 2 , 4 , 8 }, voxel size from {16,32,64}16 32 64{16,32,64}{ 16 , 32 , 64 }, and the model size from {{{{S/4, B/4, L/4, XL/4}}}}. As seen in Table3(c), when the patch size is 2, the proposed DiT-3D achieves the best performance. This trend is also observed in the original DiTPeebles2022DiT work for 2D image generation. In addition, increasing the voxel size from 16 16 16 16 to 64 64 64 64 for the input of the diffusion denoising process raises the performance in terms of all metrics, as shown in Table3(d). More importantly, we can still observe performance gains by scaling up the proposed plain diffusion transformer to XL/4 when the model is trained for 2,000 epochs. These promising results further demonstrate the strong scalability of our DiT-3D to flexible patch size, voxel size, and model sizes for generating high-fidelity 3D shapes.

5 Conclusion

In this work, we present DiT-3D, a novel plain diffusion transformer for 3D shape generation, which can directly operate the denoising process on voxelized point clouds. Compared to existing U-Net approaches, our DiT-3D is more scalable in model size and produces much higher quality generations. Specifically, we incorporate 3D positional and patch embeddings to aggregate input from voxelized point clouds. We then incorporate 3D window attention into Transformer blocks to reduce the computational cost of 3D Transformers, which can be significantly high due to the increased token length resulting from the additional dimension in 3D. Finally, we leverage linear and devoxelization layers to predict the denoised point clouds. Due to the scalability of the Transformer, DiT-3D can easily support parameter-efficient fine-tuning with modality and domain transferability. Empirical results demonstrate the state-of-the-art performance of the proposed DiT-3D in high-fidelity and diverse 3D point cloud generation.

References

(1) William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
(2) Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
(3) Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. arXiv preprint arXiv:2303.06555, 2023.
(4) Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, and Zhenguo Li. Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. arXiv preprint arXiv:2304.06648, 2023.
(5) Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 605–613, 2017.
(6) Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan C. Russell, and Mathieu Aubry. A papier-mâché approach to learning 3d surface generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 216–224, 2018.
(7) Andrey Kurenkov, Jingwei Ji, Animesh Garg, Viraj Mehta, JunYoung Gwak, Christopher Bongsoo Choy, and Silvio Savarese. Deformnet: Free-form deformation network for 3d shape reconstruction from a single image. In Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), pages 858–866, 2017.
(8) Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3d point clouds. In Proceedings of the International Conference on Machine Learning (ICML), 2018.
(9) Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath Hariharan. Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4541–4550, 2019.
(10) Hyeongju Kim, Hyeonseung Lee, Woohyun Kang, Joun Yeop Lee, and Nam Soo Kim. Softflow: Probabilistic framework for normalizing flow on manifolds. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2020.
(11) Roman Klokov, Edmond Boyer, and Jakob Verbeek. Discrete point flow networks for efficient point cloud generation. In Proceedings of the European Conference on Computer Vision (ECCV), page 694–710, 2020.
(12) Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5826–5835, 2021.
(13) Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point diffusion models for 3d shape generation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
(14) Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. In Proceedings of Advances In Neural Information Processing Systems (NeurIPS), 2022.
(15) Zhen Liu, Yao Feng, Michael J. Black, Derek Nowrouzezahrai, Liam Paull, and Weiyang Liu. Meshdiffusion: Score-based generative 3d mesh modeling. In Proceedings of International Conference on Learning Representations (ICLR), 2023.
(16) Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 206–215, 2018.
(17) Matheus Gadelha, Rui Wang, and Subhransu Maji. Multiresolution tree networks for 3d point cloud processing. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
(18) Jinwoo Kim, Jaehoon Yoo, Juho Lee, and Seunghoon Hong. Setvae: Learning hierarchical composition for generative modeling of set-structured data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15059–15068, 2021.
(19) Diego Valsesia, Giulia Fracastoro, and Enrico Magli. Learning localized generative models for 3d point clouds via graph convolution. In Proceedings of International Conference on Learning Representations (ICLR), 2019.
(20) Dong Wook Shu, Sung Woo Park, and Junseok Kwon. 3d point cloud generative adversarial network based on tree structured graph convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3859–3868, 2019.
(21) Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge Belongie, Noah Snavely, and Bharath Hariharan. Learning gradient fields for shape generation. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
(22) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of Advances In Neural Information Processing Systems (NeurIPS), pages 6840–6851, 2020.
(23) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In Proceedings of International Conference on Learning Representations (ICLR), 2021.
(24) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. ArXiv, 2021.
(25) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S.Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
(26) Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021.
(27) Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In Proceedings of International Conference on Learning Representations (ICLR), 2021.
(28) Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
(29) Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2837–2845, 2021.
(30) Gimin Nam, Mariem Khlifi, Andrew Rodriguez, Alberto Tono, Linqi Zhou, and Paul Guerrero. 3d-ldm: Neural implicit 3d shape generation with latent diffusion models. arXiv preprint arXiv:2212.00842, 2022.
(31) Muheng Li, Yueqi Duan, Jie Zhou, and Jiwen Lu. Diffusion-sdf: Text-to-shape via voxelized diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
(32) Ruihang Chu, Enze Xie, Shentong Mo, Zhenguo Li, Matthias Nießner, Chi-Wing Fu, and Jiaya Jia. Diffcomplete: Diffusion-based generative 3d shape completion. arXiv preprint arXiv:2306.16329, 2023.
(33) Zhijian Liu, Haotian Tang, Yujun Lin, and Song Han. Point-voxel cnn for efficient 3d deep learning. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2019.
(34) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
(35) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations (ICLR), 2021.
(36) Yanghao Li, Hanzi Mao, Ross B. Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
(37) Jia Deng, Wei Dong, Richard Socher, Li-Jia. Li, Kai Li, and Li Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
(38) Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
(39) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pages 8026–8037, 2019.
(40) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Appendix

In this appendix, we first provide additional experimental analyses on multi-class training and DDPM sampling steps in SectionA. Furthermore, we showcase qualitative visualizations for comparisons with state-of-the-art works, visualizations of the diffusion process, and more high-fidelity visualizations in SectionB. Finally, we thoroughly discuss this work’s limitation and broader impact in SectionC.

Appendix A Additional Experimental Analyses

A.1 Results on Multi-class Training

In order to show the effectiveness of our DiT-3D on multi-class training, we change the training classes from {{{{Chair}}}}, {{{{Chair,Car}}}}, {{{{Chair,Car,Airplane}}}} and test on chair class in Table5. We can observe that the proposed diffusion transformer achieves competitive generation results against category-specific models for all metrics by using learnable class embeddings as the condition after multi-class training. This benefits us in training only one global model for all classes simultaneously instead of training class-specific models multiple times, which differs from previous DDPM-based approaches without class embeddings involved.

Table 5: Exploration studies on multi-class training. One global model for all three classes achieves competitive results against category-specific models trained on only one class.

Train Test 1-NNA (↓↓\downarrow↓)COV (↑↑\uparrow↑) Class Class CD EMD CD EMD Chair Chair 51.99 49.94 54.76 57.37 Chair, Car Chair 52.68 50.62 54.15 56.83 Chair, Car, Airplane Chair 53.35 51.84 52.81 55.30

Figure 4: Effect of sampling steps on 3D shape generation (Chair) during the inference stage.

A.2 Effect of Sampling Steps

Furthermore, we explore the effect of DDPM sampling steps T 𝑇 T italic_T on the final performance during the inference stage in Figure4. As can be seen, the proposed DiT-3D achieves the best results (lowest 1-NNA and highest COV) for all metrics (CD and EMD) when the number of sampling steps is set to 1000 1000 1000 1000. This trend is consistent with similar conclusions in the prior DDPM work[13].

Figure 5: Qualitative comparisons with state-of-the-art works. The proposed DiT-3D generates high-fidelity and diverse point clouds of 3D shapes for each category.

Appendix B Qualitative Visualizations

B.1 Comparisons with State-of-the-art Works

In order to qualitatively evaluate the generated 3D shapes, we compare the proposed DiT-3D with SetVAE[18], DPM[29], and PVD[12] on generated 3D point clouds of all three class in Figure5. From comparisons, we can observe that the qualities of 3D point clouds generated by our framework are superior to SetVAE[18], a hierarchical variational autoencoder for sets to learn latent variables for coarse-to-fine dependency and permutation invariance. Meanwhile, we achieve much better results than DPM[29], the first diffusion denoising probabilistic model on point cloud generation. More importantly, the proposed DiT-3D achieves high-fidelity and diverse results compared to the strong diffusion model based on point voxels, PVD[12]. These visualizations further showcase the superiority of our DiT-3D in generating high-fidelity and diverse shapes by using a plain diffusion transformer to operate the denoising process from point clouds.

B.2 Visualizations of Diffusion Process

Furthermore, we visualize the diffusion process of generated Chair shapes from 1000 1000 1000 1000 sampling steps in Figure6, generating 6 6 6 6 shapes from random noise to the final 3D shapes for each sample. From left to right, we can observe that our DiT-3D achieves a meaningful diffusion process to produce high-fidelity and diverse shapes. When the number of sampling steps is closer to 1000 1000 1000 1000, the generated shapes are more realistic, while they are more like random noises in the initial few sampling steps. These qualitative diffusion process results also showcase the effectiveness of applying a plain diffusion transformer to generate high-fidelity and diverse shapes. The results of the diffusion process for Airplane and Car shapes generated from 1000 1000 1000 1000 sampling steps are reported in Figure7 and8.

Figure 6: Qualitative visualizations of the diffusion process on Chair shape generation. The results of generating from random noise to final 3D shapes are shown in left-to-right order.

Figure 7: Qualitative visualizations of the diffusion process on Airplane shape generation. The results of generating from random noise to final 3D shapes are shown in left-to-right order.

Figure 8: Qualitative visualizations of the diffusion process on Car shape generation. The results of generating from random noise to final 3D shapes are shown in left-to-right order.

B.3 More Visualizations of Generated Shapes

To qualitatively showcase the high-fidelity and diverse properties of generated shapes, we visualize more generated samples from all three classes in Figure9,10, and11. These qualitative visualizations demonstrate the effectiveness of the proposed 3D design components in a plain diffusion transformer to produce high-fidelity and diverse shapes by achieving the denoising process from point clouds of three categories directly.

Figure 9: Qualitative visualizations of high-fidelity and diverse results on Chair shape generation.

Figure 10: Qualitative visualizations of high-fidelity and diverse results on Airplane shape generation.

Figure 11: Qualitative visualizations of high-fidelity and diverse results on Car shape generation.

Appendix C Discussion

Limitation & Future Work. This work thoroughly explores the plain diffusion transformer on point clouds for generating high-fidelity and diverse 3D shapes. However, we have yet to explore the potential of other 3D modalities, such as signed distance fields (SDFs) and meshes, or scaling our DiT-3D to large-scale training on more 3D shapes. These directions are promising, and we will leave them as the future work.

Broader Impact. The proposed DiT-3D generates high-fidelity and diverse 3D shapes from training samples in the existing ShapeNet benchmark, which might cause the model to learn internal biases in the data. These biased problems should be carefully solved for the deployment of real applications.

Xet Storage Details

Size:: 81.2 kB
Xet hash:: c96754f479130d137da2ca805c5cd3b84a35d11f5fa1ce7a448bb946cb2727a4

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.