91.8 kB

Title: Compositional Generative Modeling: A Single Model is Not All You Need

URL Source: https://arxiv.org/html/2402.01103

Published Time: Wed, 05 Jun 2024 00:15:00 GMT

Markdown Content:

Abstract

Large monolithic generative models trained on massive amounts of data have become an increasingly dominant approach in AI research. We argue that we should instead construct large generative systems by composing smaller generative models together. We show how such a compositional generative approach enables us to learn distributions in a more data-efficient manner, enabling generalization to parts of the data distribution unseen at training time. We further show how this enables us to program and construct new generative models for tasks completely unseen at training. Finally, we show that in many cases, we can discover compositional components from data.

Generative Modeling, Modularity, Compositionality, Energy Based Models

1 Introduction

In the past two years, increasingly large generative models have become a dominant force in AI research, with compelling results in natural language(Brown et al., 2020), computer vision(Rombach et al., 2022) and decision-making(Reed et al., 2022). Much of the AI research field has now focused on scaling and constructing increasingly large generative models(Hoffmann et al., 2022), developing tools to build even larger models(Dao et al., 2022; Kwon et al., 2023), and studying how properties emerge as these models scale in size(Lu et al., 2023; Schaeffer et al., 2023).

Despite significant scaling in generative models, existing models remain far from intelligent, exhibiting poor reasoning ability(Tamkin et al., 2021), extensive hallucinations(Zhang et al., 2023b), and poor understanding of commonsense relationships in images (Figure2)(Majumdar et al., 2023). Despite this, large models have already been trained on most of the existing data on the Internet and have reached the limits of modern computational hardware, costing hundreds of millions of dollars to train (Figure1). Inference costs of such gigantic models are also prohibitive, requiring large computational clusters and a cost of several dollars for longer queries and answers(OpenAI,).

In addition, adapting such large models to new task distributions is difficult. Directly fine-tuning larger models is often prohibitively expensive, requiring a large computation cluster and an often difficult-to-acquire fine-tuning dataset. Other works have explored leveraging language and a set of in-context examples to teach models new distributions, but such adaptation is limited to settings that are well expressed using a set of language instructions that are further roughly similar to the distributions already seen during training (Yadlowsky et al., 2023).

In this paper, we argue that as an alternative to studying how to scale and construct increasingly large monolithic generative models, we should instead construct complex generative models compositionally from simpler models. Each constituent model captures the probability distribution of a subset of variables of the distribution of interest, which are combined to model the more complex full distribution. Individual distributions are therefore much simpler and computationally modeled with both fewer parameters and learnable from less data. Furthermore, the combined model can generalize to unseen portions of the data distribution as long as each constituent dimension is locally in distribution.

Figure 1: Rising Size and Cost of Models. While much of AI research has focused on constructing increasingly larger monolithic models, training costs are exponentially rising by a factor of 3 every year with current models already costing several hundred million dollars per training run. Data from (Epoch, 2023).

Such compositional generative modeling enables us to effectively represent the sparsity and symmetry naturally found in nature. Sparsity of interactions, for instance between an agent and external environment dynamics can be encoded by representing each with separate generative models. Sources of symmetry can be captured using multiple instances of the same independent generative component to represent each occurrence of the symmetry, for instance by tiling patch-level generative model over the patches in an image. Compositional structure is widely used in existing work, to tractably represent high dimensional distributions in Probablistic Graphical Models (PGMs) (Koller & Friedman, 2009), and even in existing generative models, i.e. autoregressive models which factorize distributions into a set of conditional probability distributions (represented by a single model).

Compositional generative modeling further enables us to effectively program and construct new generative systems for unseen task distributions. Individual generative models can be composed in new ways, with each model specifying a set of constraints, and probabilistic composition seen as a communication language among models, ensuring a distribution is a constructed so that all constraints are satisfied to form the task distribution of interest. Such programming further requires no explicit training or data, enabling generalization in inference even on distributions with no previously seen data. We illustrate how such recombination enables generalization to new task distributions in decision making, image and video synthesis.

Figure 2: Limited Compositionality in Multimodal Models. Existing large multimodal models such as GPT-4V and DALL-E 3 still struggle with simple textual queries, often falling back to biases in data.

The underlying compositional components in generative modeling can in many cases be directly inferred and discovered in an unsupervised manner from data, representing compositional structure such as objects and relations. Such discovered components can then be similarly recombined to form new distributions – for instance, objects components discovered by one generative model on one dataset can be combined with components discovered by a separate generative model on another dataset to form hybrid scenes with objects in both datasets. We illustrate the efficacy of such discovered compositional structure across domains in images and trajectory dynamics.

Overall, in this paper, we advocate for the idea that we should construct complex generative systems by representing them as a compositional system of simpler components and illustrate its benefits across various domains.

2 Data Efficient Generative Modeling

The predominant paradigm for training generative models has been to construct increasingly larger monolithic models trained with greater amounts of data and computational power. While language models have demonstrated significant improvements with increased scale (albeit still with difficulty in compositionality(Dziri et al., 2023)), current multimodal models such as DALL-E 3 and GPT-4V remain unable to take advantage of even simple forms of compositionality (Figure2). Such models may be unable to accurately generate images given combinations of relations rarely seen in training data, or fail to understand simple spatial relations in images, despite being trained on a very significant portion of the existing Internet.

One difficulty is that the underlying sample complexity of learning generative models over joint distributions of variables increases dramatically with the number of variables. As an example, consider learning probability distributions by maximizing log-likelihood over a set of random variables A, B, C, D, each of which can take a set of K 𝐾 K italic_K values. Directly learning a distribution over a single variable A by requires O⁢(K)𝑂 𝐾 O(K)italic_O ( italic_K ) values(Canonne, 2020). The data required to learn distributions over a joint set of variables generally increases exponentially – so that learning a joint distribution p⁢(A,B,C,D)𝑝 𝐴 𝐵 𝐶 𝐷 p(A,B,C,D)italic_p ( italic_A , italic_B , italic_C , italic_D ) requires O⁢(K 4)𝑂 superscript 𝐾 4 O(K^{4})italic_O ( italic_K start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) samples(Canonne, 2020).

Constructing large multimodal generative models such as GPT-4V or DALL-E 3 falls into the same difficulty – as the number of modalities jointly modeled increases, the combination of samples required to see and learn the entire data distribution exponentially increases. This is particularly challenging in the multimodal setting as the existing data on the Internet used to train these models is often highly non-uniform, with many combinations of natural language and images unseen.

One approach to significantly reduce the data necessary to learn generative models over complex joint distributions is factorization – if we know that a distribution exhibits an independence structure between variables such as

p⁢(A,B,C,D)∝p⁢(A)⁢p⁢(B)⁢p⁢(C,D),proportional-to 𝑝 𝐴 𝐵 𝐶 𝐷 𝑝 𝐴 𝑝 𝐵 𝑝 𝐶 𝐷 p(A,B,C,D)\propto p(A)p(B)p(C,D),italic_p ( italic_A , italic_B , italic_C , italic_D ) ∝ italic_p ( italic_A ) italic_p ( italic_B ) italic_p ( italic_C , italic_D ) ,

we can substantially reduce the data requirements by only needing to learn these factors, composing them together to form a more complex distribution. This also enables our learned joint distribution to generalize to unseen combinations of variables so long as each local variable combination is in distribution (illustrated in Figure3). Even in settings where distributions are not accurately modeled as a product of independent factors, such a factorization can still lead to a better models given limited data by reducing the hypothesis space(Murphy, 2022). This idea of factorizing probability distributions has led to substantial work in probabilistic graphical models (PGMs)(Koller & Friedman, 2009).

Below, we illustrate across four settings how representing a target distribution p⁢(x)𝑝 𝑥 p(x)italic_p ( italic_x ) in a factorized manner can substantially improve generative modeling performance from a limited amount of data:

Figure 3: Generalizing Outside Training Data. Given a narrow slice of training data, we can learn generative models that generalize outside the data through composition. We learn separate generative models to model each axis of the data – the composition of models can then cover the entire data space.

Figure 4: Distribution Composition – When modeling simple product (top) or mixture (bottom) compositions, learning two compositional models on the factors is more data efficient than learning a single monolithic model on the product distribution. The monolithic model is trained on twice as much data as individual factors.

Figure 5: Compositional Trajectory Generation – By factorizing a trajectory generative model into a set of components, models are able to more accurately simulate dynamics from limited trajectories (a) and train in fewer training iterations (b).

Simple Distribution Composition. In Figure4, we consider modeling a distribution p⁢(x)𝑝 𝑥 p(x)italic_p ( italic_x ) that is a product p⁢(x)∝p 1⁢(x)⁢p 2⁢(x)proportional-to 𝑝 𝑥 subscript 𝑝 1 𝑥 subscript 𝑝 2 𝑥 p(x)\propto p_{1}(x)p_{2}(x)italic_p ( italic_x ) ∝ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) or mixture p⁢(x)∝p 1⁢(x)+p 2⁢(x)proportional-to 𝑝 𝑥 subscript 𝑝 1 𝑥 subscript 𝑝 2 𝑥 p(x)\propto p_{1}(x)+p_{2}(x)italic_p ( italic_x ) ∝ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) + italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) of two factors p 1⁢(x)subscript 𝑝 1 𝑥 p_{1}(x)italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) and p 2⁢(x)subscript 𝑝 2 𝑥 p_{2}(x)italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ). We compare training either a single model on p⁢(x)𝑝 𝑥 p(x)italic_p ( italic_x ) or learning two generative models on the factors p 1⁢(x)subscript 𝑝 1 𝑥 p_{1}(x)italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) and p 2⁢(x)subscript 𝑝 2 𝑥 p_{2}(x)italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ). We find that training compositional models leads to a more accurate distribution modeling if the same amount of data is used to learn p⁢(x)𝑝 𝑥 p(x)italic_p ( italic_x ) as is used to learn both p 1⁢(x)subscript 𝑝 1 𝑥 p_{1}(x)italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) and p 2⁢(x)subscript 𝑝 2 𝑥 p_{2}(x)italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ). Even when modeling simple distributions, the data complexity of modeling each factor is simpler than representing the joint distribution.

Trajectory Modeling. Next, we consider modeling a probability distribution p⁢(τ)𝑝 𝜏 p(\tau)italic_p ( italic_τ ) over trajectories τ=(s 0,a 0,s 1,a 1,…,s T,a T)𝜏 subscript 𝑠 0 subscript 𝑎 0 subscript 𝑠 1 subscript 𝑎 1…subscript 𝑠 𝑇 subscript 𝑎 𝑇\tau=(s_{0},a_{0},s_{1},a_{1},\ldots,s_{T},a_{T})italic_τ = ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), which many recent works have typically modeled using a single joint distribution p⁢(s 0,a 0,…,s T,a T)𝑝 subscript 𝑠 0 subscript 𝑎 0…subscript 𝑠 𝑇 subscript 𝑎 𝑇 p(s_{0},a_{0},\ldots,s_{T},a_{T})italic_p ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )(Janner et al., 2022; Ajay et al., 2022). In contrast to a monolithic generative distribution, given structural knowledge of the environment – i.e., that it is a Markov Decision Process, a more factorized generative model to represent the distribution is as a product

p⁢(τ)∝∏i p⁢(s i∣s i−1,a).proportional-to 𝑝 𝜏 subscript product 𝑖 𝑝 conditional subscript 𝑠 𝑖 subscript 𝑠 𝑖 1 𝑎 p(\tau)\propto\prod_{i}p(s_{i}\mid s_{i-1},a).italic_p ( italic_τ ) ∝ ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_a ) .

In Figure5 , we explore the efficacy of compositional and monolithic models in characterizing trajectories in Maze2D, which consists of a 4D state space (2D position and velocity) and 2D action space (2D forces), using the model in (Janner et al., 2022) (with the compositional model representing trajectory chunksize 8 to ensure compatibility with the architecture). We plot the accuracy of generated trajectories at unseen start states as the function of the number of agent episodes used to train models, where each episode has length of approximately 10000 timesteps. As seen in the Figure5(a), given only a very limited number of agent episodes in an environment, a factorized model can more accurately simulate trajectory dynamics. In addition, we found that training a single joint generative model also took a substantially larger number of iterations to train than the factorized model as illustrated in Figure5(b).

Figure 6: Compositional Visual Synthesis. By composing a set of generative models modeling conditional image distributions given a sentence description, we can more accurately synthesize images given paragraph-level text descriptions. Figure adapted from (Liu et al., 2022)

Compositional Visual Generation. We further consider modeling a probability distribution p⁢(x∣T)𝑝 conditional 𝑥 𝑇 p(x\mid T)italic_p ( italic_x ∣ italic_T ) in text-to-image synthesis, where x 𝑥 x italic_x is an image and T 𝑇 T italic_T is a complex text description. While this distribution is usually characterized by a single generative model, we can factor the generation as a product of distributions(Liu et al., 2022) given sentences t 1 subscript 𝑡 1 t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, t 2 subscript 𝑡 2 t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and t 3 subscript 𝑡 3 t_{3}italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT in the description T 𝑇 T italic_T

p⁢(x∣T)∝p⁢(x∣t 1)⁢p⁢(x∣t 2)⁢p⁢(x∣t 3).proportional-to 𝑝 conditional 𝑥 𝑇 𝑝 conditional 𝑥 subscript 𝑡 1 𝑝 conditional 𝑥 subscript 𝑡 2 𝑝 conditional 𝑥 subscript 𝑡 3 p(x\mid T)\propto p(x\mid t_{1})p(x\mid t_{2})p(x\mid t_{3}).italic_p ( italic_x ∣ italic_T ) ∝ italic_p ( italic_x ∣ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_p ( italic_x ∣ italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_p ( italic_x ∣ italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) .

This representation of the distribution is more data efficient: we only need to see the full distribution of images given single sentences. In addition, it enables us to generalize to unseen regions of p⁢(x∣T)𝑝 conditional 𝑥 𝑇 p(x\mid T)italic_p ( italic_x ∣ italic_T ) such as unseen combinations of sentences and longer text descriptions. In Figure6, we illustrate the efficacy of such an approach.

Composing Language Models. Finally, we consider modeling a probability distribution p⁢(x)𝑝 𝑥 p(x)italic_p ( italic_x ) over a language sequence x 𝑥 x italic_x. Similar to the previous examples, we can represent the likelihood as a composition p⁢(x)∝∏i p i⁢(x)proportional-to 𝑝 𝑥 subscript product 𝑖 subscript 𝑝 𝑖 𝑥 p(x)\propto\prod_{i}p_{i}(x)italic_p ( italic_x ) ∝ ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ), where each distribution p i⁢(x)subscript 𝑝 𝑖 𝑥 p_{i}(x)italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) is parameterized by a separate language model. However, directly sampling from such a composition of language models is difficult as it requires intermediate access to the output logits of each model, which are often unavailable for proprietary models. One approach to avoid this issue is to combine outputs of individual language models p i⁢(x)subscript 𝑝 𝑖 𝑥 p_{i}(x)italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) in the language space and use the result as context for representing the final distribution p⁢(x)𝑝 𝑥 p(x)italic_p ( italic_x ) over language sequences(Du et al., 2023b).

In Du et al. (2023b), this compositional approach is found to effectively improve the performance of base language models. For instance, on the MATH dataset(Hendrycks et al., 2021), by composing 5 instances of a GPT-3.5 model, we can obtain a final accuracy of 58.0±2.8%plus-or-minus 58.0 percent 2.8 58.0\pm 2.8%58.0 ± 2.8 %, even outperforming a much larger and expensive GPT-4 model, which obtains a performance of 55.0±2.9%plus-or-minus 55.0 percent 2.9 55.0\pm 2.9%55.0 ± 2.9 %.

3 Generalization to New Distributions

In the previous section, we’ve illustrated how composition can enable us to effectively model a distribution p⁢(x)𝑝 𝑥 p(x)italic_p ( italic_x ), including areas we have not seen any data in. In this section, we further illustrate how composition enables generalization, allowing us to re-purpose a generative model p⁢(x)𝑝 𝑥 p(x)italic_p ( italic_x ) to solve a new task by constructing a new generative model q⁢(x)𝑞 𝑥 q(x)italic_q ( italic_x ).

U-Maze

Medium

Large

Figure 7: Planning through Probability Composition. By composing a probability density trained on modeling dynamics in an environment p t⁢r⁢a⁢j⁢(τ)subscript 𝑝 𝑡 𝑟 𝑎 𝑗 𝜏 p_{traj}(\tau)italic_p start_POSTSUBSCRIPT italic_t italic_r italic_a italic_j end_POSTSUBSCRIPT ( italic_τ ) with a probability density p g⁢o⁢a⁢l⁢(τ,g)subscript 𝑝 𝑔 𝑜 𝑎 𝑙 𝜏 𝑔 p_{goal}(\tau,g)italic_p start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT ( italic_τ , italic_g ) which specifies a specific goal state, we can sample plans from specified start to a goal condition. Figure from (Janner et al., 2022), where the horizontal axis illustrates progression of sampling.

Consider the task of planning, where we wish to construct a generative model q⁢(τ)𝑞 𝜏 q(\tau)italic_q ( italic_τ ) which samples plans that reach a goal state g 𝑔 g italic_g starting from a start state s 𝑠 s italic_s. Given a generative model p⁢(τ)𝑝 𝜏 p(\tau)italic_p ( italic_τ ), which sample legal, but otherwise unconstrained, state sequences in an environment, we can construct an additional generative model r⁢(τ,s,g)𝑟 𝜏 𝑠 𝑔 r(\tau,s,g)italic_r ( italic_τ , italic_s , italic_g ) which has high likelihood when τ 𝜏\tau italic_τ has start state s 𝑠 s italic_s and goal state g 𝑔 g italic_g and low likelihood everywhere else. By composing the two distributions

q⁢(τ)∝p⁢(τ)⁢r⁢(τ,s,g),proportional-to 𝑞 𝜏 𝑝 𝜏 𝑟 𝜏 𝑠 𝑔 q(\tau)\propto p(\tau)r(\tau,s,g),italic_q ( italic_τ ) ∝ italic_p ( italic_τ ) italic_r ( italic_τ , italic_s , italic_g ) ,(1)

we can construct our desired planning distribution q⁢(τ)𝑞 𝜏 q(\tau)italic_q ( italic_τ ), exploiting the fact that probability can be treated as a “currency” to combine models, enabling us to selectively choose trajectories that satisfy the constraints in both distributions.

Below, we illustrate a set of applications where we can construct new compositional generative models q⁢(x)𝑞 𝑥 q(x)italic_q ( italic_x ) to solve tasks in planning, constraint satisfaction, hierarchical decision-making, and image and video generation.

Planning with Trajectory Composition. We first consider constructing q⁢(τ)𝑞 𝜏 q(\tau)italic_q ( italic_τ ) representing planning as described in Equation1. In Figure7 we illustrate how sampling from this composed distribution enables successful planning from start to goal states. Quantatively, this approach performs well also as illustrated in(Janner et al., 2022).

Figure 8: Manipulation through Constraint Composition. New object manipulation problems can be converted into a graph of constraints between variables. Each constraint can be represented as a low-dimensional factor of the joint distribution, with sampling from the composition of distributions corresponding to solving the arrangement problem. Figure adapted from (Yang et al., 2023b).

Manipulation through Constraint Satisfaction. We next illustrate how we can construct a generative model q⁢(V)𝑞 𝑉 q(V)italic_q ( italic_V ) to solve a variety of robotic object arrangement tasks. As illustrated in Figure8, many object arrangement tasks can be formulated as continuous constraint satisfaction problems consisting of a graph 𝒢=⟨𝒱,𝒰,𝒞⟩𝒢 𝒱 𝒰 𝒞{\mathcal{G}}=\langle{\mathcal{V}},{\mathcal{U}},{\mathcal{C}}\rangle caligraphic_G = ⟨ caligraphic_V , caligraphic_U , caligraphic_C ⟩, where v∈𝒱 𝑣 𝒱 v\in{\mathcal{V}}italic_v ∈ caligraphic_V is a decision variable (such as the pose of an object), while each u∈𝒰 𝑢 𝒰 u\in{\mathcal{U}}italic_u ∈ caligraphic_U is a conditioning variable (such as the geometry of an object) and c∈𝒞 𝑐 𝒞 c\in{\mathcal{C}}italic_c ∈ caligraphic_C is a constraint such as collision-free. Given such a specification, we can solve the robotics tasks by sampling from the composed distribution

q⁢(V)∝∏c∈𝒞 p c⁢(𝒱 c∣𝒰 c),proportional-to 𝑞 𝑉 subscript product 𝑐 𝒞 subscript 𝑝 𝑐 conditional superscript 𝒱 𝑐 superscript 𝒰 𝑐 q(V)\propto\prod_{c\in{\mathcal{C}}}p_{c}({\mathcal{V}}^{c}\mid{\mathcal{U}}^{% c}),italic_q ( italic_V ) ∝ ∏ start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( caligraphic_V start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∣ caligraphic_U start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ,

corresponding to solving the constraint satisfaction problem. Such an approach enables effective generalization to new problems(Yang et al., 2023b), to temporally extended plans(Mishra et al., 2023), and the combination of heterogenous policies(Wang et al., 2024).

Figure 9: Hierarchical Planning through Composition. By composing a set of foundation models trained on Internet data (language, videos, action), we can zero-shot construct a hierarchical planning system. Figure adapted from (Ajay et al., 2023).

Figure 10: Image Tapestries through Composition. By composing a set of probability distributions defined over different spatial regions in an image, we can construct detailed image tapestries. Figure adapted from (Du et al., 2023a).

Hierarchical Planning with Foundation Models. We further illustrate how we can construct a generative model that functions as a hierarchical planner for long-horizon tasks. We construct q⁢(τ text,τ image,τ action)𝑞 subscript 𝜏 text subscript 𝜏 image subscript 𝜏 action q(\tau_{\text{text}},\tau_{\text{image}},\tau_{\text{action}})italic_q ( italic_τ start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT image end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT action end_POSTSUBSCRIPT ), which jointly models the distribution over a text plan τ text subscript 𝜏 text\tau_{\text{text}}italic_τ start_POSTSUBSCRIPT text end_POSTSUBSCRIPT, image plan τ image subscript 𝜏 image\tau_{\text{image}}italic_τ start_POSTSUBSCRIPT image end_POSTSUBSCRIPT, and action plan τ action subscript 𝜏 action\tau_{\text{action}}italic_τ start_POSTSUBSCRIPT action end_POSTSUBSCRIPT given a natural language goal g 𝑔 g italic_g and image observation o 𝑜 o italic_o, by combining pre-existing foundation models trained on Internet knowledge. We formulate q⁢(τ text,τ image,τ action)𝑞 subscript 𝜏 text subscript 𝜏 image subscript 𝜏 action q(\tau_{\text{text}},\tau_{\text{image}},\tau_{\text{action}})italic_q ( italic_τ start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT image end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT action end_POSTSUBSCRIPT ) through the composition

p LLM⁢(τ text,g)⁢p Video⁢(τ image,τ text,o)⁢p Action⁢(τ action,τ image).subscript 𝑝 LLM subscript 𝜏 text 𝑔 subscript 𝑝 Video subscript 𝜏 image subscript 𝜏 text 𝑜 subscript 𝑝 Action subscript 𝜏 action subscript 𝜏 image p_{\text{LLM}}(\tau_{\text{text}},g)p_{\text{Video}}(\tau_{\text{image}},\tau_% {\text{text}},o)p_{\text{Action}}(\tau_{\text{action}},\tau_{\text{image}}).italic_p start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , italic_g ) italic_p start_POSTSUBSCRIPT Video end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT image end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , italic_o ) italic_p start_POSTSUBSCRIPT Action end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT action end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT image end_POSTSUBSCRIPT ) .

This distribution assigns a high likelihood to sequences of natural-language instructions τ text subscript 𝜏 text\tau_{\text{text}}italic_τ start_POSTSUBSCRIPT text end_POSTSUBSCRIPT that are plausible ways to reach a final goal g 𝑔 g italic_g (leveraging textual knowledge embedded in an LLM) which are consistent with visual plans τ image subscript 𝜏 image\tau_{\text{image}}italic_τ start_POSTSUBSCRIPT image end_POSTSUBSCRIPT starting from image o 𝑜 o italic_o (leveraging visual dynamics information embedded in a video model), which are further consistent with execution with actions τ action subscript 𝜏 action\tau_{\text{action}}italic_τ start_POSTSUBSCRIPT action end_POSTSUBSCRIPT (leveraging action information in a large action model). Sampling from this distribution then corresponds to finding sequences τ text,τ image,τ action subscript 𝜏 text subscript 𝜏 image subscript 𝜏 action\tau_{\text{text}},\tau_{\text{image}},\tau_{\text{action}}italic_τ start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT image end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT action end_POSTSUBSCRIPT that are mutually consistent with all constraints, and thus constitute successful hierarchical plans to accomplish the task. We provide an illustration of this composition in Figure9 with efficacy of this approach demonstrated in(Ajay et al., 2023).

Controllable Image Synthesis. Composition can also allows us to construct a generative model q⁢(x∣D)𝑞 conditional 𝑥 𝐷 q(x\mid D)italic_q ( italic_x ∣ italic_D ) to generate images x 𝑥 x italic_x from a detailed scene description D 𝐷 D italic_D consisting of text and bounding-box descriptions {text i,bbox i}i=1:N subscript subscript text 𝑖 subscript bbox 𝑖:𝑖 1 𝑁{\text{text}{i},\text{bbox}{i}}_{i=1:N}{ text start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bbox start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 : italic_N end_POSTSUBSCRIPT. This compositional distribution is

q⁢(x|D)∝∏i∈{1,…,N}p⁢(x bbox i∣text i),proportional-to 𝑞 conditional 𝑥 𝐷 subscript product 𝑖 1…𝑁 𝑝 conditional subscript 𝑥 subscript bbox 𝑖 subscript text 𝑖 q(x|D)\propto\prod_{i\in{1,\ldots,N}}p(x_{\text{bbox}{i}}\mid\text{text}{i% }),italic_q ( italic_x | italic_D ) ∝ ∏ start_POSTSUBSCRIPT italic_i ∈ { 1 , … , italic_N } end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT bbox start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ text start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

where each distribution is defined over bounding boxes in an image. In Figure10, we illustrate the efficacy of this approach for constructing complex images. This approach enables the synthesis of image tapestries(Du et al., 2023a) and collages(Zhang et al., 2023a).

Figure 11: Video Stylization through Composition. By composing one video model with a model specifying style, we can stylize video generations. Figure adapted from (Yang et al., 2023a).

Style Adaptation of Video Models. Finally, composition can be used to construct a generative model q⁢(τ)𝑞 𝜏 q(\tau)italic_q ( italic_τ ) that synthesizes video in new styles. Given a pretrained video model p pretrained⁢(τ∣text)subscript 𝑝 pretrained conditional 𝜏 text p_{\text{pretrained}}(\tau\mid\text{text})italic_p start_POSTSUBSCRIPT pretrained end_POSTSUBSCRIPT ( italic_τ ∣ text ) and a small video model of a particular style p adapt⁢(τ∣text)subscript 𝑝 adapt conditional 𝜏 text p_{\text{adapt}}(\tau\mid\text{text})italic_p start_POSTSUBSCRIPT adapt end_POSTSUBSCRIPT ( italic_τ ∣ text ), we can sample videos τ 𝜏\tau italic_τ from the compositional distribution

p pretrained⁢(τ∣text)⁢p adapt⁢(τ∣text)subscript 𝑝 pretrained conditional 𝜏 text subscript 𝑝 adapt conditional 𝜏 text p_{\text{pretrained}}(\tau\mid\text{text})p_{\text{adapt}}(\tau\mid\text{text})italic_p start_POSTSUBSCRIPT pretrained end_POSTSUBSCRIPT ( italic_τ ∣ text ) italic_p start_POSTSUBSCRIPT adapt end_POSTSUBSCRIPT ( italic_τ ∣ text )

to generate new videos in different specified styles. The efficacy of using composition to adapt the style of a video model is illustrated in (Yang et al., 2023a).

4 Generative Modeling with Learned Compositional Structure

A limitation of compositional generative modeling discussed in the earlier sections is that it requires a priori knowledge about the independence structure of the distribution we wish to model. However, these compositional components can also be discovered jointly while learning a probability distribution by formulating maximum likelihood estimation as maximizing the likelihood of the factorized distribution

p θ⁢(x)∝∏i p θ i⁢(x).proportional-to subscript 𝑝 𝜃 𝑥 subscript product 𝑖 superscript subscript 𝑝 𝜃 𝑖 𝑥 p_{\theta}(x)\propto\prod_{i}p_{\theta}^{i}(x).italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ∝ ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_x ) .

Similar to the previous two sections, the discovery of the learned components p θ i⁢(x)superscript subscript 𝑝 𝜃 𝑖 𝑥 p_{\theta}^{i}(x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_x ) enables more data-efficient learning of the generative model as well as the ability to generate samples from new task distributions. Here, we illustrate three examples of how different factors can be discovered in an unsupervised manner.

Figure 12: Composition of Discovered Objects. Probabilistic components corresponding to individual objects in a scene are discovered unsupervised in two datasets using two separate models. Discovered components (illustrated with yellow boxes) can be multiplied together to form new scenes with a hybrid composition of objects. Figure adapted from (Su et al., 2024).

Discovering Factors from an Input Image. Given an input image x 𝑥 x italic_x of a scene, we can parameterize a probability distribution over the pixel values of the image as a product of the compositional generative models

p θ⁢(x)∝∏i p θ⁢(x∣Enc i⁢(x)),proportional-to subscript 𝑝 𝜃 𝑥 subscript product 𝑖 subscript 𝑝 𝜃 conditional 𝑥 subscript Enc 𝑖 𝑥 p_{\theta}(x)\propto\prod_{i}p_{\theta}(x\mid\text{Enc}_{i}(x)),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ∝ ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ∣ Enc start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) ,

where Enc⁢(⋅)Enc⋅\text{Enc}(\cdot)Enc ( ⋅ ) is a learned neural encoder with low-dimensional latent output to encourage each component to capture distinct regions of an image. By training models to autoencode images with this likelihood expression, each component distribution p θ⁢(x∣Enc i⁢(x))subscript 𝑝 𝜃 conditional 𝑥 subscript Enc 𝑖 𝑥 p_{\theta}(x\mid\text{Enc}{i}(x))italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ∣ Enc start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) finds interpretable decomposition of images corresponding to individual objects in a scene as well global factors of variation in the scene such as lighting(Du et al., 2021; Su et al., 2024). In Figure12, we illustrate how these discovered components, p θ⁢(x∣z 1)subscript 𝑝 𝜃 conditional 𝑥 subscript 𝑧 1 p{\theta}(x\mid z_{1})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ∣ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and p θ⁢(x∣z 2)subscript 𝑝 𝜃 conditional 𝑥 subscript 𝑧 2 p_{\theta}(x\mid z_{2})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ∣ italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) from a model trained on cubes and spheres, p ϕ⁢(x∣z 3)subscript 𝑝 italic-ϕ conditional 𝑥 subscript 𝑧 3 p_{\phi}(x\mid z_{3})italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ∣ italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) and p ϕ⁢(x∣z 4)subscript 𝑝 italic-ϕ conditional 𝑥 subscript 𝑧 4 p_{\phi}(x\mid z_{4})italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ∣ italic_z start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) from a separate model trained on trucks and boots can be composed together to form the distribution

p θ⁢(x∣z 1)⁢p θ⁢(x∣z 2)⁢p ϕ⁢(x∣z 3)⁢p ϕ⁢(x∣z 4),subscript 𝑝 𝜃 conditional 𝑥 subscript 𝑧 1 subscript 𝑝 𝜃 conditional 𝑥 subscript 𝑧 2 subscript 𝑝 italic-ϕ conditional 𝑥 subscript 𝑧 3 subscript 𝑝 italic-ϕ conditional 𝑥 subscript 𝑧 4 p_{\theta}(x\mid z_{1})p_{\theta}(x\mid z_{2})p_{\phi}(x\mid z_{3})p_{\phi}(x% \mid z_{4}),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ∣ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ∣ italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ∣ italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ∣ italic_z start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ,

to construct hybrid scenes with objects from both datasets.

Figure 13: Composition of Discovered Relation Potentials In a particle dataset, particles exhibit potentials corresponding to invisible springs between particles (Col. 1) or charges between particles (Col. 2). By swapping discovered probabilistic components between each pair of objects between particle systems, we can recombine trajectories framed in green but with a pair of edge potentials from trajectories formed in red in Col. 3. Figure adapted from (Comas et al., 2023)

Discovering Relational Potentials. Given a trajectory τ 𝜏\tau italic_τ of N 𝑁 N italic_N particles, we can similarly parameterize a probability distribution over the reconstruction of the particle system as a product of components defined over each pairwise interaction between particles

p θ⁢(τ)∝∏i,j⁢∀j≠i p θ⁢(τ∣Enc i⁢j⁢(τ)),proportional-to subscript 𝑝 𝜃 𝜏 subscript product 𝑖 𝑗 for-all 𝑗 𝑖 subscript 𝑝 𝜃 conditional 𝜏 subscript Enc 𝑖 𝑗 𝜏 p_{\theta}(\tau)\propto\prod_{i,j\forall j\neq i}p_{\theta}(\tau\mid\text{Enc}% _{ij}(\tau)),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ ) ∝ ∏ start_POSTSUBSCRIPT italic_i , italic_j ∀ italic_j ≠ italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ ∣ Enc start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_τ ) ) ,

where Enc i⁢j⁢(τ)subscript Enc 𝑖 𝑗 𝜏\text{Enc}_{ij}(\tau)Enc start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_τ ) corresponds to latent encoding interactions between particle i 𝑖 i italic_i and j 𝑗 j italic_j. In Figure13, we illustrate how these discovered relational potentials on one particle system can be composed with relational potentials discovered on a separate set of forces to simulate those forces on the particle system.

Figure 14: Discovering Image Classes. Given a distribution of images drawn from 5 image classes in ImageNet, discovered components correspond to each image class. Components can further be composed together to form new images. Figure adapted from (Liu et al., 2023).

Discovering Object Classes From Image Distributions. Given a distribution of images p⁢(x)𝑝 𝑥 p(x)italic_p ( italic_x ) representing images drawn from different classes in Imagenet, we can model the likelihood of the distribution as a composition

p θ⁢(x)∝p ϕ⁢(w∣x)⁢∏i p θ i⁢(x)w i,proportional-to subscript 𝑝 𝜃 𝑥 subscript 𝑝 italic-ϕ conditional 𝑤 𝑥 subscript product 𝑖 superscript subscript 𝑝 𝜃 𝑖 superscript 𝑥 subscript 𝑤 𝑖 p_{\theta}(x)\propto p_{\phi}(w\mid x)\prod_{i}p_{\theta}^{i}(x)^{w_{i}},italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ∝ italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_w ∣ italic_x ) ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_x ) start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

where w i subscript 𝑤 𝑖 w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT refers to the weighting coefficient for each component. In Figure14, we illustrate that the discovered components in this setting represent each of the original Imagenet classes in the input distribution of images. We further illustrate how these discovered components to be composed together to generate images with multiple classes of objects.

5 Implementing Compositional Generation

In this section, we discuss some challenges with implementing compositional sampling with common generative model parameterizations and discuss a generative model parameterization that enables effective compositional generation. We then present some practical implementations of compositional sampling in both continuous and discrete domains.

5.1 Challenges With Sampling from Compositional Distributions

Given two probability densities p 1⁢(x)subscript 𝑝 1 𝑥 p_{1}(x)italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) and p 2⁢(x)subscript 𝑝 2 𝑥 p_{2}(x)italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ), it is often difficult to directly sample from the product density p 1⁢(x)⁢p 2⁢(x)subscript 𝑝 1 𝑥 subscript 𝑝 2 𝑥 p_{1}(x)p_{2}(x)italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ). Existing generative models typically represent probability distributions in a factorized manner to enable efficient learning and sampling, such as at the token level in autoregressive models(Van Den Oord et al., 2016) or across various noise levels in diffusion models(Sohl-Dickstein et al., 2015). However, depending on the form of the factorization, the models may not be straightforward to compose.

For instance, consider two learned autoregressive factorizations p 1⁢(x i|x 0:i−1)subscript 𝑝 1 conditional subscript 𝑥 𝑖 subscript 𝑥:0 𝑖 1 p_{1}(x_{i}|x_{0:i-1})italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 : italic_i - 1 end_POSTSUBSCRIPT ) and p 2⁢(x i|x 0:i−1)subscript 𝑝 2 conditional subscript 𝑥 𝑖 subscript 𝑥:0 𝑖 1 p_{2}(x_{i}|x_{0:i-1})italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 : italic_i - 1 end_POSTSUBSCRIPT ) over sequences x 0:T subscript 𝑥:0 𝑇 x_{0:T}italic_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT. The autogressive factorization of the product distribution p product⁢(x)∝p 1⁢(x)⁢p 2⁢(x)proportional-to subscript 𝑝 product 𝑥 subscript 𝑝 1 𝑥 subscript 𝑝 2 𝑥 p_{\text{product}}(x)\propto p_{1}(x)p_{2}(x)italic_p start_POSTSUBSCRIPT product end_POSTSUBSCRIPT ( italic_x ) ∝ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) corresponds to

p product⁢(x i|x 0:i−1)=∑x i+1:T subscript 𝑝 product conditional subscript 𝑥 𝑖 subscript 𝑥:0 𝑖 1 subscript subscript 𝑥:𝑖 1 𝑇\displaystyle p_{\text{product}}(x_{i}|x_{0:i-1})=\sum_{x_{i+1:T}}italic_p start_POSTSUBSCRIPT product end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 : italic_i - 1 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i + 1 : italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT p 1⁢(x i+1:T|x 0:i)⁢p 1⁢(x i|x 0:i−1)subscript 𝑝 1 conditional subscript 𝑥:𝑖 1 𝑇 subscript 𝑥:0 𝑖 subscript 𝑝 1 conditional subscript 𝑥 𝑖 subscript 𝑥:0 𝑖 1\displaystyle p_{1}(x_{i+1:T}|x_{0:i})p_{1}(x_{i}|x_{0:i-1})italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i + 1 : italic_T end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 : italic_i end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 : italic_i - 1 end_POSTSUBSCRIPT ) p 2⁢(x i+1:T|x 0:i)⁢p 2⁢(x i|x 0:i−1),subscript 𝑝 2 conditional subscript 𝑥:𝑖 1 𝑇 subscript 𝑥:0 𝑖 subscript 𝑝 2 conditional subscript 𝑥 𝑖 subscript 𝑥:0 𝑖 1\displaystyle p_{2}(x_{i+1:T}|x_{0:i})p_{2}(x_{i}|x_{0:i-1}),italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i + 1 : italic_T end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 : italic_i end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 : italic_i - 1 end_POSTSUBSCRIPT ) ,

where we need to marginalize over all possible future values of x i+1:T subscript 𝑥:𝑖 1 𝑇 x_{i+1:T}italic_x start_POSTSUBSCRIPT italic_i + 1 : italic_T end_POSTSUBSCRIPT. Since this marginalization is different dependent on the value of x i subscript 𝑥 𝑖 x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, p product⁢(x i|x 0:i−1)subscript 𝑝 product conditional subscript 𝑥 𝑖 subscript 𝑥:0 𝑖 1 p_{\text{product}}(x_{i}|x_{0:i-1})italic_p start_POSTSUBSCRIPT product end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 : italic_i - 1 end_POSTSUBSCRIPT ) is not equivalent to p 1⁢(x i|x 0:i−1)⁢p 2⁢(x i|x 0:i−1)subscript 𝑝 1 conditional subscript 𝑥 𝑖 subscript 𝑥:0 𝑖 1 subscript 𝑝 2 conditional subscript 𝑥 𝑖 subscript 𝑥:0 𝑖 1 p_{1}(x_{i}|x_{0:i-1})p_{2}(x_{i}|x_{0:i-1})italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 : italic_i - 1 end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 : italic_i - 1 end_POSTSUBSCRIPT ) and therefore autoregressive factorizations are not directly compositional. Similarly, two learned score functions from diffusion models are not directly composable as they do not correspond to the noisy gradient of the product distribution(Du et al., 2023a).

While it is often difficult to combine generative models, representing the probability density explicitly enables us to combine models by manipulating the density. One such approach is to represent probability density as an Energy-Based Model, p i⁢(x)∝e−E i⁢(x)proportional-to subscript 𝑝 𝑖 𝑥 superscript 𝑒 subscript 𝐸 𝑖 𝑥 p_{i}(x)\propto e^{-E_{i}(x)}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∝ italic_e start_POSTSUPERSCRIPT - italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT(Hinton, 2002; Du & Mordatch, 2019). Under this factorization by definition, we can construct the product density corresponding to

e−(E 1⁢(x)+E 2⁢(x))∝e−E 1⁢(x)⁢e−E 2⁢(x),proportional-to superscript 𝑒 subscript 𝐸 1 𝑥 subscript 𝐸 2 𝑥 superscript 𝑒 subscript 𝐸 1 𝑥 superscript 𝑒 subscript 𝐸 2 𝑥 e^{-(E_{1}(x)+E_{2}(x))}\propto e^{-E_{1}(x)}e^{-E_{2}(x)},italic_e start_POSTSUPERSCRIPT - ( italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) + italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) ) end_POSTSUPERSCRIPT ∝ italic_e start_POSTSUPERSCRIPT - italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT ,(2)

corresponding to a new EBM E 1⁢(x)+E 2⁢(x)subscript 𝐸 1 𝑥 subscript 𝐸 2 𝑥 E_{1}(x)+E_{2}(x)italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) + italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ). It is important to observe that EBMs generally represent probability densities in an unnormalized manner, and the product of two normalized probability densities p 1⁢(x)subscript 𝑝 1 𝑥 p_{1}(x)italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) and p 2⁢(x)subscript 𝑝 2 𝑥 p_{2}(x)italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) will be an unnormalized probability density as well (where the normalization constant is intractable to compute as it requires marginalization over the sample space). Additional operations between probability densities such as mixtures and inversions of distributions can also be expressed as combinations of energy functions(Du et al., 2020a).

To generate samples from any EBM distribution, it is necessary to run Markov Chain Monte Carlo (MCMC) to iteratively refine a starting sample to one that is high likelihood (low energy) under the EBM. We present practical MCMC algorithms for sampling from composed distributions in continuous spaces in Section5.2 and discrete spaces in Section5.3 with EBMs. Recently, new methods for implementing compositional sampling using separately trained classifiers to efficiently specify each conditioned factor have been developed(Garipov et al., 2023), which we encourage the reader to also read.

5.2 Effective Compositional Sampling on Continuous Distributions

Given a composed distribution represented as EBM E⁢(x)𝐸 𝑥 E(x)italic_E ( italic_x ) defined over inputs x∈ℝ D 𝑥 superscript ℝ 𝐷 x\in\mathbb{R}^{D}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, directly finding a low energy sample through MCMC becomes increasingly inefficient as the data dimension D 𝐷 D italic_D rises. To more effectively find low-energy samples in EBMs in high-dimensional continuous spaces, we can use the gradient of the energy function to help guide sampling. In Du & Mordatch (2019), Langevin dynamics is used to implement efficient sampling, where a sample can be repeatedly optimized using the expression

x t=x t−1−λ⁢∇x E⁢(x)+ϵ,ϵ∼𝒩⁢(0,σ),formulae-sequence subscript 𝑥 𝑡 subscript 𝑥 𝑡 1 𝜆 subscript∇𝑥 𝐸 𝑥 italic-ϵ similar-to italic-ϵ 𝒩 0 𝜎 x_{t}=x_{t-1}-\lambda\nabla_{x}E(x)+\epsilon,\quad\epsilon\sim\mathcal{N}(0,% \sigma),italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_λ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_E ( italic_x ) + italic_ϵ , italic_ϵ ∼ caligraphic_N ( 0 , italic_σ ) ,

where x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is initialized from uniform noise. By converting different operations such as products, mixtures, and inversions of probability distributions into composite energy functions, the above sampling procedure allows us to effectively compositionally sample from composed distributions(Du et al., 2020a).

There has been a substantial body of recent work on improving learning in EBMs(Du & Mordatch, 2019; Nijkamp et al., 2019; Grathwohl et al., 2019; Du et al., 2020b; Grathwohl et al., 2021) but EBMs still lag behind other generative approaches in efficiency and scalability of training. By leveraging the close connection of diffusion models with EBMs in (Song & Ermon, 2019) we can also directly implement the compositional operations with EBMs with diffusion models (Du et al., 2023a), which we briefly describe below.

Given a diffusion model representing a distribution p⁢(x)𝑝 𝑥 p(x)italic_p ( italic_x ), we can interpret the T 𝑇 T italic_T learned denoising functions ϵ⁢(x,t)italic-ϵ 𝑥 𝑡\epsilon(x,t)italic_ϵ ( italic_x , italic_t ) of the diffusion model as representing T 𝑇 T italic_T separate EBM distributions, e−E⁢(x,t)superscript 𝑒 𝐸 𝑥 𝑡 e^{-E(x,t)}italic_e start_POSTSUPERSCRIPT - italic_E ( italic_x , italic_t ) end_POSTSUPERSCRIPT, where ∇x E⁢(x,t)=ϵ⁢(x,t)subscript∇𝑥 𝐸 𝑥 𝑡 italic-ϵ 𝑥 𝑡\nabla_{x}{E(x,t)}=\epsilon(x,t)∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_E ( italic_x , italic_t ) = italic_ϵ ( italic_x , italic_t ). This sequence of EBM distributions transition from e−E⁢(x,T)superscript 𝑒 𝐸 𝑥 𝑇 e^{-E(x,T)}italic_e start_POSTSUPERSCRIPT - italic_E ( italic_x , italic_T ) end_POSTSUPERSCRIPT representing the Gaussian distribution 𝒩⁢(0,1)𝒩 0 1\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ) to e−E⁢(x,0)superscript 𝑒 𝐸 𝑥 0 e^{-E(x,0)}italic_e start_POSTSUPERSCRIPT - italic_E ( italic_x , 0 ) end_POSTSUPERSCRIPT representing the target distribution p i⁢(x)subscript 𝑝 𝑖 𝑥 p_{i}(x)italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ). We can draw samples from this sequence of EBMs using annealed importance sampling(Du et al., 2023a), where we initialize a sample from Gaussian noise and sequentially run several steps of MCMC on each EBM distribution, starting at e−E⁢(x,T)superscript 𝑒 𝐸 𝑥 𝑇 e^{-E(x,T)}italic_e start_POSTSUPERSCRIPT - italic_E ( italic_x , italic_T ) end_POSTSUPERSCRIPT and ending at e−E⁢(x,0)superscript 𝑒 𝐸 𝑥 0 e^{-E(x,0)}italic_e start_POSTSUPERSCRIPT - italic_E ( italic_x , 0 ) end_POSTSUPERSCRIPT.

This EBM interpretation of diffusion models allows them to be composed using operations such as Equation2 by applying the operation to each intermediate EBM corresponding to the component diffusion distributions, for instance e−(E 1⁢(x,k)+E 2⁢(x,k))superscript 𝑒 subscript 𝐸 1 𝑥 𝑘 subscript 𝐸 2 𝑥 𝑘 e^{-(E_{1}(x,k)+E_{2}(x,k))}italic_e start_POSTSUPERSCRIPT - ( italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x , italic_k ) + italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x , italic_k ) ) end_POSTSUPERSCRIPT. We can then use an annealed importance sampling procedure on this sequence of composite EBMs. Note that this annealed importance procedure is necessary for accurate compositional sampling – using the reverse diffusion process directly on this composed score does not sample from the composed distribution(Du et al., 2023a).

A variety of different MCMC samplers such as ULA, MALA, U-HMC, and HMC can be used as intermediate MCMC samplers for this sequence of EBM distributions. One easy-to-implement MCMC transition kernel that is easy to understand is the standard diffusion reverse sampling kernel at a fixed noise level. We illustrate in AppendixA that this is equivalent to running a ULA MCMC sampling step. This allows compositional sampling in diffusion models to be easily implemented by simply constructing the score function corresponding to the composite distribution we wish to sample from and then using the standard diffusion sampling procedure, but with the diffusion reverse step applied multiple times at each noise level.

5.3 Effective Compositional Sampling on Discrete Distributions

Given an EBM representing a composed distribution E⁢(x)𝐸 𝑥 E(x)italic_E ( italic_x ) on a high dimensional discrete landscape, we can use Gibbs sampling to sample from the resultant distribution, where we repeatedly resample values of individual dimensions of x 𝑥 x italic_x using the marginal energy function E⁢(x i∣x−i)𝐸 conditional subscript 𝑥 𝑖 subscript 𝑥 𝑖 E(x_{i}\mid x_{-i})italic_E ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ). However, this process is increasingly inefficient as the underlying dimensionality of the data increases.

The use of a gradient of the energy function E⁢(x)𝐸 𝑥 E(x)italic_E ( italic_x ) to accelerate sampling in the discrete landscape is difficult, as the gradient operation is not well defined in discrete space (though there are also promising discrete analogs of gradient samplers (Grathwohl et al., 2021)). However, we can leverage our learned generative distributions to accelerate sampling, by using one generative model as a proposal distribution and the remaining energy functions to implement a Metropolis-Hastings step (Li et al., 2022; Verkuil et al., 2022).

As an example, to sample from an energy function E⁢(x)=E 1⁢(x)+E 2⁢(x)𝐸 𝑥 subscript 𝐸 1 𝑥 subscript 𝐸 2 𝑥 E(x)=E_{1}(x)+E_{2}(x)italic_E ( italic_x ) = italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) + italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ), given an initial MCMC sample x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we can draw a new sample x t+1 subscript 𝑥 𝑡 1 x_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT by sampling from the learned distribution e−E 1⁢(x)superscript 𝑒 subscript 𝐸 1 𝑥 e^{-E_{1}(x)}italic_e start_POSTSUPERSCRIPT - italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT, and accept the new sample x t+1 subscript 𝑥 𝑡 1 x_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT with a Metropolis acceptance rate

a⁢(x t+1)=clip⁢(e E 2⁢(x t)−E 2⁢(x t+1),0,1).𝑎 subscript 𝑥 𝑡 1 clip superscript 𝑒 subscript 𝐸 2 subscript 𝑥 𝑡 subscript 𝐸 2 subscript 𝑥 𝑡 1 0 1 a(x_{t+1})=\text{clip}(e^{E_{2}(x_{t})-E_{2}(x_{t+1})},0,1).italic_a ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = clip ( italic_e start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , 0 , 1 ) .

This procedure allows us to leverage e−E 1⁢(x)superscript 𝑒 subscript 𝐸 1 𝑥 e^{-E_{1}(x)}italic_e start_POSTSUPERSCRIPT - italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT to guide sampling from e−E⁢(x)superscript 𝑒 𝐸 𝑥 e^{-E(x)}italic_e start_POSTSUPERSCRIPT - italic_E ( italic_x ) end_POSTSUPERSCRIPT .

6 Discussion and Future Directions

Most recent research on building generative models has focused on increasing the computational scale and data on which models are trained. We have presented an orthogonal direction to constructing complex generative systems, by building systems compositionally, combining simpler generative models to form more complex ones. We have illustrated how this can be more data and computation-efficient to learn, enable flexible reprogramming, and how such components can be discovered from raw data.

Figure 15: Decentralized Decision Making. By composing generative models operating over various modalities we can construct decentralized architectures for intelligent agents. Communication between models is induced by inference over the joint distribution.

Such compositional systems have additional benefits in terms of both buildability and interpretability. As individual models are responsible for independent subsets of data, each model can be built separately and modularly by different institutions. Simultaneously, at execution time, it is significantly easier to understand and monitor the execution of each simpler constituent model than a single large monolithic model.

In addition, such compositional systems can be more environmentally friendly and easier to deploy than large monolithic models. Since individual models are substantially smaller, they can run efficiently using small amounts of computation. In addition, it is more straightforward to deploy separate models across separate computational machines.

In the setting of constructing an artificially intelligent agent, such a compositional architecture may look like a decentralized decision-making system in Figure15. In this system, separate generative models are responsible for processing each modality an agent receives and other models responsible for decision-making. Sampling composed generative distribution of models corresponds to message passing between models, inducing cross-communication between models similar to a set of daemons communicating with each other(Selfridge, 1988). Individual generative models in this architecture can be substituted with existing models such as LLMs for proposing plausible plans for actions and text-to-video presenting future world states.

Finally, while we have provided a few promising results on applications of compositional generative modeling, there are many limitations to address in future work. First, the current work on compositional modeling assumes a fixed prespecified structure through which models are composed, limiting generalization to new distributions. To flexibly apply compositional models across new tasks, it would be important to construct systems that can instead automatically discover the correct compositional structure between models as well as the appropriate per-model weighting.

Second, current work on discovering compositional structure assumes that data is naturally factorized into an independent product of components. In many real-world settings, gathered data will often exhibit spurious correlations that violate such independence assumptions, causing existing algorithms to fail to discover the correct structure. Exploring more robust approaches to discovering compositional structure such as through prior knowledge or active intervention in the environment are rich directions for future work.

Lastly, while our focus in this position paper has been on combining separately trained generative models, it would be interesting to theoretically characterize compositional generalization in such systems as well as alternative approaches to improve such generalization. Past theoretical work has characterized compositional generalization in additive models (Wiedemer et al., 2024; Lachapelle et al., 2024), and it would be interesting to extend such analysis to compositional generative modeling. Furthermore, it would be interesting to explore adding explicit compositional structure to individual models to improve compositional generalization(Misino et al., 2022; Sehgal et al., 2023).

Acknowledgements

We acknowledge support from NSF grant 2214177; from AFOSR grant FA9550-22-1-0249; from ONR MURI grant N00014-22-1-2740; and from ARO grant W911NF-23-1-0034. Yilun Du is supported by a NSF Graduate Fellowship.

Impact Statement

In this paper, we argue that generative models should be built compositionally, from simpler individual parts and illustrate how this enables from data-efficient generative modeling. As generative models become increasingly deployed into production, we believe that such an approach can significantly broaden the impact of such models, enabling them to be deployed in a variety of domains with limited data.

References

Ajay et al. (2022) Ajay, A., Du, Y., Gupta, A., Tenenbaum, J., Jaakkola, T., and Agrawal, P. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
Ajay et al. (2023) Ajay, A., Han, S., Du, Y., Li, S., Gupta, A., Jaakkola, T., Tenenbaum, J., Kaelbling, L., Srivastava, A., and Agrawal, P. Compositional foundation models for hierarchical planning. arXiv preprint arXiv:2309.08587, 2023.
Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
Canonne (2020) Canonne, C.L. A short note on learning discrete distributions. arXiv preprint arXiv:2002.11457, 2020.
Comas et al. (2023) Comas, A., Du, Y., Lopez, C.F., Ghimire, S., Sznaier, M., Tenenbaum, J.B., and Camps, O. Inferring relational potentials in interacting systems. In International Conference on Machine Learning, pp. 6364–6383. PMLR, 2023.
Dao et al. (2022) Dao, T., Fu, D., Ermon, S., Rudra, A., and Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
Du & Mordatch (2019) Du, Y. and Mordatch, I. Implicit generation and generalization in energy-based models. arXiv preprint arXiv:1903.08689, 2019.
Du et al. (2020a) Du, Y., Li, S., and Mordatch, I. Compositional visual generation with energy based models. Advances in Neural Information Processing Systems, 33:6637–6647, 2020a.
Du et al. (2020b) Du, Y., Li, S., Tenenbaum, J., and Mordatch, I. Improved contrastive divergence training of energy based models. arXiv preprint arXiv:2012.01316, 2020b.
Du et al. (2021) Du, Y., Li, S., Sharma, Y., Tenenbaum, B.J., and Mordatch, I. Unsupervised learning of compositional energy concepts. In Advances in Neural Information Processing Systems, 2021.
Du et al. (2023a) Du, Y., Durkan, C., Strudel, R., Tenenbaum, J.B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W.S. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International Conference on Machine Learning, pp. 8489–8510. PMLR, 2023a.
Du et al. (2023b) Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., and Mordatch, I. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023b.
Dziri et al. (2023) Dziri, N., Lu, X., Sclar, M., Li, X.L., Jian, L., Lin, B.Y., West, P., Bhagavatula, C., Bras, R.L., Hwang, J.D., et al. Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023.
Epoch (2023) Epoch. Key trends and figures in machine learning, 2023. URL https://epochai.org/trends. Accessed: 2024-01-23.
Garipov et al. (2023) Garipov, T., De Peuter, S., Yang, G., Garg, V., Kaski, S., and Jaakkola, T. Compositional sculpting of iterative generative processes. arXiv preprint arXiv:2309.16115, 2023.
Grathwohl et al. (2019) Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D., Norouzi, M., and Swersky, K. Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263, 2019.
Grathwohl et al. (2021) Grathwohl, W., Swersky, K., Hashemi, M., Duvenaud, D., and Maddison, C. Oops i took a gradient: Scalable sampling for discrete distributions. In International Conference on Machine Learning, pp. 3831–3841. PMLR, 2021.
Hendrycks et al. (2021) Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
Hinton (2002) Hinton, G.E. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.
Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239, 2020.
Hoffmann et al. (2022) Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d.L., Hendricks, L.A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
Janner et al. (2022) Janner, M., Du, Y., Tenenbaum, J.B., and Levine, S. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
Koller & Friedman (2009) Koller, D. and Friedman, N. Probabilistic graphical models: principles and techniques. MIT press, 2009.
Kwon et al. (2023) Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp. 611–626, 2023.
Lachapelle et al. (2024) Lachapelle, S., Mahajan, D., Mitliagkas, I., and Lacoste-Julien, S. Additive decoders for latent variables identification and cartesian-product extrapolation. Advances in Neural Information Processing Systems, 36, 2024.
Li et al. (2022) Li, S., Du, Y., Tenenbaum, J.B., Torralba, A., and Mordatch, I. Composing ensembles of pre-trained models via iterative consensus. arXiv preprint arXiv:2210.11522, 2022.
Liu et al. (2022) Liu, N., Li, S., Du, Y., Torralba, A., and Tenenbaum, J.B. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pp. 423–439. Springer, 2022.
Liu et al. (2023) Liu, N., Du, Y., Li, S., Tenenbaum, J.B., and Torralba, A. Unsupervised compositional concepts discovery with text-to-image generative models. arXiv preprint arXiv:2306.05357, 2023.
Lu et al. (2023) Lu, S., Bigoulaeva, I., Sachdeva, R., Madabushi, H.T., and Gurevych, I. Are emergent abilities in large language models just in-context learning? arXiv preprint arXiv:2309.01809, 2023.
Majumdar et al. (2023) Majumdar, A., Ajay, A., Zhang, X., Putta, P., Yenamandra, S., Henaff, M., Silwal, S., Mcvay, P., Maksymets, O., Arnaud, S., Yadav, K., Li, Q., Newman, B., Sharma, M., Berges, V., Zhang, S., Agrawal, P., Bisk, Y., Batra, D., Kalakrishnan, M., Meier, F., Paxton, C., Sax, S., and Rajeswaran, A. Openeqa: Embodied question answering in the era of foundation models. 2023.
Mishra et al. (2023) Mishra, U.A., Xue, S., Chen, Y., and Xu, D. Generative skill chaining: Long-horizon skill planning with diffusion models. In Conference on Robot Learning, pp. 2905–2925. PMLR, 2023.
Misino et al. (2022) Misino, E., Marra, G., and Sansone, E. Vael: Bridging variational autoencoders and probabilistic logic programming. Advances in Neural Information Processing Systems, 35:4667–4679, 2022.
Murphy (2022) Murphy, K.P. Probabilistic machine learning: an introduction. MIT press, 2022.
Nijkamp et al. (2019) Nijkamp, E., Hill, M., Han, T., Zhu, S.-C., and Wu, Y.N. On the anatomy of mcmc-based maximum likelihood learning of energy-based models. arXiv preprint arXiv:1903.12370, 2019.
(35) OpenAI. URL https://openai.com/pricing.
Reed et al. (2022) Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S.G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J.T., et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
Rombach et al. (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models, 2022.
Schaeffer et al. (2023) Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023.
Sehgal et al. (2023) Sehgal, A., Grayeli, A., Sun, J.J., and Chaudhuri, S. Neurosymbolic grounding for compositional world models. arXiv preprint arXiv:2310.12690, 2023.
Selfridge (1988) Selfridge, O.G. Pandemonium: A paradigm for learning. In Neurocomputing: Foundations of research, pp. 115–122. 1988.
Sohl-Dickstein et al. (2015) Sohl-Dickstein, J., Weiss, E.A., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585, 2015.
Song & Ermon (2019) Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
Su et al. (2024) Su, J., Liu, N., Tenenbaum, J.B., and Du, Y. Compositional image decomposition with diffusion models, 2024. URL https://openreview.net/forum?id=88FcNOwNvM.
Tamkin et al. (2021) Tamkin, A., Brundage, M., Clark, J., and Ganguli, D. Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503, 2021.
Van Den Oord et al. (2016) Van Den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. In International conference on machine learning, pp. 1747–1756. PMLR, 2016.
Verkuil et al. (2022) Verkuil, R., Kabeli, O., Du, Y., Wicky, B.I., Milles, L.F., Dauparas, J., Baker, D., Ovchinnikov, S., Sercu, T., and Rives, A. Language models generalize beyond natural proteins. bioRxiv, pp. 2022–12, 2022.
Vincent (2011) Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
Wang et al. (2024) Wang, L., Zhao, J., Du, Y., Adelson, E.H., and Tedrake, R. Poco: Policy composition from and for heterogeneous robot learning. arXiv preprint arXiv:2402.02511, 2024.
Wiedemer et al. (2024) Wiedemer, T., Mayilvahanan, P., Bethge, M., and Brendel, W. Compositional generalization from first principles. Advances in Neural Information Processing Systems, 36, 2024.
Yadlowsky et al. (2023) Yadlowsky, S., Doshi, L., and Tripuraneni, N. Pretraining data mixtures enable narrow model selection capabilities in transformer models. arXiv preprint arXiv:2311.00871, 2023.
Yang et al. (2023a) Yang, M., Du, Y., Dai, B., Schuurmans, D., Tenenbaum, J.B., and Abbeel, P. Probabilistic adaptation of text-to-video models. arXiv preprint arXiv:2306.01872, 2023a.
Yang et al. (2023b) Yang, Z., Mao, J., Du, Y., Wu, J., Tenenbaum, J.B., Lozano-Pérez, T., and Kaelbling, L.P. Compositional diffusion-based continuous constraint solvers. arXiv preprint arXiv:2309.00966, 2023b.
Zhang et al. (2023a) Zhang, Q., Song, J., Huang, X., Chen, Y., and Liu, M.-Y. Diffcollage: Parallel generation of large content with diffusion models. arXiv preprint arXiv:2303.17076, 2023a.
Zhang et al. (2023b) Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., et al. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023b.

Appendix

Appendix A Implementing ULA Transitions as Multiple Reverse Diffusion Steps

We illustrate how a step of reverse sampling on a diffusion model at a fixed noise level is equivalent to ULA MCMC sampling at the same fixed noise level. We use the α t subscript 𝛼 𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and β t subscript 𝛽 𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT formulation from (Ho et al., 2020). The reverse sampling step on an input x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at a fixed noise level at timestep t 𝑡 t italic_t is given by a Gaussian with a mean

μ θ⁢(x t,t)=x t−β t 1−α¯t⁢ϵ θ⁢(x t,t).subscript 𝜇 𝜃 subscript 𝑥 𝑡 𝑡 subscript 𝑥 𝑡 subscript 𝛽 𝑡 1 subscript¯𝛼 𝑡 subscript italic-ϵ 𝜃 subscript 𝑥 𝑡 𝑡\mu_{\theta}(x_{t},t)=x_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}{t}}}% \epsilon{\theta}(x_{t},t).italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) .

with the variance of β t subscript 𝛽 𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (using the variance small noise schedule in(Ho et al., 2020)). This corresponds to a sampling update,

x t+1=x t−β t 1−α¯t⁢ϵ θ⁢(x t,t)+β t⁢ξ,ξ∼𝒩⁢(0,1).formulae-sequence subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 subscript 𝛽 𝑡 1 subscript¯𝛼 𝑡 subscript italic-ϵ 𝜃 subscript 𝑥 𝑡 𝑡 subscript 𝛽 𝑡 𝜉 similar-to 𝜉 𝒩 0 1 x_{t+1}=x_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}{t}}}\epsilon{\theta}(x_{% t},t)+\beta_{t}\xi,\quad\xi\sim\mathcal{N}(0,1).italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ξ , italic_ξ ∼ caligraphic_N ( 0 , 1 ) .

Note that the expression ϵ θ⁢(x t,t)1−α¯t subscript italic-ϵ 𝜃 subscript 𝑥 𝑡 𝑡 1 subscript¯𝛼 𝑡\frac{\epsilon_{\theta}(x_{t},t)}{\sqrt{1-\bar{\alpha}{t}}}divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG corresponds to the score ∇x p t⁢(x)subscript∇𝑥 subscript 𝑝 𝑡 𝑥\nabla{x}p_{t}(x)∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ), through the denoising score matching objective(Vincent, 2011), where the EBM p t⁢(x)subscript 𝑝 𝑡 𝑥 p_{t}(x)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) corresponds to the data distribution perturbed with t 𝑡 t italic_t steps of noise. The reverse sampling step can be equivalently written as

x t+1=x t−β t⁢∇x p t⁢(x)+β t⁢ξ,ξ∼𝒩⁢(0,1).formulae-sequence subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 subscript 𝛽 𝑡 subscript∇𝑥 subscript 𝑝 𝑡 𝑥 subscript 𝛽 𝑡 𝜉 similar-to 𝜉 𝒩 0 1 x_{t+1}=x_{t}-\beta_{t}\nabla_{x}p_{t}(x)+\beta_{t}\xi,\quad\xi\sim\mathcal{N}% (0,1).italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) + italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ξ , italic_ξ ∼ caligraphic_N ( 0 , 1 ) .(A1)

The ULA sampler draws an MCMC sample from the EBM probability distribution p t⁢(x)subscript 𝑝 𝑡 𝑥 p_{t}(x)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) using the expression

x t+1=x t−η⁢∇x p t⁢(x)+2⁢η⁢ξ,ξ∼𝒩⁢(0,1),formulae-sequence subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 𝜂 subscript∇𝑥 subscript 𝑝 𝑡 𝑥 2 𝜂 𝜉 similar-to 𝜉 𝒩 0 1 x_{t+1}=x_{t}-\eta\nabla_{x}p_{t}(x)+\sqrt{2}\eta\xi,\quad\xi\sim\mathcal{N}(0% ,1),italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) + square-root start_ARG 2 end_ARG italic_η italic_ξ , italic_ξ ∼ caligraphic_N ( 0 , 1 ) ,(A2)

where η 𝜂\eta italic_η is the step size of sampling.

By substituting η=β t 𝜂 subscript 𝛽 𝑡\eta=\beta_{t}italic_η = italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the ULA sampler, the sampler becomes

x t+1=x t−β t⁢∇x p t⁢(x)+2⁢β t⁢ξ,ξ∼𝒩⁢(0,1).formulae-sequence subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 subscript 𝛽 𝑡 subscript∇𝑥 subscript 𝑝 𝑡 𝑥 2 subscript 𝛽 𝑡 𝜉 similar-to 𝜉 𝒩 0 1 x_{t+1}=x_{t}-\beta_{t}\nabla_{x}p_{t}(x)+\sqrt{2}\beta_{t}\xi,\quad\xi\sim% \mathcal{N}(0,1).italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) + square-root start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ξ , italic_ξ ∼ caligraphic_N ( 0 , 1 ) .(A3)

Note the similarity of ULA sampling in EqnA3 and the reverse sampling procedure in EqnA1, where there is a factor of 2 2\sqrt{2}square-root start_ARG 2 end_ARG scaling of the added Gaussian noise in the ULA sampling procedures. This means that we can implement the ULA sampling by running the standard reverse process, but by scaling the noise added in each timestep by a factor of 2 2\sqrt{2}square-root start_ARG 2 end_ARG. Alternatively, we can directly we can directly use the reverse sampling procedure in EqnA1 to run ULA, where this then corresponds to sampling a tempered variant of p t⁢(x)subscript 𝑝 𝑡 𝑥 p_{t}(x)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) with temperature 1 2 1 2\frac{1}{\sqrt{2}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG (corresponding to less stochastic samples from the composed probability distribution).

Xet Storage Details

Size:: 91.8 kB
Xet hash:: 42efd6919fa573eb665d2d6afcdbb055074a4e7577dba5988a2dbb6dfdf052bb

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.