步驟3: 內容審查確認
請檢查解析結果並確認內容正確性
LayoutDiffusion
LayoutDiffusion.pdf 解析完成
已解析
提取的文本內容
# LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models
Junyi Zhang1\* Jiaqi Guo² Shizhao Sun2 Jian-Guang Lou² Dongmei Zhang² 1 Shanghai Jiao Tong University 2Microsoft Research Asia junyizhang@sjtu.edu.cn {jiaqiguo, shizsu, jlou, dongmeiz}@microsoft.com
# Abstract
Creating graphic layouts is a fundamental step in graphic designs. In this work, we present a novel generative model named LayoutDiffusion for automatic layout generation. As layout is typically represented as a sequence of discrete tokens, LayoutDiffusion models layout generation as a discrete denoising diffusion process. It learns to reverse a mild forward process, in which layouts become increasingly chaotic with the growth of forward steps and layouts in the neighboring steps do not differ too much. Designing such a mild forward process is however very challenging as layout has both categorical attributes and ordinal attributes. To tackle the challenge,we summarize three critical factors for achieving a mild forward process for the layout,i.e., legality, coordinate proximity and type disruption. Based on the factors,we propose a block-wise transition matrix coupled with a piece-wise linear noise schedule. Experiments onRICO and PubLayNet datasetsshow thatLayoutDiffusion outperforms state-of-the-art approaches significantly. Moreover, it enables two conditional layout generation tasks in a plug-and-play manner without re-training and achieves better performance than existing methods.Project page: https://layoutdiffusion.github.io.
# 1. Introduction
Graphic layout,i.e.,the sizes and positions of elements, is important to the interaction between the viewer and the information. Recently, layout generation attracts growing research interest. Leading approaches [11,16,17,20] often represent a layout as a sequence of elements and leverage Transformer [4O] to model element relationships. As the placement of one element could depends on any part of a layout, global context modeling plays a critical role in layout generation.However, there is no satisfactory solution to it. Some studies simply consider biased context [1,11,16,17]. They generate layout sequences autoregressively,where the generation order for elements is predefined and the placement of one element only depends on a certain part of layout. A few other studies try to utilize global context by non-autoregressive generation [22]. Unfortunately, they fail to improve the generation quality significantly since it is too challenging to generate a sequence in a single pass [7].

(d) The mapping between colors and element types used in Figs.la to 1c
Figure 1. Comparison of different forward corruption processes. We sample the layouts at the timesteps 0,1/6,2/6,3/6,4/6,5/6, and1 of the total timestep.The blank page is used when the format of the layout sequence is destroyed.
Meanwhile,the emerging diffusion probabilistic model (DDPM) [13,38] achieves amazing performance on many generation tasks [12,21,33-36,44]. It consists of multiple rounds,each of which gradually denoises the latent variables towards the desired data distribution. This sort of process seems to be a promising solution to layout generation. First, the layout generated in the last round could serve as the global context for the generation in the next round. Second, by multiple rounds of denoising,a layout could be refined iteratively, overcoming the challenge of single-pass generation from non-autoregressive models.
To this end,we propose LayoutDiffusion to improve graphic layout generation. As a layout is represented as a sequence of discrete tokens [11,16,20], we formulate layout generation as a discrete diffusion process.Roughly speaking,it samples a layout by reversing a forward process. The forward process corrupts the real data into a sequence of increasingly noisy latent variables by a fixed Markov Chain. The reverse process starts from noise and denoises it step by step via learning the posterior distribution.
To ease the estimation of the posterior distribution, it is critical to design a mild forward corruption process [29], in which latent variables in neighboring steps do not differ too much and become increasingly chaotic with the growth of forward steps (see Fig. la). However, designing such a process for layout is non-trivial, due to the heterogeneous nature of the layout sequence,where the tokens representing element types are categorical while the tokens representing element coordinates are ordinal.Existing discrete forward processes hardly consider heterogeneous tokens. Directly applying them to layout data often leads to harsh corruptions,where a layout is changed dramatically at each step (see Figs.1b and 1c).For example,the uniform process in Fig.1c will transition an element type token to a coordinate token, drastically violating the layout semantics.
To realize a mild corruption process for layout, we make three important observations.(i) Legality. The transition between type tokens and coordinate tokens will lead to an illegal layout sequence,resulting in an unpredictable change between forward steps. Hence, it is vital to impose legality during the corruption process. (ii) Coordinate Proximity. Coordinate tokens are ordinal,and thus transitioning a coordinate token to its proximal tokens (e.g., from O to 1) will introduce a milder change to a layout compared with transitioning to distant ones (e.g.,from O to 127). (iii) Type Disruption. Unlike coordinate tokens,type tokens are categorical and do not have particular proximity. Simply transitioning one type to another may cause abrupt semantic changes to a layout (e.g., from a button to a background image).
Motivated by the above observations,we propose a block-wise transition matrix coupled with a piece-wise linear noise schedule in LayoutDiffusion. The transition matrix is designed as follows. First, to achieve legality, we only allow the internal transition between coordinate tokens and that between type tokens. Second, regarding coordinate proximity,we leverage discretized Gaussian [2], where the transition between more proximal tokens takes a higher probability, for the transition between coordinate tokens. Third,as for type disruption,we introduce absorbing state [2].Each type token either stays the same or transitions to the absorbing state.To further alleviate type disruption, we propose a piece-wise linear noise schedule to make the transition for element types only occur in the late stage of the forward process. With above techniques,LayoutDiffusion achieves the mild forward process shown in Fig. la.
Our design also enables LayoutDiffusion to perform certain conditional layout generation tasks in a plug-and-play manner without re-training,which has never been explored by previous work. Specifically, owning to the mild forward process achieved by LayoutDiffusion, its reverse process is to iteratively improve a layout, which naturally supports the task of layout refinement [32]. Besides,as the transition of element types only occurs in the late forward process, LayoutDiffusion will determine the element types in a layout quickly in the reverse process. Thus, it can perform generation conditioned on types by simply keeping the types fixed and running the reverse process.
In summary, this work makes four key contributions:
1. We formulate layout generation as a discrete diffusion process,which addresses biased context modeling by iterative refinement from a non-autoregressive model. 2. We design a new diffusion process based on the heterogeneous nature of layout sequence (legality, coordinate proximity and type disruption). It not only better suits layout data but also showcases a promising way of applying diffusion models to other heterogeneous data. 3.We enable certain conditional layout generation tasks in a plug-and-play manner without re-training. 4.We make extensive experiments and user studies.LayoutDiffusion outperforms existing methods on all the tasks in terms of most evaluation metrics,even if it is not re-trained for conditional generation tasks.
# 2. Related Work
Graphic Layout Generation.Early work on graphic layout generation has explored classical optimization approaches [30,31],as well as generative models such as Generative Adversarial Networks (GANs)[19,23] and Variational Autoencoders (VAEs) [1,17,18,42].
Recently, inspired by the success of NLP, masking strategies [2O], language models [11],and encoder-decoder architectures [16] have been studied. These approaches represent the layout as a sequence of elements and use Transformer [4O] as the basic model architecture. As the placement of one element can depend on any part of a layout, one critical issue in layout generation is global context modeling.Some previous studies introduce unnatural biases and fail to model global context effectively [1,11,16,17,42]. They generate the layout sequence in an autoregressive manner,where there is a predefined generation order and the placement of one element can only depend on the generated part of the layout. On the other hand,a few other studies consider global context but do not achieve significantly better performance [19,20,23]. They generate the layout sequence in a non-autoregressive manner, where there is no predefined generation order,and all the tokens are generated in parallel.However, generating a sequence in a single pass is too challenging [8]. BLT [2O] explored an iterative refinement mechanism to alleviate the difficulty. However, it relies on heuristic rules instead of being learned from the data. The above limitation motivates us to seek a better model for layout generation. We think diffusion models are well-suited. By multiple rounds of denoising,it naturally takes the layout in the last step as the global context and generates a layout iteratively instead of by a single pass.
Another branch of studies has explored incorporating diverse user constraints into the layout generation [22,24, 32,45,47]. They treat layout generation tasks with different constraints separately,which introduces repetitive training and hinders knowledge sharing across different tasks. By utilizing the flexible forward process of diffusion models [3], we enable some conditional generation tasks without re-training for the first time,which can be potentially extended to handle more conditional generation tasks.
Diffusion Models for Discreate Data.Diffusion models on continuous data have achieved outstanding results [43]. Recently, diffusion models on discrete data are also emerging. They can be grouped into two categories. The first category [5,9,25] maps discrete data to continuous state space via a learnable or fixed embedding,and then utilizes techniques from classical continuous diffusion models. These approaches enable simple technology migration from continuous diffusion models,but make the fine-grained control of the forward corruption process much difficult. Another category [2,4,10,15,39,41] chooses to directly perform diffusion in discrete state space by modeling the forward corruption process as a random walk between different states. This category makes it easy to incorporate domaindependent structure to the transition matrices and thus enables flexible control of the forward process.Different to the discrete data (e.g., images and texts) explored by previous work, the layout data studied in this work is heterogeneous by nature.Thus,we fully consider such characteristic and propose a new transition matrix coupled with noise schedules to achieve a mild corruption process.
# 3.Problem Formulation
Graphic Layout. A graphic layout $x$ is composed of a set of graphic elements $\{ \mathbf { } e _ { i } \} _ { i = 1 } ^ { N }$ , where $N$ denotes the number of elements.Each element $e _ { i }$ has an element type $c _ { i }$ and a bounding box indicating its left $l _ { i }$ ,top $t _ { i }$ ,right $r _ { i }$ , and bottom $b _ { i }$ coordinates. Following the advanced layout generation methods [1,11,17,27,32,42], we represent an element as a sequence with 5 discrete tokens,i.e., $e _ { i } = \{ c _ { i } l _ { i } t _ { i } r _ { i } b _ { i } \}$ where the continuous bounding box coordinates are uniformly discretized into integers between $[ 0 , K )$ . Then,we represent a layout as a concatenation of element sequences:
$$
\mathbf { x } = \left\{ \langle \mathrm { s o s } \rangle c _ { 1 } l _ { 1 } t _ { 1 } r _ { 1 } b _ { 1 } \| \dots \| c _ { N } l _ { N } t _ { N } r _ { N } b _ { N } \langle \mathrm { e o s } \rangle \right\} ,
$$

where $\langle \cos \rangle$ and $\left. \mathrm { e o s } \right.$ are special tokens indicating the start and end of a sequence,and token $\parallel$ indicates the separator between any two elements. Obviously, the layout sequence is heterogeneous. The element type tokens are categorical, while the coordinate tokens are ordinal.
Graphic Layout Generation. In this work, we primarily focus on unconditional layout generation. Specifically, we learn a generative model $p _ { \theta } ( \mathbf { x } )$ parameterized by $\theta$ ,which synthesizes diverse and high-quality graphic layouts.
# 4. LayoutDiffusion
We formulate the layout generation problem as a discrete denoising diffusion process (see Fig.2). It consists of two Markov chains,where the forward process is hand-designed and fixed while the reverse process is parameterized.
Give a real layout $\mathbf { x } _ { 0 } \sim q ( \mathbf { x _ { 0 } } )$ , the forward process corrupts it into a sequence of increasingly noisy latent variables $\mathbf { x } _ { 1 : T } = \mathbf { x _ { 1 } } , \mathbf { x } _ { 2 } , \ldots , \mathbf { x } _ { T }$
$$
\begin{array} { l } { \displaystyle q ( \mathbf { x } _ { 1 : T } | \mathbf { x } _ { 0 } ) = \prod _ { t = 1 } ^ { T } q ( \mathbf { x } _ { t } | \mathbf { x } _ { t - 1 } ) , } \\ { \displaystyle q ( x _ { t } | x _ { t - 1 } ) = x _ { t } \mathbf { Q } _ { t } x _ { t - 1 } . } \end{array}
$$

Here, $x _ { t }$ denotes the one-hot version of a single discrete token in the layout sequence $\mathbf { x } _ { t }$ : $\mathbf { Q } _ { t }$ is the transition matrix, where $[ \mathbf Q _ { t } ] _ { i j } = q ( x _ { t } = j | x _ { t - 1 } = i )$ represents the probabilities that $x _ { t - 1 }$ transitions to $x _ { t }$ . Due to the property of Markov chain, the cumulative probability of $x _ { t }$ at arbitrary timestep from $x _ { 0 }$ can be derived as $q ( x _ { t } | x _ { 0 } ) = \mathbf { x } _ { t } \overline { { \mathbf { Q } } } _ { t } \mathbf { x } _ { 0 }$ where $\overline { { \mathbf { Q } } } _ { t } = \mathbf { Q } _ { 1 } \mathbf { Q } _ { 2 } \dots \mathbf { Q } _ { t }$ (refer to [2] for details).
To generate a layout, the reverse process starts with a random noise $\mathbf { x } _ { T }$ and gradually recovers it relying on the learned posterior distribution $p _ { \theta } ( \mathbf { x } _ { t - 1 } | \mathbf { x } _ { t } )$ ,
$$
p _ { \theta } ( \mathbf { x } _ { 0 : T } ) = p ( \mathbf { x } _ { T } ) \prod _ { t = 1 } ^ { T } p _ { \theta } ( \mathbf { x } _ { t - 1 } \vert \mathbf { x } _ { t } ) .
$$

In the following,we will introduce how to construct a mild forward process $q \big ( \mathbf { x } _ { t } | \mathbf { x } _ { t - 1 } \big )$ for layout generation (Sec.4.1),and how to learn the generative model $p _ { \theta } ( \mathbf { x } _ { 0 } )$ in the reverse process (Sec. 4.2).
# 4.1. Forward Process
In LayoutDiffusion, we propose a block-wise transition matrix $\mathbf { Q } _ { t }$ and a piece-wise linear noise schedule to realize a mild forward process,in which layouts in the neighboring steps do not differ too much and become increasingly disordered as the forward step grows (see Fig. la).
The design of the transition matrix and noise schedule stems from our three important observations. (i) Legality. As defined in Sec.3,layout sequence has a rigorous format. Any transition between element type tokens and coordinate tokens will lead to an illegal layout sequence, resulting in a disruptive change between forward steps. Hence, it is vital to impose sequence legality in the transition matrix.(i) Coordinate Proximity. Coordinate tokens in layout sequence are ordinal and have a meaningful proximity. Transitioning a coordinate token to its proximal tokens (e.g., from O to 1) will introduce a milder change to a layout, compared with transitioning to distant ones (e.g., from O to 127). Thus, it is helpful to encode the proximity prior in the transition matrix.(ii) Type Disruption. Type tokens are categorical and do not present particular proximity. Each type of element has its unique coordinate distribution.For example,a background image tends to have a large size,while a button has a small size. Transitioning a type to another type may produce an abnormal element (e.g.,a button has a large size and is placed at the top-left corner), leading to abrupt changes in layout. This is also consistent with the observation from diffusion models on other categorical data, e.g.,latent code and text [2,1O]. Therefore, it is beneficial to alleviate type disruption in the transition matrix and noise schedule.

Figure2.AnilustratiofoLayousiInteforadproessteoodnatesaellyedintoatioarydistr,d theelementtysrebsobdintoKinelateageeveeprocs,eemettefstecoeddg coordinatesaregraduallrefied.Forbrevityonlytwoelementsaresown,while teotherelementsandthespecialtokensareited.
Transition Matrices.There are three kinds of tokens in the layout sequence, including type tokens (i.e., $c _ { i } ^ { \phantom { \dagger } }$ ),coordinate tokens (i.e., $l _ { i } , t _ { i }$ $r _ { i }$ and $b _ { i }$ ) and special tokens (i.e., $\langle \cos \rangle$ , $\left. \mathrm { e o s } \right.$ , $\parallel$ ,and PAD). Denote the number of different coordinate tokens and type tokens as $K$ and $C$ . Then, the transition matrix is denoted as $\mathbf { Q } _ { t } ~ \in ~ \mathbb { R } ^ { V \times V }$ ,where $V = K + C + 4$
To achieve the legality of the layout sequence,we only allow the internal transition within each kind of tokens. Thus, $\mathbf { Q } _ { t }$ can be reduced to a block-wise diagonal matrix,
For $\mathbf { Q } _ { t } ^ { \mathrm { c o o r d } }$ , to encode the ordinal proximity, we introduce the discretized Gaussian matrix [2] for coordinate tokens, which assigns a higher probability to the transition between more proximal tokens,
$$
\left[ \mathbf { Q } _ { t } ^ { \mathrm { c o o r d } } \right] _ { i j } = \left\{ \begin{array} { l l } { \frac { \exp { \left( - \frac { 4 | i - j | ^ { 2 } } { ( K - 1 ) ^ { 2 } \beta _ { t } } \right) } } { \sum _ { n = - ( K - 1 ) } ^ { ( K - 1 ) } \exp { \left( - \frac { 4 n ^ { 2 } } { ( K - 1 ) ^ { 2 } \beta _ { t } } \right) } } , } & { i \neq j } \\ { 1 - \sum _ { l = 0 , l \neq i } ^ { ( K - 1 ) } [ \mathbf { Q } _ { t } ^ { \mathrm { c o o r d } } ] _ { i l } , } & { i = j } \end{array} \right.
$$

where the parameters $\beta _ { t }$ influence the variance of the forward process distributions.
For $\mathbf { Q } _ { t } ^ { \mathrm { t y p e } }$ ,to alleviate the type disruption, we choose to transit a type token to a special MASK token instead of another meaningful type token. Therefore,we introduce the absorbing state transition matrix [2] for type tokens,
$$
\mathbf { Q } _ { t } ^ { \mathrm { t y p e } } = \left[ \begin{array} { c c c c } { 1 - \gamma _ { t } } & { 0 } & { \cdot \cdot } & { 0 } \\ { 0 } & { 1 - \gamma _ { t } } & { \cdot \cdot } & { 0 } \\ { \vdots } & { \vdots } & { \ddots } & { \vdots } \\ { \gamma _ { t } } & { \gamma _ { t } } & { \cdot \cdot } & { 1 } \end{array} \right] ,
$$

where $\gamma _ { t }$ indicates the probability that a token is absorbed into a MASK token, and $1 - \gamma _ { t }$ is the probability that a token stays unchanged.
For $\mathbf { Q } _ { t } ^ { \mathrm { s p e c } }$ , as special tokessibethestructurefthe layout sequence,any transition between them will lead to an invalid layout sequence.Therefore,we choose to disable any transition between them,
$$
\mathbf { Q } _ { t } = \left[ \begin{array} { c c c } { \mathbf { Q } _ { t } ^ { \mathrm { c o o r d } } } & & \\ & { \mathbf { Q } _ { t } ^ { \mathrm { t y p e } } } & \\ & & { \mathbf { Q } _ { t } ^ { \mathrm { s p e c } } } \end{array} \right] ,
$$

$$
\mathbf { Q } _ { t } ^ { \mathrm { s p e c } } = \mathbf { I } ,
$$

where $\mathbf { Q } _ { t } ^ { \mathrm { c o o r d } } , \mathbf { Q } _ { t } ^ { \mathrm { t y p e } }$ and $\mathbf { Q } _ { t } ^ { \mathrm { s p e c } }$ depicts the probabities f the internal transition within coordinate tokens, type tokens and special tokens, respectively.
where I is an identity matrix.
Noise Schedules. An early absorbing of type tokens (i.e., transitioning to MASK token) will bring an abrupt change to the layout. Hence,to further eliminate type disruption, we choose to make the element type begin to change only in the late stage of the forward process. Specifically, we design $\begin{array} { r } { \overline { { \gamma } } _ { t } = 1 - \prod _ { i = 1 } ^ { t } ( 1 - \gamma _ { i } ) } \end{array}$ for the cumulative probability $q _ { t } ( x _ { t } | x _ { 0 } )$ as a piece-wise linear function,
$$
\overline { { \gamma } } _ { t } = \left\{ \begin{array} { l l } { 0 , } & { t < \tilde { T } } \\ { ( t - \tilde { T } ) / ( T - \tilde { T } ) , } & { t \geq \tilde { T } } \end{array} \right.
$$

Here, $\tilde { T }$ is the timestep where the absorbing is enabled,and $T$ is the terminal timestep.
Besides,although existing work often uses linear schedule for Gaussian transition process,we choose to use $\beta _ { t } =$ $g / ( T - t + \epsilon ) ^ { h }$ for the transition of coordinate tokens $\mathbf { Q } _ { t } ^ { \mathrm { c o o r d } }$ Here $g$ and $h$ are hyper-parameters,and $\epsilon$ denotes a small positive quantity. It is generalized from a commonly used noise schedule $1 / ( T { - } t { + } 1 )$ [2,37]. We find that with $h > 1$ it achieves a slower and more smooth corruption to the layout in the early forward process,which helps the model in the reverse process better learn the posterior distribution.
# 4.2. Reverse Process
To reverse the forward process, we optimize the generative model $p _ { \theta } ( \mathbf { x } _ { 0 } )$ to fit the data distribution $q ( \mathbf { x } _ { 0 } )$ by minimizing the variational lower bound (VLB) [2],
$$
\begin{array} { r } { \mathcal { L } _ { \mathrm { V L B } } = - \log p _ { \theta } ( \mathbf { x } _ { 0 } | \mathbf { x } _ { 1 } ) + D _ { \mathrm { K L } } ( q ( \mathbf { x } _ { T } | \mathbf { x } _ { 0 } ) \| p ( \mathbf { x } _ { T } ) ) } \\ { + \displaystyle \sum _ { t = 2 } ^ { T } D _ { \mathrm { K L } } ( q ( \mathbf { x } _ { t - 1 } | \mathbf { x } _ { t } ) \| p _ { \theta } ( \mathbf { x } _ { t - 1 } | \mathbf { x } _ { t } ) ) . } \end{array}
$$

Following recent work [2,10], we predict $p _ { \theta } ( \mathbf { x } _ { 0 } | \mathbf { x } _ { t } )$ instead of $p _ { \theta } ( \mathbf { x } _ { t - 1 } | \mathbf { x } _ { t } )$ ,and encourage good predictions of $\mathbf { x } _ { \mathrm { 0 } }$ at each step by combining ${ \mathcal { L } } _ { \mathrm { V L B } }$ with an auxiliary objective,
$$
\begin{array} { r } { \mathcal { L } = \mathcal { L } _ { \mathrm { V L B } } - \lambda \log p _ { \theta } \big ( \mathbf { x } _ { 0 } \big | \mathbf { x } _ { t } \big ) . } \end{array}
$$

Specifically,we leverage Transformer encoder [40] to learn $p _ { \theta } ( \mathbf { x } _ { 0 } | \mathbf { x } _ { t } )$ .Denote the embedding of $i$ -th token in the layout sequence $\mathbf { x } _ { t }$ as $\textstyle \operatorname { e m b } ( x _ { t , i } )$ and its positional embedding as $p _ { i }$ . Denote the embedding of the timestep $t$ as $\operatorname { e m b } ( t )$ . Then, Transformer takes the aggregation of them, ie., $\{ \mathsf { e m b } ( x _ { t , i } ) + p _ { i } + \mathsf { e m b } ( t ) \} _ { i = 1 } ^ { M }$ ,as the input and predicts a new layout sequence $\tilde { \mathbf { x } } _ { 0 } = \{ \tilde { x } _ { 0 , i } \} _ { i = 1 } ^ { M }$ as th output.
In practice, we set an $N$ as the maximum number of elements.During inference,we first sample an element count $n$ from the training set's prior distribution.For constructing $\mathbf { x } _ { T }$ , we assign MASK tokens for the type and random coordinate tokens for bounding boxes of the first $n$ elements.For the remaining $( N - n )$ elements,PAD tokens are utilized to ensure a consistent length.By performing denoising from timestep $T$ to O, we derive the layout $\mathbf { x } _ { \mathrm { 0 } }$
# 4.3.Enabling Conditional Layout Generation in a Plug-and-Play Manner
Although LayoutDiffusion is trained for unconditional layout generation,it can handle some conditional generation tasks without re-training,which has never been explored by previous work. Such a plug-and-play feature of conditional generation is enabled by the design of transition matrices and noise schedules. In the following,we introduce how LayoutDiffusion achieves it.
Refinement is a user-oriented layout generation task first posed in RUITE [32],and is recently studied by LayoutFormer $^ { + + }$ [16]. Its goal is to take a user given flawed layout as input and provide a high-quality layout for the user while maintaining the original design style. With the proposed transition matrices and noise schedules,a layout is gradually corrupted in the forward process.With such a forward process, the reverse process learned by LayoutDiffusion is to iteratively improve a layout, which naturally enables refinement. Specifically,in LayoutDiffusion,we achieve refinement by feeding the flawed layout into the model and then running reverse process from a certain timestep.Here the timestep is related to how noisy the input layout is.
Generation Conditioned on Types (Gen-Type) is also a widely studied conditional layout generation task [16,20, 22] to satisfy the needs of user. It aims to generate layouts with the given element types.In LayoutDiffusion, there is no transition between coordinates and types (see Eq. (5)). Besides,with the noise schedule in Eq. (9),the change of the types only occurs in the late forward process. With the above two mechanisms,LayoutDiffusion will determine the element types in the early reverse steps very quickly and then continue to improve the coordinates in the remaining reverse steps without changing the types (see the transformation of the layout from right to left in Fig.2).In other words,the generation for coordinates and that for element types are approximately decoupled. Thus,in LayoutDiffusion, we achieve Gen-Type by feeding in the element types in the early stage and running the reverse process.
# 5. Experiments
# 5.1. Setups
Datasets.We employ two widely-used public datasets of graphic layouts.RICO [6] is a dataset of user interface designs for mobile applications,which contains $6 6 \mathrm { K } +$ UI layouts with 25 element types. PublayNet [46] includes $3 6 0 \mathrm { K } +$ annotated scientific document layouts with 5 element types.Both datasets contain a few over-length entries. We filter out the layouts longer than 2O elements as in LayoutFormer $^ { + + }$ [16]. Then,we split the filtered data into a training, validation,and test set by $90 \%$ $5 \%$ ,and $5 \%$
Baselines.First, we compare LayoutDiffusion with leading approaches for layout generation.Specifically, we compare against LayoutTransformer [11],VTN [1], Coarse2Fine [17],and LayoutFormer $^ { + + }$ [16] on unconditional generation (UGen); against NDN-none [22], LayoutGan $^ { + + }$ [19],BLT [20],and LayoutFormer $^ { + + }$ [16] on generation conditioned on type (Gen-Type);against RUITE [32],and LayoutFormer+ $^ +$ [16] on refinement. Moreover, we compare LayoutDiffusion with the existing diffusion models that do not consider the characteristics of layouts. Diffusion-LM [25] maps discrete data to continuous state space,while D3PM(uniform)[2] and D3PM(absorbing)[2] perform diffusion in discrete state space using different transition matrices.

Table 1. Quantitative results.Methods with $\diamond$ are diffusion-based, which achieve conditional generation (i.e., Gen-Type and Refinement) inaplug-and-playmanner,whileothermethodsrequirere-trainingforeach subtask.Thebestandthesecondbestvaluesofeachmetric are bold and underlined respectively. For mIoU, the higher the score, the beter the performance (indicated by $\uparrow$ ). For Overlap and Align, the closer to real data,the better (indicated by ). For FID,the lower the score,the better the performance (indicated by $\downarrow$ )
Implementation Details. We set the weight of auxiliary loss as $\lambda = 0 . 0 0 0 1$ (see Eq. (11)). In the training, we set the timestep as $T = 2 0 0$ ; in the inference, we set the timesteps $T _ { \mathrm { U G e n } } = 2 0 0$ $T _ { \mathrm { G e n - T y p e } } = 1 6 0$ ,and $T _ { \mathrm { R e f i n e } } = 5 0$ for different generation tasks. For the layout sequence (see Eq.(1)), we arrange elements in the alphabetical order of the type,and each token is embedded with $d = 1 2 8$ dimensions.For the denoising network $p _ { \theta } ( \mathbf { x } _ { 0 } | \mathbf { x } _ { t } )$ ,we apply a 12-layer Transformer encoder with 12 attention heads.We train the model using AdamW optimizer [26] with $2 \sim 4$ NVIDIA V100 GPUs.We also employ importance timestep sampling [29] during the training. See Supplementals for more details.
Evaluation Metrics. We adopt four metrics to measure the performance comprehensively. Among them,Frechet Inception Distance (FID) measures the overall performance, while Maximum Interaction over Union (mIoU),Alignment (Align.) and Overlap measure the quality from a specific aspect. Specifically, FID computes the distance between the distribution of the generated layouts and that of real layouts. Following the previous practice [19,22],we train a classification-based neural network to get the feature embedding for the layout. mloU calculates the maximum IoU between bounding boxes of the generated layouts and those of the real layouts with the same type set [19]. Align. measures whether the elements in a generated layout are well-aligned,either by center or by edges.In addition to the original implementation [22],a normalization over the number of elements is applied. Overlap measures the overlapping area between elements in the generated layout. Following LayoutFormer+ $^ +$ [16],we ignore normal overlaps, e.g., elements on top of the background.
# 5.2. Comparison with Existing Approaches for Layout Generation
Quantitative Analysis. In Tab.1, the methods without the symbol $\diamond$ are existing approaches for layout generation.
First, we compare FID as it is an overall metric for generation performance. LayoutDiffusion achieves significantly better FID scores than all other methods. For example, on Un-Gen, LayoutDiffusion achieves 2.490 and 8.625 on RICO and PubLayNet datasets,respectively, while the best existing work only achieves 20.198 and 30.048.
Furthermore,we examine individual metrics, including mIoU,Overlap,and Align.,each of which measures the quality from a specific aspect. LayoutDiffusion is in the top two on almost every metric and frequently achieves the best performance. On the contrary, existing approaches may perform well on a certain metric but fail on the other individual metrics and the overall metric (i.e., FID), indicating that LayoutDiffusion is a well-rounded approach. For example,on Un-Gen, LayoutTransformer has a good overlap score,but it does not perform well on Align.,mloU and FID; LayoutFormer $^ { + + }$ has the best Align.for PubLayNet, but underperforms in mIoU,overlap and FID.

Figure 3. Qualitative comparison against strongest baselines selected by FID (better view in color and $2 \times$ zoom).The first three row is forRICOandthelastthreeisforPubLayNet.LayoutDifusion generates high-qualityanddiverselayouts.LayoutsfromLayoutForer $^ { + + }$ eitherlack diversity(Un-Gen)orarefawed(Gen-Typeand Refinement).Layouts fromother methods misalignand overlapfrequently.
Moreover, on conditional generation tasks (i.e.,GenType and Refinement),the above observations still hold, even though LayoutDiffusion is not re-trained on these tasks (see Sec. 4.3) while existing approaches are re-trained. Qualitative Analysis. Fig.3 shows qualitative results. On Un-Gen, LayoutDiffusion generates diverse and highquality layouts.In contrast, LayoutTransformer mainly suffers from incorrect spacing and overlap,and LayoutFormer++ is deficient in diversity. For example,for LayoutFormer $^ { + + }$ ,most layouts on RICO contain a top toolbar and several list items,and most layouts on PublayNet are double-columned and have many texts. Besides,on GenType and Refinement,LayoutDiffusion outperforms other methods (e.g.,alignment, overlap and spacing) while it is not re-trained. For more qualitative results, please refer to Supplemental Sec.E.
User Study. On each task, we select the best two baselines by FID for the user study. We design two kinds of evaluation. One is quality evaluation. We show three layouts from three models respectively (two from baselines and one from LayoutDiffusion) and invite the user to choose which one has the best quality (e.g.,more plausible overall structure and pleasing details). Another one is diversity evaluation. We show three sets of layouts from three models respectively,where each set contains five layouts from the same model. Then,we invite the user to choose which set has the most diverse layouts.For Refinement, we do not conduct diversity evaluation as it is not necessary for this scenario.
Fig.4 shows the results.Across different datasets, tasks and evaluation modes, there are 1O groups of user studies in total, in each of which we invite 15 people and everyone labels 5O groups of layouts.The user study shows that LayoutDiffusion outperforms other methods significantly.
# 5.3.Comparison with Traditional Diffusion Models
Quantitative Analysis.In Tab.1, methods marked with $\diamond$ are traditional diffusion models. They are originally proposed for other generation tasks (e.g., image and text). We adapt them for layout generation. On Un-Gen,LayoutDiffusion achieves the best performance on most metrics.On

Figure 4.Resultsoftheuserstudy.Foreach model,wecount howmany people preferthelayoutsgeneratedfromthis model.Te study show that theresults generatedbyLayoutDifusion werefavored byusersover theother methods,particularlyinterms of diversity.

Figure 5.Reverse denoising process for unconditional generation on RICO(from left to right).Each row is for one model.The blank page is used when the generated layout sequence is invalid.
Gen-Type and Refinement, while traditional diffusion models can be used for conditional generation tasks without retraining,their performance is usually worse than existing methods for layout generation (e.g.,LayoutFormer $^ { + + }$ ,not to mention LayoutDiffusion. These observations demonstrate that our consideration of the heterogeneous nature of layout data is critical for both achieving good performance and realizing plug-and-play conditional generation.
Qualitative Analysis. Fig. 3 shows a qualitative comparison with the best traditional diffusion model (selected by FID).LayoutDiffusion consistently generates better layouts,e.g.,,better alignment and less overlap. Moreover, Fig.5 compares the reverse denoising processes of different diffusion models. LayoutDiffusion quickly generates a draft layout and then gradually refines it to a pleasing layout, while other diffusion models take many steps to generate a rough layout and fewer steps for iterative refinement, which may limit the modeling of precise relationships between elements, such as strict alignment and no overlap.
# 5.4.Ablation Studies and Discussions
Transition Matrices. Our transition matrices are designed by considering three critical factors,i.e.,legality,coordinate proximity and type disruption (see Sec.4.1). We remove the technique corresponding to each factor. First, without considering the legality,the techniques for the other two factors cannot be applied. Thus,LayoutDiffusion degrades to
Table 2.Ablation studies on RICO with unconditional generation.


Table 3.Ablation study on timesteps for diffusion models.All the experiments are on RICO with unconditional generation. The training and inference steps are set as the same.2OOO and 1000 are default settings used in Diffusion-LM[25] and D3PMs [2].
D3PM (absorbing or uniform). Tab.1 shows that LayoutDiffusion achieves better performance. Second, to ignore coordinate proximity,we use uniform or absorbing transition for coordinate tokens,denoted as Uniform $\mathbf { Q } _ { t } ^ { \mathrm { c o o r d } }$ and Absorbing $\mathbf { Q } _ { t } ^ { \mathrm { c o o r d } }$ in Tab. 2. LayoutDiffusion outperforms these two variations on most metrics,especially Overlap and FID.Third,to study type disruption,we use uniform transtion for type tokens,denotedasUniform $\mathbf { Q } _ { t } ^ { \mathrm { t y p e } }$ Tab.2.LayoutDiffusion outperforms this variation,where the improvement of mIoU is most significant.
Noise Schedules. To make the corruption of element types occur throughout the forward process, we set $\tilde { T }$ as O for $\overline { { \gamma } } _ { t }$ (see Eq.(9)), which results in a linear schedule (denoted as Linear $\overline { { \gamma } } _ { t }$ in Tab.2). We replace the noise schedule for $\beta _ { t }$ with the original linear schedule in previous work [2] (denoted as Linear $\beta _ { t }$ in Tab.2). LayoutDiffusion outperforms both of them on most metrics, especially mIoU and Align.
Timesteps. Tab.3 shows FID of diffusion models with different timesteps.While 2OO steps are enough for LayoutDiffusion, the performance of Diffusion-LM, D3PM(absorbing),and D3PM (uniform) saturates at 500,100O,and 500 timesteps respectively. Besides,even the fastest version of LayoutDiffusion surpasses all the other diffusion models.
Additional Analyses.For a detailed discussion on the diversity of the generated layouts, see Supplemental Sec.B. Experiments on additional settings of conditional generation are available in Sec. C.Ablation studies concerning conditional generation tasks, noise schedules,and sequence ordering can be found in Sec.D.
# 6. Conclusion
In this work,we propose LayoutDiffusion to improve graphic layout generation by discrete diffusion models. The core of our method lies in realizing a mild forward process by considering the heterogeneous characteristics of the layout. Our method also enables two conditional generation tasks without re-training.Experiments demonstrate the superiority ofLayoutDiffusion over leading approaches for layout generation and existing diffusion models. In the future,we plan to incorporate diverse conditions [14,28] in LayoutDiffusion.Besides,we will also explore how to extend LayoutDiffusion to handle other heterogeneous data.
# 7. Acknowledgement
We would like to thank Zhaoyun Jiang for helpful discussions. We also appreciate Zhaoyun Jiang's support in providing the pre-processed datasets and pretrained models for the FID evaluations.
# References
[1]Diego Martin Arroyo,Janis Postels,and Federico Tombari. Variational transformer networks for layout generation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13642-13652,2021.1, 2,3,5
[2] Jacob Austin,Daniel D Johnson,Jonathan Ho,Daniel Tarlow,and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in Neural Information Processing Systems,34:17981-17993,2021. 1, 2, 3,4,5, 6, 8
[3] Arpit Bansal,Eitan Borgnia,Hong-Min Chu,Jie S Li, Hamid Kazemi,Furong Huang,Micah Goldblum,Jonas Geiping,and Tom Goldstein.Cold diffusion:Inverting arbitrary image transforms without noise.arXiv preprint arXiv:2208.09392,2022. 3
[4] Andrew Campbell,Joe Benton,Valentin De Bortoli,Tom Rainforth,George Deligiannidis,and Arnaud Doucet.A continuous time framework for discrete denoising models.arXiv preprint arXiv:2205.14987,2022. 3
[5] Ting Chen,Ruixiang Zhang,and Geoffrey Hinton. Analog bits:Generating discrete data using diffusion models with self-conditioning.arXiv preprint arXiv:2208.04202,2022.3
[6] Biplab Deka,Zifeng Huang,Chad Franzen,Joshua Hibschman,Daniel Afergan,YangLi, Jeffrey Nichols,and Ranjitha Kumar. Rico:A mobile app dataset for building datadriven design applications.In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, pages 845-854,2017. 5
[7] Marjan Ghazvininejad, Omer Levy,Yinhan Liu,and Luke Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6112-6121, 2019. 1 [8] Marjan Ghazvininejad, Omer Levy, Yinhan Liu,and Luke Zettlemoyer.Mask-predict: Parallel decoding of conditional masked language models. arXiv preprint arXiv:1904.09324, 2019.3
[9] Shansan Gong,Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong.Diffuseq:Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933,2022. 3
[10] Shuyang Gu,Dong Chen,Jianmin Bao,Fang Wen,Bo Zhang,Dongdong Chen,Lu Yuan,and Baining Guo.Vector quantized difusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696-10706,2022. 3, 4,5
[11] Kamal Gupta, Justin Lazarow, Alessandro Achille, Larry S Davis,Vijay Mahadevan,and Abhinav Shrivastava. Layouttransformer: Layout generation and completion with selfattention.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1004-1014,2021.1, 2,3,5
[12] Jonathan Ho,William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao,Alexey Gritsenko,Diederik P Kingma,Ben Poole,Mohammad Norouzi, David JFleet,et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303,2022.1
[13] Jonathan Ho,Ajay Jain,and Pieter Abbeel.Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems,33:6840-6851,2020. 1
[14] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,2022.9
[15] Emiel Hoogeboom,Didrik Nielsen,Priyank Jaini,Patrick Forre,and Max Welling. Argmax flows and multinomial diffusion:Towards non-autoregressive language models. CoRR,abs/2102.05379,2021. 3
[16] Zhaoyun Jiang,Jiaqi Guo,Shizhao Sun,Huayu Deng, Zhongkai Wu, Vuksan Mijovic, Zijiang James Yang,JianGuang Lou,and Dongmei Zhang. Layoutformer++: Conditional graphic layout generation via constraint serialization and decoding space restriction.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18403-18412,2023. 1,2,5,6
[17] Zhaoyun Jiang, Shizhao Sun, Jihua Zhu, Jian-Guang Lou, and Dongmei Zhang. Coarse-to-fine generative modeling for graphic layouts.Proceedings of the AAAI Conference on Artificial Intelligence,36(1):1096-1103, Jun. 2022. 1,2,3,5
[18] Akash Abdu Jyothi,Thibaut Durand, Jiawei He,Leonid Sigal, and Greg Mori. Layoutvae: Stochastic scene layout generation from a label set.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9895- 9904,2019.2
[19] Kotaro Kikuchi, Edgar Simo-Serra,Mayu Otani,and Kota Yamaguchi. Constrained graphic layout generation via latent optimization. In Proceedings of the 29th ACM International Conference on Multimedia, pages 88-96,2021. 2,5, 6
[20] Xiang Kong,Lu Jiang,Huiwen Chang,Han Zhang,Yuan Hao,Haifeng Gong,and Irfan Essa. Blt: Bidirectional layout transformer for controllable layout generation. arXiv preprint arXiv:2112.05112,2021. 1,2,3,5
[21] Zhifeng Kong,Wei Ping,Jiaji Huang,Kexin Zhao,and Bryan Catanzaro.Diffwave:A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761,2020.1
[22] Hsin-Ying Lee,Lu Jiang, Irfan Essa, Phuong B Le, Haifeng Gong,Ming-Hsuan Yang,and Weilong Yang. Neural design network: Graphic layout generation with constraints.In European Conference on Computer Vision,pages 491-506. Springer,2020. 1,3,5,6
[23] Jianan Li, Jimei Yang,Aaron Hertzmann, Jianming Zhang, and Tingfa Xu. Layoutgan: Generating graphic layouts with wireframe discriminators. In International Conference on Learning Representations,2018.2
[24] Jianan Li, Jimei Yang, Jianming Zhang,Chang Liu, Christina Wang,and Tingfa Xu. Atribute-conditioned layout gan for automatic graphic design. IEEE Transactions on Visualization and Computer Graphics, 27(10):4039-4048, 2020.3
[25] Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang,and Tatsunori B Hashimoto.Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217,2022.3,6,8
[26] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,2017. 6
[27] DavidD Nguyen, Surya Nepal,and Salil S Kanhere.Diverse multimedia layout generation with multi choice learning. In Proceedings of the 29th ACM International Conference on Multimedia, pages 218-226,2021. 3
[28] Alex Nichol,Prafulla Dhariwal,Aditya Ramesh,Pranav Shyam,Pamela Mishkin,Bob McGrew,Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741,2021.9
[29] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162-8171. PMLR, 2021. 2, 6
[30] Peter O'Donovan, Aseem Agarwala,and Aaron Hertzmann. Designscape: Design with interactive layout suggestions. In Proceedings of the 33rd annual ACM conference on human factors in computing systems, pages 1221-1224, 2015. 2
[31] Peter O'Donovan, Aseem Agarwala,and Aaron Hertzmann. Learning layouts for single-pagegraphic designs. IEEE transactions on visualization and computer graphics, 20(8):1200-1213,2014.2
[32] Soliha Rahman,Vinoth Pandian Sermuga Pandian,and Mathias Jarke.Ruite: Refining ui layout aesthetics using transformer encoder. In 26th International Conference on Intellgent User Interfaces-Companion,pages 81-83,2021. 2,3,5,6
[33] Aditya Ramesh,Prafulla Dhariwal,Alex Nichol, Casey Chu, and Mark Chen.Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022. 1
[34] Chitwan Saharia,William Chan,Huiwen Chang,Chris Lee, Jonathan Ho,Tim Salimans,David Fleet,and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1- 10,2022. 1
[35] Chitwan Saharia,William Chan,Saurabh Saxena,Lala Li,Jay Whang,Emily Denton,Seyed Kamyar Seyed Ghasemipour,Burcu Karagol Ayan,S Sara Mahdavi, Rapha Gontijo Lopes,et al. Photorealistic text-to-image diffusion models with deep language understanding.arXiv preprint arXiv:2205.11487,2022. 1
[36] Uriel Singer, Adam Polyak,Thomas Hayes, Xi Yin, Jie An, Songyang Zhang,Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni,et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.1
[37] Jascha Sohl-Dickstein, Eric Weiss,Niru Maheswaranathan, and Surya Ganguli.Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256-2265.PMLR,2015. 1,5
[38] Yang Song,Jascha Sohl-Dickstein,Diederik P Kingma, Abhishek Kumar, Stefano Ermon,and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456,2020. 1
[39] Zhicong Tang, Shuyang Gu, Jianmin Bao, Dong Chen,and Fang Wen.Improved vector quantized diffusion models. arXiv preprint arXiv:2205.16007,2022. 3
[40] Ashish Vaswani, Noam Shazeer,Niki Parmar, Jakob Uszkoreit,Llion Jones,Aidan N Gomez,Lukasz Kaiser,and Illia Polosukhin.Attention is all you need.Advances in neural information processing systems,30,2017. 1,2,5
[41] Pan Xie, Qipeng Zhang, Zexian Li, Hao Tang,Yao Du, and Xiaohui Hu. Vector quantized diffusion model with codeunet for text-to-sign pose sequences generation. arXiv preprint arXiv:2208.09141,2022.3
[42] Kota Yamaguchi. Canvasvae:Learning to generate vector graphic documents. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5481-5489, 2021.2,3
[43] Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong,Runsheng Xu, Yue Zhao,Yingxia Shao,Wentao Zhang,Bin Cui,and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796,2022. 3
[44] Xiaohui Zeng,Arash Vahdat, Francis Wiliams,Zan Gojcic, Or Litany,Sanja Fidler, and Karsten Kreis.Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978,2022.1
[45] Xinru Zheng,Xiaotian Qiao,Ying Cao,and Rynson WH Lau. Content-aware generative modeling of graphic design layouts.ACM Transactions on Graphics (TOG),38(4):1-15, 2019.3
[46] Xu Zhong,Jianbin Tang,and Antonio Jimeno Yepes.Publaynet:largest dataset ever for document layout analysis.In 2019 International Conference on Document Analysis and Recognition (ICDAR),pages 1015-1022.IEEE,2019.5
[47] Min Zhou,Chenchen Xu,Ye Ma, Tiezheng Ge,Yuning Jiang,and Weiwei Xu. Composition-aware graphic layout gan for visual-textual presentation designs.arXiv preprint arXiv:2205.00303,2022.3