71.1 kB

Title: Trans-Adapter: A Plug-and-Play Framework for Transparent Image Inpainting

URL Source: https://arxiv.org/html/2508.01098

Markdown Content: Yuekun Dai Haitian Li Shangchen Zhou Chen Change Loy

S-Lab, Nanyang Technological University

{ydai005, liha0032, s200094, ccloy}@ntu.edu.sg

https://ykdai.github.io/projects/trans-adapter

Abstract

RGBA images, with the additional alpha channel, are crucial for any application that needs blending, masking, or transparency effects, making them more versatile than standard RGB images. Nevertheless, existing image inpainting methods are designed exclusively for RGB images. Conventional approaches to transparent image inpainting typically involve placing a background underneath RGBA images and employing a two-stage process: image inpainting followed by image matting. This pipeline, however, struggles to preserve transparency consistency in edited regions, and matting can introduce jagged edges along transparency boundaries. To address these challenges, we propose Trans-Adapter, a plug-and-play adapter that enables diffusion-based inpainting models to process transparent images directly. Trans-Adapter also supports controllable editing via ControlNet and can be seamlessly integrated into various community models. To evaluate our method, we introduce LayerBench, along with a novel non-reference alpha edge quality evaluation metric for assessing transparency edge quality. We conduct extensive experiments on LayerBench to demonstrate the effectiveness of our approach.

1 Introduction

Figure 1: We introduce the first transparent image inpainting method. Given a transparent image, our approach can generate RGBA content within the masked region (shown in black) based on a provided text prompt.

Transparent images, typically represented as RGBA images, are widely used in film, animation, and game production. These images are layered using various compositing techniques to produce the final scene. To support this workflow, most commercial image and video editing software, such as Adobe Photoshop, Adobe After Effects, Live2D, and DragonBones, provide advanced layer management and image editing tools. These tools enable artists to manipulate individual layers with precision using brushes, pencils, and erasers, ensuring seamless integration of elements within a scene. Despite the demand for editing transparent layers, AI-powered inpainting tools like generative fill remain limited to RGB images. This limitation forces artists to manually refine or restore incomplete or unsatisfactory transparent images, often requiring tedious hand-painting to maintain visual coherence.

Image inpainting has seen remarkable improvements with the advent of text-to-image (T2I) diffusion models. These image inpainting models are typically fine-tuned from foundation generative models[34, 26, 47]. Many studies[13, 5, 23, 56, 52] have further enhanced these models by introducing flexible user control through semantic and structural conditions. However, these methods are designed specifically for RGB images. While some models can inpaint the alpha and RGB channels of transparent images separately, inconsistencies between the generated content in these channels often lead to misalignment, resulting in unnatural inpainting artifacts.

In this study, we extend the functionality of inpainting to transparent images by proposing Trans-Adapter, a plug-and-play adapter that learns to inpaint aligned RGB and alpha content. Trans-Adapter enables transparent image editing without compromising inpainting performance, allowing seamless integration with existing diffusion-based inpainting models. We decompose the RGBA image into separate RGB and alpha channels, treating them as a two-frame video. We then inflate a T2I model to process this “video” by incorporating a spatial alignment module and cross-domain self-attention, ensuring alignment and geometric consistency between the RGB and alpha channels. With this design, our method effectively preserves structural coherence across transparent regions, enabling seamless and high-quality inpainting results (Fig.1). To train the adapter, we collect a new dataset of 35K high-quality transparent images from online PNG stock sources. We merge this dataset with a subset of MAGICK[5], a generated dataset of objects with high-quality alpha mattes.

To evaluate the effectiveness of our approach comprehensively, we introduce LayerBench, a new benchmark consisting of 800 transparent images. This dataset includes 400 natural images sourced from online PNG stocks and test images of previous image matting datasets[45], while the remaining 400 generated images are carefully selected from LayerDiffusion[51] and MAGICK[5] to ensure diversity. LayerBench differs significantly from existing RGB image inpainting benchmarks, such as EditBench[42] and BrushBench[13] that do not involve alpha channels. LayerBench is specifically constructed to evaluate the quality of transparent image inpainting, with a focus on RGB-alpha alignment. We ensure that a significant portion of the inpainting masks overlap with the boundaries of the alpha channel, where transparency transitions occur. This design allows for a more rigorous assessment of how well methods preserve edge consistency between RGB and alpha channels.

When the RGB channels of a transparent image are not perfectly aligned with its alpha map, jagged or rough edges may emerge, an artifact commonly seen in the outputs of image segmentation and matting. These imperfections can degrade visual quality, especially when blending the transparent image with other elements in a composition. To assess the misalignment between the RGB and alpha channels, we propose a new non-reference metric for evaluating alpha edge quality. This metric can be computed with our specially trained Alpha Edge Quality (AEQ) classifier.

Our contributions are as follows: 1) We propose a plug-and-play framework for transparent image inpainting, marking the first dedicated approach for this task. 2) We present LayerBench, a carefully curated benchmark specifically designed to evaluate transparent image inpainting methods. 3) To complement LayerBench, we propose a new non-reference metric that quantifies the alignment between the RGB and alpha channels, providing a reliable measure of RGB-alpha consistency. We conduct extensive experiments comparing our method against baseline techniques, including image inpainting combined with image matting[14] and dichotomous image segmentation methods[55, 29]. Experimental results demonstrate the superior performance of our approach in generating high-quality detail and RGB-alpha aligned content.

2 Related Work

Transparent Image and Video Generation. RGBA-related generation can be categorized into single-layer generation, multi-layered generation, and transparent video generation. For single-layer generation, Text2Live[3] generates transparent edit layers by leveraging CLIP[32] to construct an RGB editing loss. LayerDiffuse[51] introduces latent transparency for RGBA image generation, encoding the alpha channel into the pretrained Stable Diffusion’s latent space without affecting the decoder’s output, and decodes the alpha channel with a latent transparency decoder. Zippo[43] compresses the RGB and alpha distributions into a single diffusion model, enabling RGB-to-alpha, alpha-to-RGB, and text-to-RGBA generation. Alfie[31] and DiffMatting[11] generate images with a pure background first, then extract the alpha channel using GrabCut and image matting, respectively. For multi-layered generation, Text2Layer[54] formulates a two-layer generation task and can generate transparent foreground and background at the same time. LayerDiff[12] and ART[27] can generate multi-layered images in one diffusion denoising process. TransPixar[41] proposes the only transparent video generation method using sequence extension for the alpha channel in DiT[25]’s self-attention. While previous research has explored transparent image generation, our work extends inpainting capabilities to transparent images, introducing Trans-Adapter as a novel plug-and-play solution for diffusion-based inpainting.

Figure 2: Sample images and their corresponding text prompts (mask-simple) from the proposed LayerBench benchmark. Our benchmark includes many challenging cases, such as images with complex alpha channel structures, transparent objects, and diverse artistic styles. Images shown in (b) are sourced from MAGICK[5] and DIM[45] to complement the real portion (a) of our benchmark.

Image Inpainting. Image inpainting has been extensively studied in the past, focusing on the exploration of patch-matching techniques with different architectural choices [39, 4, 6] and GAN-based approaches[24, 48, 33, 46, 49, 50]. Recent methods typically begin with pretrained diffusion models to leverage the exceptional generative capabilities. RePaint[22] introduces a training-free inpainting method based on DDPM[9], where unmasked regions are sampled and blended with the denoised masked region at each step of the reverse diffusion process. Blended Latent Diffusion[1, 2] extends this strategy to the latent space. For finetuning-based approaches, Imagen Editor[42] is finetuned from Imagen[36]. SmartBrush[44] is built on Stable Diffusion, incorporating mask boundary information to guide the denoising process for more intuitive editing. Stable-Diffusion Inpainting[34] is the official finetuned version of Stable Diffusion, modifying the UNet[35] by increasing its input channels to encode both the masked image and the alpha channel. HD-Painter[23] and PowerPaint[56] further refine Stable-Diffusion Inpainting, improving generation quality and supporting multiple tasks independently. ControlNet[52] enhances diffusion models by incorporating various conditioning signals, including inpainting. Similarly, BrushNet[13] and BrushEdit[17] introduce a plug-and-play branch that integrates with fine-tuned community models. Building on BrushNet and ControlNet, MagicQuill[21] enables intuitive content alterations by combining automatically estimated text prompts with user-drawn strokes. While these methods offer powerful editing capabilities, they primarily operate on standard RGB images and lack support for RGBA editing, limiting their applicability in workflows requiring transparency handling.

Image Matting and Dichotomous Image Segmentation. Image matting and dichotomous image segmentation both aim to extract foreground objects with fine edge details. Building on the success of SAM[15] and its large-scale training, several recent works have explored its applicability to high-precision matting. MattingAnything[16] follows SAM’s framework, supporting various prompt types—including box, point, and text—to enable more flexible matting. ZIM[14] enhances fine-grained image matting in a zero-shot setting, improving mask precision through a dataset from large-scale segmentation-to-matting conversion. Dichotomous image segmentation methods BiRefNet[55] and U 2\text{U}^{2}U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-Net[29] also focus on refining segmentation boundaries, contributing to high-quality foreground extraction. While matting and dichotomous segmentation focus on extracting precise foreground masks, our proposed method targets transparent image inpainting, ensuring transparency-aware restoration with aligned RGB and alpha content, which is a more challenging problem.

3 LayerBench Benchmark

In this study, we introduce LayerBench, the first high-quality benchmark designed specifically for evaluating transparent image inpainting. To further enhance the evaluation framework, we propose a non-reference metric for evaluating alpha edge quality. We first review some existing transparent image datasets below before introducing our benchmark.

Existing Transparent Image Datasets. Existing transparent image datasets remain limited in size and quality when compared to large-scale datasets like LAION[37] that contain billions of images. Among the available datasets, MAGICK[5] is the only one that balances both scale and quality. It contains 150K transparent images generated using SDXL[26], with chroma-keying applied to extract alpha maps. In contrast, natural transparent images are significantly rarer than their generated counterparts. For instance, DIM[45] contains only 431 foreground transparent images, while the Semantic Matting Dataset[38] extends DIM to 726 images. Distinctions-646[28] consists of 646 foreground images. In the video matting domain, VideoMatte240K[18] provides 240K unique transparent frames extracted from 484 high-resolution videos, with a primary focus on human figures. Recently, MULAN[40] introduces a multi-layer dataset featuring 44K layers derived from matting results of LAION[37] and COCO[19], but the resolution and quality of the foreground images remain relatively low. None of the existing data is specifically designed for high-quality transparent image inpainting evaluation.

3.1 LayerBench

Some examples in LayerBench are shown in Fig.2. To cover different use cases, LayerBench is composed of two subsets: LayerBench-Natural and LayerBench-Generated, each containing 400 images. The LayerBench-Natural subset consists of natural images sourced from online PNG stocks and matting datasets[45]. For the LayerBench-Generated subset, we select 200 images with high Aesthetic Score[37] from the MAGICK dataset[5] that are exclusive for testing. The remaining 200 images are generated using LayerDiffusion applied to SDXL[26] and the community model RealVisXL-V4.0, with text prompts generated by ChatGPT. To simulate real-world use cases, we manually annotate masks for each image, marking the regions that require inpainting. Our dataset includes all common mask types, such as strokes, regular shapes, regions masked by Bézier curves, and object masks. Besides, similar to EditBench[42], we provide full-text descriptions for the entire image and mask-simple annotations to describe the high-level content within the masked regions. The full-text descriptions are generated using LLaVA[20], following the same procedure as our training set’s text prompt labeling. For mask-simple annotations, we manually label the semantic information of the masked regions. All images in LayerBench have a resolution of 1024×\times×1024.

Figure 3: Overview of the Alpha Edge Quality (AEQ) assessment pipeline. The classifier receives an eight-channel input, consisting of the transparent image composited on white and black backgrounds, the alpha channel, and the alpha edge mask. It outputs a probability map that represents low-quality edge regions.

3.2 Alpha Edge Quality Assessment

As discussed in Sec.1, misalignment between the RGB and alpha channels in transparent images can cause jagged edges, degrading visual quality during compositing. To measure the alignment quality of RGB and alpha channels, we propose a new non-reference metric named Alpha Edge Quality (AEQ), which can be conveniently computed using a specially trained classifier.

The classifier is a lightweight Convolutional Neural Network (CNN) that performs binary segmentation, i.e., distinguishing low-quality edge regions from the high-quality ones. It takes an eight-channel input formed by the transparent image layered on white and black backgrounds, then concatenates with the alpha channel α\alpha italic_α and alpha edge mask M e M_{e}italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. We use backgrounds of two different colors to better reveal the quality of alpha edges. The AEQ can be computed as one minus the average low-quality edge probabilities within the evaluation mask ℳ e\mathcal{M}_{e}caligraphic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, which surrounds the edges of the alpha map. Specifically, the AEQ is defined as:

AEQ=1−1|ℳ e|∑(x,y)∈ℳ e ℱ(I w,I b,α,ℳ e)x,y\mathrm{AEQ}=1-\frac{1}{|\mathcal{M}{e}|}\sum{(x,y)\in\mathcal{M}{e}}\mathcal{F}(I{w},I_{b},\alpha,\mathcal{M}{e}){x,y}roman_AEQ = 1 - divide start_ARG 1 end_ARG start_ARG | caligraphic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ caligraphic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_F ( italic_I start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_α , caligraphic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT(1)

where ℱ\mathcal{F}caligraphic_F is the classifier, the evaluation mask ℳ e\mathcal{M}{e}caligraphic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is obtained by applying the Canny edge detector to the alpha map with a threshold of 20, followed by a dilation operation with a kernel size of five pixels. Note that we only consider ℳ e\mathcal{M}{e}caligraphic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT within the inpainting mask. AEQ ranges between 0 and 1, where higher values indicate better quality. The classifier ℱ\mathcal{F}caligraphic_F is trained to be robust in discerning various alpha edge qualities. This is achieved by training it carefully with an extensive set of augmented data as potential input, starting from a set of high-quality transparent images. More details about the training of the AEQ classifier are provided in the supplementary material.

Figure 4: Trans-Adapter can be flexibly integrated into different image inpainting frameworks. We demonstrate two instantiations: (a) SD-Inpainting[34], which expands the input channels to encode the additional mask and masked image, and (b) BrushNet[13], which introduces an inpainting branch that accepts the inpainting conditioning. In BrushNet, the inpainting conditioning is the same as the SD-Inpainting’s input. Trans-Adapter enables these pipelines to process transparent images by incorporating trainable cross-domain self-attention, spatial alignment modules, and alpha map LoRA.

4 Trans-Adapter

Trans-Adapter is designed as a plug-and-play module that can be seamlessly integrated into existing diffusion-based inpainting frameworks. In this study, we show two instantiations of this adaptation, as illustrated in Fig.4. The first instantiation expands the input channels of Stable Diffusion’s U-Net to encode the mask and masked image, followed by fine-tuning the diffusion model. This approach is used in methods such as SD-Inpainting, with official implementations available for SD1.5[34], SDXL[26], and Flux[47]. The second instantiation introduces a trainable inpainting branch, as seen in ControlNet[52] and BrushNet[13].

Having a good network design can be tricky as the network needs to simultaneously generate both the RGB image and its corresponding alpha map, while enforcing feature alignment between color and transparency information. We overcome this challenge by using an inflated network design. Specifically, we decompose the RGBA image into a separate padded RGB image[51] and its alpha channel, treating them as a two-frame video with a 5D input tensor, where the feature map is denoted as f∈ℝ b×c×2×h×w f\in\mathbb{R}^{b\times c\times 2\times h\times w}italic_f ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_c × 2 × italic_h × italic_w end_POSTSUPERSCRIPT. As the tensor passes through the original weights of a pre-trained diffusion model, the latent is reshaped into f r∈ℝ 2b×c×h×w f_{r}\in\mathbb{R}^{2b\times c\times h\times w}italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_b × italic_c × italic_h × italic_w end_POSTSUPERSCRIPT, stacking the latents of the RGB and alpha channels along the batch size dimension b b italic_b. This design allows the model to capture cross-channel dependencies and avoid independent processing biases that can lead to RGB-alpha misalignment.

To enforce spatial coherence between the RGB and alpha components, we add a spatial alignment module and cross-domain self-attention in our design, inspired by AnimateDiff[7]. In the two instantiations shown in Fig.4, the original network parameters remain frozen, requiring only the newly introduced spatial alignment module and cross-domain self-attention to be trained. To facilitate the fine-tuning of these modules, all output projection layers in these modules are initialized to zero, a design choice validated by ControlNet[52], ensuring a stable training process while preserving the pre-trained model’s capabilities.

Training Strategy. We employ a two-stage training strategy. In the first stage, the Alpha Map LoRA[10] is trained to guide the model in reconstructing the alpha map. In the second stage, we jointly fine-tune the spatial alignment module, cross-domain self-attention module, and the alpha map LoRA, which primarily enables the model to inpaint well-aligned RGBA content. More training details are provided in the supplementary material.

Alpha Map LoRA. We introduce a LoRA[10] module into the pre-trained diffusion model to equip it with the ability to inpaint the alpha map. In this stage, the original weights of UNet remain frozen, and only the LoRA module attached to the U-Net is trained. The LoRA weights are initialized to zero, ensuring that the module learns only the necessary residuals for alpha map generation. During training, we simultaneously select both the RGB content and the alpha map from transparent images for fine-tuning. To further enhance the model’s ability to distinguish between RGB content and alpha content, we use different text prompts for each during training. Specifically, for alpha map inpainting, we prepend the text prompt with “alpha map of”, while for RGB content, we use the original prompt. This explicit separation guides the model to learn distinct inpainting strategies for RGB images and alpha maps.

Spatial Alignment Module. In the shallower layers of Stable Diffusion’s U-Net, we incorporate a spatial alignment module. Specifically, we utilize convolution layers to enforce spatial synchronization between the alpha map and the RGB latent representation at corresponding spatial locations. This design ensures that the RGB and alpha maps remain aligned even after passing through the VAE decoder. Given a feature tensor f∈ℝ b×c×2×h×w f\in\mathbb{R}^{b\times c\times 2\times h\times w}italic_f ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_c × 2 × italic_h × italic_w end_POSTSUPERSCRIPT, we first reshape it into f r∈ℝ b×2c×h×w f_{r}\in\mathbb{R}^{b\times 2c\times h\times w}italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × 2 italic_c × italic_h × italic_w end_POSTSUPERSCRIPT and then use:

f=f r+𝒵 c(𝐂𝐨𝐧𝐯𝐁𝐥𝐨𝐜𝐤(f r)),f=f_{r}+\mathcal{Z}{c}(\mathbf{ConvBlock}(f{r})),italic_f = italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + caligraphic_Z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_ConvBlock ( italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ) ,(2)

where 𝒵 c\mathcal{Z}_{c}caligraphic_Z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denotes the zero-initialized convolution layer.

Cross-domain Self-Attention. To ensure the consistency of the masked alpha channel with its surrounding regions, we incorporate cross-domain self-attention at U-Net’s bottleneck. For high-frequency details in the alpha channel that are difficult to predict, such as hair, cross-domain self-attention allows the masked region to better reference its surrounding areas for inpainting. After adding a 2D positional embedding to the latent representation, it is then reshaped into f r∈ℝ b×2hw×c f_{r}\in\mathbb{R}^{b\times 2hw\times c}italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × 2 italic_h italic_w × italic_c end_POSTSUPERSCRIPT and processed through:

self-attention(f r)=softmax(𝐐 i𝐊 i D)𝐕 i,\textbf{self-attention}(f_{r})=\text{softmax}\left(\frac{\mathbf{Q}{i}\mathbf{K}{i}}{\sqrt{D}}\right)\mathbf{V}_{i},self-attention ( italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = softmax ( divide start_ARG bold_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_D end_ARG end_ARG ) bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,(3)

where 𝐐 i\mathbf{Q}{i}bold_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝐊 i\mathbf{K}{i}bold_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and 𝐕 i\mathbf{V}{i}bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent f r{f}{r}italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT processed by MLPs for query, key, and value. Finally, a zero-initialized MLP 𝒵 M\mathcal{Z}_{M}caligraphic_Z start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT will be applied to the feature:

f=f r+𝒵 M(AttentionBlock(f r)).f=f_{r}+\mathcal{Z}{M}(\textbf{AttentionBlock}(f{r})).italic_f = italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + caligraphic_Z start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( AttentionBlock ( italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ) .(4)

Figure 5: Visual comparison of Trans-Adapter with official SD1.5-inpainting combined with other image matting and dichotomous image segmentation methods. Since our approach simultaneously generates both the alpha map and RGB, it performs well even for complex details and transparent objects.

Training Loss. We adopt the vanilla training objective as proposed in DDPM[9], which can be expressed as:

ℒ=𝔼 ℰ(x 0),y,ϵ∼𝒩(0,I),t[∥ϵ−ϵ θ(z t,t,τ θ(y),C)∥2 2],\mathcal{L}=\mathbb{E}{\mathcal{E}(x{0}),y,\epsilon\sim\mathcal{N}(0,\mathit{I}),t}\left[\lVert\epsilon-\epsilon_{\theta}(z_{t},t,\tau_{\theta}(y),C)\rVert_{2}^{2}\right],caligraphic_L = blackboard_E start_POSTSUBSCRIPT caligraphic_E ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_y , italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y ) , italic_C ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,(5)

where y y italic_y means the textual prompt, τ θ(⋅)\tau_{\theta}(\cdot)italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) is a text encoder mapping the prompt to a sequence of tokens. The conditioning information is denoted as C C italic_C, which, in the context of image inpainting, consists of the masked image latent and the mask. In the training stage of a latent diffusion network, an input image x 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is initially encoded into the latent space by a frozen encoder, yielding z 0=ℰ(x 0)z_{0}=\mathcal{E}(x_{0})italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_E ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).

Table 1: Quantitative comparison of Trans-Adapter with different foreground extraction methods. For BiRefNet[55], we use the checkpoint from RMBG 2.0 for better results. The metrics are calculated on the entire inpainted image, which adopts a mask blending with the original image. All experiments with SD1.5 are conducted at a resolution of 512×512 512\times 512 512 × 512, while those with SDXL are done at a resolution of 1024×1024 1024\times 1024 1024 × 1024.

Inpainting Strategy Pure Noise Blended Noise Base Model Method AS↑\uparrow↑LPIPS↓\downarrow↓CLIP Sim↑\uparrow↑AEQ↑\uparrow↑AS↑\uparrow↑LPIPS↓\downarrow↓CLIP Sim↑\uparrow↑AEQ↑\uparrow↑ Input (512×512 512\times 512 512 × 512)6.157/27.143 0.9866 6.157/27.143 0.9866 SD1.5-inpaint ZIM[14]6.011 0.0697 27.061 0.9866 6.044 0.0526 27.040 0.9874 U 2\text{U}^{2}U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-Net[29]5.970 0.0720 26.989 0.9502 6.007 0.0560 26.978 0.9537 BiRefNet[55]6.027 0.0682 27.034 0.9889 6.055 0.0515 27.049 0.9886 Ours 6.025 0.0591 26.870 0.9871 6.097 0.0408 27.030 0.9878 SD1.5-BrushNet ZIM[14]5.929 0.0764 26.906 0.9853 5.942 0.0601 26.970 0.9852 U 2\text{U}^{2}U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-Net[29]5.898 0.0779 26.813 0.9549 5.897 0.0640 26.892 0.9547 BiRefNet[55]5.946 0.0742 26.897 0.9844 5.954 0.0592 26.952 0.9863 Ours 6.021 0.0757 26.987 0.9856 6.053 0.0505 26.941 0.9869 Input (1024×1024 1024\times 1024 1024 × 1024)6.181/27.153 0.9849 6.181/27.153 0.9849 SDXL LayerDiffusion 5.851 0.0903 26.831 0.9760 6.016 0.0642 27.097 0.9781 SDXL-inpaint ZIM[14]6.075 0.0693 27.126 0.9794 6.115 0.0461 27.111 0.9828 U 2\text{U}^{2}U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-Net[29]6.031 0.0708 27.002 0.8093 6.066 0.0500 27.038 0.8367 BiRefNet[55]6.087 0.0681 27.076 0.9843 6.129 0.0453 27.104 0.9859 Ours 6.042 0.0632 27.332 0.9844 6.140 0.0434 27.134 0.9872

Training Data. Existing transparent image datasets primarily originate from matting datasets[45, 38, 48] and generated images[5]. Although matting datasets are natural images of high quality, their limited quantity makes them unsuitable for large-scale training. To ensure our method performs well on both natural and generated transparent images, we collect a new dataset by purchasing transparent images from an online PNG stock and manually filtering out those with jagged edges, blurriness, or low resolution (long side <<< 600px), resulting in 35K high-quality images. Each image is then paired with a text prompt generated by the image captioning model LLaVA[20]. This training dataset encompasses a diverse range of categories, including people, objects, buildings, hand-drawn cartoon scenes, cartoon characters, and commercial artwork elements. Finally, we merge it with a subset of MAGICK[5], selecting 90% for training and reserving 10% for benchmarking.

5 Experiments

Figure 6: Visual comparison of Trans-Adapter’s performance on SDXL-Inpainting against the official SDXL-Inpainting method, combined with the image matting method, ZIM[14], and dichotomous image segmentation methods U 2\text{U}^{2}U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-Net[30] and BiRefNet[55]. The proposed method generates alpha maps aligned with the RGB images while preserving their quality.

To demonstrate the generalizability of our approach, we trained Trans-Adapter on SD1.5, BrushNet based on SD1.5, and SDXL. To ensure fairness in subsequent experiments, we maintained the same training resolution as in pretraining: 512×\times×512 for SD1.5 and BrushNet, and 1024×\times×1024 for SDXL. SD1.5 was trained on an Nvidia 3090, while SDXL was trained on an H100 due to memory constraints. Detailed settings are provided in the supplementary material.

5.1 Comparison with Existing Methods

Apart from the proposed AEQ (Sec.3.2), following Ju et al. [13], we also use Aesthetic Score (AS)[37], LPIPS[53], and CLIP Similarity[32] to assess image quality, masked region preservation, and text alignment, respectively. As these metrics are designed for RGB images, they cannot be directly applied to RGBA images. Besides, since transparent images may contain arbitrary values in the RGB channels when the alpha value is zero, direct comparison using the RGB channels of transparent images is not reliable. To better evaluate the quality of our transparent image inpainting, we composite the transparent image onto both white and black backgrounds to generate two RGB images. We then compute the metrics against the original image processed in the same manner and take the average of the two results as the final score. This approach effectively addresses the issue where certain transparent images with defined RGB values (e.g., line drawings created in black) cannot be properly evaluated when blended with backgrounds of the same RGB value. During the evaluation, we replace the unmasked regions with the original input to ensure our metric only focuses on the inpainting region.

Since there is no existing study about transparent image inpainting, we mainly compare our method with different diffusion-based inpainting methods followed by image matting[14] or dichotomous image segmentation[55, 29]. First, we composite a transparent image onto a gray background. After inpainting the target regions using[34, 26, 13], we apply image matting and dichotomous image segmentation techniques to extract the edited content. To keep a fair comparison, we choose the same inpainting network to compare our method with the previous two-stage pipelines, as shown in Table1. To provide a more comprehensive evaluation, we assess each method under two commonly used inpainting strategies: (1) initializing with pure noise, and (2) initializing with blended noise (where the masked region is blended with noise at a strength of 0.99). These strategies reflect typical settings in practical inpainting and editing scenarios.

The results demonstrate that our method achieves competitive and often better results than previous two-stage pipelines. As shown in Fig.5, compared with these two-stage pipelines, our method can better reconstruct details in the alpha channel and avoid misalignment between RGB and alpha channels. Since LayerDiffuse[51] can also inpaint the transparent image by utilizing Blended Latent Diffusion[1]’s denoising strategy, we compare our method with it in Fig.6. As can be observed, our approach achieves more accurate structure reconstruction and better boundary consistency in challenging regions.

Table 2: Quantitative comparison of training with different datasets, network structures, and methods. ‘localized’ means localized inpainting – we only inpaint the pixels around the edge of the alpha channel. All these experiments are conducted on the SD1.5-inpainting model with a resolution of 512.

Pure Noise Blended Noise Type Method AS↑LPIPS↓CLIP Sim.↑AEQ↑AS↑LPIPS↓CLIP Sim.↑AEQ↑ Ours 6.025 0.0591 26.870 0.9871 6.097 0.0408 27.030 0.9878 Dataset w/o MAGICK 5.992 0.0633 26.934 0.9867 6.073 0.0435 27.037 0.9873 w/o Ours 6.007 0.0613 26.921 0.9865 6.067 0.0457 27.071 0.9881 Network AnimateDiff 5.989 0.0623 26.793 0.9859 6.067 0.0459 27.032 0.9872 w/o Spatial Align.5.542 0.0712 26.170 0.9881 5.685 0.0580 26.642 0.9897 w/o Self-Attn 6.011 0.0621 26.720 0.9872 6.065 0.0461 26.987 0.9880 RGB Padding Telea et al.[39]6.017 0.0603 26.470 0.9871 6.070 0.0427 26.681 0.9853 Telea et al.[39] localized 6.021 0.0627 26.865 0.9868 6.064 0.0457 27.005 0.9861 Grey Background 6.031 0.0584 27.017 0.9647 6.103 0.0462 26.998 0.9717

5.2 Ablation Study

Training Set. We analyze the effect of training data composition by removing either the MAGICK or our dataset. As shown in Table2, removing either dataset leads to a slight decrease in AS, LPIPS, and AEQ, demonstrating that both subsets contribute to the overall performance. This highlights the significance of our newly introduced dataset, as each subset brings complementary benefits to the transparent image inpainting model.

Network Structure. We also conduct experiments where we replace all spatial alignment modules with cross-domain self-attention (w/o Spatial Align.) and replace all the cross-domain self-attention modules with the spatial alignment modules (w/o Self-Attn). Additionally, we train the original AnimateDiff for comparison, as presented in Table1. The visual results shown in Fig.8 illustrate the importance of our designed modules. Due to the removal of spatial alignment, the image occasionally exhibits large pure-color edges originating from the background, which unexpectedly increases AEQ.

Figure 7: Visual comparison of various RGB padding strategies for filling transparent regions.

RGB Padding Method. Due to the nature of diffusion models, achieving perfectly aligned RGB and alpha channels is challenging. Since the RGB values in regions where alpha is 0 can be arbitrarily chosen, selecting an appropriate RGB padding strategy ensures that even if the alignment between RGB and alpha is imperfect, the final edge quality remains unaffected. However, as the network also needs to predict the RGB values in 0-alpha regions, an overly artificial padding strategy may cause the adapter to focus excessively on reconstructing the transparent regions, ultimately degrading the final image quality. An effective RGB padding strategy should ensure that, after training, the algorithm preserves both edge quality and the aesthetic appeal of the final image. As shown in Fig.7, we evaluate different RGB padding methods for regions where the alpha value is lower than 20, including Zhang et al.[51]’s RGB padding, Telea et al.[39]’s inpainting, and grey background padding. Additionally, we test expanding the alpha map by 30 pixels, inpainting only these expanded regions using Telea et al.[39]’s inpainting method (the inpainting method applied in OpenCV), and setting the remaining areas to gray. This approach is denoted as Telea et al.[39] localized in Table2. We select LayerDiffuse[51]’s RGB padding as our baseline method, as it strikes a balance between aesthetic quality and edge preservation, as shown in Table2.

Figure 8: Training Trans-Adapter with different network structures. The details of the hair in the first row demonstrate the effectiveness of our spatial alignment module. As shown in the second and third rows, the absence of self-attention leads to a lack of global information, making artifacts more likely to appear in the image for Animate Diff and our method w/o Self-Attn, which ultimately affects the aesthetic metrics.

6 Conclusion

We have introduced Trans-Adapter, a plug-and-play framework that enables diffusion-based inpainting models to process transparent images directly. Trans-Adapter also supports controllable editing via ControlNet and integrates seamlessly into various community models. To facilitate the rigorous evaluation of transparent image inpainting methods, we proposed LayerBench, a benchmark specifically designed for this task. Furthermore, we introduced a novel non-reference evaluation metric that quantifies the alignment between the RGB and alpha channels, providing a reliable measure of RGB-alpha consistency. Experimental results demonstrate that Trans-Adapter outperforms existing pipelines in preserving transparency consistency and producing high-quality inpainted results.

Acknowledgement:

This study is supported under the RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

References

Avrahami et al. [2022] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022.
Avrahami et al. [2023] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion. ACM TOG, 42(4):1–11, 2023.
Bar-Tal et al. [2022] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In ECCV. Springer, 2022.
Barnes et al. [2009] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM TOG, 28(3):24, 2009.
Burgert et al. [2024] Ryan D Burgert, Brian L Price, Jason Kuen, Yijun Li, and Michael S Ryoo. Magick: A large-scale captioned dataset from matting generated images using chroma keying. In CVPR, 2024.
Criminisi et al. [2004] Antonio Criminisi, Patrick Pérez, and Kentaro Toyama. Region filling and object removal by exemplar-based image inpainting. IEEE TIP, 13(9), 2004.
Guo et al. [2024] Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In ICLR, 2024.
Guttenberg [2023] Nicholas Guttenberg. Diffusion with offset noise. https://www.crosslabs.org/blog/diffusion-with-offset-noise, 2023.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 33, 2020.
Hu et al. [2022] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 2022.
Hu et al. [2024] Xiaobin Hu, Xu Peng, Donghao Luo, Xiaozhong Ji, Jinlong Peng, Zhengkai Jiang, Jiangning Zhang, Taisong Jin, Chengjie Wang, and Rongrong Ji. Diffumatting: Synthesizing arbitrary objects with matting-level annotation. In ECCV, pages 396–413. Springer, 2024.
Huang et al. [2024] Runhui Huang, Kaixin Cai, Jianhua Han, Xiaodan Liang, Renjing Pei, Guansong Lu, Songcen Xu, Wei Zhang, and Hang Xu. Layerdiff: Exploring text-guided multi-layered composable image synthesis via layer-collaborative diffusion model. In ECCV, 2024.
Ju et al. [2024] Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. In ECCV, 2024.
Kim et al. [2024] Beomyoung Kim, Chanyong Shin, Joonhyun Jeong, Hyungsik Jung, Se-Yun Lee, Sewhan Chun, Dong-Hyun Hwang, and Joonsang Yu. Zim: Zero-shot image matting for anything. arXiv preprint arXiv:2411.00626, 2024.
Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In ICCV, 2023.
Li et al. [2024a] Jiachen Li, Jitesh Jain, and Humphrey Shi. Matting anything. In CVPR, 2024a.
Li et al. [2024b] Yaowei Li, Yuxuan Bian, Xuan Ju, Zhaoyang Zhang, Ying Shan, and Qiang Xu. Brushedit: All-in-one image inpainting and editing. arXiv preprint arXiv:2412.10316, 2024b.
Lin et al. [2021] Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, Brian L Curless, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Real-time high-resolution background matting. In CVPR, 2021.
Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV. Springer, 2014.
Liu et al. [2023] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 36, 2023.
Liu et al. [2024] Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Wen Wang, Zhiheng Liu, Qifeng Chen, and Yujun Shen. Magicquill: An intelligent interactive image editing system. arXiv preprint arXiv:2411.09703, 2024.
Lugmayr et al. [2022] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, pages 11461–11471, 2022.
Manukyan et al. [2023] Hayk Manukyan, Andranik Sargsyan, Barsegh Atanyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Hd-painter: High-resolution and prompt-faithful text-guided image inpainting with diffusion models. arXiv preprint arXiv:2312.14091, 2023.
Pathak et al. [2016] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
Peebles and Xie [2023] William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, 2023.
Podell et al. [2024] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In ICLR, 2024.
Pu et al. [2025] Yifan Pu, Yiming Zhao, Zhicong Tang, Ruihong Yin, Haoxing Ye, Yuhui Yuan, Dong Chen, Jianmin Bao, Sirui Zhang, Yanbin Wang, et al. Art: Anonymous region transformer for variable multi-layer transparent image generation. arXiv preprint arXiv:2502.18364, 2025.
Qiao et al. [2020] Yu Qiao, Yuhao Liu, Xin Yang, Dongsheng Zhou, Mingliang Xu, Qiang Zhang, and Xiaopeng Wei. Attention-guided hierarchical structure aggregation for image matting. In CVPR, 2020.
Qin et al. [2020a] Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar Zaiane, and Martin Jagersand. U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition, 106, 2020a.
Qin et al. [2020b] Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R Zaiane, and Martin Jagersand. U2-net: Going deeper with nested u-structure for salient object detection. Pattern recognition, 106, 2020b.
Quattrini et al. [2024] Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, and Rita Cucchiara. Alfie: Democratising rgba image generation with no $$$. In ECCVW, 2024.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
Reed et al. [2016] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In ICML, 2016.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35, 2022.
Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35, 2022.
Sun et al. [2021] Yanan Sun, Chi-Keung Tang, and Yu-Wing Tai. Semantic image matting. In CVPR, 2021.
Telea [2004] Alexandru Telea. An image inpainting technique based on the fast marching method. Journal of graphics tools, 9(1):23–34, 2004.
Tudosiu et al. [2024] Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio Iacobacci, and Sarah Parisot. Mulan: A multi layer annotated dataset for controllable text-to-image generation. In CVPR, 2024.
Wang et al. [2025] Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, and Yingcong Chen. Transpixar: Advancing text-to-video generation with transparency. In CVPR, 2025.
Wang et al. [2023] Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J Fleet, Radu Soricut, et al. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In CVPR, 2023.
Xie et al. [2024] Kangyang Xie, Binbin Yang, Hao Chen, Meng Wang, Cheng Zou, Hui Xue, Ming Yang, and Chunhua Shen. Zippo: Zipping color and transparency distributions into a single diffusion model. arXiv preprint arXiv:2403.11077, 2024.
Xie et al. [2023] Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, and Kun Zhang. Smartbrush: Text and shape guided object inpainting with diffusion model. In CVPR, pages 22428–22437, 2023.
Xu et al. [2017] Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. Deep image matting. In CVPR, 2017.
Xu et al. [2018] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
Yang et al. [2024] Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, and Liang-Chieh Chen. 1.58-bit flux. arXiv preprint arXiv:2412.18653, 2024.
Yu et al. [2018] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In CVPR, pages 5505–5514, 2018.
Zhang et al. [2017] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
Zhang et al. [2018a] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE TPAMI, 41(8), 2018a.
Zhang and Agrawala [2024] Lvmin Zhang and Maneesh Agrawala. Transparent image layer diffusion using latent transparency. ACM TOG, 2024.
Zhang et al. [2023a] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023a.
Zhang et al. [2018b] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018b.
Zhang et al. [2023b] Xinyang Zhang, Wentian Zhao, Xin Lu, and Jeff Chien. Text2layer: Layered image generation using latent diffusion model. arXiv preprint arXiv:2307.09781, 2023b.
Zheng et al. [2024] Peng Zheng, Dehong Gao, Deng-Ping Fan, Li Liu, Jorma Laaksonen, Wanli Ouyang, and Nicu Sebe. Bilateral reference for high-resolution dichotomous image segmentation. CAAI AIR, 3, 2024.
Zhuang et al. [2024] Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, and Kai Chen. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting. In ECCV, 2024.

\thetitle

Supplementary Material

A Potential Applications

Transparent Image Editing. By not inpainting from the pure noise, our method can also serve as a transparent image editor naturally. As shown in Fig.9, users can plot color strokes on the original transparent image. These color strokes will be considered as a mask. Then, we add noise to the drawn RGB and alpha map with a strength of 0.99 as the initial noise. Finally, we perform denoising following the previous approach to obtain the edited result.

Extending to Community Models. Since both BrushNet and our Trans-Adapter are plug-and-play modules, they can be applied together to other community models, enabling inpainting based on different models. As shown in the Fig.10, by using different community models, we can achieve inpainting effects in various styles.

Table 3: Quantitative comparison of different LoRA training strategies under pure noise and blended noise settings.

Pure Noise Blended Noise Method AS↑LPIPS↓CLIP Sim↑AEQ↑AS↑LPIPS↓CLIP Sim↑AEQ↑ Ours 6.025 0.0591 26.870 0.9871 6.097 0.0408 27.030 0.9878 Frame-specific LoRA 5.992 0.0616 27.165 0.9855 6.089 0.0416 27.036 0.9863 Frozen LoRA 5.979 0.0637 27.043 0.9817 6.067 0.0471 27.005 0.9831 Frozen Frame-specific LoRA 5.985 0.0624 27.166 0.9856 6.082 0.0408 27.023 0.9859 w/o LoRA 5.982 0.0769 27.363 0.9863 6.091 0.0430 27.103 0.9868

ControlNet Extension. As shown in Fig.11, we demonstrate that existing control models like ControlNet[52] can be applied to our model for enriched functionality. Since ControlNet does not provide a control model for SD-Inpainting and only supports T2I generation, we apply Trans-Adapter to BrushNet based on SD1.5 and then use ControlNet (with Scribbles) to control image details. The visualization results demonstrate that our method can effectively integrate ControlNet to control inpainting outputs, allowing users to guide structure and details more precisely.

B More Details of AEQ Assessment

Data Augmentation. We collect a high-quality transparent image dataset with clean alpha edges with 1,000 images from the online PNG stock and matting data. Starting from these transparent images, we simulate images with varying degrees of edge quality to create a comprehensive training set. To generate low-quality edges, we: (1) set regions with an alpha value lower than 50 to a solid color, then perform a dilation operation or a Gaussian blur operation on the alpha map. For the expanded region at the alpha map, we multiply it with the mask generated by the fractal noise to create a more realistic edge degradation effect. (2) composite the images with different backgrounds (solid color backgrounds + natural scenery backgrounds) and then process these images using different segmentation and matting methods[15, 16]. To quantify edge degradation, we compute the difference between the original image’s alpha map and its augmented counterpart, identifying regions with significant discrepancies as low-quality edges. In this way, we can train a segmentation network for alpha edge quality assessment.

Figure 9: Demonstration of color-stroke-based transparent image editing. Users can draw on a transparent image and obtain an estimated result based on their strokes and a provided text prompt.

Figure 10: While combining with BrushNet, our Trans-Adapter also supports community models for inpainting.

Training Loss. Since the edge quality assessment is a binary classification task, we use a weighted cross-entropy loss to address class imbalance, as low-quality edge pixels are much fewer than high-quality ones. We assign a higher weight to the low-quality class and further emphasize edge regions using an edge mask ℳ e\mathcal{M}_{e}caligraphic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT.

Here, I concat I_{\text{concat}}italic_I start_POSTSUBSCRIPT concat end_POSTSUBSCRIPT denotes the input, ℱ\mathcal{F}caligraphic_F is the segmentation network, y y italic_y is the ground truth label map, and ℳ e\mathcal{M}_{e}caligraphic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the edge mask. The loss is defined as:

ℒ=1 HW∑i=1 HW CE(ℱ(I concat)i,y i,w y i)⋅(1+w eℳ e)\mathcal{L}=\frac{1}{HW}\sum_{i=1}^{HW}\mathrm{CE}(\mathcal{F}(I_{\text{concat}}){i},y{i},w_{y_{i}})\cdot(1+w_{e}\mathcal{M}_{e})caligraphic_L = divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_W end_POSTSUPERSCRIPT roman_CE ( caligraphic_F ( italic_I start_POSTSUBSCRIPT concat end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ ( 1 + italic_w start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT )(6)

where w y i w_{y_{i}}italic_w start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the class weight (w 0 w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for high-quality, w 1 w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for low-quality, typically w 0 w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as 1.0, w 1 w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as 10.0), and w e w_{e}italic_w start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the edge weight (set to 4.0). This loss encourages the model to focus more on low-quality edge regions during training.

Figure 11: Our approach can be directly combined with control models like ControlNet[52] to enhance functionality. Users can define a mask and outline the inpainting region to generate a transparent image.

AEQ Visualization. To demonstrate the effect of our proposed AEQ, we visualize the estimated artifact maps for images in various styles. As shown in Fig.12, AEQ effectively highlights boundary artifacts across different image types, indicating its strong generalization ability beyond specific styles.

Network Architecture and Training Details. We adopt a lightweight U-Net-based segmentation network comprising three downsampling and three upsampling blocks. Each block contains two convolutional layers with ReLU activation and batch normalization. The network takes an 8-channel input and produces a 2-channel output, with hidden channel dimensions of 64, 128, and 256, respectively. We use the Adam optimizer with a learning rate of 1×10−4 1\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. Training is performed on images with resolutions of 512×512 512\times 512 512 × 512 and 1024×1024 1024\times 1024 1024 × 1024, using a batch size of 4 and randomly selecting a resolution for each batch. The network is trained for 40,000 iterations.

C More Details of Trans-Adapter

Stage 1: Alpha Map LoRA Training. To enable the generation of large areas of pure black and pure white in the alpha map during inpainting, we adopt offset noise[8] with a weight of 0.1 during LoRA training, as conventional finetuning often struggles with such cases. The LoRA rank is set to 16 and the LoRA alpha is set to 32. We use the AdamW optimizer with an initial learning rate of 1×10−4 1\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and train for 40,000 steps with a batch size of 4.

Stage 2: Joint Finetuning. In the joint finetuning stage, we first load the pretrained LoRA weights, then zero-initialize the spatial alignment module and cross-domain self-attention module. These two modules are finetuned with a learning rate of 5×10−5 5\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT using the AdamW optimizer for 100,000 steps, with a batch size of 4. We conduct an ablation study to compare different training strategies, including frame-specific LoRA (training LoRA only on alpha images, affecting only the alpha channel), frozen LoRA (training LoRA on both RGB and alpha images, then freezing LoRA during stage 2), frozen frame-specific LoRA, and training without LoRA (directly training stage 2 without LoRA). As shown in Table3, our full method achieves the best overall performance in terms of AS, LPIPS, and AEQ metrics under both pure noise and blended noise settings. Our method achieves the best AEQ and LPIPS, validating the effectiveness of the two-stage training. Removing or freezing LoRA degrades AEQ and LPIPS, highlighting the importance of LoRA-based alpha map pretraining.

Figure 12: Estimated artifact map of our proposed AEQ for images in different styles. (Zoom in for better visualization.)

D Limitations

While our proposed method achieves strong performance in transparent image inpainting and editing, several limitations remain. First, our approach is based on SD1.5 and SDXL, which are known to struggle with generating realistic faces and hands during inpainting. This can result in artifacts when editing or restoring these regions, limiting applicability in scenarios requiring high-fidelity human features. Second, when the strength is set to 1.0 for SDXL-inpainting 0.1, the quality of the generated image degrades, which also affects our RGBA inpainting results. This limitation originates from the base checkpoint. Finally, the alpha maps in the MAGICK dataset[5] used for training are not perfect; some regions, such as eyes, are translucent. As a result, our model may also produce undesired translucent areas in these regions during inpainting.

Xet Storage Details

Size:: 71.1 kB
Xet hash:: 7a789f1dc145a86873c265e71761d8b2bd7f39a24f7c190bbf19c915ae1a57f2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.