Disentangling visual layers in real-world images is a persistent challenge in vision and graphics, as such layers often involve non-linear and globally coupled interactions, including shading, reflection, and perspective distortion. In this work, we present an in-context image decomposition framework that leverages large diffusion foundation models for layered separation.
We focus on the challenging case of logo-object decomposition, where the goal is to disentangle a logo from the surface on which it appears while faithfully preserving both layers. Our method fine-tunes a pretrained diffusion model via lightweight LoRA adaptation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images. This bidirectional supervision substantially enhances robustness in cases where the layers exhibit complex interactions.
Furthermore, we introduce a progressive self-improving process, which iteratively augments the training set with high-quality model-generated examples to refine performance. Extensive experiments demonstrate that our approach achieves accurate and coherent decompositions and also generalizes effectively across other decomposition types, suggesting its potential as a unified framework for layered image decomposition.
Given a composite image, the model receives a masked input, a binary mask indicating the logo region, and a noise latent, and predicts both the isolated logo and the clean object. The process leverages the in-context learning capabilities of the model.
Image decomposition is inherently ill-posed.
We aim to split one image into multiple layers, but the layers are not uniquely determined, especially when the image is coupled with lighting or perspective distortions.
However, composition is deterministic.
The key insight is that composition is deterministic. Once layers are known, putting them back together should always give the original image.
Therefore, we introduce cycle consistency.
We train the network to decompose an image, recombine the predicted layers, and recover the input. If the reconstruction fails, the decomposition must be wrong.
This cyclic loop provides reliable supervision.
The model jointly learns decomposition and composition, stabilizes training under severe appearance coupling, and greatly reduces ambiguity.
@article{gu2026cycle,
title = {Cycle-Consistent Tuning for Layered Image Decomposition},
author = {GU, Zheng and Lu, Min and Sun, Zhida and Lischinski, Dani and Cohen-Or, Daniel and Huang, Hui},
year = {2026},
}