This paper introduces Bifröst, a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition. Previous methods concentrate on image compositing at the 2D level, which fall short in handling complex spatial relationships (e.g., occlusion). Bifröst addresses these issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process to bridge the gap between 2D and 3D, which enhances spatial comprehension and supports sophisticated spatial interactions. Our method begins by fine-tuning MLLM with a custom counterfactual dataset to predict 2.5D object locations in complex backgrounds from language instructions. Then, the image-compositing model is uniquely designed to process multiple types of input features, enabling it to perform high-fidelity image compositions that consider occlusion, depth blur, and image harmonization. Extensive qualitative and quantitative evaluations demonstrate that Bifröst significantly outperforms existing methods, providing a robust solution for generating realistically composited images in scenarios demanding intricate spatial understanding. This work not only pushes the boundaries of generative image compositing but also reduces reliance on expensive annotated datasets by effectively utilizing existing resources in innovative ways.
Achieving arbitrary personalized object-level image compositing necessitates a deep understanding of the visual concept inherent to both the identity of the reference object and spatial relations of the background image. To date, this task has not been well addressed. Paint-by-Example (Yang et al., 2023) and Objectstitch (Song et al., 2023) use a target image as the template to edit a specific region of the background image, but they could not generate ID-consistent contents, especially for untrained categories. On the other hand, (Chen et al., 2024; Song et al., 2024) generate objects with ID (identity) preserved in the target scene, but they fall short in processing complicated 3D geometry relations (e.g., the occlusion) as they only consider 2D-level composition. To sum up, previous methods mainly either 1) fail to achieve both ID preservation and background harmony, or 2) do not explicitly take into account the geometry behind the background and fail to accurately composite objects and backgrounds in complex spatial relations.
We conclude that the root cause of the aforementioned issues is that image composition is conducted at a 2D level. Ideally, the composition operation should be done in a 3D space for precise 3D geometry relationships. However, accurately modeling a 3D scene with any given image, especially with only one view, is non-trivial and time-consuming (Liu et al., 2023b). To address these challenges, we introduce Bifröst†, which offers a 3D-aware framework for image composition without explicit 3D modeling. We achieve this by leveraging depth to indicate the 3D geometry relationship between the object and the background. In detail, our approach leverages a multi-modal large language model (MLLM) as a 2.5D location predictor (i.e., bounding box and depth for the object in the given background image). With the predicted bounding box and depth, our method yields a depth map for the composited image, which is fed into a diffusion model as guidance. This enables our method to achieve good ID preservation and background harmony simultaneously, as it is now aware of the spatial relations between them, and the conflict at the dimension of depth is eliminated. In addition, MLLM enables our method to composite images with text instructions, which enlarges the application scenario of Bifröst. Bifröst achieves significantly better visual results than the previous method, which in turn validates our conclusion.
Our main contributions can be summarized as follows: 1) We are the first to embed depth into the image composition pipeline, which improves the ID preservation and background harmony simultaneously. 2) We delicately build a counterfactual dataset and fine-tuned MLLM as a powerful tool to predict 2.5D location of the object in a given background image. Further, the fine-tuned MLLM enables our approach to understand language instructions for image composition. 3) Our approach has demonstrated exceptional performance through comprehensive qualitative assessments and quantitative analyses on image compositing and outperforms other methods, Bifröst allows us to generate images with better control of occlusion, depth blur, and image harmonization.
Our method consists of two stages: 1) in stage 1, given the input image of object and background, and text instruction that indicates the location for object compositing to the background, the MLLMs are finetuned on our customized dataset for predicting a 2.5D location, which provides the bounding box and a depth value of the object in the background; 2) in stage 2, our Bifröst performs 3D-aware image compositing according to the generated 2.5D location, images of object and background and their depth maps estimated by a depth predictor. As we divide the pipeline into two stages, we can adopt the existing benchmarks that have been collected for common vision problems and avoid the demand of collected new and task-specific paired data.
Overview of the 2.5D counterfactual dataset generation for fine-tuning MLLM. Given a scene image I, one object o was randomly selected as the object we want to predict (e.g., the laptop in this figure). The depth of the object is predicted by a pre-trained depth predictor. The selected object is then removed from the given image using the SAM (i.e. mask the object) followed by an SD-based inpainting model (i.e., inpaint the masked hole). The final data pair consists of a text instruction, a counterfactual image, and a 2.5D location of the selected object o.
Examples of 2.5D counterfactual dataset for fine-tuning MLLM.
Overview of training pipeline of Bifröst on image compositing stage. A segmentation module is first adopted to get the masked image and object without background, followed by an ID extractor to obtain its identity information. The high-frequency filter is then applied to extract the detail of the object, stitch the result with the scene at the predicted location, and employ a detail extractor to complement the ID extractor with texture details. We then use a depth predictor to estimate the depth of the image and apply a depth extractor to capture the spatial information of the scene. Finally, the ID tokens, detail maps, and depth maps are integrated into a pre-trained diffusion model, enabling the target object to seamlessly blend with its surroundings while preserving complex spatial relationships.
Data preparation pipeline of leveraging videos. Given a clip, we first sample two frames, selecting an instance from one frame as the reference object and using the corresponding instance from the other frame as the training supervision.
Qualitative comparison with reference-based image generation methods, including Paint-by-Example (Yang et al., 2023), ObjectStitch (Song et al., 2023), and AnyDoor (Chen et al., 2024), where our Bifröst better preserves the geometry consistency. Note that all approaches do not fine-tune the model on the test samples.
Results of other application scenarios of Bifröst.
Qualitative ablation study on the core components of Bifröst, where the last column is the result of our full model, “HF-Filter” stands for the high-frequency filter in the detail extractor.
Ablation study of different depth control from deep to shallow.
@INPROCEEDINGS{Li24,
title = {BIFRÖST: 3D-Aware Image compositing with Language Instructions},
author = {Lingxiao Li and Kaixiong Gong and Weihong Li and Xili Dai and Tao Chen and Xiaojun Yuan and Xiangyu Yue},
booktitle={Advanced Neural Information Processing System (NeurIPS)},
year={2024}
}
Acknowledgements: We thank DreamBooth for the page templates.