We introduce a unified image-text generation framework that integrates discrete text diffusion (masked diffusion) with continuous image diffusion (flow matching). Through this joint diffusion approach, we can transform a pretrained text-to-image model (SD3 medium) into a versatile vision-language model with a moderate amount of fine-tuning.
(a) Overall model architecture, which is based on MM-DiT; (b) During training for (image-conditioned) text denoising, the text input is randomly masked while the image is noise-free; (c) During training for text-conditioned image denoising, the image is randomly noised while the text is noise-free;
Maksed text diffusion with bi-directional attention removes the restriction of processing input modalities in a specific sequential order (unlike typical AR models).
@misc{li2024dualdiffusionunifiedimage,
title={Dual Diffusion for Unified Image Generation and Understanding},
author={Zijie Li and Henry Li and Yichun Shi and Amir Barati Farimani and Yuval Kluger and Linjie Yang and Peng Wang},
year={2024},
eprint={2501.00289},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.00289},
}