Dual Diffusion for Unified Image Generation and Understanding

*Equal contribution, work done during an internship at Bytedance
1Carnegie Mellon University, 2Yale University, 3ByteDance Seed

Brief intro

We introduce a unified image-text generation framework that integrates discrete text diffusion (masked diffusion) with continuous image diffusion (flow matching). Through this joint diffusion approach, we can transform a pretrained text-to-image model (SD3 medium) into a versatile vision-language model with a moderate amount of fine-tuning.

Visual question answering with Dual-Diffusion.

Model overview

(a) Overall model architecture, which is based on MM-DiT; (b) During training for (image-conditioned) text denoising, the text input is randomly masked while the image is noise-free; (c) During training for text-conditioned image denoising, the image is randomly noised while the text is noise-free;


Maksed text diffusion with bi-directional attention removes the restriction of processing input modalities in a specific sequential order (unlike typical AR models).

Image understanding and image generation results

After dual-diffusion training (including visual-intrusction tuning), the model is capable of performing visual question answering.


At the meantime, the text-to-image generation capability of original model is preserved.

Multi-modal benchmark

Quantitative comparison between different vision-language models. Our model compares favorably against other AR+Diffusion VLM.

BibTeX


@misc{li2024dualdiffusionunifiedimage,
  title={Dual Diffusion for Unified Image Generation and Understanding}, 
  author={Zijie Li and Henry Li and Yichun Shi and Amir Barati Farimani and Yuval Kluger and Linjie Yang and Peng Wang},
  year={2024},
  eprint={2501.00289},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2501.00289}, 
}