R-Genie: Reasoning-Guided Generative Image Editing

1The Hong Kong University of Science and Technology, 2Nanjing University of Science and Technology,
Demo image 1
My daughter is going to school soon. I wanna put some foods rich in Vitamins and put them in her schoolbag. Please change the uneaten prepared food in the picture into kiwi fruits.
Demo image 2
Whose clothes are mostly likely to make him invisible in the wild? Switch him or her with a female.
Demo image 3
I am a programmer. Among these computers, mine is running program. I wanna replace it with a new MacBook.
Demo image 4
Winter is coming, and animals need to rest. There is a sleepy bird in this picture. Help me exchange it with a frog.
Demo image 5
Which one is the tallest? Replace it with a toy pig.
Demo image 6
Holiday is over. Someone needs to go to primary school. Let a dog replace that guy.

By integrating multimodal large language models, we endow generative image editing models with intricate reasoning capabilities. Our method interprets implicit user-provided contextual knowledge to control the generative pixel-level editing process, ensuring results that align faithfully with the intended modifications. The underline indicates the content that requires reasoning-based processing.

Abstract

While recent advances in image editing have enabled impressive visual synthesis capabilities, current methods remain constrained by explicit textual instructions and limited editing operations, lacking deep comprehension of implicit user intentions and contextual reasoning. In this work, we introduce a new image editing paradigm: reasoning-guided generative editing, which synthesizes images based on complex, multi-faceted textual queries accepting world knowledge and intention inference. To facilitate this task, we first construct a comprehensive dataset featuring over 1,000 image-instruction-edit triples that incorporate rich reasoning contexts and real-world knowledge. We then propose R-Genie: a reasoning-guided generative image editor, which synergizes the generation power of diffusion models with advanced reasoning capabilities of multimodal large language models. R-Genie incorporates a reasoning-attention mechanism to bridge linguistic understanding with visual synthesis, enabling it to handle intricate editing requests involving abstract user intentions and contextual reasoning relations. Extensive experimental results validate that R-Genie can equip diffusion models with advanced reasoning-based editing capabilities, unlocking new potentials for intelligent image synthesis.

Method

Interpolate start reference image.

R-Genie employs a MLLM to process an introduced token alongside textual and visual input tokens. The MLLM-generated token is subsequently routed through a reasoning-attention bridge and a hierarchical reasoning module, which perform bidirectional reasoning by integrating visual features through cross-modal interactions. Finally, a discrete diffusion model reconstructs the target visual features in discrete space, ensuring alignment between the expected modified visual semantic output and the reconstructed visual representation.

Results

Result image

We conduct a systematic evaluation comparing our R-Genie with state-of-the-art methods, including: (1) task-specific instruction-based editing models (i.e., InstructPix2Pix[2], MagicBrush[55], MGIE[8], InstructDiffusion[13], and SmartEdit[17]), and (2) unified multimodal models (i.e., Show-o[50], Janus[45], VILA-U[48], and OmniGen[49]), and SEED-X[12]. All experiments are conducted on our proposed REditBench under identical conditions to ensure fair result comparisons.

Interpolate start reference image.

Qualitative result comparisons with other instruction-based image editing methods. In contrast, our method effectively grounds the instruction in commonsense knowledge before performing spatially aware edits in the generative process, leading to more accurate and coherent results.

BibTeX

@article{zhang2025r,
  title={R-Genie: Reasoning-Guided Generative Image Editing},
  author={Zhang, Dong and He, Lingfeng and Yan, Rui and Shen, Fei and Tang, Jinhui},
  journal={arXiv preprint arXiv:2505.17768},
  year={2025}
}