EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis 文章

ArXiv CS.CV2026-05-26NEWSen作者: Xiangyuan Wang, Honghao Cai, Yunhao Bai, Chao Hui, Tianze Zhou, Haohua Chen, Hao Shi, Yuling Wu, Yao Hu, Xu Tang, Yibo Chen, Wei Zhu

摘要

arXiv:2604.08213v2 Announce Type: replace Abstract: High-quality source-target image pairs with precise editing instructions are essential for instruction-guided image editing, yet constructing such training triplets at scale remains costly. Recent pipelines often rely on vision-language models to synthesize editing instructions automatically, but we find that strong VLMs still struggle to describe visual transformations between image pairs. In particular, they exhibit three recurring failure modes: orientation inconsistency, viewpoint ambiguity, and missing fine-grained attributes. In a human evaluation on 400 image pairs, several open-source VLM baselines produce critical-error rates above 47\%, making many synthesized instructions unsuitable for downstream training. To address this, we propose EditCaption, a two-stage post-training pipeline for image editing instruction synthesis.