Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs 文章

ArXiv CS.CV2026-06-02NEWSen作者: Xin Gao, Cheng Yang, Chufan Shi, Taylor Berg-Kirkpatrick

详细信息

来源站点: ArXiv CS.CV
作者: Xin Gao, Cheng Yang, Chufan Shi, Taylor Berg-Kirkpatrick
文章类型: NEWS
语言: en
发布日期: 2026-06-02

摘要

arXiv:2606.00477v1 Announce Type: cross Abstract: Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While knowledge editing has matured for text-only models, it remains unclear whether edits that successfully modify textual outputs also transfer to image generation in UMMs. To study this question, we introduce UniKE, the first benchmark for cross-modality knowledge editing in UMMs, comprising 2,971 edit subjects spanning attribute and relation edits. Using VQA-based visual verification, we reveal a striking modality gap: text-side efficacy can reach approximately 92%, whereas the best overall VQA accuracy under direct image generation is only 18.5%.

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (1)

相关技术查看全部 (3)