CLIP-Guided SAM: Parameter-Efficient Semantic Conditioning for Promptable Segmentation 文章

ArXiv CS.CV2026-05-26NEWSen作者: Shayan Jalilian, Abdul Bais

摘要

arXiv:2605.24807v1 Announce Type: new Abstract: Promptable foundation models such as the Segment Anything Model (SAM) produce high-quality masks but remain semantically blind, relying on external prompts to specify categories. Existing vision-language approaches address this limitation by using external prompt coupling, where a vision-language model generates spatial prompts for SAM as a separate stage. We propose CLIP-Guided SAM, a parameter-efficient segmentation framework built on internal semantic conditioning. Instead of using semantic signals only to generate prompts, we inject CLIP-derived text, vision, and similarity features directly into SAM's image encoder through lightweight multi-modal semantic adapters. These adapters condition SAM's internal feature representations, allowing semantic information to influence mask prediction while preserving SAM's original promptable interface.