Skip to content

CLIP-Driven Referring Image Segmentation

Overview

Vision Language Decoder - Transfer fine grained semantic information from text to pixel level features(patches) - Self-Attention: for long range dependencies - Cross-Attention: for fine structured textual features to pixel level features Text To Pixel contrastive Learning - Explicitly enforce the textual similarity - Align language features with corresponding pixel features

Architecture

![[Pasted image 20240106103205.png]]

Feature-Extraction

  • Image Encoder: Res-net
    • Use of 2nd-4th stage features
  • Text Encoder: Standard encoder
    • Fs: [CLS] token
  • Cross Modal Neck: incorporate the textual embedding into image features
  • Use of [[CoordConv]]

Vision-Language-Decoder

Takes input FvRN×C and FtRL×C generates multi-model features FcRN×C. To capture positional information, sine spacial positional embeddings are added. - Multi head self attention \(Fv=MHSA(LN(Fv))+Fv\) - Multi head cross attention Fc=MHCA(LN(Fv),Ft)+Fv - MLP block Fc=MLP(LN(Fc))+Fc

Contrastive-Learning

We get a FcRN×C from vision language decoder and FsRC from text feature extractor. We transform this by zv=FcWv+bv,Fc=Up(Fc)\(zt=FsWt+bt,\) - ztRD, zvRN×D , N=H/4×W/4 - Up denotes 4× up-sampling - Then it is trained constrastive style