Skip to content

CLIP-Driven Referring Image Segmentation

Overview

Vision Language Decoder - Transfer fine grained semantic information from text to pixel level features(patches) - Self-Attention: for long range dependencies - Cross-Attention: for fine structured textual features to pixel level features Text To Pixel contrastive Learning - Explicitly enforce the textual similarity - Align language features with corresponding pixel features

Architecture

![[Pasted image 20240106103205.png]]

Feature-Extraction

  • Image Encoder: Res-net
    • Use of 2nd-4th stage features
  • Text Encoder: Standard encoder
    • \(F_s\): [CLS] token
  • Cross Modal Neck: incorporate the textual embedding into image features
  • Use of [[CoordConv]]

Vision-Language-Decoder

Takes input \(F_v \in R^{N \times C}\) and \(F_t \in R^{L \times C}\) generates multi-model features \(F_c \in R^{N \times C}\). To capture positional information, sine spacial positional embeddings are added. - Multi head self attention \(\(F^′_v = M HSA(LN (F_v)) + F_v\)\) - Multi head cross attention $$ F^′_c = M HCA(LN (F^′_v), F_t) + F^′_v$$ - MLP block $$ F_c = MLP (LN (F^′_c)) + F ^′_c$$

Contrastive-Learning

We get a \(F_c \in R^{N \times C}\) from vision language decoder and \(F_s \in R^{C^{'}}\) from text feature extractor. We transform this by $$\large z_v = F^′_ cW_v + b_v, F^′_c = U p(F_c) $$\(\(\large z_t = F_sW_t + b_t,\)\) - \(z_t ∈ R^D\), \(z_v ∈ R^{N×D}\) , \(N = H/4 ×W/4\) - Up denotes 4× up-sampling - Then it is trained constrastive style