CLIP-Driven Referring Image Segmentation
Overview¶
Vision Language Decoder - Transfer fine grained semantic information from text to pixel level features(patches) - Self-Attention: for long range dependencies - Cross-Attention: for fine structured textual features to pixel level features Text To Pixel contrastive Learning - Explicitly enforce the textual similarity - Align language features with corresponding pixel features
Architecture¶
![[Pasted image 20240106103205.png]]
Feature-Extraction¶
- Image Encoder: Res-net
- Use of 2nd-4th stage features
- Text Encoder: Standard encoder
- \(F_s\):
[CLS]
token
- \(F_s\):
- Cross Modal Neck: incorporate the textual embedding into image features
- Use of [[CoordConv]]
Vision-Language-Decoder¶
Takes input \(F_v \in R^{N \times C}\) and \(F_t \in R^{L \times C}\) generates multi-model features \(F_c \in R^{N \times C}\). To capture positional information, sine spacial positional embeddings are added. - Multi head self attention \(\(F^′_v = M HSA(LN (F_v)) + F_v\)\) - Multi head cross attention $$ F^′_c = M HCA(LN (F^′_v), F_t) + F^′_v$$ - MLP block $$ F_c = MLP (LN (F^′_c)) + F ^′_c$$
Contrastive-Learning¶
We get a \(F_c \in R^{N \times C}\) from vision language decoder and \(F_s \in R^{C^{'}}\) from text feature extractor. We transform this by $$\large z_v = F^′_ cW_v + b_v, F^′_c = U p(F_c) $$\(\(\large z_t = F_sW_t + b_t,\)\) - \(z_t ∈ R^D\), \(z_v ∈ R^{N×D}\) , \(N = H/4 ×W/4\) - Up denotes 4× up-sampling - Then it is trained constrastive style