CLIP-Driven Referring Image Segmentation
Overview¶
Vision Language Decoder - Transfer fine grained semantic information from text to pixel level features(patches) - Self-Attention: for long range dependencies - Cross-Attention: for fine structured textual features to pixel level features Text To Pixel contrastive Learning - Explicitly enforce the textual similarity - Align language features with corresponding pixel features
Architecture¶
![[Pasted image 20240106103205.png]]
Feature-Extraction¶
- Image Encoder: Res-net
- Use of 2nd-4th stage features
- Text Encoder: Standard encoder
:[CLS]
token
- Cross Modal Neck: incorporate the textual embedding into image features
- Use of [[CoordConv]]
Vision-Language-Decoder¶
Takes input
Contrastive-Learning¶
We get a