Skip to content

Obsidian Notes

Hello there, we got some notes here. Get started with ViT

Topics

Learning Resources of Research Interns Architectures

CNNs

RNNs (mostly LSTMs)

Why is it almost replaced by transformers?

Which cases do RNNs work better?

Transformers

Encoder-only variants (mainly used)

Some recent advancements in decoder-only models or encoder-decoder models

ViT and its variants like Swin (Still transformers, but transferred to CV from NLP)

Learning Paradigms

Supervised

Semi-supervised

Self-supervised

Unsupervised

Differences and use cases for each

We mainly work on supervised learning. But some work ahead needs the knowledge of semi-supervised and self-supervised learning.

Domains

NLP - Language Modelling - Next Word Prediction vs Masked Token Prediction

  • NER
  • Subword tokenization challenges: how to combine the predictions?
  • POS Tagging
  • Sentiment Analysis (esp. Target-aspect-based sentiment analysis; TABSA; some effort needed in this)

Computer Vision (esp. Medical Imaging) - Segmentation - ViT - SWIN

  • Object Detection
  • Use pretrained models on each of the above tasks

Multimodal Learning (Vision and Language; VLM; VLSM) - CLIP and its variants (OpenCLIP, MetaCLIP) (VLM) - CRIS

VLSMs

CLIPSeg

ZegCLIP

Include others