Skip to content


Data efficient image transformers & distillation through attention - Heavy image augmentation - Distillation token - Regularization


  • competitive result at imagenet with no external data and less compute
    • When pretrained on imagenet, good result on downstream tasks.


  • [[Erasing]]
  • [[Stochastic Depth]]
  • [[Repeated Augmentation]] The first two being crucial for convergence Not applied but cool
  • [[Dropout]]
  • [[Exponentially Moving average]]


There is a need for heavy augmentation - Rand-Augment - Mixup - CutMix Not applied but interesting - Auto augmentation


A different technique compared to the usual Distilation - Use of a teacher-student architecture - Convnet is a better teacher than Transformers - Prob because of inductive bias
Pasted image 20231227113323.png
- Use of a separate distillation token, - starts out different from the class token, (does different things) - at the end has high similarity with class token - It is different from class token - Less training required for good results.

How Trained ?

  • for soft, use of KL divergence between teacher and student as loss
  • for hard, same as classification loss (we can use label smoothing for soft labels)

At inference

  • Use of both class and distillation token