Distilation
Generally - A large teacher model and a smaller student model. - We try to match the output logits of teacher model by the student model.
Issue with softmax¶
From the original paper - Softmax pushes non target labels towards very small values - Softmax tends to hide the similarity between other classes (which are not the target) - Example: Say a 2 can be more similar to a 7, than a 8. Insight: If we make the values of logits smaller before passing into softmax function, they retain some relativeness
Temperature¶
Used to more smooth the output of softmax \(\(\frac {\exp(5/T)} {\exp(5/T) + \exp(1/T)}\)\) More generally. Here T is a hyper parameter \(\(\frac {\exp(z_i/T)} {\sum_j \exp(z_j/T)}\)\) We choose a T such that it removes the impurities but preserves the elements
Losses¶
Use of Cross Entropy Loss / L2 loss between the probablities
Intermediate match (Weights/Features)¶
- FC used to align dimensions
Ways of distillation¶
Function Matching¶
Out of distribution data for teacher and the student