[Paper Notes] JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture🔗

TL;DR: JEPA learns a a generalized semantic representation with less data pairs by predicting missing information in the embedding space, which helps it disregard unnecessary noisy from input(pixel)-level details and learns at a higher abstraction level with good semantic generalization.

1. Innovation & Significance

The Bottleneck:

Pixel level pre-training paired & data augmentation are strongly biased towards trained data distribution, hard to determine proper generalization and level of abstraction.