It’s not new and only superior in a very narrow set of categories.
Needs a (2023) tag. But definitely the release of ARC2 and image outputs from 4o got me thinking about the JEPA family too.
I don't know if it's right (and I'm sure JEPA has lots of performance issues) but seems good to have a fully latent space representation, ideally across all modalities, so that the concept "an apple a day keeps the doctor away" becoming image/audio/text is a choice of decoder rather than dedicated token ranges being chosen even before the actual creation process in the model begins.
GPTs are in the “exploit” phase of the “explore-exploit” trade-off.
JEPA is still in the explore phase, it’s good to read the paper and have an understanding of the architecture to gain an alternative perspective.
Not new, not notable right now, not sure why it's getting upvoted (just kidding, it's because people see YLC and upvote based on names)
Even average papers can have nice overview of the problem and references.
I don't care for names, i just thought it was an interesting read.
JEPA is presumably superior to Transformers. Can any expert enlighten us on the implications of this paper?
Transformers are usually part of JEPA architectures. In I-JEPA's case, there is a ViT that is used in the context encoding phase.
As a computer vision guy I'm sad JEPA didn't end up more effective. Makes perfect sense conceptually, would have easily transferred to video, but other self-supervised methods just seem to beat it!
Yeah! JEPA seems awesome. Do you mind sharing what other self-supervised methods work better than JEPA?