Emergence and grokking in "simple" architectures

Misha Belkin
University of California, San Diego (UCSD)

In recent years transformers have become a dominant machine learning methodology.
A key element of transformer architectures is a standard neural network (MLP). I argue that MLPs alone already exhibit many remarkable behaviors observed in modern LLMs, including emergent phenomena. Furthermore, despite large amounts of work, we are still far from understanding how 2-layer MLPs learn relatively simple problems, such as "grokking" modular arithmetic. I will discuss recent progress and will argue that feature-learning kernel machines (Recursive Feature Machines) isolate some key computational aspects of modern neural architectures and are preferable to MLPs as a model for analysis of emergent phenomena.


View on Youtube

Back to Long Programs