Abstract

Emergence and grokking in simple architectures

Misha Belkin
University of California, San Diego (UCSD)

In recent years transformers have become a dominant machine learning methodology.
A key element of transformer architectures is a standard neural network (MLP). I argue that MLPs alone already exhibit many remarkable behaviors observed in modern LLMs, including emergent phenomena. Furthermore, despite large amounts of work, we are still far from understanding how 2-layer MLPs learn relatively simple problems, such as "grokking" modular arithmetic. I will discuss recent progress and will argue that feature-learning kernel machines (Recursive Feature Machines) isolate some key computational aspects of modern neural architectures and are preferable to MLPs as a model for analysis of emergent phenomena.

View on Youtube

Back to Long Programs

Abstract

Emergence and grokking in simple architectures

Misha BelkinUniversity of California, San Diego (UCSD)

Misha Belkin
University of California, San Diego (UCSD)