Drawing on the experience of designing and scaling Griffin (https://arxiv.org/abs/2402.19427) and RecurrentGemma, I will introduce some of the key practical concepts behind training large language models. Likely to include: a brief introduction to Transformers, including why MLPs, not Attention, usually dominate computation. A simple mental model of the computational bottlenecks on TPUs and GPUs. How to train models too large to fit in memory on a single device. Scaling laws and hyper-parameter tuning. A detailed discussion of LLM inference. If time permits, I will discuss how to design recurrent models competitive with transformers, their advantages and drawbacks.
Back to Workshop II: Theory and Practice of Deep Learning