Accurately modeling complex temporal environments is essential for self-driving. I will present recent Waymo research, which demonstrates several key modeling trends: 1) it is beneficial to recover scene structure and leverage such structure to aid tasks of interest; 2) a lot of the higher-level structure is intuitively about objects and their relationships, which is well modeled by graphical neural networks. I will demonstrate perception and behavior prediction work that builds on these ideas to get state-of-the-art performance in our domain.