Tl;dr: Two Google papers for this week. The first introduces a new way of doing mixture of experts — the experts are applied dynamically at the token level. The idea is that you can learn how much compute to spend per token in a sequence. Then, Google released Lumiere a text-to-video model which seems to be their response to Open AI’s Sora. Unlike Open AI, Google actually explains some of the model architecture and training process. Definitely worth a skim.
FYI: My ‘popularity emoji’ is based on aggregate statistics of how many people have engaged with a paper on Twitter/X (as well as my own subjective personal interest).
Very popular (you really should know about this): 🔥
Popular (a good amount of people are discussing this): 😄
Less popular (but still worth making a mental note) : 🙂
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Popularity: 😄
Transformer-based language models spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth. Our method enforces a total compute budget by capping the number of tokens (k) that can participate in the self-attention and MLP computations at a given layer. The tokens to be processed are determined by the network using a top-k routing mechanism. Since k is defined a priori, this simple procedure uses a static computation graph with known tensor sizes, unlike other conditional computation techniques. Nevertheless, since the identities of the k tokens are fluid, this method can expend FLOPs non-uniformly across the time and model depth dimensions. Thus, compute expenditure is entirely predictable in sum total, but dynamic and context-sensitive at the token-level. Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50\% faster to step during post-training sampling.
Lumiere: A Space-Time Diffusion Model for Video Generation
Popularity: 😄
We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.