Laminating Representation Autoencoders for Efficient Diffusion

Image generated by Gemini AI
Researchers have developed FlatDINO, a variational autoencoder that compresses dense patch grids from models like DINOv2 into a one-dimensional sequence of 32 tokens, reducing dimensionality by 48x. On ImageNet 256x256, a DiT-XL using FlatDINO achieves a gFID of 1.80 while requiring 8x fewer FLOPs per forward pass and up to 4.5x less for training, indicating significant efficiency gains. Preliminary results suggest promising advancements in image generation models.
Laminating Representation Autoencoders for Efficient Diffusion
Researchers have introduced FlatDINO, a variational autoencoder designed to streamline the dense patch grids from encoders like DINOv2, which often lead to increased computational costs. FlatDINO compresses the representation into a one-dimensional sequence of 32 continuous tokens, achieving an 8x reduction in sequence length and a 48x compression in total dimensionality.
When a DiT-XL model is trained on FlatDINO latents, it achieves a generalized Fréchet Inception Distance (gFID) score of 1.80 with classifier-free guidance, requiring 8x fewer floating point operations (FLOPs) per forward pass and up to 4.5x fewer FLOPs per training step compared to traditional diffusion methods utilizing uncompressed DINOv2 features.
Related Topics:
📰 Original Source: https://arxiv.org/abs/2602.04873v1
All rights and credit belong to the original publisher.