input image: 3×256×256 latent (tokens): 4×32×32
TL;DR: With proper designs and scaling up, diffusion autoencoders (with a single L2 loss) can outperform
the GAN-LPIPS autoencoders (with hybrid losses) used in current state-of-the-art generative models.
Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models. We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models. Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers. Since diffusion is already widely used for image generation, our insight greatly simplifies training such tokenizers. In contrast, current state-of-the-art tokenizers rely on an empirically found combination of heuristics and losses, thus requiring a complex training recipe that relies on non-trivially balancing different losses and pretrained supervised models. We show design decisions, along with theoretical grounding, that enable us to scale DiTo for learning competitive image representations. Our results show that DiTo is a simpler, scalable, and self-supervised alternative to the current state-of-the-art image tokenizer which is supervised. DiTo achieves competitive or better quality than state-of-the-art in image reconstruction and downstream image generation tasks.
Diffusion tokenizer (DiTo) is a diffusion autoencoder with an ELBO objective (e.g., Flow Matching). The input image \(\boldsymbol{x}\) is passed into the encoder \(E\) to obtain the latent representation, i.e., `tokens' \(\boldsymbol{z}\), a decoder \(D\) then learns the distribution \(p(\boldsymbol{x}|\boldsymbol{z})\) with the diffusion objective. \(E\) and \(D\) are jointly trained from scratch. In contrast, prior work (a) relies on a combination of losses, heuristics, and pretrained models to learn a latent \(\boldsymbol{z}\).
We propose an additional regularization on the DiTo's latent representations \(\boldsymbol{z}\) that facilitates training the latent diffusion model on top of them for image generation. The idea is to synchronize the noising process on the latent \(\boldsymbol{z}\) to the pixel space \(\boldsymbol{x}\). During the DiTo training, after obtaining \(\boldsymbol{z}=E(\boldsymbol{x})\), we augment \(\boldsymbol{z}_\tau=\alpha_\tau \boldsymbol{z} + \sigma_\tau \boldsymbol{\epsilon}\) with probability \(p=0.1\) for a random time \(\tau\in[0, 1]\), then use the diffusion decoder to compute the denoising loss with \(t\) sampled in \([\tau, 1]\). (See the paper for details.)
When increasing the number of trainable parameters in the diffusion decoder from DiTo-B (162.8M), DiTo-L (338.5M), to DiTo-XL (620.9M) in the joint training, we observe that the image reconstruction quality keeps improving for structures and textures. Both the visual quality and reconstruction faithfulness are improved when scaling up the diffusion tokenizer.
When being scaled up, DiTo's visual quality significantly improves and outperforms GAN-LPIPS tokenziers in human preference.
dataset: ImageNet 256×256 (face blurred)
@misc{chen2025diffusionautoencodersscalableimage,
title={Diffusion Autoencoders are Scalable Image Tokenizers},
author={Yinbo Chen and Rohit Girdhar and Xiaolong Wang and Sai Saketh Rambhatla and Ishan Misra},
year={2025},
eprint={2501.18593},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.18593},
}