Logo

DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

1Tsinghua University 2Kuaishou Technology
*Indicates Equal Contribution
Indicates Corresponding Author

TL;DR: We propose DiffMoE to efficiently scale the Diffusion Transformers, achieving 3x dense model performance with 1x active params.

Logo

Token Accessibility and Dynamic Computation. (a) Token accessibility levels from token isolation to crosssample interaction. Colors represent tokens in different samples, ti indicates noise levels. (b) Performance-accessibility analysis across architectures. (c) Computational dynamics during diffusion sampling, showing adaptive computation from noise to image. (d) Class-wise computation allocation from hard (technical diagrams) to easy (natural photos) tasks. Results from DiffMoE-L-E16-Flow (700K).

Abstract

Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications.

Method

DiffMoE Architecture Overview. DiffMoE flattens tokens into a batch-level global token pool, where each expert maintains a fixed training capacity. During inference, a dynamic capacity predictor adaptively routes tokens across different sampling steps and conditions. Different colors denote tokens from distinct samples, while ti represents corresponding noise levels.

method

Experiments

ImageNet Generation. After 7000K steps, DiffMoE-L-E8 achieves FID50K scores of DDPM/Flow (2.30/2.13) with cfg=1.5, surpassing Dense-DiT-XL (2.32/2.19) as shown below. All evaluations follow DiT's protocol.

method

Scaling Parameter Behavior. To explore the upper limits of DiffMoE and quantify its performance efficiency, we scaled the model to larger sizes and trained them for 3000K steps. DiffMoE-L-E16-Flow achieves the best performance among the evaluated models. Notably, DiffMoE-L-E16 surpasses the performance of Dense-DiT-XXXL-Flow, which uses 3x the parameters, while operating with only 1x the parameters. This highlights the exceptional parameter efficiency and scalability of DiffMoE.

method

Visualization

method method method

Analysis

method method

BibTeX


        @misc{shi2025diffmoedynamictokenselection,
          title={DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers}, 
          author={Minglei Shi and Ziyang Yuan and Haotian Yang and Xintao Wang and Mingwu Zheng and Xin Tao and Wenliang Zhao and Wenzhao Zheng and Jie Zhou and Jiwen Lu and Pengfei Wan and Di Zhang and Kun Gai},
          year={2025},
          eprint={2503.14487},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2503.14487}, 
    }