Internal-Guidance

Guiding a Diffusion Transformer

with the Internal Dynamics of Itself

Under Review

Xingyu Zhou¹, Qifan Li¹, Xiaobin Hu², Hai Chen^3,4, Shuhang Gu^1,*
¹University of Electronic Science and Technology of China ²National University of Singapore
³Sun Yat-sen University ⁴North China Institute of Computer Systems Engineering
^*Corresponding Author

[Paper] [Code] [HuggingFace]

TL;DR:

🔥 New SOTA on 256 × 256 ImageNet generation. We present Internal Guidance (IG), a simple yet powerful guidance mechanism for Diffusion Transformers. LightningDiT-XL/1 + IG sets a new state of the art with FID = 1.07 on ImageNet, while achieving FID = 1.24 without classifier-free guidance. IG delivers dramatic quality gains with far fewer training epochs, adds negligible overhead, and works as a drop-in upgrade for modern diffusion transformers.

Method

(a) Internal Guidance (IG) is a lightweight guidance framework for Diffusion Transformers that leverages the model’s own internal dynamics to improve both generation quality and training efficiency. During training, IG introduces a simple auxiliary supervision at an intermediate layer, enabling the model to produce a weaker but semantically meaningful intermediate prediction alongside the final output. This additional supervision helps stabilize optimization and alleviates gradient vanishing in deep diffusion transformers.
(b) At inference time, IG treats the intermediate output as an internal “weaker model” and guides the final prediction by extrapolating between intermediate and final-layer outputs. This yields an autoguidance-like effect without requiring degraded models, extra training, or additional sampling steps. As a result, IG enhances sample fidelity while preserving diversity.
(c) IG is fully plug-and-play and can be seamlessly integrated into existing Diffusion Transformer architectures. Moreover, it is complementary to classifier-free guidance (CFG) and guidance intervals, enabling further performance gains. Extensive experiments demonstrate that IG consistently improves generation quality across model scales and backbones, achieving state-of-the-art results with minimal computational overhead.

Discussion

We conduct a 2D toy example experiment similar to Autoguidance to further analyze the role of the IG.

Compatibility with CFG: IG can be combined with classifier-free guidance (CFG), further improving image quality. This synergy between IG and CFG significantly reduces outliers while maintaining diversity.

Guidance Interval: The introduction of a guidance interval allows IG to be applied selectively at certain noise levels. This enhances its effectiveness, particularly in high-noise ranges, while avoiding over-simplification of the generated samples.

Training Acceleration: By incorporating IG into the training process, we can also accelerate convergence. In experiments on SiT-B/2, the IG-based training loss has been shown to outperform REPA, demonstrating faster convergence with comparable or better generation quality.

Experimental Results

We demonstrate the effectiveness of the IG that we have proposed.

Just adding an auxiliary supervision (without IG sampling) achieves a similar effect as that incorporates self-supervised representation learning as regularization.

IG shows benifit on different baselines across different model size.

IG shows superior performance against all of other methods and can be combined with the methods that utilize pre-trained self-supervised representation models.

FID-50K comparison under random class sampling.

FID-50K comparison under uniform class sampling.

The combination of IG and CFG can further enhance the quality of image generation.

References

[1] Karras, Tero, et al. "Guiding a diffusion model with a bad version of itself." NeurIPS 2024.
[2] Yu, Sihyun, et al. "Representation alignment for generation: Training diffusion transformers is easier than you think." ICLR 2025.
[3] Yao, Jingfeng, et al. "Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models." CVPR 2025.