JLT: Clean-Latent Prediction in Latent Diffusion Transformers 文章

ArXiv CS.CV2026-05-27NEWSen作者: Funing Fu, Tenghui Wang, Junyong Cen, Qichao Zhu, Guanyu Zhou

摘要

arXiv:2605.27102v1 Announce Type: new Abstract: Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.