--- language: - en license: apache-2.0 tags: - image-generation - latent-diffusion - denoising-diffusion - computer-vision - generative-model paperswithcode_id: akatsuki-neo/JLT --- # JLT: Clean-Latent Prediction in Latent Diffusion Transformers [![Paper](https://img.shields.io/badge/arXiv-2605.27102-b31b1b)](https://arxiv.org/abs/2605.27102) [![Project Page](https://img.shields.io/badge/GitHub%20Pages-JLT-blue)](https://akatsuki-neo.github.io/JLT) [![Code](https://img.shields.io/badge/Code-GitHub-blue)](https://github.com/akatsuki-neo/JLT)


ImageNet 256×256 samples from JLT-B/1 using 50-step Heun sampling.
## Authors [**Funing Fu**](https://github.com/chinoll) · [**Tenghui Wang**](https://github.com/spawner1145) · [**Guanyu Zhou**](https://the-martyr.github.io/) · Junyong Cen · Qichao Zhu ## Overview JLT investigates whether predicting clean data is better than predicting velocity in latent space. Under the same architecture, training settings, and FLUX.2 VAE representation, clean-latent prediction achieves **FID 2.50** vs. velocity prediction at **FID 6.56** — a 62% improvement on ImageNet 256×256. This model is trained on FLUX.2 VAE latent space with clean-latent prediction target. ## Results | Model | Target | FID-50K ↓ | IS ↑ | |-------|--------|-----------|------| | **JLT-B/1** | x (clean) | **2.50** | 232.51 | | DiT-B/1 | v (velocity) | 6.56 | 132.12 | ## Method Under the linear corruption path `z_t = t * x + (1-t) * epsilon`: - **Clean prediction** (JLT): predict `x` directly, attenuating low-variance latent directions - **Velocity prediction** (DiT): predict `v = x - epsilon`, adding an isotropic unit floor to all directions Key insight: velocity prediction amplifies low-variance latent directions while clean prediction attenuates them. ## Architecture | Component | Specification | |-----------|--------------| | Transformer Blocks | 12 | | Hidden Dimension | 768 | | Attention Heads | 12 | | Parameters | 130M | | Tokenizer | FLUX.2 VAE (frozen) | ## Usage ### Download ```bash huggingface-cli download dawn-neo/JLT checkpoint-last.pth ``` ### Evaluation ```bash # Requires pre-encoded ImageNet latents and torch-fidelity python main_jit.py \ --model JiT-B/1 --vae_type flux2 \ --data_path /path/to/imagenet_latents_256 --use_latent_cache \ --online_eval --eval_freq 1 --gen_bsz 128 --num_images 50000 \ --cfg 2.9 --num_sampling_steps 50 \ --resume checkpoint-last.pth --output_dir ./eval_output ``` For full training and inference code, see the [GitHub repository](https://github.com/akatsuki-neo/JLT). ## Citation ```bibtex @article{fu2026jlt, title={{JLT}: {C}lean-{L}atent {P}rediction in {L}atent {D}iffusion {T}ransformers}, author={Fu, Funing and Wang, Tenghui and Zhou, Guanyu and Cen, Junyong and Zhu, Qichao}, journal = {arXiv preprint arXiv:2605.27102}, year={2026} } ``` ## Acknowledgements - [JiT](https://github.com/LTH14/JiT) - Base architecture - [FLUX.2 VAE](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B) - Latent space - [Li & He. "Back to Basics"](https://arxiv.org/abs/2511.13720) - Clean prediction insight