---
language:
- en
license: apache-2.0
tags:
- image-generation
- latent-diffusion
- denoising-diffusion
- computer-vision
- generative-model
paperswithcode_id: akatsuki-neo/JLT
---
# JLT: Clean-Latent Prediction in Latent Diffusion Transformers
[](https://arxiv.org/abs/2605.27102)
[](https://akatsuki-neo.github.io/JLT)
[](https://github.com/akatsuki-neo/JLT)
ImageNet 256×256 samples from JLT-B/1 using 50-step Heun sampling.
## Authors
[**Funing Fu**](https://github.com/chinoll) · [**Tenghui Wang**](https://github.com/spawner1145) · [**Guanyu Zhou**](https://the-martyr.github.io/) · Junyong Cen · Qichao Zhu
## Overview
JLT investigates whether predicting clean data is better than predicting velocity in latent space. Under the same architecture, training settings, and FLUX.2 VAE representation, clean-latent prediction achieves **FID 2.50** vs. velocity prediction at **FID 6.56** — a 62% improvement on ImageNet 256×256.
This model is trained on FLUX.2 VAE latent space with clean-latent prediction target.
## Results
| Model | Target | FID-50K ↓ | IS ↑ |
|-------|--------|-----------|------|
| **JLT-B/1** | x (clean) | **2.50** | 232.51 |
| DiT-B/1 | v (velocity) | 6.56 | 132.12 |
## Method
Under the linear corruption path `z_t = t * x + (1-t) * epsilon`:
- **Clean prediction** (JLT): predict `x` directly, attenuating low-variance latent directions
- **Velocity prediction** (DiT): predict `v = x - epsilon`, adding an isotropic unit floor to all directions
Key insight: velocity prediction amplifies low-variance latent directions while clean prediction attenuates them.
## Architecture
| Component | Specification |
|-----------|--------------|
| Transformer Blocks | 12 |
| Hidden Dimension | 768 |
| Attention Heads | 12 |
| Parameters | 130M |
| Tokenizer | FLUX.2 VAE (frozen) |
## Usage
### Download
```bash
huggingface-cli download dawn-neo/JLT checkpoint-last.pth
```
### Evaluation
```bash
# Requires pre-encoded ImageNet latents and torch-fidelity
python main_jit.py \
--model JiT-B/1 --vae_type flux2 \
--data_path /path/to/imagenet_latents_256 --use_latent_cache \
--online_eval --eval_freq 1 --gen_bsz 128 --num_images 50000 \
--cfg 2.9 --num_sampling_steps 50 \
--resume checkpoint-last.pth --output_dir ./eval_output
```
For full training and inference code, see the [GitHub repository](https://github.com/akatsuki-neo/JLT).
## Citation
```bibtex
@article{fu2026jlt,
title={{JLT}: {C}lean-{L}atent {P}rediction in {L}atent {D}iffusion {T}ransformers},
author={Fu, Funing and Wang, Tenghui and Zhou, Guanyu and Cen, Junyong and Zhu, Qichao},
journal = {arXiv preprint arXiv:2605.27102},
year={2026}
}
```
## Acknowledgements
- [JiT](https://github.com/LTH14/JiT) - Base architecture
- [FLUX.2 VAE](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B) - Latent space
- [Li & He. "Back to Basics"](https://arxiv.org/abs/2511.13720) - Clean prediction insight