---
language:
- en
license: apache-2.0
tags:
- image-generation
- latent-diffusion
- denoising-diffusion
- computer-vision
- generative-model
paperswithcode_id: akatsuki-neo/JLT
---

# JLT: Clean-Latent Prediction in Latent Diffusion Transformers

[![Paper](https://img.shields.io/badge/arXiv-2605.27102-b31b1b)](https://arxiv.org/abs/2605.27102)
[![Project Page](https://img.shields.io/badge/GitHub%20Pages-JLT-blue)](https://akatsuki-neo.github.io/JLT)
[![Code](https://img.shields.io/badge/Code-GitHub-blue)](https://github.com/akatsuki-neo/JLT)

<div align="center">
<img src="https://akatsuki-neo.github.io/JLT/images/jlt_b16_heun50_samples.png" width="75%">
<br><br>
ImageNet 256×256 samples from JLT-B/1 using 50-step Heun sampling.
</div>

## Authors

[**Funing Fu**](https://github.com/chinoll) · [**Tenghui Wang**](https://github.com/spawner1145) · [**Guanyu Zhou**](https://the-martyr.github.io/) · Junyong Cen · Qichao Zhu

## Overview

JLT investigates whether predicting clean data is better than predicting velocity in latent space. Under the same architecture, training settings, and FLUX.2 VAE representation, clean-latent prediction achieves **FID 2.50** vs. velocity prediction at **FID 6.56** — a 62% improvement on ImageNet 256×256.

This model is trained on FLUX.2 VAE latent space with clean-latent prediction target.

## Results

| Model | Target | FID-50K ↓ | IS ↑ |
|-------|--------|-----------|------|
| **JLT-B/1** | x (clean) | **2.50** | 232.51 |
| DiT-B/1 | v (velocity) | 6.56 | 132.12 |

## Method

Under the linear corruption path `z_t = t * x + (1-t) * epsilon`:

- **Clean prediction** (JLT): predict `x` directly, attenuating low-variance latent directions
- **Velocity prediction** (DiT): predict `v = x - epsilon`, adding an isotropic unit floor to all directions

Key insight: velocity prediction amplifies low-variance latent directions while clean prediction attenuates them.

## Architecture

| Component | Specification |
|-----------|--------------|
| Transformer Blocks | 12 |
| Hidden Dimension | 768 |
| Attention Heads | 12 |
| Parameters | 130M |
| Tokenizer | FLUX.2 VAE (frozen) |

## Usage

### Download

```bash
huggingface-cli download dawn-neo/JLT checkpoint-last.pth
```

### Evaluation

```bash
# Requires pre-encoded ImageNet latents and torch-fidelity
python main_jit.py \
    --model JiT-B/1 --vae_type flux2 \
    --data_path /path/to/imagenet_latents_256 --use_latent_cache \
    --online_eval --eval_freq 1 --gen_bsz 128 --num_images 50000 \
    --cfg 2.9 --num_sampling_steps 50 \
    --resume checkpoint-last.pth --output_dir ./eval_output
```

For full training and inference code, see the [GitHub repository](https://github.com/akatsuki-neo/JLT).

## Citation

```bibtex
@article{fu2026jlt,
  title={{JLT}: {C}lean-{L}atent {P}rediction in {L}atent {D}iffusion {T}ransformers},
  author={Fu, Funing and Wang, Tenghui and Zhou, Guanyu and Cen, Junyong and Zhu, Qichao},
  journal = {arXiv preprint arXiv:2605.27102},
  year={2026}
}
```

## Acknowledgements

- [JiT](https://github.com/LTH14/JiT) - Base architecture
- [FLUX.2 VAE](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B) - Latent space
- [Li & He. "Back to Basics"](https://arxiv.org/abs/2511.13720) - Clean prediction insight