LTX-2.3-FP8 / README.md

drbaph

Update README.md

c777f58 verified 1 day ago

preview code

raw

history blame contribute delete

2.89 kB

metadata

language:
  - en
  - de
  - es
  - fr
  - ja
  - ko
  - zh
  - it
  - pt
library_name: diffusers
license: other
license_name: ltx-2-community-license-agreement
license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
pipeline_tag: image-to-video
arxiv: 2601.03233
tags:
  - image-to-video
  - text-to-video
  - video-to-video
  - image-text-to-video
  - audio-to-video
  - text-to-audio
  - video-to-audio
  - audio-to-audio
  - text-to-audio-video
  - image-to-audio-video
  - image-text-to-audio-video
  - ltx-2
  - ltx-2-3
  - ltx-video
  - ltxv
  - lightricks
pinned: true
demo: https://app.ltx.studio/ltx-2-playground/i2v

LTX-2.3 FP8 Quantized

FP8 quantized versions of the LTX-2.3 22B models by Lightricks.

Quantized Checkpoints

Name	Original	Size
ltx-2.3-22b-dev-fp8_mixed.safetensors	ltx-2.3-22b-dev	~30 GB
ltx-2.3-22b-distilled-fp8_mixed.safetensors	ltx-2.3-22b-distilled	~30 GB

Quantization Details

Format: float8_e4m3fn (E4M3, max=448)
Method: Static per-tensor W8A8 quantization
Scope: Transformer blocks 1–42 (block 0 and last 5 blocks kept in BF16)
Targets: All linear projection weight matrices in attn1, attn2, audio_attn1, audio_attn2, audio_to_video_attn, video_to_audio_attn, ff.net, audio_ff.net — specifically to_q, to_k, to_v, to_out.0, ff.net.0.proj, ff.net.2 and their audio equivalents
Scale: Per-tensor weight_scale = max(|W|) / 448 stored as F32 scalar alongside each weight. Static input_scale = 1.0 placeholder matching the source model format
Non-quantized: Biases, norms, scale_shift_tables, gate_logits kept as BF16/F32
Quantized tensors: 1176 / 5947 total (28 patterns × 42 blocks)
Output size: ~29.94 GB (down from ~46 GB BF16)

Original Model

This is a quantized derivative of Lightricks/LTX-2.3. All original model details, usage instructions, and license terms apply.

LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model.

Citation

@article{hacohen2025ltx2,
  title={LTX-2: Efficient Joint Audio-Visual Foundation Model},
  author={HaCohen, Yoav and Brazowski, Benny and Chiprut, Nisan and Bitterman, Yaki and Kvochko, Andrew and Berkowitz, Avishai and Shalem, Daniel and Lifschitz, Daphna and Moshe, Dudu and Porat, Eitan and Richardson, Eitan and Guy Shiran and Itay Chachy and Jonathan Chetboun and Michael Finkelson and Michael Kupchick and Nir Zabari and Nitzan Guetta and Noa Kotler and Ofir Bibi and Ori Gordon and Poriya Panet and Roi Benita and Shahar Armon and Victor Kulikov and Yaron Inger and Yonatan Shiftan and Zeev Melumian and Zeev Farbman},
  journal={arXiv preprint arXiv:2601.03233},
  year={2025}
}