LTX-2.3-turbo / README.md
Exosfeer
Rename Space title to LTX-2.3 Video [Turbo]
9153a5d

A newer version of the Gradio SDK is available: 6.13.0

Upgrade
metadata
title: LTX-2.3 Video [Turbo]
emoji: 
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 6.2.0
python_version: '3.12'
app_file: app.py
pinned: true
license: other
license_name: ltx-2-community-license-agreement
license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
short_description: Generate video + audio with LTX-2.3 (22B) on free ZeroGPU
tags:
  - video-generation
  - audio-generation
  - text-to-video
  - image-to-video
  - ltx
  - lightricks
  - zerogpu
  - fp8
  - distilled
models:
  - Lightricks/LTX-2.3

LTX-2.3 Turbo (ZeroGPU)

Generate synchronized video + audio from text or images using Lightricks/LTX-2.3 — a 22B parameter DiT-based audio-video foundation model — running entirely on free ZeroGPU hardware. No paid GPU required.

~10 seconds to generate a 2-second video with audio at 768x512.

Features

  • Text to Video — describe a scene and get video with synchronized audio
  • Image to Video — provide a first frame and animate it with audio
  • Interpolate Mode — provide first and last frames, generate video between them
  • Audio Input — provide custom audio for lip-sync or soundtrack conditioning
  • Prompt enhancement — Gemma-3 12B rewrites your prompt for better results
  • Multiple resolutions — 16:9, 1:1, and 9:16 aspect ratios
  • Duration presets — 2s, 3s, and 5s video clips
  • Reproducible — set a seed for consistent outputs

How it works

This Space uses a vendored packages + pre-loaded models strategy to fit the massive 22B model on ZeroGPU's A10G (40GB VRAM):

  1. Startup: Downloads model files, constructs ModelLedger, loads the Gemma-3 12B text encoder, and pre-loads the FP8-quantized transformer + video encoder into the pipeline cache.
  2. Text Encoding (@spaces.GPU): Encodes the prompt into video/audio context tensors using the pre-loaded text encoder, returned to CPU.
  3. Video Generation (@spaces.GPU): Runs the two-stage distilled denoising pipeline (8 steps low-res + 4 steps high-res with 2x spatial upscaling) using pre-encoded contexts, then decodes video and audio.

Key optimizations

Optimization Details
FP8 quantization Transformer weights cast to float8_e4m3fn, halving VRAM usage
Distilled pipeline Only 8+4 denoising steps (vs 30+ for full model)
Pre-loaded models Text encoder, transformer, video encoder loaded once at startup
Two-stage upscaling Generates at half resolution, then upscales 2x with spatial upsampler
Vendored packages ltx-core and ltx-pipelines bundled for fast builds

Parameters

Parameter Range Default Notes
Mode Text to Video / Image to Video Text to Video
Prompt Free text Describe scene, motion, and audio
Resolution 768x512, 512x512, 512x768 768x512 Upscaled 2x by spatial upscaler
Duration 1-5 seconds 2s Shorter = more reliable on ZeroGPU
Enhance prompt On/Off On Gemma-3 rewrites prompt for better results
Seed 0-2B Random For reproducibility

Limitations

  • ZeroGPU time limits: Longer videos may exceed the GPU lease duration. Keep duration at 3 seconds or less for best reliability.
  • VRAM constraints: Even with FP8 quantization, very high resolutions are not possible. The preset resolutions are tuned for ZeroGPU.
  • No camera LoRAs: Camera LoRAs are only available for the 19B model, not the 22B 2.3 model.

Duplicating this Space

This Space uses google/gemma-3-12b-it-qat-q4_0-unquantized as the text encoder. Before duplicating, you must:

  1. Accept the Gemma license on your HuggingFace account
  2. Create a read-access HuggingFace token and add it as a Space secret named HF_TOKEN

Credits