Spaces:

ZeroCollabs
/

LTX-2.3-turbo

Running on Zero

App Files Files Community

LTX-2.3-turbo / README.md

Exosfeer

Rename Space title to LTX-2.3 Video [Turbo]

9153a5d about 2 months ago

preview code

raw

history blame contribute delete

4.6 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

metadata

title: LTX-2.3 Video [Turbo]
emoji: ⚡
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 6.2.0
python_version: '3.12'
app_file: app.py
pinned: true
license: other
license_name: ltx-2-community-license-agreement
license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
short_description: Generate video + audio with LTX-2.3 (22B) on free ZeroGPU
tags:
  - video-generation
  - audio-generation
  - text-to-video
  - image-to-video
  - ltx
  - lightricks
  - zerogpu
  - fp8
  - distilled
models:
  - Lightricks/LTX-2.3

LTX-2.3 Turbo (ZeroGPU)

Generate synchronized video + audio from text or images using Lightricks/LTX-2.3 — a 22B parameter DiT-based audio-video foundation model — running entirely on free ZeroGPU hardware. No paid GPU required.

~10 seconds to generate a 2-second video with audio at 768x512.

Features

Text to Video — describe a scene and get video with synchronized audio
Image to Video — provide a first frame and animate it with audio
Interpolate Mode — provide first and last frames, generate video between them
Audio Input — provide custom audio for lip-sync or soundtrack conditioning
Prompt enhancement — Gemma-3 12B rewrites your prompt for better results
Multiple resolutions — 16:9, 1:1, and 9:16 aspect ratios
Duration presets — 2s, 3s, and 5s video clips
Reproducible — set a seed for consistent outputs

How it works

This Space uses a vendored packages + pre-loaded models strategy to fit the massive 22B model on ZeroGPU's A10G (40GB VRAM):

Startup: Downloads model files, constructs ModelLedger, loads the Gemma-3 12B text encoder, and pre-loads the FP8-quantized transformer + video encoder into the pipeline cache.
Text Encoding (@spaces.GPU): Encodes the prompt into video/audio context tensors using the pre-loaded text encoder, returned to CPU.
Video Generation (@spaces.GPU): Runs the two-stage distilled denoising pipeline (8 steps low-res + 4 steps high-res with 2x spatial upscaling) using pre-encoded contexts, then decodes video and audio.

Key optimizations

Optimization	Details
FP8 quantization	Transformer weights cast to `float8_e4m3fn`, halving VRAM usage
Distilled pipeline	Only 8+4 denoising steps (vs 30+ for full model)
Pre-loaded models	Text encoder, transformer, video encoder loaded once at startup
Two-stage upscaling	Generates at half resolution, then upscales 2x with spatial upsampler
Vendored packages	`ltx-core` and `ltx-pipelines` bundled for fast builds

Parameters

Parameter	Range	Default	Notes
Mode	Text to Video / Image to Video	Text to Video
Prompt	Free text	—	Describe scene, motion, and audio
Resolution	768x512, 512x512, 512x768	768x512	Upscaled 2x by spatial upscaler
Duration	1-5 seconds	2s	Shorter = more reliable on ZeroGPU
Enhance prompt	On/Off	On	Gemma-3 rewrites prompt for better results
Seed	0-2B	Random	For reproducibility

Limitations

ZeroGPU time limits: Longer videos may exceed the GPU lease duration. Keep duration at 3 seconds or less for best reliability.
VRAM constraints: Even with FP8 quantization, very high resolutions are not possible. The preset resolutions are tuned for ZeroGPU.
No camera LoRAs: Camera LoRAs are only available for the 19B model, not the 22B 2.3 model.

Duplicating this Space

This Space uses google/gemma-3-12b-it-qat-q4_0-unquantized as the text encoder. Before duplicating, you must:

Accept the Gemma license on your HuggingFace account
Create a read-access HuggingFace token and add it as a Space secret named HF_TOKEN

Credits

Model: Lightricks/LTX-2.3 (22B parameters)
Text encoder: google/gemma-3-12b-it-qat-q4_0-unquantized (requires accepting Google's Gemma license)
Codebase: Lightricks/LTX-2
ZeroGPU architecture inspired by alexnasa/ltx-2-TURBO
Space by ZeroCollabs | GitHub