Spaces:
Running on Zero
Running on Zero
A newer version of the Gradio SDK is available: 6.13.0
metadata
title: LTX-2.3 Video [Turbo]
emoji: ⚡
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 6.2.0
python_version: '3.12'
app_file: app.py
pinned: true
license: other
license_name: ltx-2-community-license-agreement
license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
short_description: Generate video + audio with LTX-2.3 (22B) on free ZeroGPU
tags:
- video-generation
- audio-generation
- text-to-video
- image-to-video
- ltx
- lightricks
- zerogpu
- fp8
- distilled
models:
- Lightricks/LTX-2.3
LTX-2.3 Turbo (ZeroGPU)
Generate synchronized video + audio from text or images using Lightricks/LTX-2.3 — a 22B parameter DiT-based audio-video foundation model — running entirely on free ZeroGPU hardware. No paid GPU required.
~10 seconds to generate a 2-second video with audio at 768x512.
Features
- Text to Video — describe a scene and get video with synchronized audio
- Image to Video — provide a first frame and animate it with audio
- Interpolate Mode — provide first and last frames, generate video between them
- Audio Input — provide custom audio for lip-sync or soundtrack conditioning
- Prompt enhancement — Gemma-3 12B rewrites your prompt for better results
- Multiple resolutions — 16:9, 1:1, and 9:16 aspect ratios
- Duration presets — 2s, 3s, and 5s video clips
- Reproducible — set a seed for consistent outputs
How it works
This Space uses a vendored packages + pre-loaded models strategy to fit the massive 22B model on ZeroGPU's A10G (40GB VRAM):
- Startup: Downloads model files, constructs
ModelLedger, loads the Gemma-3 12B text encoder, and pre-loads the FP8-quantized transformer + video encoder into the pipeline cache. - Text Encoding (
@spaces.GPU): Encodes the prompt into video/audio context tensors using the pre-loaded text encoder, returned to CPU. - Video Generation (
@spaces.GPU): Runs the two-stage distilled denoising pipeline (8 steps low-res + 4 steps high-res with 2x spatial upscaling) using pre-encoded contexts, then decodes video and audio.
Key optimizations
| Optimization | Details |
|---|---|
| FP8 quantization | Transformer weights cast to float8_e4m3fn, halving VRAM usage |
| Distilled pipeline | Only 8+4 denoising steps (vs 30+ for full model) |
| Pre-loaded models | Text encoder, transformer, video encoder loaded once at startup |
| Two-stage upscaling | Generates at half resolution, then upscales 2x with spatial upsampler |
| Vendored packages | ltx-core and ltx-pipelines bundled for fast builds |
Parameters
| Parameter | Range | Default | Notes |
|---|---|---|---|
| Mode | Text to Video / Image to Video | Text to Video | |
| Prompt | Free text | — | Describe scene, motion, and audio |
| Resolution | 768x512, 512x512, 512x768 | 768x512 | Upscaled 2x by spatial upscaler |
| Duration | 1-5 seconds | 2s | Shorter = more reliable on ZeroGPU |
| Enhance prompt | On/Off | On | Gemma-3 rewrites prompt for better results |
| Seed | 0-2B | Random | For reproducibility |
Limitations
- ZeroGPU time limits: Longer videos may exceed the GPU lease duration. Keep duration at 3 seconds or less for best reliability.
- VRAM constraints: Even with FP8 quantization, very high resolutions are not possible. The preset resolutions are tuned for ZeroGPU.
- No camera LoRAs: Camera LoRAs are only available for the 19B model, not the 22B 2.3 model.
Duplicating this Space
This Space uses google/gemma-3-12b-it-qat-q4_0-unquantized as the text encoder. Before duplicating, you must:
- Accept the Gemma license on your HuggingFace account
- Create a read-access HuggingFace token and add it as a Space secret named
HF_TOKEN
Credits
- Model: Lightricks/LTX-2.3 (22B parameters)
- Text encoder: google/gemma-3-12b-it-qat-q4_0-unquantized (requires accepting Google's Gemma license)
- Codebase: Lightricks/LTX-2
- ZeroGPU architecture inspired by alexnasa/ltx-2-TURBO
- Space by ZeroCollabs | GitHub