Single-stream transformer efficiency: timestep-free denoising trade-offs
The single-stream architecture with timestep-free denoising is a clean design choice β eliminating explicit timestep embeddings reduces parameter overhead and inference complexity. The 2-second 256p and 38-second 1080p timings are impressive for a 15B model.
Two questions for production deployment:
The timestep-free approach β how does it handle edge cases like rapid motion or complex audio-video synchronization? Traditional diffusion models rely on timesteps for coarse-to-fine refinement. Without them, do you observe quality degradation in challenging scenes?
The sandwich architecture with shared middle layers β is the modality-specific projection only at the first/last 4 layers? For fine-tuning on new domains (e.g., gaming avatars, sign language), would you recommend freezing the shared layers or full fine-tuning?
The 80% win rate against Ovi 1.1 and 60.9% against LTX 2.3 is strong. For real-time interactive applications (video chat, live streaming), the 2-second 256p latency opens interesting possibilities β assuming the super-resolution stage can be parallelized.
Curious about the per-head gating mechanism's impact on training stability β learned scalar gates are cleaner than complex conditioning branches.
Cool!
I still cant run this in zerogpu so close yet it fails