Single-stream transformer efficiency: timestep-free denoising trade-offs

#8
by O96a - opened

The single-stream architecture with timestep-free denoising is a clean design choice β€” eliminating explicit timestep embeddings reduces parameter overhead and inference complexity. The 2-second 256p and 38-second 1080p timings are impressive for a 15B model.

Two questions for production deployment:

  1. The timestep-free approach β€” how does it handle edge cases like rapid motion or complex audio-video synchronization? Traditional diffusion models rely on timesteps for coarse-to-fine refinement. Without them, do you observe quality degradation in challenging scenes?

  2. The sandwich architecture with shared middle layers β€” is the modality-specific projection only at the first/last 4 layers? For fine-tuning on new domains (e.g., gaming avatars, sign language), would you recommend freezing the shared layers or full fine-tuning?

The 80% win rate against Ovi 1.1 and 60.9% against LTX 2.3 is strong. For real-time interactive applications (video chat, live streaming), the 2-second 256p latency opens interesting possibilities β€” assuming the super-resolution stage can be parallelized.

Curious about the per-head gating mechanism's impact on training stability β€” learned scalar gates are cleaner than complex conditioning branches.

I still cant run this in zerogpu so close yet it fails

Sign up or log in to comment