Abstract
Helix4D enables high-quality dynamic mesh generation by adapting Trellis2's frame-local attention across frames and extending 3D positional encoding with 4D temporal information.
Current video-to-4D methods struggle with complex topology changes, transparent materials, thin structures, and inner surfaces. We present Helix4D, a dynamic mesh generation framework by inheriting the expressive representation of Trellis2, adapting it from image-to-3D to video-conditioned 4D generation. Our design arises from two key questions: (a) how to enable Trellis2's frame-local attention to share information across frames while preserving its pretrained quality on rare cases such as transparent objects and inner surfaces, and (b) how to inject temporal information into a purely 3D positional encoding without breaking pretrained capabilities. We address (a) with a sliding-window cross-frame attention and anchor on the first frame. The first frame is generated by the base Trellis2 model and injected into our model, letting it inherit Trellis2's quality in rare cases through cross-frame attention. We address (b) with a 4D temporal encoding that repurposes redundant low-frequency spatial RoPE bands for time, extending the encoding from 3D with no additional parameters. Extensive experiments show the effectiveness of Helix4D for high-quality dynamic mesh generation on ActionBench and our own challenging complex dynamics set.
Community
A dynamic mesh generation framework that can model challenging 4D scenarios, including topology changes, deformation, shattering, melting, transparency, and thin structures.
the 4d temporal encoding that reuses low-frequency spatial RoPE bands to encode time is a neat trick that keeps the pretrained backbone intact while adding dynamics. i’d love to see an ablation on the sliding window size and the anchor frame impact, because with long videos the cross-frame attention might have to trade off local fidelity for global coherence and i want to know where it breaks. also curious how this interacts with frame-local priors when facing rapid topology changes or inner surfaces, does the cross-frame sharing risk washing out rare cases Trellis2 handles well. the arxivlens breakdown helped me parse the method details, and it’s nice to see how they lay out the 3-stage Trellis2 conditioning on video, https://arxivlens.com/PaperView/Details/helix4d-complex-4d-mesh-generation-8456-0b762c53
Get this paper in your agent:
hf papers read 2605.26109 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper