RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

RayDer is a self-supervised novel view synthesis model that unifies camera estimation and view synthesis in a single transformer. Unlike prior self-supervised NVS approaches, which are bottlenecked by scarce static-scene data, RayDer is trained on general, dynamic real-world video — and its performance scales predictably with data, model size, and compute, following power-law relationships (R² > 0.99) analogous to those observed in LLMs.

Paper and Abstract

The RayDer model was presented in the paper RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video.

Self-supervised novel view synthesis methods are fundamentally data-limited: they require static-scene training data, which is scarce. RayDer removes this bottleneck by enabling stable training on general, dynamic real-world video. By consolidating three separate networks into one unified transformer, introducing dynamic state prediction with dropout, and improving pose learning through autoregressive training, RayDer's performance scales predictably with data, model size, and compute.

Existing approaches rely on scarce data sources: supervised NVS requires posed multi-view images, while prior self-supervised methods require unposed videos of static scenes. RayDer instead trains from generic unposed videos that may contain dynamic objects, enabling learning from the dominant form of visual data and unlocking improved scaling with dataset size.

A single transformer unifies camera estimation and novel view synthesis, replacing the three separate networks used by prior self-supervised NVS pipelines.

Usage

To integrate RayDer into your own codebase, copy rayder/model.py from the GitHub repository and instantiate the model as:

import torch
from rayder.model import RayDer_L

model = RayDer_L()
model.load_state_dict(torch.load("rayder_l_576.pt", weights_only=True))
model.requires_grad_(False)
model.eval()

The RayDer class exposes two high-level inference methods:

predict_cameras(x): estimate camera parameters from a set of input views (trained for 8 views, but the models extrapolate quite well).
predict_views(x_in, cam_in, cam_target): synthesize novel views at target camera poses (trained for 1–7 input views, arbitrarily many output views).

Images are channels-last (b, t, h, w, 3) with pixel values in [-1, 1]. Camera extrinsics use the camera-to-world (c2w) convention, and the focal length f is normalized by the shorter image side (f = f_pixels / min(h-1, w-1)).

See the GitHub repository for generate_video.py (smooth view-interpolation videos from a set of input images) and app.py (Gradio demo).

Models

We currently release the following model variants:

Variant	Width	Depth	Params	Resolution	File
RayDer-L	1024	24	~743M	256²	`rayder_l.pt`
RayDer-L-576²	1024	24	~743M	576²	`rayder_l_576.pt`

Additional model variants and licensing available upon request.

License

This model is released under a license for personal and scientific non-commercial research purposes — see LICENSE.md for the full terms. For any commercial use or exploitation, please contact license.compvis@ifi.lmu.de.

Citation

If you find our model or code useful, please cite our paper:

@misc{prestel2026rayderscalableselfsupervisednovel,
    title={RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video}, 
    author={Ulrich Prestel and Stefan Andreas Baumann and Nick Stracke and Björn Ommer},
    year={2026},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for CompVis/rayder

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Paper • 2605.31535 • Published May 29 • 7