RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video
RayDer is a self-supervised novel view synthesis model that unifies camera estimation and view synthesis in a single transformer. Unlike prior self-supervised NVS approaches, which are bottlenecked by scarce static-scene data, RayDer is trained on general, dynamic real-world video — and its performance scales predictably with data, model size, and compute, following power-law relationships (R² > 0.99) analogous to those observed in LLMs.
Paper and Abstract
The RayDer model was presented in the paper RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video.
Self-supervised novel view synthesis methods are fundamentally data-limited: they require static-scene training data, which is scarce. RayDer removes this bottleneck by enabling stable training on general, dynamic real-world video. By consolidating three separate networks into one unified transformer, introducing dynamic state prediction with dropout, and improving pose learning through autoregressive training, RayDer's performance scales predictably with data, model size, and compute.
Existing approaches rely on scarce data sources: supervised NVS requires posed multi-view images, while prior self-supervised methods require unposed videos of static scenes. RayDer instead trains from generic unposed videos that may contain dynamic objects, enabling learning from the dominant form of visual data and unlocking improved scaling with dataset size.
A single transformer unifies camera estimation and novel view synthesis, replacing the three separate networks used by prior self-supervised NVS pipelines.
Usage
To integrate RayDer into your own codebase, copy rayder/model.py from the GitHub repository and instantiate the model as:
import torch
from rayder.model import RayDer_L
model = RayDer_L()
model.load_state_dict(torch.load("rayder_l_576.pt", weights_only=True))
model.requires_grad_(False)
model.eval()
The RayDer class exposes two high-level inference methods:
predict_cameras(x): estimate camera parameters from a set of input views (trained for 8 views, but the models extrapolate quite well).predict_views(x_in, cam_in, cam_target): synthesize novel views at target camera poses (trained for 1–7 input views, arbitrarily many output views).
Images are channels-last (b, t, h, w, 3) with pixel values in [-1, 1]. Camera extrinsics use the camera-to-world (c2w) convention, and the focal length f is normalized by the shorter image side (f = f_pixels / min(h-1, w-1)).
See the GitHub repository for generate_video.py (smooth view-interpolation videos from a set of input images) and app.py (Gradio demo).
Models
We currently release the following model variants:
| Variant | Width | Depth | Params | Resolution | File |
|---|---|---|---|---|---|
| RayDer-L | 1024 | 24 | ~743M | 256² | rayder_l.pt |
| RayDer-L-576² | 1024 | 24 | ~743M | 576² | rayder_l_576.pt |
Additional model variants and licensing available upon request.
License
This model is released under a license for personal and scientific non-commercial research purposes — see LICENSE.md for the full terms. For any commercial use or exploitation, please contact license.compvis@ifi.lmu.de.
Citation
If you find our model or code useful, please cite our paper:
@misc{prestel2026rayderscalableselfsupervisednovel,
title={RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video},
author={Ulrich Prestel and Stefan Andreas Baumann and Nick Stracke and Björn Ommer},
year={2026},
}