dwellbot_stream3r / README.md
brian4dwell's picture
add STream3r
594b88c

STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

Yushi Lan1*  Yihang Luo1*  Fangzhou Hong1  Shangchen Zhou1  Honghua Chen1 
Zhaoyang Lyu2  Shuai Yang3  Bo Dai 4 Chen Change Loy 1   Xingang Pan 1
S-Lab, Nanyang Technological University1;
Shanghai Artificial Intelligence Laboratory2; WICT, Peking University3; The University of Hong Kong 4

STream3R reformulates dense 3D reconstruction into a sequential registration task with causal attention.
⭐ Now supports FlashAttention, KV Cache, Causal Attention, Sliding Window Attention, and Full Attention!

pipeline :open_book: See more visual results on our project page

Abstract
pipeline

We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments.

:fire: News

  • [Sep 16, 2025] The complete training code is released!
  • [Aug 22, 2025] The evaluation code is now available!
  • [Aug 15, 2025] Our inference code and weights are released!

πŸ”§ Installation

  1. Clone Repo

    git clone https://github.com/NIRVANALAN/STream3R
    cd STream3R
    
  2. Create Conda Environment

    conda create -n stream3r python=3.11 cmake=3.14.0 -y
    conda activate stream3r
    
  3. Install Python Dependencies

    Important: Install Torch based on your CUDA version. For example, for Torch 2.8.0 + CUDA 12.6:

    # Install Torch
    pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126
    
    # Install other dependencies
    pip install -r requirements.txt
    
    # Install STream3R as a package
    pip install -e .
    

:computer: Inference

You can now try STream3R with the following code. The checkpoint will be downloaded automatically from Hugging Face.

You can set the inference mode to causal for causal attention, window for sliding window attention (with a default window size of 5), or full for bidirectional attention.

import os
import torch
from stream3r.models.stream3r import STream3R
from stream3r.models.components.utils.load_fn import load_and_preprocess_images

device = "cuda" if torch.cuda.is_available() else "cpu"

model = STream3R.from_pretrained("yslan/STream3R").to(device)

example_dir = "examples/static_room"
image_names = [os.path.join(example_dir, file) for file in sorted(os.listdir(example_dir))]
images = load_and_preprocess_images(image_names).to(device)

with torch.no_grad():
    # Use one mode "causal", "window", or "full" in a single forward pass
    predictions = model(images, mode="causal")

We also support a KV cache version to enable streaming input using StreamSession. The StreamSession takes sequential input and processes them one by one, making it suitable for real-time or low-latency applications. This streaming 3D reconstruction pipeline can be applied in various scenarios such as real-time robotics, autonomous navigation, online 3D understanding and SLAM. An example usage is shown below:

import os
import torch
from stream3r.models.stream3r import STream3R
from stream3r.stream_session import StreamSession
from stream3r.models.components.utils.load_fn import load_and_preprocess_images

device = "cuda" if torch.cuda.is_available() else "cpu"

model = STream3R.from_pretrained("yslan/STream3R").to(device)

example_dir = "examples/static_room"
image_names = [os.path.join(example_dir, file) for file in sorted(os.listdir(example_dir))]
images = load_and_preprocess_images(image_names).to(device)
# StreamSession supports KV cache management for both "causal" and "window" modes.
session = StreamSession(model, mode="causal")

with torch.no_grad():
    # Process images one by one to simulate streaming inference
    for i in range(images.shape[0]):
        image = images[i : i + 1]
        predictions = session.forward_stream(image)
    session.clear()

:zap: Demo

You can run the demo built on VGG-T's code using the script app.py with the following command:

python app.py

πŸ“ Code Structure

The repository is structured as follows:

STream3R/
β”œβ”€β”€ stream3r/                    
β”‚   β”œβ”€β”€ models/                  
β”‚   β”‚   β”œβ”€β”€ stream3r.py            
β”‚   β”‚   β”œβ”€β”€ multiview_dust3r_module.py  
β”‚   β”‚   └── components/               
β”‚   β”œβ”€β”€ dust3r/                 
β”‚   β”œβ”€β”€ croco/                  
β”‚   β”œβ”€β”€ utils/                  
β”‚   └── stream_session.py          
β”œβ”€β”€ configs/                     
β”œβ”€β”€ examples/                    
β”œβ”€β”€ assets/                      
β”œβ”€β”€ app.py                          
β”œβ”€β”€ requirements.txt                 
β”œβ”€β”€ setup.py                        
└── README.md                       

:100: Quantitive Results

3D Reconstruction Comparison on NRGBD.

Method Type Acc Mean ↓ Acc Med. ↓ Comp Mean ↓ Comp Med. ↓ NC Mean ↑ NC Med. ↑
VGG-T FA 0.073 0.018 0.077 0.021 0.910 0.990
DUSt3R Optim 0.144 0.019 0.154 0.018 0.870 0.982
MASt3R Optim 0.085 0.033 0.063 0.028 0.794 0.928
MonST3R Optim 0.272 0.114 0.287 0.110 0.758 0.843
Spann3R Stream 0.416 0.323 0.417 0.285 0.684 0.789
CUT3R Stream 0.099 0.031 0.076 0.026 0.837 0.971
StreamVGGT Stream 0.084 0.044 0.074 0.041 0.861 0.986
Ours Stream 0.057 0.014 0.028 0.013 0.910 0.993

Read our full paper for more insights.

⏳ GPU Memory Usage and Runtime

We report the peak GPU memory usage (VRAM) and runtime of our full model for processing each streaming input using the StreamSession implementation. All experiments were conducted at a common resolution of 518 Γ— 384 on a single H200 GPU. The benchmark includes both Causal for causal attention and Window for sliding window attention with a window size of 5.

Run Time (s).

Num of Frames 1 20 40 80 100 120 140 180 200
Causal 0.1164 0.2034 0.3060 0.4986 0.5945 0.6947 0.7916 0.9911 1.1703
Window 0.1167 0.1528 0.1523 0.1517 0.1515 0.1512 0.1482 0.1443 0.1463

VRAM (GB).

Num of Frames 1 20 40 80 100 120 140 180 200
Causal 5.49 9.02 12.92 21.00 25.03 29.10 33.21 41.31 45.41
Window 5.49 6.53 6.53 6.53 6.53 6.53 6.53 6.53 6.53

:hotsprings: Training

  1. Prepare Dataset

    We follow CUT3R to preprocess the dataset for training.

  2. Set Up Config

    Update training config file configs/experiment/stream3r/stream3r.yaml as needed. For example:

    • Set pretrained to the path of the VGG-T checkpoint.
    • Set data_root to the directory where you saved the processed dataset.
  3. Launch training with:

    python stream3r/train.py experiment=stream3r/stream3r
    
  4. After training, you can convert the checkpoint into a state_dict file, for example:

    from lightning.pytorch.utilities.deepspeed import convert_zero_checkpoint_to_fp32_state_dict
    
    convert_zero_checkpoint_to_fp32_state_dict(
        checkpoint_dir="logs/stream3r/runs/stream3r_99999/checkpoints/000-00002000.ckpt",
        output_file="logs/stream3r/runs/stream3r_99999/checkpoints/last_aggregated.ckpt",
        tag=None
    )
    

πŸ“ˆ Evaluation

The evaluation follows MonST3R and Spann3R, CUT3R.

  1. Prepare Evaluation Dataset

    We follow the dataset preparation guides from MonST3R and Spann3R to prepare the datasets. For convenience, we provide the processed datasets on Hugging Face, which can be downloaded directly.

    The datasets should be organized as follows under the root directiory of the project:

    data/
    β”œβ”€β”€ 7scenes
    β”œβ”€β”€ bonn
    β”œβ”€β”€ kitti
    β”œβ”€β”€ neural_rgbd
    β”œβ”€β”€ nyu-v2
    β”œβ”€β”€ scannetv2
    β”œβ”€β”€ sintel
    └── tum
    
  2. Run Evaluation

    Use the provided scripts to evaluate different tasks.

    For Video Depth and Camera Pose Estimation, some datasets contain more than 100 images. To reduce memory usage, we use StreamSession to process frames sequentially while managing the KV cache.

    Monodepth

    bash eval/monodepth/run.sh
    

    Results will be saved in eval_results/monodepth/${model_name}/${data}/metric.json.

    Video Depth

    bash eval/video_depth/run.sh
    

    Results will be saved in eval_results/video_depth/${model_name}/${data}/result_scale.json.

    Camera Pose Estimation

    bash eval/relpose/run.sh
    

    Results will be saved in eval_results/relpose/${model_name}/${data}/_error_log.txt.

    Multi-view Reconstruction

    bash eval/mv_recon/run.sh
    

    Results will be saved in eval_results/mv_recon/${model_name}/${data}/logs_all.txt.

:calendar: TODO

  • Release evaluation code.
  • Release training code.
  • Release the metric-scale version.

:page_with_curl: License

This project is licensed under NTU S-Lab License 1.0. Redistribution and use should follow this license.

:pencil: Citation

If you find our code or paper helps, please consider citing:

@article{stream3r2025,
  title={STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer},
  author={Lan, Yushi and Luo, Yihang and Hong, Fangzhou and Zhou, Shangchen and Chen, Honghua and Lyu, Zhaoyang and Yang, Shuai and Dai, Bo and Loy, Chen Change and Pan, Xingang},
  booktitle={arXiv preprint arXiv:2508.10893},
  year={2025}
}

:pencil: Acknowledgments

We recognize several concurrent works on streaming methods. We encourage you to check them out:

StreamVGGT  |  CUT3R  |  SLAM3R  |  Spann3R

STream3R is built on the shoulders of several outstanding open-source projects. Many thanks to the following exceptional projects:

VGG-T  |  Fast3R  |  DUSt3R  |  MonST3R  |  Viser

:mailbox: Contact

If you have any question, please feel free to contact us via lanyushi15@gmail.com or Github issues.