Spaces:

dwellbot
/

dwellbot_stream3r

Configuration error

App Files Files Community

brian4dwell commited on Sep 17, 2025

Commit

594b88c

1 Parent(s): 5b0c756

add STream3r

Browse files

Files changed (2) hide show

README.md +342 -12
app.py +761 -4

README.md CHANGED Viewed

@@ -1,12 +1,342 @@
----
-title: Dwellbot Stream3r
-emoji: 📚
-colorFrom: blue
-colorTo: red
-sdk: gradio
-sdk_version: 5.46.0
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+<div align="center">
+    <h1>
+    STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer
+    </h1>
+</div>
+<div align="center">
+    <h4>
+        <a href="https://nirvanalan.github.io/projects/stream3r" target='_blank'>
+        <img src="https://img.shields.io/badge/🐳-Project%20Page-blue">
+        </a>
+        <a href="https://arxiv.org/abs/2508.10893" target='_blank'>
+        <img src="https://img.shields.io/badge/arXiv-2508.10893-b31b1b.svg">
+        </a>
+        <img src="https://visitor-badge.laobi.icu/badge?page_id=yhluo.STream3R">
+    </h4>
+    <div >
+        <a href='https://nirvanalan.github.io/' target='_blank'>Yushi Lan</a><sup>1*</sup>&emsp;
+        <a href='https://scholar.google.com/citations?user=fZxK2B0AAAAJ&hl' target='_blank'>Yihang Luo</a><sup>1*</sup>&emsp;
+        <a href='https://hongfz16.github.io' target='_blank'>Fangzhou Hong</a><sup>1</sup>&emsp;
+        <a href='https://shangchenzhou.com/' target='_blank'>Shangchen Zhou</a><sup>1</sup>&emsp;
+        <a href='https://chenhonghua.github.io/clay.github.io/' target='_blank'>Honghua Chen</a><sup>1</sup>&emsp;
+        <br>
+        <a href='https://zhaoyanglyu.github.io/' target='_blank'>Zhaoyang Lyu</a><sup>2</sup>&emsp;
+        <a href='https://williamyang1991.github.io/' target='_blank'>Shuai Yang</a><sup>3</sup>&emsp;
+        <a href='https://daibo.info/' target='_blank'>Bo Dai</a>
+        <sup>4</sup>
+        <a href='https://www.mmlab-ntu.com/person/ccloy/' target='_blank'>Chen Change Loy</a>
+        <sup>1</sup> &emsp;
+        <a href='https://xingangpan.github.io/' target='_blank'>Xingang Pan</a>
+        <sup>1</sup>
+    </div>
+    <div>
+        S-Lab, Nanyang Technological University<sup>1</sup>;
+        <!-- &emsp; -->
+        <br>
+        Shanghai Artificial Intelligence Laboratory<sup>2</sup>;
+        WICT, Peking University<sup>3</sup>;
+        The University of Hong Kong <sup>4</sup>
+        <!-- <br>
+        <sup>*</sup>corresponding author -->
+    </div>
+</div>
+<br>
+<div align="center">
+    <p>
+        <span style="font-variant: small-caps;"><strong>STream3R</strong></span> reformulates dense 3D reconstruction into a sequential registration task with causal attention.
+        <br>
+        <i>⭐ Now supports <b>FlashAttention</b>, <b>KV Cache</b>, <b>Causal Attention</b>, <b>Sliding Window Attention</b>, and <b>Full Attention</b>!</i>
+    </p>
+    <img width="820" alt="pipeline" src="assets/teaser_dynamic.gif">
+    :open_book: See more visual results on our <a href="https://nirvanalan.github.io/projects/stream3r" target="_blank">project page</a>
+</div>
+<br>
+<details>
+<summary><b>Abstract</b></summary>
+    <br>
+    <div align="center">
+        <img width="820" alt="pipeline" src="assets/pipeline.png">
+        <p align="justify">
+            We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments.
+        </p>
+    </div>
+</details>
+## :fire: News
+- [Sep 16, 2025] The complete training code is released!
+- [Aug 22, 2025] The evaluation code is now available!
+- [Aug 15, 2025] Our inference code and weights are released!
+## 🔧 Installation
+1. Clone Repo
+    ```bash
+    git clone https://github.com/NIRVANALAN/STream3R
+    cd STream3R
+    ```
+2. Create Conda Environment
+    ```bash
+    conda create -n stream3r python=3.11 cmake=3.14.0 -y
+    conda activate stream3r
+    ```
+3. Install Python Dependencies
+    **Important:** Install [Torch](https://pytorch.org/get-started/locally/) based on your CUDA version. For example, for *Torch 2.8.0 + CUDA 12.6*:
+    ```
+    # Install Torch
+    pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126
+    # Install other dependencies
+    pip install -r requirements.txt
+    # Install STream3R as a package
+    pip install -e .
+    ```
+## :computer: Inference
+You can now try STream3R with the following code. The checkpoint will be downloaded automatically from [Hugging Face](https://huggingface.co/yslan/STream3R).
+You can set the inference mode to `causal` for causal attention, `window` for sliding window attention (with a default window size of 5), or `full` for bidirectional attention.
+```python
+import os
+import torch
+from stream3r.models.stream3r import STream3R
+from stream3r.models.components.utils.load_fn import load_and_preprocess_images
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = STream3R.from_pretrained("yslan/STream3R").to(device)
+example_dir = "examples/static_room"
+image_names = [os.path.join(example_dir, file) for file in sorted(os.listdir(example_dir))]
+images = load_and_preprocess_images(image_names).to(device)
+with torch.no_grad():
+    # Use one mode "causal", "window", or "full" in a single forward pass
+    predictions = model(images, mode="causal")
+```
+We also support a KV cache version to enable streaming input using `StreamSession`. The `StreamSession` takes sequential input and processes them one by one, making it suitable for real-time or low-latency applications. This streaming 3D reconstruction pipeline can be applied in various scenarios such as real-time robotics, autonomous navigation, online 3D understanding and SLAM. An example usage is shown below:
+```python
+import os
+import torch
+from stream3r.models.stream3r import STream3R
+from stream3r.stream_session import StreamSession
+from stream3r.models.components.utils.load_fn import load_and_preprocess_images
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = STream3R.from_pretrained("yslan/STream3R").to(device)
+example_dir = "examples/static_room"
+image_names = [os.path.join(example_dir, file) for file in sorted(os.listdir(example_dir))]
+images = load_and_preprocess_images(image_names).to(device)
+# StreamSession supports KV cache management for both "causal" and "window" modes.
+session = StreamSession(model, mode="causal")
+with torch.no_grad():
+    # Process images one by one to simulate streaming inference
+    for i in range(images.shape[0]):
+        image = images[i : i + 1]
+        predictions = session.forward_stream(image)
+    session.clear()
+```
+## :zap: Demo
+You can run the demo built on [VGG-T's code](https://github.com/facebookresearch/vggt) using the script [`app.py`](app.py) with the following command:
+```sh
+python app.py
+```
+## 📁 Code Structure
+The repository is structured as follows:
+```
+STream3R/
+├── stream3r/
+│   ├── models/
+│   │   ├── stream3r.py
+│   │   ├── multiview_dust3r_module.py
+│   │   └── components/
+│   ├── dust3r/
+│   ├── croco/
+│   ├── utils/
+│   └── stream_session.py
+├── configs/
+├── examples/
+├── assets/
+├── app.py
+├── requirements.txt
+├── setup.py
+└── README.md
+```
+## :100: Quantitive Results
+*3D Reconstruction Comparison on NRGBD.*
+| Method              | Type     | Acc Mean ↓ | Acc Med. ↓ | Comp Mean ↓ | Comp Med. ↓ | NC Mean ↑ | NC Med. ↑ |
+|---------------------|----------|------------|------------|-------------|-------------|-----------|-----------|
+| VGG-T      | FA        | 0.073  | 0.018  | 0.077   | 0.021   | 0.910 | 0.990 |
+| DUSt3R | Optim    | 0.144      | 0.019    | 0.154       | 0.018     | 0.870   | 0.982   |
+| MASt3R | Optim    | 0.085    | 0.033      | 0.063     | 0.028       | 0.794     | 0.928     |
+| MonST3R | Optim   | 0.272      | 0.114      | 0.287       | 0.110       | 0.758     | 0.843     |
+| Spann3R   | Stream   | 0.416      | 0.323      | 0.417       | 0.285       | 0.684     | 0.789     |
+| CUT3R       | Stream   | 0.099      | 0.031    | 0.076       | 0.026     | 0.837     | 0.971     |
+| StreamVGGT     | Stream   | 0.084    | 0.044      | 0.074     | 0.041       | 0.861   | 0.986   |
+| Ours   | Stream   | **0.057**  | **0.014**  | **0.028**   | **0.013**   | **0.910** | **0.993** |
+Read our [full paper](https://arxiv.org/abs/2508.10893) for more insights.
+## ⏳ GPU Memory Usage and Runtime
+We report the peak GPU memory usage (VRAM) and runtime of our full model for processing each streaming input using the `StreamSession` implementation. All experiments were conducted at a common resolution of 518 × 384 on a single H200 GPU. The benchmark includes both *Causal* for causal attention and *Window* for sliding window attention with a window size of 5.
+*Run Time (s).*
+| Num of Frames      | 1      | 20     | 40     | 80     | 100    | 120    | 140    | 180    | 200    |
+|-----------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
+| Causal    | 0.1164 | 0.2034 | 0.3060 | 0.4986 | 0.5945 | 0.6947 | 0.7916 | 0.9911 | 1.1703 |
+| Window    | 0.1167 | 0.1528 | 0.1523 | 0.1517 | 0.1515 | 0.1512 | 0.1482 | 0.1443 | 0.1463 |
+*VRAM (GB).*
+| Num of Frames      | 1      | 20     | 40     | 80     | 100    | 120    | 140    | 180    | 200    |
+|-----------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
+| Causal    | 5.49   | 9.02   | 12.92  | 21.00  | 25.03  | 29.10  | 33.21  | 41.31  | 45.41  |
+| Window    | 5.49   | 6.53   | 6.53   | 6.53   | 6.53   | 6.53   | 6.53   | 6.53   | 6.53   |
+## :hotsprings: Training
+1. Prepare Dataset
+    We follow [CUT3R](https://github.com/CUT3R/CUT3R/blob/main/docs/preprocess.md) to preprocess the dataset for training.
+2. Set Up Config
+    Update training config file ```configs/experiment/stream3r/stream3r.yaml``` as needed. For example:
+    - Set `pretrained` to the path of the [VGG-T checkpoint](https://huggingface.co/facebook/VGGT-1B/resolve/main/model.pt).
+    - Set `data_root` to the directory where you saved the processed dataset.
+3. Launch training with:
+    ```bash
+    python stream3r/train.py experiment=stream3r/stream3r
+    ```
+4. After training, you can convert the checkpoint into a `state_dict` file, for example:
+    ```python
+    from lightning.pytorch.utilities.deepspeed import convert_zero_checkpoint_to_fp32_state_dict
+    convert_zero_checkpoint_to_fp32_state_dict(
+        checkpoint_dir="logs/stream3r/runs/stream3r_99999/checkpoints/000-00002000.ckpt",
+        output_file="logs/stream3r/runs/stream3r_99999/checkpoints/last_aggregated.ckpt",
+        tag=None
+    )
+    ```
+## 📈 Evaluation
+The evaluation follows [MonST3R](https://github.com/Junyi42/monst3r) and [Spann3R](https://github.com/HengyiWang/spann3r), [CUT3R](https://github.com/CUT3R/CUT3R).
+1. Prepare Evaluation Dataset
+    We follow the dataset preparation guides from [MonST3R](https://github.com/Junyi42/monst3r/blob/main/data/evaluation_script.md) and [Spann3R](https://github.com/HengyiWang/spann3r/blob/main/docs/data_preprocess.md) to prepare the datasets. For convenience, we provide the processed datasets on [Hugging Face](https://huggingface.co/datasets/yslan/pointmap_regression_evalsets), which can be downloaded directly.
+    The datasets should be organized as follows under the root directiory of the project:
+    ```
+    data/
+    ├── 7scenes
+    ├── bonn
+    ├── kitti
+    ├── neural_rgbd
+    ├── nyu-v2
+    ├── scannetv2
+    ├── sintel
+    └── tum
+    ```
+2. Run Evaluation
+    Use the provided scripts to evaluate different tasks.
+    *For Video Depth and Camera Pose Estimation, some datasets contain more than 100 images. To reduce memory usage, we use `StreamSession` to process frames sequentially while managing the KV cache.*
+    ### Monodepth
+    ```bash
+    bash eval/monodepth/run.sh
+    ```
+    Results will be saved in `eval_results/monodepth/${model_name}/${data}/metric.json`.
+    ### Video Depth
+    ```bash
+    bash eval/video_depth/run.sh
+    ```
+    Results will be saved in `eval_results/video_depth/${model_name}/${data}/result_scale.json`.
+    ### Camera Pose Estimation
+    ```bash
+    bash eval/relpose/run.sh
+    ```
+    Results will be saved in `eval_results/relpose/${model_name}/${data}/_error_log.txt`.
+    ### Multi-view Reconstruction
+    ```bash
+    bash eval/mv_recon/run.sh
+    ```
+    Results will be saved in `eval_results/mv_recon/${model_name}/${data}/logs_all.txt`.
+## :calendar: TODO
+- [x] Release evaluation code.
+- [x] Release training code.
+- [ ] Release the metric-scale version.
+## :page_with_curl: License
+This project is licensed under <a rel="license" href="./LICENSE">NTU S-Lab License 1.0</a>. Redistribution and use should follow this license.
+## :pencil: Citation
+If you find our code or paper helps, please consider citing:
+```bibtex
+@article{stream3r2025,
+  title={STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer},
+  author={Lan, Yushi and Luo, Yihang and Hong, Fangzhou and Zhou, Shangchen and Chen, Honghua and Lyu, Zhaoyang and Yang, Shuai and Dai, Bo and Loy, Chen Change and Pan, Xingang},
+  booktitle={arXiv preprint arXiv:2508.10893},
+  year={2025}
+}
+```
+## :pencil: Acknowledgments
+We recognize several concurrent works on streaming methods. We encourage you to check them out:
+[StreamVGGT](https://github.com/wzzheng/StreamVGGT) &nbsp;|&nbsp; [CUT3R](https://github.com/CUT3R/CUT3R) &nbsp;|&nbsp; [SLAM3R](https://github.com/PKU-VCL-3DV/SLAM3R) &nbsp;|&nbsp; [Spann3R](https://github.com/HengyiWang/spann3r)
+STream3R is built on the shoulders of several outstanding open-source projects. Many thanks to the following exceptional projects:
+[VGG-T](https://github.com/facebookresearch/vggt) &nbsp;|&nbsp; [Fast3R](https://github.com/facebookresearch/fast3r) &nbsp;|&nbsp; [DUSt3R](https://github.com/naver/dust3r) &nbsp;|&nbsp; [MonST3R](https://github.com/Junyi42/monst3r) &nbsp;|&nbsp; [Viser](https://github.com/nerfstudio-project/viser)
+## :mailbox: Contact
+If you have any question, please feel free to contact us via `lanyushi15@gmail.com` or Github issues.

app.py CHANGED Viewed

@@ -1,7 +1,764 @@
 import gradio as gr
-def greet(name):
-    return "Hello " + name + "!!"
-demo = gr.Interface(fn=greet, inputs="text", outputs="text")
-demo.launch()

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import cv2
+import torch
+import numpy as np
 import gradio as gr
+import shutil
+from datetime import datetime
+import glob
+import gc
+import time
+from stream3r.models.stream3r import STream3R
+from stream3r.stream_session import StreamSession
+from stream3r.models.components.utils.load_fn import load_and_preprocess_images
+from stream3r.models.components.utils.pose_enc import pose_encoding_to_extri_intri
+from stream3r.models.components.utils.geometry import unproject_depth_map_to_point_map
+from stream3r.utils.visual_utils import predictions_to_glb
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = STream3R.from_pretrained("yslan/STream3R")
+# -------------------------------------------------------------------------
+# 1) Core model inference
+# -------------------------------------------------------------------------
+def run_model(target_dir: str, model: STream3R, mode: str="causal", streaming: bool=False) -> dict:
+    """
+    Run the STream3R model on images in the 'target_dir/images' folder and return predictions.
+    Args:
+        target_dir: Directory containing the images subfolder
+        model: STream3R model instance
+        mode: Processing mode ("causal", "window", or "full")
+        streaming: If True, use StreamSession for sequential processing; if False, use batch processing
+    """
+    print(f"Processing images from {target_dir}")
+    # Device check
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    if not torch.cuda.is_available():
+        raise ValueError("CUDA is not available. Check your environment.")
+    # Move model to device
+    model = model.to(device)
+    model.eval()
+    # Load and preprocess images
+    image_names = glob.glob(os.path.join(target_dir, "images", "*"))
+    image_names = sorted(image_names)
+    print(f"Found {len(image_names)} images")
+    if len(image_names) == 0:
+        raise ValueError("No images found. Check your upload.")
+    images = load_and_preprocess_images(image_names).to(device)
+    print(f"Preprocessed images shape: {images.shape}")
+    # Run inference
+    print(f"Running inference in {'streaming' if streaming else 'batch'} mode...")
+    dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
+    with torch.no_grad():
+        with torch.amp.autocast(dtype=dtype, device_type=device):
+            if streaming:
+                # Use StreamSession for sequential processing
+                if mode == "full":
+                    print("Warning: Streaming mode does not support 'full' attention mode. Switching to 'causal' mode.")
+                    mode = "causal"
+                session = StreamSession(model, mode=mode)
+                # Process images one by one to simulate streaming inference
+                for i in range(images.shape[0]):
+                    image = images[i : i + 1]
+                    predictions = session.forward_stream(image)
+                session.clear()
+            else:
+                # Use batch processing (original behavior)
+                predictions = model(images, mode=mode)
+    # Convert pose encoding to extrinsic and intrinsic matrices
+    print("Converting pose encoding to extrinsic and intrinsic matrices...")
+    extrinsic, intrinsic = pose_encoding_to_extri_intri(predictions["pose_enc"], images.shape[-2:])
+    predictions["extrinsic"] = extrinsic
+    predictions["intrinsic"] = intrinsic
+    # Convert tensors to numpy
+    for key in predictions.keys():
+        if isinstance(predictions[key], torch.Tensor):
+            predictions[key] = predictions[key].cpu().numpy().squeeze(0)  # remove batch dimension
+    predictions['pose_enc_list'] = None # remove pose_enc_list
+    # Generate world points from depth map
+    print("Computing world points from depth map...")
+    depth_map = predictions["depth"]  # (S, H, W, 1)
+    world_points = unproject_depth_map_to_point_map(depth_map, predictions["extrinsic"], predictions["intrinsic"])
+    predictions["world_points_from_depth"] = world_points
+    # Clean up
+    torch.cuda.empty_cache()
+    return predictions
+# -------------------------------------------------------------------------
+# 2) Handle uploaded video/images --> produce target_dir + images
+# -------------------------------------------------------------------------
+def handle_uploads(input_video, input_images):
+    """
+    Create a new 'target_dir' + 'images' subfolder, and place user-uploaded
+    images or extracted frames from video into it. Return (target_dir, image_paths).
+    """
+    start_time = time.time()
+    gc.collect()
+    torch.cuda.empty_cache()
+    # Create a unique folder name
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
+    target_dir = os.path.join("demo_cache", f"input_images_{timestamp}")
+    target_dir_images = os.path.join(target_dir, "images")
+    # Clean up if somehow that folder already exists
+    if os.path.exists(target_dir):
+        shutil.rmtree(target_dir)
+    os.makedirs(target_dir)
+    os.makedirs(target_dir_images)
+    image_paths = []
+    # --- Handle images ---
+    if input_images is not None:
+        for file_data in input_images:
+            if isinstance(file_data, dict) and "name" in file_data:
+                file_path = file_data["name"]
+            else:
+                file_path = file_data
+            dst_path = os.path.join(target_dir_images, os.path.basename(file_path))
+            shutil.copy(file_path, dst_path)
+            image_paths.append(dst_path)
+    # --- Handle video ---
+    if input_video is not None:
+        if isinstance(input_video, dict) and "name" in input_video:
+            video_path = input_video["name"]
+        else:
+            video_path = input_video
+        vs = cv2.VideoCapture(video_path)
+        fps = vs.get(cv2.CAP_PROP_FPS)
+        frame_interval = int(fps * 1)  # 1 frame/sec
+        count = 0
+        video_frame_num = 0
+        while True:
+            gotit, frame = vs.read()
+            if not gotit:
+                break
+            count += 1
+            if count % frame_interval == 0:
+                image_path = os.path.join(target_dir_images, f"{video_frame_num:06}.png")
+                cv2.imwrite(image_path, frame)
+                image_paths.append(image_path)
+                video_frame_num += 1
+    # Sort final images for gallery
+    image_paths = sorted(image_paths)
+    end_time = time.time()
+    print(f"Files copied to {target_dir_images}; took {end_time - start_time:.3f} seconds")
+    return target_dir, image_paths
+# -------------------------------------------------------------------------
+# 3) Update gallery on upload
+# -------------------------------------------------------------------------
+def update_gallery_on_upload(input_video, input_images):
+    """
+    Whenever user uploads or changes files, immediately handle them
+    and show in the gallery. Return (target_dir, image_paths).
+    If nothing is uploaded, returns "None" and empty list.
+    """
+    if not input_video and not input_images:
+        return None, None, None, None
+    target_dir, image_paths = handle_uploads(input_video, input_images)
+    return None, target_dir, image_paths, "Upload complete. Click 'Reconstruct' to begin 3D processing."
+# -------------------------------------------------------------------------
+# 4) Reconstruction: uses the target_dir plus any viz parameters
+# -------------------------------------------------------------------------
+def gradio_demo(
+    target_dir,
+    conf_thres=3.0,
+    frame_filter="All",
+    mask_black_bg=False,
+    mask_white_bg=False,
+    show_cam=True,
+    mask_sky=False,
+    prediction_mode="Pointmap Regression",
+    mode="causal",
+    streaming=False,
+):
+    """
+    Perform reconstruction using the already-created target_dir/images.
+    """
+    if not os.path.isdir(target_dir) or target_dir == "None":
+        return None, "No valid target directory found. Please upload first.", None, None
+    start_time = time.time()
+    gc.collect()
+    torch.cuda.empty_cache()
+    # Prepare frame_filter dropdown
+    target_dir_images = os.path.join(target_dir, "images")
+    all_files = sorted(os.listdir(target_dir_images)) if os.path.isdir(target_dir_images) else []
+    all_files = [f"{i}: {filename}" for i, filename in enumerate(all_files)]
+    frame_filter_choices = ["All"] + all_files
+    print("Running run_model...")
+    with torch.no_grad():
+        predictions = run_model(target_dir, model, mode=mode, streaming=streaming)
+    # Save predictions
+    prediction_save_path = os.path.join(target_dir, "predictions.npz")
+    np.savez(prediction_save_path, **predictions)
+    # Handle None frame_filter
+    if frame_filter is None:
+        frame_filter = "All"
+    # Build a GLB file name
+    glbfile = os.path.join(
+        target_dir,
+        f"glbscene_{conf_thres}_{frame_filter.replace('.', '_').replace(':', '').replace(' ', '_')}_maskb{mask_black_bg}_maskw{mask_white_bg}_cam{show_cam}_sky{mask_sky}_pred{prediction_mode.replace(' ', '_')}_mode{mode}.glb",
+    )
+    # Convert predictions to GLB
+    glbscene = predictions_to_glb(
+        predictions,
+        conf_thres=conf_thres,
+        filter_by_frames=frame_filter,
+        mask_black_bg=mask_black_bg,
+        mask_white_bg=mask_white_bg,
+        show_cam=show_cam,
+        mask_sky=mask_sky,
+        target_dir=target_dir,
+        prediction_mode=prediction_mode,
+    )
+    glbscene.export(file_obj=glbfile)
+    # Cleanup
+    del predictions
+    gc.collect()
+    torch.cuda.empty_cache()
+    end_time = time.time()
+    print(f"Total time: {end_time - start_time:.2f} seconds (including IO)")
+    log_msg = f"Reconstruction Success ({len(all_files)} frames). Waiting for visualization."
+    return glbfile, log_msg, gr.Dropdown(choices=frame_filter_choices, value=frame_filter, interactive=True)
+# -------------------------------------------------------------------------
+# 5) Helper functions for UI resets + re-visualization
+# -------------------------------------------------------------------------
+def clear_fields():
+    """
+    Clears the 3D viewer, the stored target_dir, and empties the gallery.
+    """
+    return None
+def update_log():
+    """
+    Display a quick log message while waiting.
+    """
+    return "Loading and Reconstructing..."
+def update_visualization(
+    target_dir, conf_thres, frame_filter, mask_black_bg, mask_white_bg, show_cam, mask_sky, prediction_mode, is_example
+):
+    """
+    Reload saved predictions from npz, create (or reuse) the GLB for new parameters,
+    and return it for the 3D viewer. If is_example == "True", skip.
+    """
+    # If it's an example click, skip as requested
+    if is_example == "True":
+        return None, "No reconstruction available. Please click the Reconstruct button first."
+    if not target_dir or target_dir == "None" or not os.path.isdir(target_dir):
+        return None, "No reconstruction available. Please click the Reconstruct button first."
+    predictions_path = os.path.join(target_dir, "predictions.npz")
+    if not os.path.exists(predictions_path):
+        return None, f"No reconstruction available at {predictions_path}. Please run 'Reconstruct' first."
+    key_list = [
+        "pose_enc",
+        "depth",
+        "depth_conf",
+        "world_points",
+        "world_points_conf",
+        "images",
+        "extrinsic",
+        "intrinsic",
+        "world_points_from_depth",
+    ]
+    loaded = np.load(predictions_path)
+    predictions = {key: np.array(loaded[key]) for key in key_list}
+    glbfile = os.path.join(
+        target_dir,
+        f"glbscene_{conf_thres}_{frame_filter.replace('.', '_').replace(':', '').replace(' ', '_')}_maskb{mask_black_bg}_maskw{mask_white_bg}_cam{show_cam}_sky{mask_sky}_pred{prediction_mode.replace(' ', '_')}_mode{mode}.glb",
+    )
+    if not os.path.exists(glbfile):
+        glbscene = predictions_to_glb(
+            predictions,
+            conf_thres=conf_thres,
+            filter_by_frames=frame_filter,
+            mask_black_bg=mask_black_bg,
+            mask_white_bg=mask_white_bg,
+            show_cam=show_cam,
+            mask_sky=mask_sky,
+            target_dir=target_dir,
+            prediction_mode=prediction_mode,
+        )
+        glbscene.export(file_obj=glbfile)
+    return glbfile, "Updating Visualization"
+# -------------------------------------------------------------------------
+# Example images
+# -------------------------------------------------------------------------
+great_wall_video = "examples/videos/great_wall.mp4"
+colosseum_video = "examples/videos/Colosseum.mp4"
+room_video = "examples/videos/room.mp4"
+kitchen_video = "examples/videos/kitchen.mp4"
+fern_video = "examples/videos/fern.mp4"
+single_cartoon_video = "examples/videos/single_cartoon.mp4"
+single_oil_painting_video = "examples/videos/single_oil_painting.mp4"
+pyramid_video = "examples/videos/pyramid.mp4"
+# -------------------------------------------------------------------------
+# 6) Build Gradio UI
+# -------------------------------------------------------------------------
+theme = gr.themes.Ocean()
+theme.set(
+    checkbox_label_background_fill_selected="*button_primary_background_fill",
+    checkbox_label_text_color_selected="*button_primary_text_color",
+)
+with gr.Blocks(
+    theme=theme,
+    css="""
+    .custom-log * {
+        font-style: italic;
+        font-size: 22px !important;
+        background-image: linear-gradient(120deg, #0ea5e9 0%, #6ee7b7 60%, #34d399 100%);
+        -webkit-background-clip: text;
+        background-clip: text;
+        font-weight: bold !important;
+        color: transparent !important;
+        text-align: center !important;
+    }
+    .example-log * {
+        font-style: italic;
+        font-size: 16px !important;
+        background-image: linear-gradient(120deg, #0ea5e9 0%, #6ee7b7 60%, #34d399 100%);
+        -webkit-background-clip: text;
+        background-clip: text;
+        color: transparent !important;
+    }
+    #my_radio .wrap {
+        display: flex;
+        flex-wrap: nowrap;
+        justify-content: center;
+        align-items: center;
+    }
+    #my_radio .wrap label {
+        display: flex;
+        width: 50%;
+        justify-content: center;
+        align-items: center;
+        margin: 0;
+        padding: 10px 0;
+        box-sizing: border-box;
+    }
+    """,
+) as demo:
+    # Instead of gr.State, we use a hidden Textbox:
+    is_example = gr.Textbox(label="is_example", visible=False, value="None")
+    num_images = gr.Textbox(label="num_images", visible=False, value="None")
+    example_preview = gr.Image(label="Example Preview", visible=False)
+    gr.HTML(
+        """
+    <h1>🌅 STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer</h1>
+    <p>
+    <a href="https://github.com/NIRVANALAN/STream3R">GitHub Repository</a> |
+    <a href="https://nirvanalan.github.io/projects/stream3r">Project Page</a> |
+    <a href="https://arxiv.org/abs/2508.10893">Paper</a>
+    </p>
+    <blockquote>
+        Special thanks to VGG-T for their visualization demo, which this demo is built upon!
+    </blockquote>
+    <div style="font-size: 16px; line-height: 1.5;">
+    <p>Upload a video or a set of images to create a 3D reconstruction of a scene or object. STream3R takes these images and generates a 3D point cloud, along with estimated camera poses.</p>
+    <h3>Getting Started:</h3>
+    <ol>
+        <li><strong>Upload Your Data:</strong> Use the "Upload Video" or "Upload Images" buttons on the left to provide your input. Videos will be automatically split into individual frames (one frame per second).</li>
+        <li><strong>Preview:</strong> Your uploaded images will appear in the gallery on the left.</li>
+        <li><strong>Reconstruct:</strong> Click the "Reconstruct" button to start the 3D reconstruction process.</li>
+        <li><strong>Visualize:</strong> The 3D reconstruction will appear in the viewer on the right. You can rotate, pan, and zoom to explore the model, and download the GLB file. Note the visualization of 3D points may be slow for a large number of input images.</li>
+        <li>
+        <strong>Adjust Visualization (Optional):</strong>
+        After reconstruction, you can fine-tune the visualization using the options below
+        <details style="display:inline;">
+            <summary style="display:inline;">(<strong>click to expand</strong>):</summary>
+            <ul>
+            <li><em>Confidence Threshold:</em> Adjust the filtering of points based on confidence.</li>
+            <li><em>Show Points from Frame:</em> Select specific frames to display in the point cloud.</li>
+            <li><em>Show Camera:</em> Toggle the display of estimated camera positions.</li>
+            <li><em>Filter Sky / Filter Black Background:</em> Remove sky or black-background points.</li>
+            <li><em>Select a Prediction Mode:</em> Choose between "Depthmap and Camera Branch" or "Pointmap Branch."</li>
+            </ul>
+        </details>
+        </li>
+    </ol>
+    <p><strong style="color: #0ea5e9;">Please note:</strong> <span style="color: #0ea5e9; font-weight: bold;">STream3R typically reconstructs a scene in less than 1 second. However, visualizing 3D points may take tens of seconds due to third-party rendering, which are independent of STream3R's processing time. </span></p>
+    </div>
+    """
+    )
+    target_dir_output = gr.Textbox(label="Target Dir", visible=False, value="None")
+    with gr.Row():
+        with gr.Column(scale=2):
+            input_video = gr.Video(label="Upload Video", interactive=True)
+            input_images = gr.File(file_count="multiple", label="Upload Images", interactive=True)
+            image_gallery = gr.Gallery(
+                label="Preview",
+                columns=4,
+                height="300px",
+                show_download_button=True,
+                object_fit="contain",
+                preview=True,
+            )
+        with gr.Column(scale=4):
+            with gr.Column():
+                gr.Markdown("**3D Reconstruction (Point Cloud and Camera Poses)**")
+                log_output = gr.Markdown(
+                    "Please upload a video or images, then click Reconstruct.", elem_classes=["custom-log"]
+                )
+                reconstruction_output = gr.Model3D(height=520, zoom_speed=0.5, pan_speed=0.5)
+            with gr.Row():
+                submit_btn = gr.Button("Reconstruct", scale=1, variant="primary")
+                clear_btn = gr.ClearButton(
+                    [input_video, input_images, reconstruction_output, log_output, target_dir_output, image_gallery],
+                    scale=1,
+                )
+            with gr.Row():
+                prediction_mode = gr.Radio(
+                    ["Depthmap and Camera Branch", "Pointmap Branch"],
+                    label="Select a Prediction Mode",
+                    value="Depthmap and Camera Branch",
+                    scale=1,
+                    elem_id="my_radio",
+                )
+            with gr.Row():
+                streaming = gr.Radio(
+                    [('stream', True), ('batch', False)],
+                    label="Streaming or Batch Mode",
+                    value=False,
+                    scale=1,
+                )
+            with gr.Row():
+                mode = gr.Radio(
+                    ["causal", "window", "full"],
+                    label="Select Processing Mode",
+                    value="causal",
+                    scale=1,
+                )
+            with gr.Row():
+                conf_thres = gr.Slider(minimum=0, maximum=100, value=50, step=0.1, label="Confidence Threshold (%)")
+                frame_filter = gr.Dropdown(choices=["All"], value="All", label="Show Points from Frame")
+                with gr.Column():
+                    show_cam = gr.Checkbox(label="Show Camera", value=True)
+                    mask_sky = gr.Checkbox(label="Filter Sky", value=False)
+                    mask_black_bg = gr.Checkbox(label="Filter Black Background", value=False)
+                    mask_white_bg = gr.Checkbox(label="Filter White Background", value=False)
+    # ---------------------- Examples section ----------------------
+    def build_examples_from_folder():
+        examples_root = "examples"
+        entries = []
+        if not os.path.isdir(examples_root):
+            return entries
+        candidate_dirs = sorted(
+            [
+                os.path.join(examples_root, d)
+                for d in os.listdir(examples_root)
+                if os.path.isdir(os.path.join(examples_root, d))
+            ], reverse=True
+        )
+        if not candidate_dirs:
+            candidate_dirs = [examples_root]
+        for example_dir in candidate_dirs:
+            image_files = []
+            for pattern in ("*.png", "*.jpg", "*.jpeg", "*.bmp", "*.webp"):
+                image_files.extend(sorted(glob.glob(os.path.join(example_dir, pattern))))
+            if not image_files:
+                continue
+            preview_image = image_files[0]
+            num_images_str = str(len(image_files))
+            entries.append(
+                [
+                    preview_image,  # preview image (for visualization only)
+                    None,  # input_video (unused for examples)
+                    num_images_str,
+                    image_files,  # input_images
+                    15.0,  # conf_thres
+                    False,  # mask_black_bg
+                    False,  # mask_white_bg
+                    True,  # show_cam
+                    False,  # mask_sky
+                    "Depthmap and Camera Branch",  # prediction_mode
+                    "True",  # is_example
+                    "causal",  # mode
+                ]
+            )
+        return entries[:2]
+    examples = build_examples_from_folder()
+    def example_pipeline(
+        preview_image,
+        input_video,
+        num_images_str,
+        input_images,
+        conf_thres,
+        mask_black_bg,
+        mask_white_bg,
+        show_cam,
+        mask_sky,
+        prediction_mode,
+        is_example_str,
+        mode="causal",
+    ):
+        """
+        1) Copy example images to new target_dir
+        2) Reconstruct
+        3) Return model3D + logs + new_dir + updated dropdown + gallery
+        We do NOT return is_example. It's just an input.
+        """
+        target_dir, image_paths = handle_uploads(input_video, input_images)
+        # Always use "All" for frame_filter in examples
+        frame_filter = "All"
+        glbfile, log_msg, dropdown = gradio_demo(
+            target_dir, conf_thres, frame_filter, mask_black_bg, mask_white_bg, show_cam, mask_sky, prediction_mode, mode
+        )
+        return glbfile, log_msg, target_dir, dropdown, image_paths
+    gr.Markdown("Click any row to load an example.", elem_classes=["example-log"])
+    gr.Examples(
+        examples=examples,
+        inputs=[
+            example_preview,
+            input_video,
+            num_images,
+            input_images,
+            conf_thres,
+            mask_black_bg,
+            mask_white_bg,
+            show_cam,
+            mask_sky,
+            prediction_mode,
+            is_example,
+            mode,
+        ],
+        outputs=[reconstruction_output, log_output, target_dir_output, frame_filter, image_gallery],
+        fn=example_pipeline,
+        cache_examples=False,
+        examples_per_page=50,
+    )
+    # -------------------------------------------------------------------------
+    # "Reconstruct" button logic:
+    #  - Clear fields
+    #  - Update log
+    #  - gradio_demo(...) with the existing target_dir
+    #  - Then set is_example = "False"
+    # -------------------------------------------------------------------------
+    submit_btn.click(fn=clear_fields, inputs=[], outputs=[reconstruction_output]).then(
+        fn=update_log, inputs=[], outputs=[log_output]
+    ).then(
+        fn=gradio_demo,
+        inputs=[
+            target_dir_output,
+            conf_thres,
+            frame_filter,
+            mask_black_bg,
+            mask_white_bg,
+            show_cam,
+            mask_sky,
+            prediction_mode,
+            mode,
+            streaming,
+        ],
+        outputs=[reconstruction_output, log_output, frame_filter],
+    ).then(
+        fn=lambda: "False", inputs=[], outputs=[is_example]  # set is_example to "False"
+    )
+    # -------------------------------------------------------------------------
+    # Real-time Visualization Updates
+    # -------------------------------------------------------------------------
+    conf_thres.change(
+        update_visualization,
+        [
+            target_dir_output,
+            conf_thres,
+            frame_filter,
+            mask_black_bg,
+            mask_white_bg,
+            show_cam,
+            mask_sky,
+            prediction_mode,
+            is_example,
+        ],
+        [reconstruction_output, log_output],
+    )
+    frame_filter.change(
+        update_visualization,
+        [
+            target_dir_output,
+            conf_thres,
+            frame_filter,
+            mask_black_bg,
+            mask_white_bg,
+            show_cam,
+            mask_sky,
+            prediction_mode,
+            is_example,
+        ],
+        [reconstruction_output, log_output],
+    )
+    mask_black_bg.change(
+        update_visualization,
+        [
+            target_dir_output,
+            conf_thres,
+            frame_filter,
+            mask_black_bg,
+            mask_white_bg,
+            show_cam,
+            mask_sky,
+            prediction_mode,
+            is_example,
+        ],
+        [reconstruction_output, log_output],
+    )
+    mask_white_bg.change(
+        update_visualization,
+        [
+            target_dir_output,
+            conf_thres,
+            frame_filter,
+            mask_black_bg,
+            mask_white_bg,
+            show_cam,
+            mask_sky,
+            prediction_mode,
+            is_example,
+        ],
+        [reconstruction_output, log_output],
+    )
+    show_cam.change(
+        update_visualization,
+        [
+            target_dir_output,
+            conf_thres,
+            frame_filter,
+            mask_black_bg,
+            mask_white_bg,
+            show_cam,
+            mask_sky,
+            prediction_mode,
+            is_example,
+        ],
+        [reconstruction_output, log_output],
+    )
+    mask_sky.change(
+        update_visualization,
+        [
+            target_dir_output,
+            conf_thres,
+            frame_filter,
+            mask_black_bg,
+            mask_white_bg,
+            show_cam,
+            mask_sky,
+            prediction_mode,
+            is_example,
+        ],
+        [reconstruction_output, log_output],
+    )
+    prediction_mode.change(
+        update_visualization,
+        [
+            target_dir_output,
+            conf_thres,
+            frame_filter,
+            mask_black_bg,
+            mask_white_bg,
+            show_cam,
+            mask_sky,
+            prediction_mode,
+            is_example,
+        ],
+        [reconstruction_output, log_output],
+    )
+    # -------------------------------------------------------------------------
+    # Auto-update gallery whenever user uploads or changes their files
+    # -------------------------------------------------------------------------
+    input_video.change(
+        fn=update_gallery_on_upload,
+        inputs=[input_video, input_images],
+        outputs=[reconstruction_output, target_dir_output, image_gallery, log_output],
+    )
+    input_images.change(
+        fn=update_gallery_on_upload,
+        inputs=[input_video, input_images],
+        outputs=[reconstruction_output, target_dir_output, image_gallery, log_output],
+    )
+    demo.queue(max_size=20).launch(show_error=True, share=True)