File size: 2,391 Bytes
d40aa14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
license: mit
library_name: transformers
pipeline_tag: image-feature-extraction
---

# OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

OmniStream is a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), the model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache.

- **Paper:** [OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams](https://huggingface.co/papers/2603.12265)
- **Project Page:** [https://go2heart.github.io/omnistream/](https://go2heart.github.io/omnistream/)
- **Repository:** [https://github.com/Go2Heart/OmniStream](https://github.com/Go2Heart/OmniStream)

## Sample Usage

The following code snippet demonstrates how to use OmniStream for feature extraction. Note that this requires the `model.py` file from the official repository to be present in your environment.

```python
from model import OmnistreamMultiFrameTransformer
from transformers import AutoImageProcessor
import torch
import numpy as np

# Load processor and model
processor = AutoImageProcessor.from_pretrained("StreamFormer/OmniStream")
model = OmnistreamMultiFrameTransformer.from_pretrained("StreamFormer/OmniStream").to("cuda")

model.eval()

# Prepare dummy input: 16 frames of 512x512 RGB images (Batch x Time, Height, Width, Channels)
fake_pixel = np.random.randn(16, 512, 512, 3) 
fake_input = processor(images=fake_pixel, return_tensors="pt").to("cuda") 

# Reshape to (Batch, Time, Channels, Height, Width)
fake_input["pixel_values"] = fake_input["pixel_values"].unsqueeze(0).float() 

with torch.no_grad():
    output = model(**fake_input, return_dict=True)

print(output.keys())
print(output["last_hidden_state"].shape) # last layer's hidden states
print(output["pooler_output"].shape)      # cls token
print(output["patch_start_idx"])         # index of the first patch of each frame
```

## Citation

```bibtex
@article{yan2026omnistream,
  title={OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams}, 
  author={Yibin Yan and Jilan Xu and Shangzhe Di and Haoning Wu and Weidi Xie},
  journal={arXiv preprint arXiv:2603.12265},
  year={2026},
  url={https://arxiv.org/abs/2603.12265}
}
```