| --- |
| language: |
| - en |
| library_name: transformers |
| pipeline_tag: video-text-to-text |
| license: apache-2.0 |
| model_type: video_mllama |
| tags: |
| - multimodal |
| - video |
| - vision-language |
| - mllama |
| - video-text-to-text |
| --- |
| |
| # MOSS-Video-Preview-Base |
|
|
| ## Introduction |
|
|
| We introduce **MOSS-Video-Preview-Base**, the pretrained foundation checkpoint in the MOSS-Video-Preview series. |
|
|
| > [!Important] |
| > This is a **pretrained** model checkpoint **without** supervised instruction tuning (no offline SFT / no Real-Time SFT). |
|
|
| This repo contains the **pretrained weights** that are intended to serve as the starting point for downstream: |
|
|
| - **Offline SFT**: instruction-following and reasoning on full video segments |
| - **Real-Time SFT**: low-latency streaming video understanding and response |
|
|
| ## 🌟 Key Highlights |
|
|
| - **🧩 First Cross-Attention Base**: A unique foundation model architecture designed for native video-language understanding, moving beyond simple feature concatenation. |
| - **🔄 Streaming-Ready Backbone**: The underlying architecture is natively designed to support "Silence-Speak" switching and real-time interruption (requires subsequent Real-Time SFT). |
| - **⚡ Extreme Efficiency**: Optimized for **Flash Attention 2** and compatible with **NPU/CUDA** platforms, providing a high-throughput starting point for long-video research. |
|
|
| #### Model Architecture |
|
|
| **MOSS-Video-Preview-Base** is the foundational checkpoint of the series, featuring a **Pioneering Image-Video Unified Cross-Attention Architecture**: |
|
|
| <p align="center"> |
| <img src="assets/model_structure.png" width="90%" alt="Model Architecture"/> |
| </p> |
|
|
| - **Native Unified Design**: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling. |
| - **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context. |
| - **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning. |
|
|
| For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview). |
|
|
| ## 🚀 Quickstart |
|
|
| <details> |
| <summary><strong>Video inference</strong></summary> |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoProcessor |
| checkpoint = "fnlp-vision/moss-video-preview-base" |
| video_path = "data/example_video.mp4" |
| prompt = "" # For base model, prompt is set to empty to perform completion task. |
| |
| processor = AutoProcessor.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| frame_extract_num_threads=1, |
| ) |
| model = AutoModelForCausalLM.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| device_map="auto", |
| torch_dtype=torch.bfloat16, |
| attn_implementation="flash_attention_2", |
| ) |
| |
| messages = [ |
| { |
| "role": "user", |
| "content": [ |
| {"type": "video"}, |
| {"type": "text", "text": prompt}, |
| ], |
| } |
| ] |
| |
| input_text = processor.apply_chat_template(messages, add_generation_prompt=True) |
| inputs = processor( |
| text=input_text, |
| videos=[video_path], |
| video_fps=1.0, |
| video_minlen=8, |
| video_maxlen=16, |
| add_special_tokens=False, |
| return_tensors="pt", |
| ).to(model.device) |
| |
| with torch.no_grad(): |
| output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False) |
| |
| print(processor.decode(output_ids[0], skip_special_tokens=True)) |
| |
| ``` |
|
|
|
|
|
|
| </details> |
|
|
| <details> |
| <summary><strong>Image inference</strong></summary> |
|
|
| ```python |
| import torch |
| from PIL import Image |
| from transformers import AutoModelForCausalLM, AutoProcessor |
| checkpoint = "fnlp-vision/moss-video-preview-base" |
| image_path = "data/example_image.jpg" |
| prompt = "" # For base model, prompt is set to empty to perform completion task. |
| |
| image = Image.open(image_path).convert("RGB") |
| |
| processor = AutoProcessor.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| ) |
| model = AutoModelForCausalLM.from_pretrained( |
| checkpoint, |
| trust_remote_code=True, |
| device_map="auto", |
| torch_dtype=torch.bfloat16, |
| attn_implementation="flash_attention_2", |
| ) |
| |
| messages = [ |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image"}, |
| {"type": "text", "text": prompt}, |
| ], |
| } |
| ] |
| |
| input_text = processor.apply_chat_template(messages, add_generation_prompt=True) |
| inputs = processor( |
| text=input_text, |
| images=[image], |
| add_special_tokens=False, |
| return_tensors="pt", |
| ).to(model.device) |
| |
| with torch.no_grad(): |
| output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False) |
| |
| print(processor.decode(output_ids[0], skip_special_tokens=True)) |
| ``` |
|
|
| </details> |
|
|
| ## ✅ Intended Use |
|
|
| - **Research Foundation**: An ideal starting point for researchers focusing on **Representation Learning** or **Model Efficiency** in video understanding. |
| - **SFT Starting Point**: The recommended backbone for training your own **Offline SFT** or **Real-Time Streaming** variants. |
| - **Architecture Exploration**: Test new multimodal alignment techniques, temporal encodings, or domain-specific adaptation. |
|
|
| ## ⚠️ Limitations & Future Outlook |
|
|
| - **Base Model Nature**: This checkpoint is **pretrained only** and has not undergone instruction tuning. It may generate repetitive text or fail to follow complex instructions without further SFT. |
| - **Performance Benchmarking**: While leading in real-time architectural innovation, a performance gap still exists compared to top-tier models like **Qwen2.5-VL**. Closing this gap is the core focus of our ongoing iterations. |
| - **Scalable Distributed Training**: The current training pipeline is optimized for architectural validation. We are migrating to the **Megatron-LM framework** to leverage **3D parallelism (Tensor, Pipeline, and Data Parallelism)** for larger-scale pre-training. |
| - **Open-Source Commitment**: In the next major release, we will officially open-source the **complete training codebase (integrated with Megatron-LM)** and more diverse datasets to the community. |
|
|
| ## 🧩 Requirements |
|
|
| - **Python**: 3.10+ |
| - **PyTorch**: 1.13.1+ (GPU strongly recommended) |
| - **Tested setup**: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1 |
| - **CPU-only**: PyTorch 2.4.0 |
| - **Transformers**: required with `trust_remote_code=True` for this model family (due to `auto_map` custom code) |
| - **Optional (recommended)**: FlashAttention 2 (`attn_implementation="flash_attention_2"`) |
| - **Video decode**: |
| - streaming demo imports OpenCV (`cv2`) |
| - offline demo relies on the processor's video loading backend |
|
|
| For full environment setup (including optional FlashAttention2 extras), see the top-level repository `README.md`. |
|
|
|
|
|
|
|
|
| ## ⚠️ Notes |
|
|
| - This is a **base** model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline. |
| - The Python source files in this directory are referenced via `auto_map` in `config.json`. |
|
|
|
|
| > [!IMPORTANT] |
| > ### 🌟 Our Mission & Community Invitation |
| > **We have filled the gap in cross-attention-based foundation models for video understanding.** |
| > |
| > We warmly welcome experts in **Representation Learning** and **Model Efficiency** to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together! |
|
|
|
|
| ## Citation |
| ```bibtex |
| @misc{moss_video_2026, |
| title = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}}, |
| author = {OpenMOSS Team}, |
| year = {2026}, |
| howpublished = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}}, |
| note = {GitHub repository} |
| } |
| ``` |
|
|