--- language: - en library_name: transformers pipeline_tag: video-text-to-text license: apache-2.0 model_type: video_mllama tags: - multimodal - video - vision-language - mllama - video-text-to-text --- # MOSS-Video-Preview-Base ## Introduction We introduce **MOSS-Video-Preview-Base**, the pretrained foundation checkpoint in the MOSS-Video-Preview series. > [!Important] > This is a **pretrained** model checkpoint **without** supervised instruction tuning (no offline SFT / no Real-Time SFT). This repo contains the **pretrained weights** that are intended to serve as the starting point for downstream: - **Offline SFT**: instruction-following and reasoning on full video segments - **Real-Time SFT**: low-latency streaming video understanding and response ## 🌟 Key Highlights - **🧩 First Cross-Attention Base**: A unique foundation model architecture designed for native video-language understanding, moving beyond simple feature concatenation. - **🔄 Streaming-Ready Backbone**: The underlying architecture is natively designed to support "Silence-Speak" switching and real-time interruption (requires subsequent Real-Time SFT). - **⚡ Extreme Efficiency**: Optimized for **Flash Attention 2** and compatible with **NPU/CUDA** platforms, providing a high-throughput starting point for long-video research. #### Model Architecture **MOSS-Video-Preview-Base** is the foundational checkpoint of the series, featuring a **Pioneering Image-Video Unified Cross-Attention Architecture**:

Model Architecture

- **Native Unified Design**: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling. - **Cross-Modal Projector**: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context. - **Unified Spatio-Temporal Encoding**: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning. For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview). ## 🚀 Quickstart
Video inference ```python import torch from transformers import AutoModelForCausalLM, AutoProcessor checkpoint = "fnlp-vision/moss-video-preview-base" video_path = "data/example_video.mp4" prompt = "" # For base model, prompt is set to empty to perform completion task. processor = AutoProcessor.from_pretrained( checkpoint, trust_remote_code=True, frame_extract_num_threads=1, ) model = AutoModelForCausalLM.from_pretrained( checkpoint, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) messages = [ { "role": "user", "content": [ {"type": "video"}, {"type": "text", "text": prompt}, ], } ] input_text = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor( text=input_text, videos=[video_path], video_fps=1.0, video_minlen=8, video_maxlen=16, add_special_tokens=False, return_tensors="pt", ).to(model.device) with torch.no_grad(): output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False) print(processor.decode(output_ids[0], skip_special_tokens=True)) ```
Image inference ```python import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoProcessor checkpoint = "fnlp-vision/moss-video-preview-base" image_path = "data/example_image.jpg" prompt = "" # For base model, prompt is set to empty to perform completion task. image = Image.open(image_path).convert("RGB") processor = AutoProcessor.from_pretrained( checkpoint, trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( checkpoint, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": prompt}, ], } ] input_text = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor( text=input_text, images=[image], add_special_tokens=False, return_tensors="pt", ).to(model.device) with torch.no_grad(): output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False) print(processor.decode(output_ids[0], skip_special_tokens=True)) ```
## ✅ Intended Use - **Research Foundation**: An ideal starting point for researchers focusing on **Representation Learning** or **Model Efficiency** in video understanding. - **SFT Starting Point**: The recommended backbone for training your own **Offline SFT** or **Real-Time Streaming** variants. - **Architecture Exploration**: Test new multimodal alignment techniques, temporal encodings, or domain-specific adaptation. ## ⚠️ Limitations & Future Outlook - **Base Model Nature**: This checkpoint is **pretrained only** and has not undergone instruction tuning. It may generate repetitive text or fail to follow complex instructions without further SFT. - **Performance Benchmarking**: While leading in real-time architectural innovation, a performance gap still exists compared to top-tier models like **Qwen2.5-VL**. Closing this gap is the core focus of our ongoing iterations. - **Scalable Distributed Training**: The current training pipeline is optimized for architectural validation. We are migrating to the **Megatron-LM framework** to leverage **3D parallelism (Tensor, Pipeline, and Data Parallelism)** for larger-scale pre-training. - **Open-Source Commitment**: In the next major release, we will officially open-source the **complete training codebase (integrated with Megatron-LM)** and more diverse datasets to the community. ## 🧩 Requirements - **Python**: 3.10+ - **PyTorch**: 1.13.1+ (GPU strongly recommended) - **Tested setup**: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1 - **CPU-only**: PyTorch 2.4.0 - **Transformers**: required with `trust_remote_code=True` for this model family (due to `auto_map` custom code) - **Optional (recommended)**: FlashAttention 2 (`attn_implementation="flash_attention_2"`) - **Video decode**: - streaming demo imports OpenCV (`cv2`) - offline demo relies on the processor's video loading backend For full environment setup (including optional FlashAttention2 extras), see the top-level repository `README.md`. ## ⚠️ Notes - This is a **base** model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline. - The Python source files in this directory are referenced via `auto_map` in `config.json`. > [!IMPORTANT] > ### 🌟 Our Mission & Community Invitation > **We have filled the gap in cross-attention-based foundation models for video understanding.** > > We warmly welcome experts in **Representation Learning** and **Model Efficiency** to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together! ## Citation ```bibtex @misc{moss_video_2026, title = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}}, author = {OpenMOSS Team}, year = {2026}, howpublished = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}}, note = {GitHub repository} } ```