Upload moss-video-preview-base

fd3507c verified 19 days ago

7.9 kB

	---
	language:
	- en
	library_name: transformers
	pipeline_tag: video-text-to-text
	license: apache-2.0
	model_type: video_mllama
	tags:
	- multimodal
	- video
	- vision-language
	- mllama
	- video-text-to-text
	---

	# MOSS-Video-Preview-Base

	## Introduction

	We introduce MOSS-Video-Preview-Base, the pretrained foundation checkpoint in the MOSS-Video-Preview series.

	> [!Important]
	> This is a pretrained model checkpoint without supervised instruction tuning (no offline SFT / no Real-Time SFT).

	This repo contains the pretrained weights that are intended to serve as the starting point for downstream:

	- Offline SFT: instruction-following and reasoning on full video segments
	- Real-Time SFT: low-latency streaming video understanding and response

	## 🌟 Key Highlights

	- 🧩 First Cross-Attention Base: A unique foundation model architecture designed for native video-language understanding, moving beyond simple feature concatenation.
	- 🔄 Streaming-Ready Backbone: The underlying architecture is natively designed to support "Silence-Speak" switching and real-time interruption (requires subsequent Real-Time SFT).
	- ⚡ Extreme Efficiency: Optimized for Flash Attention 2 and compatible with NPU/CUDA platforms, providing a high-throughput starting point for long-video research.

	#### Model Architecture

	MOSS-Video-Preview-Base is the foundational checkpoint of the series, featuring a Pioneering Image-Video Unified Cross-Attention Architecture:

	<p align="center">
	<img src="assets/model_structure.png" width="90%" alt="Model Architecture"/>
	</p>

	- Native Unified Design: Unlike traditional projection-based models, this architecture provides native, unified support for both image and video streams, ensuring seamless temporal consistency and visual-language decoupling.
	- Cross-Modal Projector: Powered by the proprietary `VideoMllamaTextCrossAttention` mechanism, it achieves high-efficiency semantic alignment between temporal visual features and linguistic context.
	- Unified Spatio-Temporal Encoding: Aligns video frame sequences with text tokens, providing a robust backbone for long-context multimodal reasoning.

	For architecture diagrams and full system details, see the top-level repository: [fnlp-vision/MOSS-Video-Preview](https://github.com/fnlp-vision/MOSS-Video-Preview).

	## 🚀 Quickstart

	<details>
	<summary><strong>Video inference</strong></summary>

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor
	checkpoint = "fnlp-vision/moss-video-preview-base"
	video_path = "data/example_video.mp4"
	prompt = "" # For base model, prompt is set to empty to perform completion task.

	processor = AutoProcessor.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	frame_extract_num_threads=1,
	)
	model = AutoModelForCausalLM.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	)

	messages = [
	{
	"role": "user",
	"content": [
	{"type": "video"},
	{"type": "text", "text": prompt},
	],
	}
	]

	input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
	inputs = processor(
	text=input_text,
	videos=[video_path],
	video_fps=1.0,
	video_minlen=8,
	video_maxlen=16,
	add_special_tokens=False,
	return_tensors="pt",
	).to(model.device)

	with torch.no_grad():
	output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)

	print(processor.decode(output_ids[0], skip_special_tokens=True))

	```



	</details>

	<details>
	<summary><strong>Image inference</strong></summary>

	```python
	import torch
	from PIL import Image
	from transformers import AutoModelForCausalLM, AutoProcessor
	checkpoint = "fnlp-vision/moss-video-preview-base"
	image_path = "data/example_image.jpg"
	prompt = "" # For base model, prompt is set to empty to perform completion task.

	image = Image.open(image_path).convert("RGB")

	processor = AutoProcessor.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	)
	model = AutoModelForCausalLM.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	)

	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": prompt},
	],
	}
	]

	input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
	inputs = processor(
	text=input_text,
	images=[image],
	add_special_tokens=False,
	return_tensors="pt",
	).to(model.device)

	with torch.no_grad():
	output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)

	print(processor.decode(output_ids[0], skip_special_tokens=True))
	```

	</details>

	## ✅ Intended Use

	- Research Foundation: An ideal starting point for researchers focusing on Representation Learning or Model Efficiency in video understanding.
	- SFT Starting Point: The recommended backbone for training your own Offline SFT or Real-Time Streaming variants.
	- Architecture Exploration: Test new multimodal alignment techniques, temporal encodings, or domain-specific adaptation.

	## ⚠️ Limitations & Future Outlook

	- Base Model Nature: This checkpoint is pretrained only and has not undergone instruction tuning. It may generate repetitive text or fail to follow complex instructions without further SFT.
	- Performance Benchmarking: While leading in real-time architectural innovation, a performance gap still exists compared to top-tier models like Qwen2.5-VL. Closing this gap is the core focus of our ongoing iterations.
	- Scalable Distributed Training: The current training pipeline is optimized for architectural validation. We are migrating to the Megatron-LM framework to leverage 3D parallelism (Tensor, Pipeline, and Data Parallelism) for larger-scale pre-training.
	- Open-Source Commitment: In the next major release, we will officially open-source the complete training codebase (integrated with Megatron-LM) and more diverse datasets to the community.

	## 🧩 Requirements

	- Python: 3.10+
	- PyTorch: 1.13.1+ (GPU strongly recommended)
	- Tested setup: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1
	- CPU-only: PyTorch 2.4.0
	- Transformers: required with `trust_remote_code=True` for this model family (due to `auto_map` custom code)
	- Optional (recommended): FlashAttention 2 (`attn_implementation="flash_attention_2"`)
	- Video decode:
	- streaming demo imports OpenCV (`cv2`)
	- offline demo relies on the processor's video loading backend

	For full environment setup (including optional FlashAttention2 extras), see the top-level repository `README.md`.




	## ⚠️ Notes

	- This is a base model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline.
	- The Python source files in this directory are referenced via `auto_map` in `config.json`.


	> [!IMPORTANT]
	> ### 🌟 Our Mission & Community Invitation
	> We have filled the gap in cross-attention-based foundation models for video understanding.
	>
	> We warmly welcome experts in Representation Learning and Model Efficiency to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!


	## Citation
	```bibtex
	@misc{moss_video_2026,
	title = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
	author = {OpenMOSS Team},
	year = {2026},
	howpublished = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
	note = {GitHub repository}
	}
	```