OpenMOSS-Team
/

moss-video-preview-base

@@ -2,7 +2,7 @@
 language:
 - en
 library_name: transformers
-pipeline_tag: image-text-to-text
 license: apache-2.0
 model_type: video_mllama
 tags:
@@ -10,10 +10,10 @@ tags:
 - video
 - vision-language
 - mllama
-- streaming
 ---
-# MOSS-Video-Preview-Base 🤗
 ## Introduction
@@ -159,13 +159,15 @@ print(processor.decode(output_ids[0], skip_special_tokens=True))
 ## ⚠️ Limitations
 - **Not instruction-tuned**: as a pretrain-only checkpoint, responses may be less aligned/helpful than SFT variants.
-- **Realtime streaming not supported by default**: streaming generation APIs are typically provided by Real-Time SFT checkpoints.
 - **Performance is hardware/config dependent**: enabling FlashAttention 2 and using `bfloat16` on modern GPUs generally improves throughput and memory efficiency.
 ## 🧩 Requirements
 - **Python**: 3.10+
 - **PyTorch**: 1.13.1+ (GPU strongly recommended)
 - **Transformers**: required with `trust_remote_code=True` for this model family (due to `auto_map` custom code)
 - **Optional (recommended)**: FlashAttention 2 (`attn_implementation="flash_attention_2"`)
 - **Video decode**:
@@ -180,7 +182,7 @@ For full environment setup (including optional FlashAttention2 extras), see the
 ## ⚠️ Notes
 - This is a **base** model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline.
-- The Python source files in this directory are referenced via `auto_map` in `config.json`, so `trust_remote_code=True` is typically required when loading from this local folder.
 > [!IMPORTANT]
@@ -190,17 +192,14 @@ For full environment setup (including optional FlashAttention2 extras), see the
 > We warmly welcome experts in **Representation Learning** and **Model Efficiency** to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!
 ## Citation
 ```bibtex
 @misc{moss_video_2026,
-  title         = {MOSS-Video-Preview: Towards Real-Time Video Understanding},
   author        = {OpenMOSS Team},
   year          = {2026},
-  publisher     = {GitHub},
-  journal       = {GitHub repository},
-  howpublished  = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}}
 }
 ```

 language:
 - en
 library_name: transformers
+pipeline_tag: video-text-to-text
 license: apache-2.0
 model_type: video_mllama
 tags:
 - video
 - vision-language
 - mllama
+- video-text-to-text
 ---
+# MOSS-Video-Preview-Base
 ## Introduction
 ## ⚠️ Limitations
 - **Not instruction-tuned**: as a pretrain-only checkpoint, responses may be less aligned/helpful than SFT variants.
+- **Real-Time streaming not supported by default**: streaming generation APIs are typically provided by Real-Time SFT checkpoints.
 - **Performance is hardware/config dependent**: enabling FlashAttention 2 and using `bfloat16` on modern GPUs generally improves throughput and memory efficiency.
 ## 🧩 Requirements
 - **Python**: 3.10+
 - **PyTorch**: 1.13.1+ (GPU strongly recommended)
+- **Tested setup**: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1
+- **CPU-only**: PyTorch 2.4.0
 - **Transformers**: required with `trust_remote_code=True` for this model family (due to `auto_map` custom code)
 - **Optional (recommended)**: FlashAttention 2 (`attn_implementation="flash_attention_2"`)
 - **Video decode**:
 ## ⚠️ Notes
 - This is a **base** model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline.
+- The Python source files in this directory are referenced via `auto_map` in `config.json`.
 > [!IMPORTANT]
 > We warmly welcome experts in **Representation Learning** and **Model Efficiency** to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!
 ## Citation
 ```bibtex
 @misc{moss_video_2026,
+  title         = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
   author        = {OpenMOSS Team},
   year          = {2026},
+  howpublished  = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
+  note          = {GitHub repository}
 }
 ```