Upload moss-video-preview-base
Browse files
README.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
library_name: transformers
|
| 5 |
-
pipeline_tag:
|
| 6 |
license: apache-2.0
|
| 7 |
model_type: video_mllama
|
| 8 |
tags:
|
|
@@ -10,10 +10,10 @@ tags:
|
|
| 10 |
- video
|
| 11 |
- vision-language
|
| 12 |
- mllama
|
| 13 |
-
-
|
| 14 |
---
|
| 15 |
|
| 16 |
-
# MOSS-Video-Preview-Base
|
| 17 |
|
| 18 |
## Introduction
|
| 19 |
|
|
@@ -159,13 +159,15 @@ print(processor.decode(output_ids[0], skip_special_tokens=True))
|
|
| 159 |
## ⚠️ Limitations
|
| 160 |
|
| 161 |
- **Not instruction-tuned**: as a pretrain-only checkpoint, responses may be less aligned/helpful than SFT variants.
|
| 162 |
-
- **
|
| 163 |
- **Performance is hardware/config dependent**: enabling FlashAttention 2 and using `bfloat16` on modern GPUs generally improves throughput and memory efficiency.
|
| 164 |
|
| 165 |
## 🧩 Requirements
|
| 166 |
|
| 167 |
- **Python**: 3.10+
|
| 168 |
- **PyTorch**: 1.13.1+ (GPU strongly recommended)
|
|
|
|
|
|
|
| 169 |
- **Transformers**: required with `trust_remote_code=True` for this model family (due to `auto_map` custom code)
|
| 170 |
- **Optional (recommended)**: FlashAttention 2 (`attn_implementation="flash_attention_2"`)
|
| 171 |
- **Video decode**:
|
|
@@ -180,7 +182,7 @@ For full environment setup (including optional FlashAttention2 extras), see the
|
|
| 180 |
## ⚠️ Notes
|
| 181 |
|
| 182 |
- This is a **base** model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline.
|
| 183 |
-
- The Python source files in this directory are referenced via `auto_map` in `config.json`
|
| 184 |
|
| 185 |
|
| 186 |
> [!IMPORTANT]
|
|
@@ -190,17 +192,14 @@ For full environment setup (including optional FlashAttention2 extras), see the
|
|
| 190 |
> We warmly welcome experts in **Representation Learning** and **Model Efficiency** to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!
|
| 191 |
|
| 192 |
|
| 193 |
-
|
| 194 |
## Citation
|
| 195 |
-
|
| 196 |
```bibtex
|
| 197 |
@misc{moss_video_2026,
|
| 198 |
-
title = {MOSS-Video-Preview:
|
| 199 |
author = {OpenMOSS Team},
|
| 200 |
year = {2026},
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
howpublished = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}}
|
| 204 |
}
|
| 205 |
```
|
| 206 |
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
library_name: transformers
|
| 5 |
+
pipeline_tag: video-text-to-text
|
| 6 |
license: apache-2.0
|
| 7 |
model_type: video_mllama
|
| 8 |
tags:
|
|
|
|
| 10 |
- video
|
| 11 |
- vision-language
|
| 12 |
- mllama
|
| 13 |
+
- video-text-to-text
|
| 14 |
---
|
| 15 |
|
| 16 |
+
# MOSS-Video-Preview-Base
|
| 17 |
|
| 18 |
## Introduction
|
| 19 |
|
|
|
|
| 159 |
## ⚠️ Limitations
|
| 160 |
|
| 161 |
- **Not instruction-tuned**: as a pretrain-only checkpoint, responses may be less aligned/helpful than SFT variants.
|
| 162 |
+
- **Real-Time streaming not supported by default**: streaming generation APIs are typically provided by Real-Time SFT checkpoints.
|
| 163 |
- **Performance is hardware/config dependent**: enabling FlashAttention 2 and using `bfloat16` on modern GPUs generally improves throughput and memory efficiency.
|
| 164 |
|
| 165 |
## 🧩 Requirements
|
| 166 |
|
| 167 |
- **Python**: 3.10+
|
| 168 |
- **PyTorch**: 1.13.1+ (GPU strongly recommended)
|
| 169 |
+
- **Tested setup**: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1
|
| 170 |
+
- **CPU-only**: PyTorch 2.4.0
|
| 171 |
- **Transformers**: required with `trust_remote_code=True` for this model family (due to `auto_map` custom code)
|
| 172 |
- **Optional (recommended)**: FlashAttention 2 (`attn_implementation="flash_attention_2"`)
|
| 173 |
- **Video decode**:
|
|
|
|
| 182 |
## ⚠️ Notes
|
| 183 |
|
| 184 |
- This is a **base** model directory. Quality/latency characteristics (offline SFT, real-time streaming, etc.) depend on the specific fine-tuned checkpoints and inference pipeline.
|
| 185 |
+
- The Python source files in this directory are referenced via `auto_map` in `config.json`.
|
| 186 |
|
| 187 |
|
| 188 |
> [!IMPORTANT]
|
|
|
|
| 192 |
> We warmly welcome experts in **Representation Learning** and **Model Efficiency** to explore, experiment, and innovate on top of our architecture. Let's push the boundaries of video intelligence and advance the open-source community together!
|
| 193 |
|
| 194 |
|
|
|
|
| 195 |
## Citation
|
|
|
|
| 196 |
```bibtex
|
| 197 |
@misc{moss_video_2026,
|
| 198 |
+
title = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
|
| 199 |
author = {OpenMOSS Team},
|
| 200 |
year = {2026},
|
| 201 |
+
howpublished = {\url{https://github.com/fnlp-vision/MOSS-Video-Preview}},
|
| 202 |
+
note = {GitHub repository}
|
|
|
|
| 203 |
}
|
| 204 |
```
|
| 205 |
|