findcard12138 commited on
Commit
c53c874
ยท
verified ยท
1 Parent(s): 056914d

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -39,3 +39,5 @@ assets/logo.png filter=lfs diff=lfs merge=lfs -text
39
  assets/structure.png filter=lfs diff=lfs merge=lfs -text
40
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
41
  assets/MOSS-VL-Benchmark.png filter=lfs diff=lfs merge=lfs -text
 
 
 
39
  assets/structure.png filter=lfs diff=lfs merge=lfs -text
40
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
41
  assets/MOSS-VL-Benchmark.png filter=lfs diff=lfs merge=lfs -text
42
+ assets/MOSS-VL-benchmark.png filter=lfs diff=lfs merge=lfs -text
43
+ assets/benchmark_table.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  title: MOSS-VL-Instruct-0408
3
- date: '2026-04-08T00:00:00.000Z'
4
  category: Multimodal-LLM
5
  status: SFT
6
  language:
@@ -8,8 +8,7 @@ language:
8
  library_name: transformers
9
  pipeline_tag: video-text-to-text
10
  license: apache-2.0
11
- base_model:
12
- - fnlp-vision/MOSS-VL-Base-0408
13
  tags:
14
  - SFT
15
  - Video-Understanding
@@ -29,51 +28,45 @@ tags:
29
 
30
  ## ๐Ÿ“Œ Introduction
31
 
32
- We introduce **MOSS-VL-Instruct-0408**, the supervised fine-tuned checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem).
33
 
34
- > [!IMPORTANT]
35
- > This is an **SFT** checkpoint (instruction-tuned). It is **NOT** the Real-Time SFT streaming checkpoint.
36
-
37
- This model is designed as a high-performance offline engine for multimodal tasks, bridging the gap between static image understanding and dynamic real-time interaction.
38
 
39
- ### This checkpoint is intended for:
40
 
41
- - **video/image understanding** with significantly improved instruction following capabilities.
42
- - Serving as a **strong starting point** for further **Real-Time SFT** or specific domain adaptation.
 
43
 
44
- ---
45
 
46
- ## ๐Ÿš€ Key Features & Status
 
47
 
48
- | Feature | Status | Description |
49
- | :--- | :---: | :--- |
50
- | **Model Loading** | โœ… | Standard HF loading with `trust_remote_code=True` |
51
- | **Image Understanding** | โœ… | Single/Multi-image input support |
52
- | **Video Understanding** | โœ… | Native video frame sequence processing |
53
- | **Mixed Inference** | โœ… | Interleaved image and video inputs |
54
- | **Offline Generation** | โœ… | Optimized `offline_generate` & `offline_batch_generate` |
55
- | **Benchmarks/Metrics** | โณ | Coming in future updates |
56
 
57
  ---
58
 
59
  ## ๐Ÿ— Model Architecture
60
 
61
- **MOSS-VL-Instruct-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language reasoning.
 
62
  <p align="center">
63
  <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
64
  </p>
65
 
66
 
67
- ## Temporal-Aware Prompting
 
 
68
 
69
- At the model-family level, MOSS-VL uses timestamp-aware multimodal prompting for video understanding. This design gives sampled frames explicit temporal anchors, which helps the model reason about order, duration, and event localization more robustly.
70
  <p align="center">
71
  <img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
72
  </p>
73
 
74
- ## Multimodal RoPE
 
 
75
 
76
- MOSS-VL uses multimodal rotary position encoding to align text tokens and visual features in a shared spatial-temporal coordinate system. At a high level, this improves video-text grounding and helps preserve temporal structure during multimodal reasoning.
77
  <p align="center">
78
  <img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
79
  </p>
@@ -81,9 +74,20 @@ MOSS-VL uses multimodal rotary position encoding to align text tokens and visual
81
 
82
  ## ๐Ÿ“Š Model Performance
83
 
84
- We evaluate **MOSS-VL-Instruct-0408** across several key multimodal benchmarks, focusing on both video and image understanding.
 
 
 
 
 
 
 
85
  <p align="center">
86
- <img src="assets/MOSS-VL-Benchmark.png" alt="MOSS-VL Benchmark Results" width="100%"/>
 
 
 
 
87
  </p>
88
 
89
 
@@ -294,29 +298,9 @@ with torch.no_grad():
294
 
295
  </details>
296
 
297
- ## Intended Use
298
-
299
- - offline image understanding
300
- - offline video understanding
301
- - multimodal prompt experiments for release validation
302
- - checkpoint-level inference integration and debugging
303
 
304
  ## Requirements
305
 
306
- Core validated inference dependencies:
307
-
308
- - `python==3.12.13`
309
- - `torch==2.8.0+cu128`
310
- - `torchvision==0.23.0+cu128`
311
- - `transformers==4.57.1`
312
- - `accelerate==1.12.0`
313
- - `flash_attn==2.8.1`
314
- - `torchcodec==0.7.0`
315
- - `numpy==2.4.3`
316
- - `pillow==12.1.1`
317
- - `joblib==1.5.2`
318
- - `einops==0.8.2`
319
-
320
  Installation commands:
321
 
322
  ```bash
@@ -333,12 +317,18 @@ Validated setup notes:
333
 
334
  ## Limitations and Future Work
335
 
336
- - realtime usage is not documented here
337
- - benchmark, metric, and training details are still blank
338
- - some sections are intentionally placeholders until release information is finalized
339
- - batch calls currently require shared `generate_kwargs` and shared `media_kwargs` within one call
340
- - batch streaming and batch cancel / stop protocol are not part of `offline_batch_generate(...)`
341
- - the queue example is intentionally minimal and does not include production-grade timeout or worker error handling
 
 
 
 
 
 
342
 
343
 
344
  ## Citation
@@ -350,4 +340,4 @@ Validated setup notes:
350
  howpublished = {\url{https://github.com/fnlp-vision/MOSS-VL}},
351
  note = {GitHub repository}
352
  }
353
- ```
 
1
  ---
2
  title: MOSS-VL-Instruct-0408
3
+ date: 2026-04-08
4
  category: Multimodal-LLM
5
  status: SFT
6
  language:
 
8
  library_name: transformers
9
  pipeline_tag: video-text-to-text
10
  license: apache-2.0
11
+ base_model: fnlp-vision/MOSS-VL-Base-0408
 
12
  tags:
13
  - SFT
14
  - Video-Understanding
 
28
 
29
  ## ๐Ÿ“Œ Introduction
30
 
31
+ MOSS-VL-Instruct-0408 is the instruction-tuned checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing open multimodal foundation models.
32
 
33
+ Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this checkpoint is designed as a high-performance offline multimodal engine. It delivers strong, well-rounded performance across the full spectrum of vision-language tasks โ€” including image understanding, OCR, document parsing, visual reasoning, and instruction following โ€” and is particularly outstanding at video understanding, from long-form comprehension to fine-grained temporal reasoning and action recognition.
 
 
 
34
 
35
+ ### Highlights
36
 
37
+ - ๐ŸŽฌ **Outstanding Video Understanding** โ€” A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, MLVU, and EgoSchema.
38
+ - ๐Ÿ–ผ๏ธ **Strong General Multimodal Perception** โ€” Robust image understanding, fine-grained object recognition, OCR, and document parsing.
39
+ - ๐Ÿ’ฌ **Reliable Instruction Following** โ€” Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.
40
 
41
+ ### Note on Variants
42
 
43
+ > [!IMPORTANT]
44
+ > **This is the offline instruction-tuned checkpoint.** It is not the streaming variant. If you are looking for low-latency, real-time interactive video understanding, please refer to the upcoming **MOSS-VL-RealTime** release.
45
 
 
 
 
 
 
 
 
 
46
 
47
  ---
48
 
49
  ## ๐Ÿ— Model Architecture
50
 
51
+ **MOSS-VL-Instruct-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. This design drives latency down to the **millisecond level**, enabling instantaneous responses to dynamic video streams. Natively supporting **interleaved modalities**, it processes complex sequences of images and videos within a unified pipeline โ€” eliminating the need for heavy pre-processing.
52
+
53
  <p align="center">
54
  <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
55
  </p>
56
 
57
 
58
+ ## ๐Ÿงฉ Absolute Timestamps
59
+
60
+ To ensure the model accurately perceives the pacing and duration of events, **MOSS-VL-Instruct-0408** injects **absolute timestamps** alongside each sampled frame, grounding the reasoning process in a **precise temporal reference**.
61
 
 
62
  <p align="center">
63
  <img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
64
  </p>
65
 
66
+ ## ๐Ÿงฌ Cross-attention RoPE (XRoPE)
67
+
68
+ MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention based visionโ€“language architecture. This mechanism maps text tokens and video patches into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w).
69
 
 
70
  <p align="center">
71
  <img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
72
  </p>
 
74
 
75
  ## ๐Ÿ“Š Model Performance
76
 
77
+ We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Document/OCR, Multimodal Reasoning, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.
78
+
79
+ ### Key Highlights
80
+
81
+ * **๐Ÿš€ Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
82
+ * **๐Ÿ‘๏ธ Outstanding Multimodal Perception**: With a score of **75.1**, MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
83
+ * **๐Ÿง  Robust Multimodal Reasoning**: Achieving **64.3**, MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites such as `CVBench` and `VisuLogic`.
84
+ * **๐Ÿ“„ Reliable Document Understanding**: While the model is primarily optimized for general perception and video, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information.
85
  <p align="center">
86
+ <img src="assets/benchmark_table.png" alt="MOSS-VL Benchmark Table" width="100%"/>
87
+ </p>
88
+
89
+ <p align="center">
90
+ <img src="assets/MOSS-VL-benchmark.png" alt="MOSS-VL Benchmark Results" width="100%"/>
91
  </p>
92
 
93
 
 
298
 
299
  </details>
300
 
 
 
 
 
 
 
301
 
302
  ## Requirements
303
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
304
  Installation commands:
305
 
306
  ```bash
 
317
 
318
  ## Limitations and Future Work
319
 
320
+ MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
321
+
322
+ - ๐Ÿงฎ **Math & Code Reasoning** โ€” While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code understanding capabilities, especially in multimodal contexts (e.g., reasoning over diagrams, charts, screenshots, and code-bearing visual inputs).
323
+ - โšก **Real-Time Streaming Variant** โ€” The upcoming **MOSS-VL-RealTime** checkpoint will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants โ€” complementing this offline checkpoint.
324
+ - ๐ŸŽฏ **RL Post-Training** โ€” We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
325
+ - โณ **Longer Context for Hour-Scale Video** โ€” Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.
326
+ - ๐Ÿ”Š **Audio Modality Integration** โ€” Bringing audio understanding into the pipeline, so MOSS-VL can jointly reason over the visual and acoustic streams of a video โ€” speech, ambient sound, music, and their interaction with on-screen events.
327
+ - ๐Ÿ“ **Parameter Scaling** โ€” Releasing additional model sizes across the MOSS-VL series to cover a wider range of compute budgets and deployment scenarios.
328
+
329
+ > [!NOTE]
330
+ > We welcome community feedback and contributions on any of these directions.
331
+
332
 
333
 
334
  ## Citation
 
340
  howpublished = {\url{https://github.com/fnlp-vision/MOSS-VL}},
341
  note = {GitHub repository}
342
  }
343
+ ```
assets/MOSS-VL-benchmark.png ADDED

Git LFS Details

  • SHA256: 5c9fee49c8eb6f5005e8276e5cb4cfca06b1c2961b5de2896b8887b88fd9d249
  • Pointer size: 131 Bytes
  • Size of remote file: 233 kB
assets/benchmark_table.png ADDED

Git LFS Details

  • SHA256: 166de71650e926c3b61a60ff7dbd1f69a17b1ab6516dd35678263f486d383a38
  • Pointer size: 131 Bytes
  • Size of remote file: 189 kB