Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

.gitattributes +2 -0
README.md +45 -55
assets/MOSS-VL-benchmark.png +3 -0
assets/benchmark_table.png +3 -0

.gitattributes CHANGED Viewed

@@ -39,3 +39,5 @@ assets/logo.png filter=lfs diff=lfs merge=lfs -text
 assets/structure.png filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
 assets/MOSS-VL-Benchmark.png filter=lfs diff=lfs merge=lfs -text

 assets/structure.png filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
 assets/MOSS-VL-Benchmark.png filter=lfs diff=lfs merge=lfs -text
+assets/MOSS-VL-benchmark.png filter=lfs diff=lfs merge=lfs -text
+assets/benchmark_table.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 title: MOSS-VL-Instruct-0408
-date: '2026-04-08T00:00:00.000Z'
 category: Multimodal-LLM
 status: SFT
 language:
@@ -8,8 +8,7 @@ language:
 library_name: transformers
 pipeline_tag: video-text-to-text
 license: apache-2.0
-base_model:
-- fnlp-vision/MOSS-VL-Base-0408
 tags:
 - SFT
 - Video-Understanding
@@ -29,51 +28,45 @@ tags:
 ## 📌 Introduction
-We introduce **MOSS-VL-Instruct-0408**, the supervised fine-tuned checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem).
-> [!IMPORTANT]
-> This is an **SFT** checkpoint (instruction-tuned). It is **NOT** the Real-Time SFT streaming checkpoint.
-This model is designed as a high-performance offline engine for multimodal tasks, bridging the gap between static image understanding and dynamic real-time interaction.
-### This checkpoint is intended for:
--   **video/image understanding** with significantly improved instruction following capabilities.
--   Serving as a **strong starting point** for further **Real-Time SFT** or specific domain adaptation.
----
-## 🚀 Key Features & Status
-| Feature | Status | Description |
-| :--- | :---: | :--- |
-| **Model Loading** | ✅ | Standard HF loading with `trust_remote_code=True` |
-| **Image Understanding** | ✅ | Single/Multi-image input support |
-| **Video Understanding** | ✅ | Native video frame sequence processing |
-| **Mixed Inference** | ✅ | Interleaved image and video inputs |
-| **Offline Generation** | ✅ | Optimized `offline_generate` & `offline_batch_generate` |
-| **Benchmarks/Metrics** | ⏳ | Coming in future updates |
 ---
 ## 🏗 Model Architecture
-**MOSS-VL-Instruct-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language reasoning.
 <p align="center">
     <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
 </p>
-## Temporal-Aware Prompting
-At the model-family level, MOSS-VL uses timestamp-aware multimodal prompting for video understanding. This design gives sampled frames explicit temporal anchors, which helps the model reason about order, duration, and event localization more robustly.
 <p align="center">
     <img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
 </p>
-## Multimodal RoPE
-MOSS-VL uses multimodal rotary position encoding to align text tokens and visual features in a shared spatial-temporal coordinate system. At a high level, this improves video-text grounding and helps preserve temporal structure during multimodal reasoning.
 <p align="center">
     <img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
 </p>
@@ -81,9 +74,20 @@ MOSS-VL uses multimodal rotary position encoding to align text tokens and visual
 ## 📊 Model Performance
-We evaluate **MOSS-VL-Instruct-0408** across several key multimodal benchmarks, focusing on both video and image understanding.
 <p align="center">
-    <img src="assets/MOSS-VL-Benchmark.png" alt="MOSS-VL Benchmark Results" width="100%"/>
 </p>
@@ -294,29 +298,9 @@ with torch.no_grad():
 </details>
-## Intended Use
-- offline image understanding
-- offline video understanding
-- multimodal prompt experiments for release validation
-- checkpoint-level inference integration and debugging
 ## Requirements
-Core validated inference dependencies:
-- `python==3.12.13`
-- `torch==2.8.0+cu128`
-- `torchvision==0.23.0+cu128`
-- `transformers==4.57.1`
-- `accelerate==1.12.0`
-- `flash_attn==2.8.1`
-- `torchcodec==0.7.0`
-- `numpy==2.4.3`
-- `pillow==12.1.1`
-- `joblib==1.5.2`
-- `einops==0.8.2`
 Installation commands:
 ```bash
@@ -333,12 +317,18 @@ Validated setup notes:
 ## Limitations and Future Work
-- realtime usage is not documented here
-- benchmark, metric, and training details are still blank
-- some sections are intentionally placeholders until release information is finalized
-- batch calls currently require shared `generate_kwargs` and shared `media_kwargs` within one call
-- batch streaming and batch cancel / stop protocol are not part of `offline_batch_generate(...)`
-- the queue example is intentionally minimal and does not include production-grade timeout or worker error handling
 ## Citation
@@ -350,4 +340,4 @@ Validated setup notes:
   howpublished  = {\url{https://github.com/fnlp-vision/MOSS-VL}},
   note          = {GitHub repository}
 }
-```

 ---
 title: MOSS-VL-Instruct-0408
+date: 2026-04-08
 category: Multimodal-LLM
 status: SFT
 language:
 library_name: transformers
 pipeline_tag: video-text-to-text
 license: apache-2.0
+base_model: fnlp-vision/MOSS-VL-Base-0408
 tags:
 - SFT
 - Video-Understanding
 ## 📌 Introduction
+MOSS-VL-Instruct-0408 is the instruction-tuned checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing open multimodal foundation models.
+Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this checkpoint is designed as a high-performance offline multimodal engine. It delivers strong, well-rounded performance across the full spectrum of vision-language tasks — including image understanding, OCR, document parsing, visual reasoning, and instruction following — and is particularly outstanding at video understanding, from long-form comprehension to fine-grained temporal reasoning and action recognition.
+### Highlights
+- 🎬 **Outstanding Video Understanding** — A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, MLVU, and EgoSchema.
+- 🖼️ **Strong General Multimodal Perception** — Robust image understanding, fine-grained object recognition, OCR, and document parsing.
+- 💬 **Reliable Instruction Following** — Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.
+### Note on Variants
+> [!IMPORTANT]
+> **This is the offline instruction-tuned checkpoint.** It is not the streaming variant. If you are looking for low-latency, real-time interactive video understanding, please refer to the upcoming **MOSS-VL-RealTime** release.
 ---
 ## 🏗 Model Architecture
+**MOSS-VL-Instruct-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. This design drives latency down to the **millisecond level**, enabling instantaneous responses to dynamic video streams. Natively supporting **interleaved modalities**, it processes complex sequences of images and videos within a unified pipeline — eliminating the need for heavy pre-processing.
 <p align="center">
     <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
 </p>
+## 🧩 Absolute Timestamps
+To ensure the model accurately perceives the pacing and duration of events, **MOSS-VL-Instruct-0408** injects **absolute timestamps** alongside each sampled frame, grounding the reasoning process in a **precise temporal reference**.
 <p align="center">
     <img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
 </p>
+## 🧬 Cross-attention RoPE (XRoPE)
+MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention based vision–language architecture. This mechanism maps text tokens and video patches into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w).
 <p align="center">
     <img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
 </p>
 ## 📊 Model Performance
+We conducted a comprehensive evaluation of **MOSS-VL-Instruct-0408** across four key dimensions: Multimodal Perception, Document/OCR, Multimodal Reasoning, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in **general multimodal perception** and **complex video analysis**.
+### Key Highlights
+*   **🚀 Leading Video Intelligence**: MOSS-VL achieves a score of **65.8** in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like `VideoMME`, `MLVU`, `EgoSchema`, and `VSI-bench` (where it outperforms **Qwen3-VL-8B-Instruct** by **8.3 points**).
+*   **👁️ Outstanding Multimodal Perception**: With a score of **75.1**, MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like `BLINK` and `MMBench`.
+*   **🧠 Robust Multimodal Reasoning**: Achieving **64.3**, MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites such as `CVBench` and `VisuLogic`.
+*   **📄 Reliable Document Understanding**: While the model is primarily optimized for general perception and video, MOSS-VL still delivers **83.9** on OCR and document analysis, ensuring dependable extraction of text and structured information.
 <p align="center">
+    <img src="assets/benchmark_table.png" alt="MOSS-VL Benchmark Table" width="100%"/>
+</p>
+<p align="center">
+    <img src="assets/MOSS-VL-benchmark.png" alt="MOSS-VL Benchmark Results" width="100%"/>
 </p>
 </details>
 ## Requirements
 Installation commands:
 ```bash
 ## Limitations and Future Work
+MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:
+- 🧮 **Math & Code Reasoning** — While the current checkpoint already exhibits solid general reasoning, we plan to substantially strengthen its mathematical reasoning and code understanding capabilities, especially in multimodal contexts (e.g., reasoning over diagrams, charts, screenshots, and code-bearing visual inputs).
+- ⚡ **Real-Time Streaming Variant** — The upcoming **MOSS-VL-RealTime** checkpoint will extend MOSS-VL to low-latency, streaming video understanding, enabling interactive applications such as live video chat, real-time event detection, and online assistants — complementing this offline checkpoint.
+- 🎯 **RL Post-Training** — We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.
+- ⏳ **Longer Context for Hour-Scale Video** — Continuing to push context scaling so the model can comfortably handle hour-scale and multi-hour videos with consistent temporal grounding.
+- 🔊 **Audio Modality Integration** — Bringing audio understanding into the pipeline, so MOSS-VL can jointly reason over the visual and acoustic streams of a video — speech, ambient sound, music, and their interaction with on-screen events.
+- 📐 **Parameter Scaling** — Releasing additional model sizes across the MOSS-VL series to cover a wider range of compute budgets and deployment scenarios.
+> [!NOTE]
+> We welcome community feedback and contributions on any of these directions.
 ## Citation
   howpublished  = {\url{https://github.com/fnlp-vision/MOSS-VL}},
   note          = {GitHub repository}
 }
+```

assets/MOSS-VL-benchmark.png ADDED Viewed

Git LFS Details

SHA256: 5c9fee49c8eb6f5005e8276e5cb4cfca06b1c2961b5de2896b8887b88fd9d249
Pointer size: 131 Bytes
Size of remote file: 233 kB

assets/benchmark_table.png ADDED Viewed

Git LFS Details

SHA256: 166de71650e926c3b61a60ff7dbd1f69a17b1ab6516dd35678263f486d383a38
Pointer size: 131 Bytes
Size of remote file: 189 kB