OpenMOSS-Team
/

MOSS-VL-Base-0408

Video-Text-to-Text

feature-extraction

Video-Understanding

Image-Understanding

vision-language

Model card Files Files and versions

CCCCyx commited on Apr 8

Commit

d60acc8

·

verified ·

1 Parent(s): 22b43f9

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -275,8 +275,8 @@ texts = [item["text"] for item in result["results"]]
 MOSS-VL-Base-0408 is a pretrained base checkpoint, and we are actively improving several core capabilities for future iterations:
-- 📄 Stronger OCR, Especially for Long Documents — We plan to further improve text recognition, document parsing, and long-document understanding. A key focus is achieving near-lossless information extraction for extremely long and structurally complex inputs, such as accurately parsing texts, tables, and mathematical layouts from multi-page academic papers (dozens of pages) or dense PDF reports without degrading context or structural integrity.
-- 🎬 Expanded Long-Video Understanding — We aim to significantly extend the model's capacity for long-form video comprehension. This includes advancing temporal reasoning and cross-frame event tracking to support the continuous analysis of videos lasting several hours to dozens of hours—such as full-length movies, lengthy meetings, or extended surveillance streams—enabling robust retrieval and understanding over ultra-long visual contexts.
 > [!NOTE]
 > We expect future releases to continue strengthening the base model itself while also enabling stronger downstream aligned variants built on top of it.

 MOSS-VL-Base-0408 is a pretrained base checkpoint, and we are actively improving several core capabilities for future iterations:
+- 📄 **Stronger OCR, Especially for Long Documents** — We plan to further improve text recognition, document parsing, and long-document understanding. A key focus is achieving near-lossless information extraction and understanding for extremely long and structurally complex inputs, such as accurately parsing texts, tables, and mathematical layouts from multi-page academic papers (dozens of pages) or dense PDF reports without degrading context or structural integrity.
+- 🎬 **Expanded Extremely Long Video Understanding** — We aim to significantly extend the model's capacity for comprehending extremely long videos spanning several hours to dozens of hours. This includes advancing temporal reasoning and cross-frame event tracking for continuous analysis of full-length movies, lengthy meetings, or extended surveillance streams, enabling robust retrieval and understanding over ultra-long visual contexts.
 > [!NOTE]
 > We expect future releases to continue strengthening the base model itself while also enabling stronger downstream aligned variants built on top of it.