CCCCyx commited on
Commit
22b43f9
·
verified ·
1 Parent(s): db9bcfa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -3
README.md CHANGED
@@ -275,9 +275,8 @@ texts = [item["text"] for item in result["results"]]
275
 
276
  MOSS-VL-Base-0408 is a pretrained base checkpoint, and we are actively improving several core capabilities for future iterations:
277
 
278
- - 📄 **Stronger OCR, Especially for Long Documents** — We plan to further improve text recognition, document parsing, and long-document understanding, with a particular focus on maintaining accuracy and consistency over lengthy structured inputs.
279
- - 🎬 **Expanded Long-Video Understanding** — We aim to extend the model's ability on long-form video comprehension, including stronger temporal reasoning, better event tracking across extended durations, and more robust long-context video understanding.
280
- - 🌍 **Richer World Knowledge** — We will continue to enhance the model's general world knowledge so it can provide better grounded multimodal understanding and stronger performance on knowledge-intensive visual-language tasks.
281
 
282
  > [!NOTE]
283
  > We expect future releases to continue strengthening the base model itself while also enabling stronger downstream aligned variants built on top of it.
 
275
 
276
  MOSS-VL-Base-0408 is a pretrained base checkpoint, and we are actively improving several core capabilities for future iterations:
277
 
278
+ - 📄 Stronger OCR, Especially for Long Documents — We plan to further improve text recognition, document parsing, and long-document understanding. A key focus is achieving near-lossless information extraction for extremely long and structurally complex inputs, such as accurately parsing texts, tables, and mathematical layouts from multi-page academic papers (dozens of pages) or dense PDF reports without degrading context or structural integrity.
279
+ - 🎬 Expanded Long-Video Understanding — We aim to significantly extend the model's capacity for long-form video comprehension. This includes advancing temporal reasoning and cross-frame event tracking to support the continuous analysis of videos lasting several hours to dozens of hours—such as full-length movies, lengthy meetings, or extended surveillance streams—enabling robust retrieval and understanding over ultra-long visual contexts.
 
280
 
281
  > [!NOTE]
282
  > We expect future releases to continue strengthening the base model itself while also enabling stronger downstream aligned variants built on top of it.