SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion Paper • 2503.11576 • Published Mar 14, 2025 • 155
view article Article TimeScope: How Long Can Your Video Large Multimodal Model Go? +2 Jul 23, 2025 • 48
view article Article NVIDIA Releases 3 Million Sample Dataset for OCR, Visual Question Answering, and Captioning Tasks Aug 11, 2025 • 75
view article Article Introducing Idefics2: A Powerful 8B Vision-Language Model for the community +1 Apr 15, 2024 • 191
PubTables-1M: Towards comprehensive table extraction from unstructured documents Paper • 2110.00061 • Published Sep 30, 2021 • 3
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking Paper • 2204.08387 • Published Apr 18, 2022 • 8
DiT: Self-supervised Pre-training for Document Image Transformer Paper • 2203.02378 • Published Mar 4, 2022 • 3
LayoutLM: Pre-training of Text and Layout for Document Image Understanding Paper • 1912.13318 • Published Dec 31, 2019 • 5