Update README.md
Browse files
README.md
CHANGED
|
@@ -13,7 +13,7 @@ pinned: false
|
|
| 13 |

|
| 14 |
|
| 15 |
|
| 16 |
-
## Abstract
|
| 17 |
Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents.
|
| 18 |
Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval.
|
| 19 |
To address this gap, this work introduces a new benchmark, named as **MMDocIR**, encompassing two distinct tasks: **page-level** and **layout-level** retrieval.
|
|
@@ -27,18 +27,29 @@ Through rigorous experiments, we reveal that
|
|
| 27 |
These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.
|
| 28 |
|
| 29 |
|
| 30 |
-
##
|
| 31 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
**MMDocIR** evluation set includes 313 long documents averaging 65.1 pages, categorized into ten main domains: research reports, administration&industry, tutorials&workshops, academic papers, brochures, financial reports, guidebooks, government documents, laws, and news articles.
|
| 34 |
Different domains feature distinct distributions of multi-modal information. For instance, research reports, tutorials, workshops, and brochures predominantly contain images, whereas financial and industry documents are table-rich. In contrast, government and legal documents primarily comprise text. Overall, the modality distribution is: Text (60.4%), Image (18.8%), Table (16.7%), and other modalities (4.1%).
|
| 35 |
|
| 36 |
-
### Question and Annotation Analysis
|
| 37 |
**MMDocIR** evluation set encompasses 1,658 questions, 2,107 page labels, and 2,638 layout labels. The modalities required to answer these questions distribute across four categories: Text (44.7%), Image (21.7%), Table (37.4%), and Layout/Meta (11.5%). The ``Layout/Meta'' category encompasses questions related to layout information and meta-data statistics.
|
| 38 |
Notably, the dataset poses several challenges: 254 questions necessitate cross-modal understanding, 313 questions demand evidence across multiple pages, and 637 questions require reasoning based on multiple layouts. These complexities highlight the need for advanced multi-modal reasoning and contextual understanding.
|
| 39 |
|
| 40 |
|
| 41 |
-
## Train Set
|
| 42 |
|
| 43 |
|
| 44 |
|
|
|
|
| 13 |

|
| 14 |
|
| 15 |
|
| 16 |
+
## 1. Abstract
|
| 17 |
Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents.
|
| 18 |
Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval.
|
| 19 |
To address this gap, this work introduces a new benchmark, named as **MMDocIR**, encompassing two distinct tasks: **page-level** and **layout-level** retrieval.
|
|
|
|
| 27 |
These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.
|
| 28 |
|
| 29 |
|
| 30 |
+
## 2. Task Setting
|
| 31 |
+
### Page-level Retrieval
|
| 32 |
+
The page-level retrieval task is designed to identify the most relevant pages within a document in response to a user query.
|
| 33 |
+
|
| 34 |
+
### Layout-level Retrieval
|
| 35 |
+
The layout-level retrieval aims to retrieve most relevant layouts.
|
| 36 |
+
The layouts are defined as the fine-grained elements such as paragraphs, equations, figures, tables, and charts.
|
| 37 |
+
This task allows for a more nuanced content retrieval, honing in on specific information that directly answers user queries.
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
## 3. Evaluation Set
|
| 42 |
+
### 3.1 Document Analysis
|
| 43 |
|
| 44 |
**MMDocIR** evluation set includes 313 long documents averaging 65.1 pages, categorized into ten main domains: research reports, administration&industry, tutorials&workshops, academic papers, brochures, financial reports, guidebooks, government documents, laws, and news articles.
|
| 45 |
Different domains feature distinct distributions of multi-modal information. For instance, research reports, tutorials, workshops, and brochures predominantly contain images, whereas financial and industry documents are table-rich. In contrast, government and legal documents primarily comprise text. Overall, the modality distribution is: Text (60.4%), Image (18.8%), Table (16.7%), and other modalities (4.1%).
|
| 46 |
|
| 47 |
+
### 3.2 Question and Annotation Analysis
|
| 48 |
**MMDocIR** evluation set encompasses 1,658 questions, 2,107 page labels, and 2,638 layout labels. The modalities required to answer these questions distribute across four categories: Text (44.7%), Image (21.7%), Table (37.4%), and Layout/Meta (11.5%). The ``Layout/Meta'' category encompasses questions related to layout information and meta-data statistics.
|
| 49 |
Notably, the dataset poses several challenges: 254 questions necessitate cross-modal understanding, 313 questions demand evidence across multiple pages, and 637 questions require reasoning based on multiple layouts. These complexities highlight the need for advanced multi-modal reasoning and contextual understanding.
|
| 50 |
|
| 51 |
|
| 52 |
+
## 4. Train Set
|
| 53 |
|
| 54 |
|
| 55 |
|