Update README.md
Browse files
README.md
CHANGED
|
@@ -15,7 +15,7 @@ thumbnail: >-
|
|
| 15 |
|
| 16 |
|
| 17 |
# Abstract
|
| 18 |
-
Recent advances in Retrieval-Augmented Generation (RAG) have significantly improved response accuracy and relevance by incorporating external knowledge into Large Language Models (LLMs). However, existing RAG methods primarily focus on generating text-only answers, even in Multimodal Retrieval-Augmented Generation (MRAG) scenarios, where multimodal elements are retrieved to assist in generating text answers. To address this, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, in which we aim to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite %increasing growing attention to this challenging task, a notable lack of a comprehensive benchmark persists for effectively evaluating its performance. To bridge this gap, we provide the MRAMG-Bench, a meticulously curated, human-annotated benchmark comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, distributed across six distinct datasets and spanning three domains: Web Data, Academic
|
| 19 |
|
| 20 |
# Evaluation Metric
|
| 21 |
|
|
@@ -67,7 +67,7 @@ We use the following _statistical-based metrics_:
|
|
| 67 |
- **Image Ordering Score** evaluates whether the order of images inserted into the multimodal answer matches the order of images in the ground truth.
|
| 68 |
Specifically, we compute the weighted edit distance between the two image sequences to reflect the difference in their order.
|
| 69 |
|
| 70 |
-
- **Data Format** (For
|
| 71 |
- **Ground-truth**: A = a_1 -> a_2 -> ... -> a_n, where a_i represents the image at the i-th position in the order.
|
| 72 |
- **Answer**: B = b_1 -> b_2 -> ... -> b_m, where b_j is not necessarily in A, and m is not necessarily equal to n.
|
| 73 |
|
|
@@ -331,7 +331,8 @@ We use the following _LLM-based metrics_:
|
|
| 331 |
<img_2_score>1</img_2_score>
|
| 332 |
|
| 333 |
# Results
|
| 334 |
-
In this section, we give the full experiment results, wherein the metrics of **Prec.**, **Rec.**, **F1.**, **R.L.**, **B.S.**, **Rel.**, **Eff.**, **Comp.**, **Pos.**, and **Avg.** represent image precision, image recall, image F1 score, rouge-l, BERTScore, image relevance, image effectiveness, comprehensive score, image position score, and average score, respectively. Specifically, the metric **Ord
|
|
|
|
| 335 |
## Comprehensive performance results on MRAMG-Wit(Web Dataset).
|
| 336 |
| Framework | Model | MRAMG-Wit | | | | | | | | | |
|
| 337 |
|------------|------------------------|-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
|
|
@@ -426,7 +427,7 @@ In this section, we give the full experiment results, wherein the metrics of **P
|
|
| 426 |
| | Llama-3.1-8B-Instruct | 29.34 | 26.27 | 26.31 | 33.70 | 81.16 | 32.08 | 30.48 | 51.81 | 32.38 | 38.17 |
|
| 427 |
| | Llama-3.3-70B-Instruct | 66.83 | 95.80 | 75.47 | 47.98 | 94.79 | 92.03 | 88.03 | 88.93 | 69.34 | 79.91 |
|
| 428 |
|
| 429 |
-
## Comprehensive performance results on MRAMG-Arxiv(Academic
|
| 430 |
| Framework | Model | MRAMG-Arxiv | | | | | | | | | |
|
| 431 |
|------------|------------------------|-------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
|
| 432 |
| | | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. |
|
|
@@ -521,7 +522,7 @@ In this section, we give the full experiment results, wherein the metrics of **P
|
|
| 521 |
| | Llama-3.3-70B-Instruct | 25.74 | 50.15 | 31.26 | 39.80 | 91.31 | 28.03 | 76.72 | 74.36 | 75.95 | 62.56 | 55.59 |
|
| 522 |
|
| 523 |
## Comprehensive performance results on MRAMG-Bench.
|
| 524 |
-
| Framework | Model | Web Data | | | | | | | | | | Academic
|
| 525 |
|------------|------------------------|-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|---------------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-----------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
|
| 526 |
| | | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. | Prec. | Rec. | F1 | R.L. | B.S. | Ord. | Rel. | Eff. | Comp. | Pos. | Avg. |
|
| 527 |
| Rule-Based | GPT-4o | 43.54 | 37.30 | 39.36 | 48.88 | 92.35 | 38.70 | 35.59 | 77.15 | 43.86 | 50.75 | 55.42 | 63.04 | 57.70 | 44.96 | 94.67 | 69.10 | 67.30 | 84.20 | 75.75 | 68.02 | 47.04 | 63.54 | 50.71 | 51.66 | 92.01 | 43.54 | 77.51 | 74.47 | 79.17 | 77.13 | 65.68 |
|
|
|
|
| 15 |
|
| 16 |
|
| 17 |
# Abstract
|
| 18 |
+
Recent advances in Retrieval-Augmented Generation (RAG) have significantly improved response accuracy and relevance by incorporating external knowledge into Large Language Models (LLMs). However, existing RAG methods primarily focus on generating text-only answers, even in Multimodal Retrieval-Augmented Generation (MRAG) scenarios, where multimodal elements are retrieved to assist in generating text answers. To address this, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, in which we aim to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite %increasing growing attention to this challenging task, a notable lack of a comprehensive benchmark persists for effectively evaluating its performance. To bridge this gap, we provide the MRAMG-Bench, a meticulously curated, human-annotated benchmark comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, distributed across six distinct datasets and spanning three domains: Web Data, Academic Data, and Lifestyle Data. The datasets incorporate diverse difficulty levels and complex multi-image scenarios, providing a robust foundation for evaluating the MRAMG task. To facilitate rigorous evaluation, our MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics, enabling a thorough analysis of the performance of generative models in the MRAMG task. Additionally, we propose an efficient and flexible multimodal answer generation framework that leverages both LLMs and MLLMs to generate multimodal responses.
|
| 19 |
|
| 20 |
# Evaluation Metric
|
| 21 |
|
|
|
|
| 67 |
- **Image Ordering Score** evaluates whether the order of images inserted into the multimodal answer matches the order of images in the ground truth.
|
| 68 |
Specifically, we compute the weighted edit distance between the two image sequences to reflect the difference in their order.
|
| 69 |
|
| 70 |
+
- **Data Format** (For Lifestyle Data):
|
| 71 |
- **Ground-truth**: A = a_1 -> a_2 -> ... -> a_n, where a_i represents the image at the i-th position in the order.
|
| 72 |
- **Answer**: B = b_1 -> b_2 -> ... -> b_m, where b_j is not necessarily in A, and m is not necessarily equal to n.
|
| 73 |
|
|
|
|
| 331 |
<img_2_score>1</img_2_score>
|
| 332 |
|
| 333 |
# Results
|
| 334 |
+
In this section, we give the full experiment results, wherein the metrics of **Prec.**, **Rec.**, **F1.**, **R.L.**, **B.S.**, **Rel.**, **Eff.**, **Comp.**, **Pos.**, and **Avg.** represent image precision, image recall, image F1 score, rouge-l, BERTScore, image relevance, image effectiveness, comprehensive score, image position score, and average score, respectively. Specifically, the metric **Ord.
|
| 335 |
+
** represents image ordering score.
|
| 336 |
## Comprehensive performance results on MRAMG-Wit(Web Dataset).
|
| 337 |
| Framework | Model | MRAMG-Wit | | | | | | | | | |
|
| 338 |
|------------|------------------------|-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
|
|
|
|
| 427 |
| | Llama-3.1-8B-Instruct | 29.34 | 26.27 | 26.31 | 33.70 | 81.16 | 32.08 | 30.48 | 51.81 | 32.38 | 38.17 |
|
| 428 |
| | Llama-3.3-70B-Instruct | 66.83 | 95.80 | 75.47 | 47.98 | 94.79 | 92.03 | 88.03 | 88.93 | 69.34 | 79.91 |
|
| 429 |
|
| 430 |
+
## Comprehensive performance results on MRAMG-Arxiv(Academic Dataset).
|
| 431 |
| Framework | Model | MRAMG-Arxiv | | | | | | | | | |
|
| 432 |
|------------|------------------------|-------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
|
| 433 |
| | | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. |
|
|
|
|
| 522 |
| | Llama-3.3-70B-Instruct | 25.74 | 50.15 | 31.26 | 39.80 | 91.31 | 28.03 | 76.72 | 74.36 | 75.95 | 62.56 | 55.59 |
|
| 523 |
|
| 524 |
## Comprehensive performance results on MRAMG-Bench.
|
| 525 |
+
| Framework | Model | Web Data | | | | | | | | | | Academic Data | | | | | | | | | | Lifestyle Data | | | | | | | | | | |
|
| 526 |
|------------|------------------------|-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|---------------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-----------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
|
| 527 |
| | | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. | Prec. | Rec. | F1 | R.L. | B.S. | Ord. | Rel. | Eff. | Comp. | Pos. | Avg. |
|
| 528 |
| Rule-Based | GPT-4o | 43.54 | 37.30 | 39.36 | 48.88 | 92.35 | 38.70 | 35.59 | 77.15 | 43.86 | 50.75 | 55.42 | 63.04 | 57.70 | 44.96 | 94.67 | 69.10 | 67.30 | 84.20 | 75.75 | 68.02 | 47.04 | 63.54 | 50.71 | 51.66 | 92.01 | 43.54 | 77.51 | 74.47 | 79.17 | 77.13 | 65.68 |
|