LbhYqh commited on
Commit
f022cbd
·
verified ·
1 Parent(s): 01d9f9f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -5
README.md CHANGED
@@ -15,7 +15,7 @@ thumbnail: >-
15
 
16
 
17
  # Abstract
18
- Recent advances in Retrieval-Augmented Generation (RAG) have significantly improved response accuracy and relevance by incorporating external knowledge into Large Language Models (LLMs). However, existing RAG methods primarily focus on generating text-only answers, even in Multimodal Retrieval-Augmented Generation (MRAG) scenarios, where multimodal elements are retrieved to assist in generating text answers. To address this, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, in which we aim to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite %increasing growing attention to this challenging task, a notable lack of a comprehensive benchmark persists for effectively evaluating its performance. To bridge this gap, we provide the MRAMG-Bench, a meticulously curated, human-annotated benchmark comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, distributed across six distinct datasets and spanning three domains: Web Data, Academic Paper Data, and Lifestyle Data. The datasets incorporate diverse difficulty levels and complex multi-image scenarios, providing a robust foundation for evaluating the MRAMG task. To facilitate rigorous evaluation, our MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics, enabling a thorough analysis of the performance of generative models in the MRAMG task. Additionally, we propose an efficient and flexible multimodal answer generation framework that leverages both LLMs and MLLMs to generate multimodal responses.
19
 
20
  # Evaluation Metric
21
 
@@ -67,7 +67,7 @@ We use the following _statistical-based metrics_:
67
  - **Image Ordering Score** evaluates whether the order of images inserted into the multimodal answer matches the order of images in the ground truth.
68
  Specifically, we compute the weighted edit distance between the two image sequences to reflect the difference in their order.
69
 
70
- - **Data Format** (For lifestyle datasets):
71
  - **Ground-truth**: A = a_1 -> a_2 -> ... -> a_n, where a_i represents the image at the i-th position in the order.
72
  - **Answer**: B = b_1 -> b_2 -> ... -> b_m, where b_j is not necessarily in A, and m is not necessarily equal to n.
73
 
@@ -331,7 +331,8 @@ We use the following _LLM-based metrics_:
331
  <img_2_score>1</img_2_score>
332
 
333
  # Results
334
- In this section, we give the full experiment results, wherein the metrics of **Prec.**, **Rec.**, **F1.**, **R.L.**, **B.S.**, **Rel.**, **Eff.**, **Comp.**, **Pos.**, and **Avg.** represent image precision, image recall, image F1 score, rouge-l, BERTScore, image relevance, image effectiveness, comprehensive score, image position score, and average score, respectively. Specifically, the metric **Ord.** represents image ordering score.
 
335
  ## Comprehensive performance results on MRAMG-Wit(Web Dataset).
336
  | Framework | Model | MRAMG-Wit | | | | | | | | | |
337
  |------------|------------------------|-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
@@ -426,7 +427,7 @@ In this section, we give the full experiment results, wherein the metrics of **P
426
  | | Llama-3.1-8B-Instruct | 29.34 | 26.27 | 26.31 | 33.70 | 81.16 | 32.08 | 30.48 | 51.81 | 32.38 | 38.17 |
427
  | | Llama-3.3-70B-Instruct | 66.83 | 95.80 | 75.47 | 47.98 | 94.79 | 92.03 | 88.03 | 88.93 | 69.34 | 79.91 |
428
 
429
- ## Comprehensive performance results on MRAMG-Arxiv(Academic Paper Dataset).
430
  | Framework | Model | MRAMG-Arxiv | | | | | | | | | |
431
  |------------|------------------------|-------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
432
  | | | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. |
@@ -521,7 +522,7 @@ In this section, we give the full experiment results, wherein the metrics of **P
521
  | | Llama-3.3-70B-Instruct | 25.74 | 50.15 | 31.26 | 39.80 | 91.31 | 28.03 | 76.72 | 74.36 | 75.95 | 62.56 | 55.59 |
522
 
523
  ## Comprehensive performance results on MRAMG-Bench.
524
- | Framework | Model | Web Data | | | | | | | | | | Academic Paper Data | | | | | | | | | | Lifestyle Data | | | | | | | | | | |
525
  |------------|------------------------|-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|---------------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-----------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
526
  | | | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. | Prec. | Rec. | F1 | R.L. | B.S. | Ord. | Rel. | Eff. | Comp. | Pos. | Avg. |
527
  | Rule-Based | GPT-4o | 43.54 | 37.30 | 39.36 | 48.88 | 92.35 | 38.70 | 35.59 | 77.15 | 43.86 | 50.75 | 55.42 | 63.04 | 57.70 | 44.96 | 94.67 | 69.10 | 67.30 | 84.20 | 75.75 | 68.02 | 47.04 | 63.54 | 50.71 | 51.66 | 92.01 | 43.54 | 77.51 | 74.47 | 79.17 | 77.13 | 65.68 |
 
15
 
16
 
17
  # Abstract
18
+ Recent advances in Retrieval-Augmented Generation (RAG) have significantly improved response accuracy and relevance by incorporating external knowledge into Large Language Models (LLMs). However, existing RAG methods primarily focus on generating text-only answers, even in Multimodal Retrieval-Augmented Generation (MRAG) scenarios, where multimodal elements are retrieved to assist in generating text answers. To address this, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, in which we aim to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite %increasing growing attention to this challenging task, a notable lack of a comprehensive benchmark persists for effectively evaluating its performance. To bridge this gap, we provide the MRAMG-Bench, a meticulously curated, human-annotated benchmark comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, distributed across six distinct datasets and spanning three domains: Web Data, Academic Data, and Lifestyle Data. The datasets incorporate diverse difficulty levels and complex multi-image scenarios, providing a robust foundation for evaluating the MRAMG task. To facilitate rigorous evaluation, our MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics, enabling a thorough analysis of the performance of generative models in the MRAMG task. Additionally, we propose an efficient and flexible multimodal answer generation framework that leverages both LLMs and MLLMs to generate multimodal responses.
19
 
20
  # Evaluation Metric
21
 
 
67
  - **Image Ordering Score** evaluates whether the order of images inserted into the multimodal answer matches the order of images in the ground truth.
68
  Specifically, we compute the weighted edit distance between the two image sequences to reflect the difference in their order.
69
 
70
+ - **Data Format** (For Lifestyle Data):
71
  - **Ground-truth**: A = a_1 -> a_2 -> ... -> a_n, where a_i represents the image at the i-th position in the order.
72
  - **Answer**: B = b_1 -> b_2 -> ... -> b_m, where b_j is not necessarily in A, and m is not necessarily equal to n.
73
 
 
331
  <img_2_score>1</img_2_score>
332
 
333
  # Results
334
+ In this section, we give the full experiment results, wherein the metrics of **Prec.**, **Rec.**, **F1.**, **R.L.**, **B.S.**, **Rel.**, **Eff.**, **Comp.**, **Pos.**, and **Avg.** represent image precision, image recall, image F1 score, rouge-l, BERTScore, image relevance, image effectiveness, comprehensive score, image position score, and average score, respectively. Specifically, the metric **Ord.
335
+ ** represents image ordering score.
336
  ## Comprehensive performance results on MRAMG-Wit(Web Dataset).
337
  | Framework | Model | MRAMG-Wit | | | | | | | | | |
338
  |------------|------------------------|-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
 
427
  | | Llama-3.1-8B-Instruct | 29.34 | 26.27 | 26.31 | 33.70 | 81.16 | 32.08 | 30.48 | 51.81 | 32.38 | 38.17 |
428
  | | Llama-3.3-70B-Instruct | 66.83 | 95.80 | 75.47 | 47.98 | 94.79 | 92.03 | 88.03 | 88.93 | 69.34 | 79.91 |
429
 
430
+ ## Comprehensive performance results on MRAMG-Arxiv(Academic Dataset).
431
  | Framework | Model | MRAMG-Arxiv | | | | | | | | | |
432
  |------------|------------------------|-------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
433
  | | | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. |
 
522
  | | Llama-3.3-70B-Instruct | 25.74 | 50.15 | 31.26 | 39.80 | 91.31 | 28.03 | 76.72 | 74.36 | 75.95 | 62.56 | 55.59 |
523
 
524
  ## Comprehensive performance results on MRAMG-Bench.
525
+ | Framework | Model | Web Data | | | | | | | | | | Academic Data | | | | | | | | | | Lifestyle Data | | | | | | | | | | |
526
  |------------|------------------------|-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|---------------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-----------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
527
  | | | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. | Prec. | Rec. | F1 | R.L. | B.S. | Rel. | Eff. | Comp. | Pos. | Avg. | Prec. | Rec. | F1 | R.L. | B.S. | Ord. | Rel. | Eff. | Comp. | Pos. | Avg. |
528
  | Rule-Based | GPT-4o | 43.54 | 37.30 | 39.36 | 48.88 | 92.35 | 38.70 | 35.59 | 77.15 | 43.86 | 50.75 | 55.42 | 63.04 | 57.70 | 44.96 | 94.67 | 69.10 | 67.30 | 84.20 | 75.75 | 68.02 | 47.04 | 63.54 | 50.71 | 51.66 | 92.01 | 43.54 | 77.51 | 74.47 | 79.17 | 77.13 | 65.68 |