Update README.md
Browse files
README.md
CHANGED
|
@@ -15,4 +15,4 @@ thumbnail: >-
|
|
| 15 |
|
| 16 |
|
| 17 |
# Abstract
|
| 18 |
-
Recent advances in Retrieval-Augmented Generation (RAG) have significantly improved response accuracy and relevance by incorporating external knowledge into generative models. However, existing RAG methods primarily focus on generating text-only answers, even in Multimodal Retrieval-Augmented Generation (MRAG) scenarios, where multimodal elements are retrieved to assist in generating text answers. To address this, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, which aims to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite increasing attention to this task, a notable lack of a comprehensive benchmark persists for effectively evaluating its performance. To bridge this gap, we provide the MRAMG-Bench, a meticulously curated, human-annotated dataset comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, distributed across six distinct datasets and spanning three domains: Web Data, Academic Paper Data, and Lifestyle Data. The datasets incorporate diverse difficulty levels and complex multi-image scenarios, providing a robust foundation for evaluating the MRAMG task. To facilitate rigorous evaluation, our MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics, enabling a thorough analysis of the performance of popular generative models in the MRAMG task. Besides, we propose an efficient multimodal answer generation framework that leverages both LLMs and MLLMs to generate multimodal responses.
|
|
|
|
| 15 |
|
| 16 |
|
| 17 |
# Abstract
|
| 18 |
+
Recent advances in Retrieval-Augmented Generation (RAG) have significantly improved response accuracy and relevance by incorporating external knowledge into generative models. However, existing RAG methods primarily focus on generating text-only answers, even in Multimodal Retrieval-Augmented Generation (MRAG) scenarios, where multimodal elements are retrieved to assist in generating text answers. To address this, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, which aims to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite increasing attention to this task, a notable lack of a comprehensive benchmark persists for effectively evaluating its performance. To bridge this gap, we provide the MRAMG-Bench, a meticulously curated, human-annotated dataset comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, distributed across six distinct datasets and spanning three domains: Web Data, Academic Paper Data, and Lifestyle Data. The datasets incorporate diverse difficulty levels and complex multi-image scenarios, providing a robust foundation for evaluating the MRAMG task. To facilitate rigorous evaluation, our MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics, enabling a thorough analysis of the performance of popular generative models in the MRAMG task. Besides, we propose an efficient multimodal answer generation framework that leverages both LLMs and MLLMs to generate multimodal responses.
|