Enhance model card with metadata, paper details, and GitHub README content

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +123 -1
README.md CHANGED
@@ -1,14 +1,136 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
4
  <h1 align="center">
5
  MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems
6
  </h1>
7
  <p align="center">
 
 
 
 
 
 
8
  <a href="https://github.com/MemTensor/MoM">
9
  <img alt="GitHub Repository" src="https://img.shields.io/badge/GitHub-MoM MemReader-blue?logo=github">
10
  </a>
11
  <a href="https://opensource.org/license/apache-2-0">
12
  <img alt="Apache 2.0 License" src="https://img.shields.io/badge/License-Apache_2.0-green.svg?logo=apache">
13
  </a>
14
- </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
  ---
6
+
7
  <h1 align="center">
8
  MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems
9
  </h1>
10
  <p align="center">
11
+ <a href="https://arxiv.org/abs/2510.14252">
12
+ <img alt="arXiv Paper" src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg?logo=arxiv">
13
+ </a>
14
+ <a href="https://huggingface.co/papers/2510.14252">
15
+ <img src="https://img.shields.io/badge/Huggingface-Paper-yellow?style=flat-square&logo=huggingface">
16
+ </a>
17
  <a href="https://github.com/MemTensor/MoM">
18
  <img alt="GitHub Repository" src="https://img.shields.io/badge/GitHub-MoM MemReader-blue?logo=github">
19
  </a>
20
  <a href="https://opensource.org/license/apache-2-0">
21
  <img alt="Apache 2.0 License" src="https://img.shields.io/badge/License-Apache_2.0-green.svg?logo=apache">
22
  </a>
23
+ <br>
24
+ <a href="https://huggingface.co/datasets/Robot2050/MoM">
25
+ <img src="https://img.shields.io/badge/Huggingface-Dataset-FF6F00?style=flat-square&logo=huggingface">
26
+ </a>
27
+ <a href="https://huggingface.co/Robot2050/MoM/tree/main/scenario_cot_ratio_1.5B">
28
+ <img src="https://img.shields.io/badge/Model-MemReader 1.5B-FF6F00?style=flat-square&logo=huggingface">
29
+ </a>
30
+ <a href="https://huggingface.co/Robot2050/MoM/tree/main/scenario_cot_ratio_3B">
31
+ <img src="https://img.shields.io/badge/Model-MemReader 3B-FF6F00?style=flat-square&logo=huggingface">
32
+ </a>
33
+ <a href="https://huggingface.co/Robot2050/MoM/tree/main/scenario_ratio_7B">
34
+ <img src="https://img.shields.io/badge/Model-MemReader 7B-FF6F00?style=flat-square&logo=huggingface">
35
+ </a>
36
+ </p>
37
+
38
+ This repository hosts the **MemReader** models, part of the **MoM (Mixtures of Scenario-Aware Document Memories)** framework, introduced in the paper [MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems](https://huggingface.co/papers/2510.14252).
39
+
40
+ ## Abstract
41
+
42
+ The traditional RAG paradigm, which typically engages in the comprehension of relevant text chunks in response to received queries, inherently restricts both the depth of knowledge internalization and reasoning capabilities. To address this limitation, our research transforms the text processing in RAG from passive chunking to proactive understanding, defining this process as document memory extraction with the objective of simulating human cognitive processes during reading. Building upon this, we propose the Mixtures of scenario-aware document Memories (MoM) framework, engineered to efficiently handle documents from multiple domains and train small language models (SLMs) to acquire the ability to proactively explore and construct document memories. The MoM initially instructs large language models (LLMs) to simulate domain experts in generating document logical outlines, thereby directing structured chunking and core content extraction. It employs a multi-path sampling and multi-perspective evaluation mechanism, specifically designing comprehensive metrics that represent chunk clarity and extraction completeness to select the optimal document memories. Additionally, to infuse deeper human-like reading abilities during the training of SLMs, we incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes. Finally, leveraging diverse forms of content generated by MoM, we develop a three-layer document memory retrieval mechanism, which is grounded in our theoretical proof from the perspective of probabilistic modeling. Extensive experimental results across three distinct domains demonstrate that the MoM framework not only resolves text chunking challenges in existing RAG systems, providing LLMs with semantically complete document memories, but also paves the way for SLMs to achieve human-centric intelligent text processing.
43
+
44
+ ### 🎯 Who Should Pay Attention to Our Work?
45
+
46
+ This study proposes an innovative framework aimed at breaking through the cognitive bottlenecks of traditional RAG systems, offering significant reference value for researchers and engineers committed to enhancing the depth and breadth of information processing in LLMs. Specifically, professionals in the following fields will benefit from our work:
47
+
48
+ **Researchers in NLP and Information Retrieval**: The active memory extraction paradigm proposed in this paper challenges the traditional text processing workflow of "chunk first, then understand", providing a novel research perspective for fields such as document understanding, semantic segmentation, and knowledge representation.
49
+
50
+ **Developers of LLM Applications**: Our work directly addresses the core challenges faced by RAG systems in constructing knowledge-intensive applications, such as semantic incompleteness and logical fragmentation of text chunks. It offers a systematic approach to generating high-quality, structured document memories.
51
+
52
+ **Researchers in SLMs**: Facing the limitations of SLMs in complex cognitive tasks, we demonstrate, through the reverse construction strategy of the **C**hain reasoning **o**f **M**emory extraction (CoM), how to efficiently transfer the deep reasoning capabilities of LLMs to SLMs, opening up new pathways for building lightweight, high-performance intelligent systems.
53
+
54
+ **Scholars in the Interdisciplinary Field of Cognitive Science and AI**: The core of this study lies in simulating the cognitive processes of human experts by transforming unstructured text into hierarchical memories. This provides robust support for exploring human-like cognition, knowledge construction, and reasoning mechanisms in machines.
55
+
56
+ ### ✨ Core Contributions
57
+
58
+ **Proposing Active Memory Extraction**: We advocate transforming text processing in RAG from passive text chunking to active memory extraction. By simulating domain experts, we first achieve a holistic and macroscopic understanding of documents and then construct structured document memories.
59
+
60
+ **Defining Structured Document Memories**: We formally define document memories as a triplet composed of a macroscopic logical outline, highly condensed core content, and semantically coherent atomic chunks.
61
+
62
+ **Constructing the MoM Framework and CoM**: We design the MoM framework, which generates high-quality memories through a multi-path sampling and multi-dimensional evaluation mechanism. Furthermore, we employ a reverse reasoning strategy to construct the CoM, thereby endowing SLMs with complex cognitive capabilities.
63
+
64
+ **Designing a Three-Layer Retrieval Mechanism and Providing Theoretical Proof**: We develop a three-layer document memory retrieval mechanism encompassing logical outlines, core content, and original text. From a probabilistic modeling perspective, we theoretically demonstrate that this strategy can more effectively reduce information loss and achieve more precise knowledge localization compared to fusing information before retrieval.
65
+
66
+ ## **🛠️ Quick Start**
67
+
68
+ - Install dependency packages
69
+
70
+ ```bash
71
+ pip install -r requirements.txt
72
+ ```
73
+
74
+ - Start the milvus-lite service (vector database)
75
+
76
+ ```bash
77
+ milvus-server --data /Storage/path/of/the/database
78
+ ```
79
+
80
+ - Download models to corresponding directories.
81
+ - Modify various configurations according to your need.
82
+ - Run `chunk_*.py` and `mom_*.py` to accomplish the text chunking task for domain documents.
83
+
84
+ ```bash
85
+ CUDA_VISIBLE_DEVICES=0 nohup python chunk_gpt.py >> multifiled/qwen3_14B_set.log 2>&1 &
86
+ ```
87
+
88
+ - Subsequently, execute `quick_start.py` and `retrieval.py` to carry out the retrieval and question-answering processes.
89
+
90
+ ```bash
91
+ CUDA_VISIBLE_DEVICES=1 nohup python quick_start.py \
92
+ --docs_path 'crud_qwen3_14B_set.json' \
93
+ --collection_name 'crud_qwen3_14B_set' \
94
+ --retrieve_top_k 8 \
95
+ --task 'quest_answer' \
96
+ --construct_index \
97
+ >> log/mom_crud_qwen3_14B_set.log 2>&1 &
98
+
99
+ CUDA_VISIBLE_DEVICES=2 nohup python retrieval.py \
100
+ --data_path 'evaldata/multifieldqa_zh.json'\
101
+ --save_file 'eval/mom_multifieldqa_zh_qwen3_14B_set.json'\
102
+ --docs_path 'multifieldqa_zh_qwen3_14B_set.json' \
103
+ --collection_name 'multifieldqa_zh_qwen3_14B_set' \
104
+ --retrieve_top_k 8 \
105
+ --construct_index \
106
+ >> log/mom_multifieldqa_zh_huagong_qwen3_14B_set.log 2>&1 &
107
+ ```
108
+
109
+ - Open and run `chunk.ipynb`, which will conduct a comprehensive quality assessment of the results generated by different chunking strategies.
110
+
111
+ ### 📊 Results
112
+
113
+ We conduct extensive experiments on three QA datasets across different domains, including news, finance and so on.
114
+
115
+ **Performance Across Domains**: Our MemReader demonstrates outstanding performance in handling pure text QA tasks.
116
+
117
+ **Effectiveness of Evaluation Metrics**: The memory evaluation metrics we proposed are proven to effectively assess the quality of memory chunks, providing a reliable basis for the automatic screening of high-quality document memories.
118
+
119
+ **Information Supportiveness of Retrieved Content**: The results indicate that the memories extracted and organized by MoM can provide more comprehensive information for downstream tasks.
120
+
121
+ ![Experimental Results](https://github.com/MemTensor/MoM/raw/main/image/experimental_results.png)
122
+
123
+ ## Citation
124
+
125
+ If our work has been helpful to you, please consider citing it. Your citation serves as encouragement for our research.
126
+ ```bibtex
127
+ @misc{zhao2024mom,
128
+ title={MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems},
129
+ author={Jihao Zhao and Tianyi Long and Jinrui Liu and Xiusi Chen and Guannan Yang and Junyu Luo and Zhiping Xiao and Wei Ju and Ming Zhang},
130
+ year={2024},
131
+ eprint={2510.14252},
132
+ archivePrefix={arXiv},
133
+ primaryClass={cs.CL},
134
+ url={https://arxiv.org/abs/2510.14252},
135
+ }
136
+ ```