titantv090
/

titantv090 rmanluo commited on
Commit
fe151c7
·
0 Parent(s):

Duplicate from rmanluo/GFM-RAG-8M

Browse files

Co-authored-by: LINHAO LUO <rmanluo@users.noreply.huggingface.co>

Files changed (4) hide show
  1. .gitattributes +35 -0
  2. README.md +316 -0
  3. config.json +31 -0
  4. model.pth +3 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation
6
+
7
+ <div align="left">
8
+ <p>
9
+ <a href='https://rmanluo.github.io/gfm-rag/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
10
+ <a href='https://www.arxiv.org/abs/2502.01113'><img src='https://img.shields.io/badge/arXiv-2502.01113-b31b1b'></a>
11
+ <a href="https://pypi.org/project/gfmrag/">
12
+ </p>
13
+ <p>
14
+ <img src='https://img.shields.io/github/stars/RManLuo/gfm-rag?color=green&style=social' />
15
+ <img alt="PyPI - Version" src="https://img.shields.io/pypi/v/gfmrag">
16
+ </a>
17
+ <a href="https://pypi.org/project/gfmrag/">
18
+ <img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dm/gfmrag">
19
+ </a>
20
+ <a href="https://github.com/RManLuo/gfm-rag/issues">
21
+ <img alt="GitHub Issues" src="https://img.shields.io/github/issues/RManLuo/gfm-rag">
22
+ </a>
23
+ <a href="https://github.com/RManLuo/gfm-rag/discussions">
24
+ <img alt="GitHub Discussions" src="https://img.shields.io/github/discussions/RManLuo/gfm-rag">
25
+ </a>
26
+ </p>
27
+ </div>
28
+
29
+ The GFM-RAG is the first graph foundation model-powered RAG pipeline that combines the power of graph neural networks to reason over knowledge graphs and retrieve relevant documents for question answering.
30
+
31
+ <img src="https://github.com/RManLuo/gfm-rag/blob/main/docs/images/intro.png?raw=true" width = "800" />
32
+
33
+ We first build a knowledge graph index (KG-index) from the documents to capture the relationships between knowledge. Then, we feed the query and constructed KG-index into the pre-trained graph foundation model (GFM) retriever to obtain relevant documents for LLM generation. The GFM retriever experiences large-scale training and can be directly applied to unseen datasets without fine-tuning.
34
+
35
+ For more details, please refer to our [project page](https://rmanluo.github.io/gfm-rag/) and [paper](https://www.arxiv.org/abs/2502.01113).
36
+
37
+ ## 🎉 News
38
+ - **[2025-06-03]** We have released a new version of [GFM-RAG (2025-06-03)](https://huggingface.co/rmanluo/GFM-RAG-8M/commit/62cf6398c5875af1c4e04bbb35e4c3b21904d4ac) which is pre-trained on 286 KGs.
39
+
40
+ ## Features
41
+
42
+ - **Graph Foundation Model (GFM)**: A graph neural network-based retriever that can reason over the KG-index.
43
+ - **Knowledge Graph Index**: A knowledge graph index that captures the relationships between knowledge.
44
+ - **Efficiency**: The GFM-RAG pipeline is efficient in conduct multi-hop reasoning with single step retrieval.
45
+ - **Generalizability**: The GFM-RAG can be directly applied to unseen datasets without fine-tuning.
46
+ - **Transferability**: The GFM-RAG can be fine-tuned on your own dataset to improve performance on specific domains.
47
+ - **Compatibility**: The GFM-RAG is compatible with arbitrary agent-based framework to conduct multi-step reasoning.
48
+ - **Interpretability**: The GFM-RAG can illustrate the captured reasoning paths for better understanding.
49
+
50
+ ## Dependencies
51
+
52
+ - Python 3.12
53
+ - CUDA 12 and above
54
+
55
+ ## Installation
56
+
57
+ Conda provides an easy way to install the CUDA development toolkit which is required by GFM-RAG
58
+
59
+ Install packages
60
+ ```bash
61
+ conda create -n gfmrag python=3.12
62
+ conda activate gfmrag
63
+ conda install cuda-toolkit -c nvidia/label/cuda-12.4.1 # Replace with your desired CUDA version
64
+ pip install gfmrag
65
+ ```
66
+
67
+ Install other dependencies
68
+ ```bash
69
+ TORCH=$(python -c "import torch; print(torch.__version__)")
70
+ pip install torch_scatter torch_sparse -f https://data.pyg.org/whl/torch-${TORCH}.html
71
+ ```
72
+
73
+ ## Quick Start
74
+
75
+ ### Prepare Data
76
+
77
+ You need to prepare the following files:
78
+
79
+ - `dataset_corpus.json`: A JSON file containing the entire document corpus.
80
+ - `train.json` (optional): A JSON file containing the training data.
81
+ - `test.json` (optional): A JSON file containing the test data.
82
+
83
+ Place your files in the following structure:
84
+ ```
85
+ data_name/
86
+ ├── raw/
87
+ │ ├── dataset_corpus.json
88
+ │ ├── train.json # (optional)
89
+ │ └── test.json # (optional)
90
+ └── processed/ # Output directory
91
+ ```
92
+
93
+ #### `dataset_corpus.json`
94
+
95
+ The `dataset_corpus.json` is a dictionary where each key is the title or unique id of a document and the value is the text of the document.
96
+
97
+ ```json
98
+ {
99
+ "Fred Gehrke":
100
+ "Clarence Fred Gehrke (April 24, 1918 – February 9, 2002) was an American football player and executive. He played in the National Football League (NFL) for the Cleveland / Los Angeles Rams, San Francisco 49ers and Chicago Cardinals from 1940 through 1950. To boost team morale, Gehrke designed and painted the Los Angeles Rams logo in 1948, which was the first painted on the helmets of an NFL team. He later served as the general manager of the Denver Broncos from 1977 through 1981. He is the great-grandfather of Miami Marlin Christian Yelich"
101
+ ,
102
+ "Manny Machado":
103
+ "Manuel Arturo Machado (] ; born July 6, 1992) is an American professional baseball third baseman and shortstop for the Baltimore Orioles of Major League Baseball (MLB). He attended Brito High School in Miami and was drafted by the Orioles with the third overall pick in the 2010 Major League Baseball draft. He bats and throws right-handed."
104
+ ,
105
+ ...
106
+ }
107
+ ```
108
+
109
+ #### `train.json` and `test.json` (optional)
110
+ If you want to train and evaluate the model, you need to provide training and test data in the form of a JSON file. Each entry in the JSON file should contain the following fields:
111
+
112
+ - `id`: A unique identifier for the example.
113
+ - `question`: The question or query.
114
+ - `supporting_facts`: A list of supporting facts for the question. Each supporting fact is a list containing the title of the document that can be found in the `dataset_corpus.json` file.
115
+
116
+ Each entry can also contain additional fields depending on the task. For example:
117
+
118
+ - `answer`: The answer to the question.
119
+
120
+ The additional fields will be copied during the following steps of the pipeline.
121
+
122
+ Example:
123
+ ```json
124
+ [
125
+ {
126
+ "id": "5adf5e285542992d7e9f9323",
127
+ "question": "When was the judge born who made notable contributions to the trial of the man who tortured, raped, and murdered eight student nurses from South Chicago Community Hospital on the night of July 13-14, 1966?",
128
+ "answer": "June 4, 1931",
129
+ "supporting_facts": [
130
+ "Louis B. Garippo",
131
+ "Richard Speck"
132
+ ]
133
+ },
134
+ {
135
+ "id": "5a7f7b365542992097ad2f80",
136
+ "question": "Did the Beaulieu Mine or the McIntyre Mines yield gold and copper?",
137
+ "answer": "The McIntyre also yielded a considerable amount of copper",
138
+ "supporting_facts": [
139
+ "Beaulieu Mine",
140
+ "McIntyre Mines"
141
+ ]
142
+ }
143
+ ...
144
+ ]
145
+ ```
146
+
147
+ ### Index Dataset
148
+
149
+ You need to create a KG-index [configuration file](gfmrag/workflow/config/stage1_index_dataset.yaml).
150
+
151
+ Details of the configuration parameters are explained in the [KG-index Configuration](https://rmanluo.github.io/gfm-rag/config/kg_index_config/) page.
152
+
153
+ ```bash
154
+ python -m gfmrag.workflow.stage1_index_dataset
155
+ ```
156
+
157
+ This method performs two main tasks:
158
+
159
+ 1. Creates and saves knowledge graph related files (`kg.txt` and `document2entities.json`) from the `dataset_corpus.json` file
160
+ 2. Identify the query entities and supporting entities in training and testing data if available in the raw data directory.
161
+
162
+ Files created:
163
+
164
+ - `kg.txt`: Contains knowledge graph triples
165
+ - `document2entities.json`: Maps documents to their entities
166
+ - `train.json`: Processed training data (if raw exists)
167
+ - `test.json`: Processed test data (if raw exists)
168
+
169
+ Directory structure:
170
+ ```
171
+ root/
172
+ └── data_name/
173
+ ├── raw/
174
+ │ ├── dataset_corpus.json
175
+ │ ├── train.json (optional)
176
+ │ └── test.json (optional)
177
+ └── processed/
178
+ └── stage1/
179
+ ├── kg.txt
180
+ ├── document2entities.json
181
+ ├── train.json
182
+ └── test.json
183
+ ```
184
+
185
+ ### GFM-RAG Retrieval
186
+
187
+ You need to create a [configuration file](gfmrag/workflow/config/stage3_qa_ircot_inference.yaml) for inference.
188
+
189
+ Details of the configuration parameters are explained in the [GFM-RAG Configuration](https://rmanluo.github.io/gfm-rag/config/gfmrag_retriever_config/) page.
190
+
191
+ #### Initialize GFMRetriever
192
+
193
+ You can initialize the GFMRetriever with the following code. It will load the pre-trained GFM-RAG model and the KG-index for retrieval.
194
+
195
+ ```python
196
+ import logging
197
+ import os
198
+
199
+ import hydra
200
+ from hydra.core.hydra_config import HydraConfig
201
+ from omegaconf import DictConfig, OmegaConf
202
+
203
+ from gfmrag import GFMRetriever
204
+
205
+ logger = logging.getLogger(__name__)
206
+
207
+
208
+ @hydra.main(
209
+ config_path="config", config_name="stage3_qa_ircot_inference", version_base=None
210
+ )
211
+ def main(cfg: DictConfig) -> None:
212
+ output_dir = HydraConfig.get().runtime.output_dir
213
+ logger.info(f"Config:\n {OmegaConf.to_yaml(cfg)}")
214
+ logger.info(f"Current working directory: {os.getcwd()}")
215
+ logger.info(f"Output directory: {output_dir}")
216
+
217
+ gfmrag_retriever = GFMRetriever.from_config(cfg)
218
+ ```
219
+
220
+ #### Document Retrieval
221
+
222
+ You can use GFM-RAG retriever to reason over the KG-index and obtain documents for a given query.
223
+ ```python
224
+ docs = retriever.retrieve("Who is the president of France?", top_k=5)
225
+ ```
226
+
227
+ #### Question Answering
228
+
229
+ ```python
230
+ from hydra.utils import instantiate
231
+ from gfmrag.llms import BaseLanguageModel
232
+ from gfmrag.prompt_builder import QAPromptBuilder
233
+
234
+ llm = instantiate(cfg.llm)
235
+ qa_prompt_builder = QAPromptBuilder(cfg.qa_prompt)
236
+
237
+ message = qa_prompt_builder.build_input_prompt(current_query, retrieved_docs)
238
+ answer = llm.generate_sentence(message) # Answer: "Emmanuel Macron"
239
+ ```
240
+
241
+ ## GFM Fine-tuning
242
+
243
+ During fine-tuning, the GFM model will be trained on the query-documents pairs `train.json` from the labeled dataset to learn complex relationships for retrieval.
244
+
245
+ It can be conduced on your own dataset to improve the performance of the model on your specific domain.
246
+
247
+ A example of the training data:
248
+
249
+ ```json
250
+ [
251
+ {
252
+ "id": "5abc553a554299700f9d7871",
253
+ "question": "Kyle Ezell is a professor at what School of Architecture building at Ohio State?",
254
+ "answer": "Knowlton Hall",
255
+ "supporting_facts": [
256
+ "Knowlton Hall",
257
+ "Kyle Ezell"
258
+ ],
259
+ "question_entities": [
260
+ "kyle ezell",
261
+ "architectural association school of architecture",
262
+ "ohio state"
263
+ ],
264
+ "supporting_entities": [
265
+ "10 million donation",
266
+ "2004",
267
+ "architecture",
268
+ "austin e knowlton",
269
+ "austin e knowlton school of architecture",
270
+ "bachelor s in architectural engineering",
271
+ "city and regional planning",
272
+ "columbus ohio united states",
273
+ "ives hall",
274
+ "july 2002",
275
+ "knowlton hall",
276
+ "ksa",
277
+ ]
278
+ },
279
+ ...
280
+ ]
281
+ ```
282
+
283
+ You need to create a [configuration file](gfmrag/workflow/config/stage2_qa_finetune.yaml) for fine-tuning.
284
+
285
+ Details of the configuration parameters are explained in the [GFM-RAG Fine-tuning Configuration](https://rmanluo.github.io/gfm-rag/config/gfmrag_finetune_config/) page.
286
+
287
+ You can fine-tune the pre-trained GFM-RAG model on your dataset using the following command:
288
+
289
+ ```bash
290
+ python -m gfmrag.workflow.stage2_qa_finetune
291
+ # Multi-GPU training
292
+ torchrun --nproc_per_node=4 gfmrag.workflow.stage2_qa_finetune
293
+ # Multi-node Multi-GPU training
294
+ torchrun --nproc_per_node=4 --nnodes=2 gfmrag.workflow.stage2_qa_finetune
295
+ ```
296
+
297
+ ## Acknowledgements
298
+
299
+ We greatly appreciate the following repositories for their help to this project:
300
+
301
+ * [DeepGraphLearning/ULTRA](https://github.com/DeepGraphLearning/ULTRA): The ULTRA model is used as the base GNN model for the GFM retriever.
302
+ * [OSU-NLP-Group/HippoRAG](https://github.com/OSU-NLP-Group/HippoRAG): We get great inspiration from the KG construction process of HippoRAG.
303
+ * [microsoft/graphrag](https://github.com/microsoft/graphrag): We get great inspiration from the project design of GraphRAG.
304
+
305
+ ## Citation
306
+
307
+ If you find this repository helpful, please consider citing our paper:
308
+
309
+ ```bibtex
310
+ @article{luo2025gfmrag,
311
+ title={GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation},
312
+ author={Luo, Linhao and Zhao, Zicheng and Haffari, Gholamreza and Phung, Dinh and Gong, Chen and Pan, Shirui},
313
+ journal={arXiv preprint arXiv:2502.01113},
314
+ year={2025}
315
+ }
316
+ ```
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "text_emb_model_config": {
3
+ "_target_": "gfmrag.text_emb_models.BaseTextEmbModel",
4
+ "text_emb_model_name": "sentence-transformers/all-mpnet-base-v2",
5
+ "normalize": false,
6
+ "batch_size": 32,
7
+ "query_instruct": null,
8
+ "passage_instruct": null,
9
+ "model_kwargs": null
10
+ },
11
+ "model_config": {
12
+ "_target_": "gfmrag.models.GNNRetriever",
13
+ "entity_model": {
14
+ "_target_": "gfmrag.ultra.models.QueryNBFNet",
15
+ "input_dim": 512,
16
+ "hidden_dims": [
17
+ 512,
18
+ 512,
19
+ 512,
20
+ 512,
21
+ 512,
22
+ 512
23
+ ],
24
+ "message_func": "distmult",
25
+ "aggregate_func": "sum",
26
+ "short_cut": true,
27
+ "layer_norm": true
28
+ },
29
+ "rel_emb_dim": 768
30
+ }
31
+ }
model.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:578b1af29201beda2ef61af7fadbd7261a4964c3fcd1c68a22b90a62f6ff1247
3
+ size 32597750