Y-Tarl commited on
Commit
6ade330
·
verified ·
1 Parent(s): 49ea6c9

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +116 -0
README.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ task_categories:
4
+ - graph-ml
5
+ - tabular-classification
6
+ language:
7
+ - en
8
+ tags:
9
+ - biology
10
+ - bioinformatics
11
+ - knowledge-graph
12
+ - graph-neural-networks
13
+ - drug-discovery
14
+ - medical
15
+ - disease-gene-prediction
16
+ - protein-chemical-interaction
17
+ - medical-ontology
18
+ size_categories:
19
+ - 100K<n<1M
20
+ ---
21
+
22
+ # BioGraphFusion Dataset
23
+
24
+ [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
25
+ [![Paper](https://img.shields.io/badge/Paper-Bioinformatics-green.svg)](https://doi.org/10.1093/bioinformatics/btaf408)
26
+ [![arXiv](https://img.shields.io/badge/arXiv-2507.14468-b31b1b.svg)](https://arxiv.org/abs/2507.14468)
27
+
28
+ ## ? Dataset Description
29
+
30
+ This dataset contains the benchmark data used in the paper **"BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning"** published in *Bioinformatics*.
31
+
32
+ ## ?? Dataset Structure
33
+
34
+ The dataset includes three biomedical knowledge graph completion tasks with background knowledge integration:
35
+
36
+ ### 1. Disease-Gene Prediction (DisGeNet_cv)
37
+
38
+ - **Task**: Disease-gene association prediction
39
+ - **Background Knowledge**: Drug-Disease relationships from SIDER (14,631 triples) + Protein-Chemical relationships from STITCH (277,745 triples)
40
+ - **Main Dataset**: DisGeNet (130,820 triples) focusing on gene targets
41
+ - **Description**: Predicts disease-gene associations using multi-source biological knowledge
42
+
43
+ ### 2. Protein-Chemical Interaction (STITCH)
44
+
45
+ - **Task**: Protein-chemical interaction prediction
46
+ - **Background Knowledge**: Drug-Disease relationships from SIDER (14,631 triples) + Disease-Gene relationships from DisGeNet (130,820 triples)
47
+ - **Main Dataset**: STITCH (23,074 triples) focusing on chemical targets
48
+ - **Description**: Predicts protein-chemical interactions with integrated disease and gene knowledge
49
+
50
+ ### 3. Medical Ontology Reasoning (UMLS)
51
+
52
+ - **Task**: Medical concept reasoning
53
+ - **Background Knowledge**: Various medical relationships from UMLS (4,006 triples)
54
+ - **Main Dataset**: UMLS (2,523 triples) with multi-domain entities
55
+ - **Description**: Reasons about medical concepts and their hierarchical relationships
56
+
57
+ ## ? Dataset Statistics
58
+
59
+ | Dataset | Task | Background Knowledge Sources | Main Dataset Targets | Total Triples |
60
+ |---------|------|------------------------------|---------------------|---------------|
61
+ | **Disease-Gene Prediction** | Disease-gene association prediction | Drug-Disease Relationships SIDER (14,631) + Protein-Chemical Relationships STITCH (277,745) | DisGeNet (130,820) Gene | ~423K |
62
+ | **Protein-Chemical Interaction** | Protein-chemical interaction prediction | Drug-Disease Relationships SIDER (14,631) + Disease-Gene Relationships DisGeNet (130,820) | STITCH (23,074) Chemical | ~168K |
63
+ | **Medical Ontology Reasoning** | Medical concept reasoning | Various Medical Relationships UMLS (4,006) | UMLS (2,523) Multi-domain Entities | ~6.5K |
64
+
65
+ ## ? Usage
66
+
67
+ ### Loading the Dataset
68
+
69
+ ```python
70
+ from datasets import load_dataset
71
+
72
+ # Load the complete dataset
73
+ dataset = load_dataset("Y-TARL/BioGraphFusion")
74
+
75
+ # Load specific task
76
+ disgenet_data = load_dataset("Y-TARL/BioGraphFusion", "Disease-Gene")
77
+ stitch_data = load_dataset("Y-TARL/BioGraphFusion", "Protein-Chemical")
78
+ umls_data = load_dataset("Y-TARL/BioGraphFusion", "umls")
79
+ ```
80
+
81
+ ## ? Citation
82
+
83
+ If you use this dataset in your research, please cite our paper:
84
+
85
+ ```bibtex
86
+ @article{lin2025biographfusion,
87
+ title={BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning},
88
+ author={Lin, Yitong and He, Jiaying and Chen, Jiahe and Zhu, Xinnan and Zheng, Jianwei and Tao, Bo},
89
+ journal={Bioinformatics},
90
+ pages={btaf408},
91
+ year={2025},
92
+ publisher={Oxford University Press}
93
+ }
94
+ ```
95
+
96
+ ## ? Related Resources
97
+
98
+ - **Paper**: [Bioinformatics](https://doi.org/10.1093/bioinformatics/btaf408)
99
+ - **Preprint**: [arXiv:2507.14468](https://arxiv.org/abs/2507.14468)
100
+ - **Code**: [GitHub Repository](https://github.com/Y-TARL/BioGraphFusion)
101
+
102
+ ## ? License
103
+
104
+ This dataset is released under the Apache 2.0 License.
105
+
106
+ ## ? Acknowledgements
107
+
108
+ We thank the original data providers:
109
+
110
+ - DisGeNet for disease-gene associations
111
+ - STITCH for protein-chemical interactions
112
+ - UMLS for medical ontology data
113
+
114
+ ## ? Contact
115
+
116
+ For questions about the dataset, please open an issue in the [GitHub repository](https://github.com/Y-TARL/BioGraphFusion/issues).