abd-ur commited on
Commit
56056d9
Β·
verified Β·
1 Parent(s): db60a64

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -3
README.md CHANGED
@@ -1,3 +1,71 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Here’s a **descriptive yet concise README** for your hierarchical VCF mutation embedding model, suitable for Hugging Face, GitHub, or documentation purposes:
2
+
3
+ ---
4
+
5
+ # GvEM - Genomic Variant Embedding Model
6
+
7
+ **HierarchicalVCF** is a PyTorch-based deep learning model designed to embed and model genomic mutation data from VCF (Variant Call Format) files using a biologically-informed hierarchy:
8
+ **Pathway β†’ Chromosome β†’ Gene β†’ Mutations**
9
+
10
+ This model enables:
11
+
12
+ * Pathway-aware embedding of genomic variants.
13
+ * Learning from multi-level hierarchies of mutation context.
14
+ * Generalization across diseases and datasets via scalable, modular design.
15
+
16
+ ---
17
+
18
+ ## Hierarchy of input data
19
+
20
+ example_data = {
21
+ 'sample1': {
22
+ 'pathway1': {
23
+ 'chr1': {
24
+ 'gene1': [
25
+ {
26
+ 'impact': 'HIGH',
27
+ 'reference': 'A',
28
+ 'alternate': 'T'
29
+ }
30
+ ]
31
+ }
32
+ }
33
+ }
34
+ }
35
+
36
+ ---
37
+
38
+ ## Features
39
+
40
+ * βœ… **VCF Parser**: Converts standard VCF files into a hierarchical JSON-like structure.
41
+ * βœ… **MutationEmbedder**: Learns embeddings for categorical mutation features (scalable).
42
+ * βœ… **GeneEncoder**: Processes lists of mutations using Transformer and heirarchical attention to get gene-level representations.
43
+ * βœ… **ChromosomeEncoder**: Aggregates gene encodings.
44
+ * βœ… **PathwayEncoder**: Aggregates chromosome encodings to yield final sample representation.
45
+ * βœ… **Scalable**: Easily extensible to new fields or biological groupings.
46
+ * βœ… **HuggingFace Compatible**: Designed for sharing and experimentation on the πŸ€— Hub.
47
+
48
+ ---
49
+
50
+ ## Use Cases
51
+
52
+ * Variant-based disease classification (e.g., cancer, rare diseases, ASD)
53
+ * Multi-omics fusion models (tabular + image + VCF)
54
+ * Knowledge-driven mutation impact modeling
55
+ * Transfer learning across genomic datasets
56
+
57
+ ---
58
+
59
+ ## πŸ“‚ Folder Structure
60
+
61
+ ```
62
+ HierarchicalVCF/
63
+ β”œβ”€β”€ vcf_parser.py # Parses VCF files into hierarchical format
64
+ β”œβ”€β”€ model.py # Model components (MutationEmbedder, encoders, classification head)
65
+ β”œβ”€β”€ tokenizer.py # Vocabulary tokenizer for categorical fields
66
+ β”œβ”€β”€ VCFDataset.py # Torch dataset compiler
67
+ β”œβ”€β”€ train.py # Sample training pipeline
68
+ β”œβ”€β”€ README.md # This file
69
+ ```
70
+
71
+ ## MODEL STILL UNDER DEVELOPMENT