abd-ur commited on
Commit
972f2c5
Β·
verified Β·
1 Parent(s): 475dbbd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -36
README.md CHANGED
@@ -1,16 +1,25 @@
1
- # GvEM - Genomic Variant Embedding Model
2
-
3
- **HierarchicalVCF** is a PyTorch-based deep learning model designed to embed and model genomic mutation data from VCF (Variant Call Format) files using a biologically-informed hierarchy:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  **Pathway β†’ Chromosome β†’ Gene β†’ Mutations**
5
 
6
- This model enables:
7
-
8
- * Pathway-aware embedding of genomic variants.
9
- * Learning from multi-level hierarchies of mutation context.
10
- * Generalization across diseases and datasets via scalable, modular design.
11
-
12
  ---
13
-
14
  ## Hierarchy of input data
15
 
16
  example_data = {
@@ -30,38 +39,34 @@ This model enables:
30
  }
31
 
32
  ---
33
-
34
  ## Features
35
 
36
- * βœ… **VCF Parser**: Converts standard VCF files into a hierarchical JSON-like structure.
37
- * βœ… **MutationEmbedder**: Learns embeddings for categorical mutation features (scalable).
38
- * βœ… **GeneEncoder**: Processes lists of mutations using Transformer and heirarchical attention to get gene-level representations.
39
- * βœ… **ChromosomeEncoder**: Aggregates gene encodings.
40
- * βœ… **PathwayEncoder**: Aggregates chromosome encodings to yield final sample representation.
41
- * βœ… **Scalable**: Easily extensible to new fields or biological groupings.
42
- * βœ… **HuggingFace Compatible**: Designed for sharing and experimentation on the πŸ€— Hub.
43
-
44
  ---
 
45
 
46
- ## Use Cases
47
-
48
- * Variant-based disease classification (e.g., cancer, rare diseases, ASD)
49
- * Multi-omics fusion models (tabular + image + VCF)
50
- * Knowledge-driven mutation impact modeling
51
  * Transfer learning across genomic datasets
52
 
 
 
 
 
 
 
 
 
 
 
 
53
  ---
54
 
55
- ## πŸ“‚ Folder Structure
56
-
57
- ```
58
- HierarchicalVCF/
59
- β”œβ”€β”€ vcf_parser.py # Parses VCF files into hierarchical format
60
- β”œβ”€β”€ model.py # Model components (MutationEmbedder, encoders, classification head)
61
- β”œβ”€β”€ tokenizer.py # Vocabulary tokenizer for categorical fields
62
- β”œβ”€β”€ VCFDataset.py # Torch dataset compiler
63
- β”œβ”€β”€ train.py # Sample training pipeline
64
- β”œβ”€β”€ README.md # This file
65
- ```
66
-
67
  ## MODEL STILL UNDER DEVELOPMENT
 
1
+ ---
2
+ '[object Object]': null
3
+ license: apache-2.0
4
+ language:
5
+ - en
6
+ pipeline_tag: token-classification
7
+ tags:
8
+ - RepresentationLearning
9
+ - Genomics
10
+ - Variant
11
+ - Classiciation
12
+ - Mutations
13
+ - Embedding
14
+ - VariantClassificaion
15
+ ---
16
+
17
+ # Model - GvEM (Genomic Variant Embedding Model)
18
+
19
+ **GvEM** is a PyTorch-based deep learning model designed to embed and model genomic mutation data from VCF (Variant Call Format) files using a biologically-informed hierarchy:
20
  **Pathway β†’ Chromosome β†’ Gene β†’ Mutations**
21
 
 
 
 
 
 
 
22
  ---
 
23
  ## Hierarchy of input data
24
 
25
  example_data = {
 
39
  }
40
 
41
  ---
 
42
  ## Features
43
 
44
+ * **VCF Parser**: Converts standard VCF files into a hierarchical JSON-like structure.
45
+ * **MutationEmbedder**: Learns embeddings for categorical mutation features (scalable).
46
+ * **GeneEncoder**: Processes lists of mutations using Transformer and heirarchical attention to get gene-level representations.
47
+ * **ChromosomeEncoder**: Aggregates gene encodings.
48
+ * **PathwayEncoder**: Aggregates chromosome encodings to yield final sample representation.
49
+ * **Scalable**: Easily extensible to new fields or biological groupings.
50
+ * **HuggingFace Compatible**: Designed for sharing and experimentation on the πŸ€— Hub.
 
51
  ---
52
+ ## Uses
53
 
54
+ # Direct Use :
55
+ * Obtain sample level embeddings
56
+ * Mutation pattern learning
 
 
57
  * Transfer learning across genomic datasets
58
 
59
+ # Downstream Use :
60
+ * Variant-based disease prediction (e.g., cancer, rare diseases, ASD)
61
+ * Multi-omics fusion models (tabular + image + VCF)
62
+ * Cohort level mutation analysis
63
+ * Fine-tuning for prognosis, drug response prediction, or variant effect interpretation.
64
+
65
+ # Limitations
66
+ * Use in clinical decision-making without expert oversight.
67
+ * Input variants must already be annotated.
68
+ * Application to non-human genomes, unless explicitly fine-tuned for those organisms.
69
+ * High-resolution functional variant prediction - FUTURE DEVELOPMENT TO BE MADE
70
  ---
71
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  ## MODEL STILL UNDER DEVELOPMENT