Add information about the model and data sets

#1
by ozlemmuslu - opened
Files changed (1) hide show
  1. README.md +139 -3
README.md CHANGED
@@ -1,3 +1,139 @@
1
- ---
2
- license: cc-by-nc-nd-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+
5
+ license: cc-by-nc-nd-4.0
6
+
7
+ tags:
8
+ - NGS
9
+ - somatic-variant-calling
10
+ ---
11
+
12
+ # Model Card for Model ID
13
+
14
+ This is an extra trees model for sensitive detection of somatic indel candidates.
15
+
16
+ ## Model Details
17
+
18
+ ### Model Description
19
+
20
+ - **Developed by:** Özlem Muslu
21
+ - **Funded by :** European Research Council (“ERC Advanced Grant “SUMMIT” (Ugur Sahin): 789256”)
22
+ - **License:** cc-by-nc-nd-4.0
23
+
24
+ ### Model Sources
25
+
26
+ - **Repository:** https://github.com/TRON-Bioinformatics/VariantMedium
27
+
28
+ ## Uses
29
+
30
+ Using matched tumor-normal paired short read sequencing data, you can call a sensitive list of somatic small indels.
31
+
32
+ ### Direct Use
33
+
34
+ You can extract features from a matched tumor-normal sequencing pair using https://github.com/TRON-Bioinformatics/tronflow-vcf-postprocessing and use this model on its output. Specific features this model utilizes are:
35
+ - primary_af
36
+ - primary_dp
37
+ - primary_ac
38
+ - primary_pu
39
+ - primary_pw
40
+ - primary_k
41
+ - primary_rsmq
42
+ - primary_rsmq_pv
43
+ - primary_rspos
44
+ - primary_rspos_pv
45
+ - normal_af
46
+ - normal_dp
47
+ - normal_ac
48
+ - normal_pu
49
+ - normal_pw
50
+ - normal_k
51
+ - normal_rsmq
52
+ - normal_rsmq_pv
53
+ - normal_rspos
54
+ - normal_rspos_pv
55
+
56
+
57
+ ### Downstream Use
58
+
59
+ This model is a part of VariantMedium somatic variant caller and is integrated directly into its workflow https://github.com/TRON-Bioinformatics/VariantMedium.
60
+
61
+ ### Out-of-Scope Use
62
+
63
+ The model on its own is not intended to create a final list of variant calls, it is intended for filtering out noticeable false positives.
64
+
65
+ ## Bias, Risks, and Limitations
66
+
67
+ The model is trained on the output of cell line WES and an AML WGS, both originating from short read Illumina sequencing. It is tested for other cancer entities and for solid tumors, but is not tested for non-Illumina sequencing.
68
+
69
+ ### Recommendations
70
+
71
+ We recommend using this model for Illumina-based WES and WGS (paired, short read).
72
+
73
+ ## Training Details
74
+
75
+ ### Training Data
76
+
77
+ Matched tumor-normal sequencing published under https://ega-archive.org/studies/EGAS00001007633 and https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000159.v13.p5.
78
+
79
+ ### Training Procedure
80
+
81
+ scikit-learn GridSearchCV. Hyperparameters are given under the relevant section.
82
+
83
+ #### Preprocessing
84
+
85
+ Given matched tumor-normal BAM files:
86
+ 1. BAM preprocessing (https://github.com/TRON-Bioinformatics/tronflow-bam-preprocessing)
87
+ 2. Candidate variant calling (https://github.com/TRON-Bioinformatics/tronflow-strelka2)
88
+ 3. Variant normalization and feature extraction (https://github.com/TRON-Bioinformatics/tronflow-vcf-postprocessing)
89
+
90
+ #### Training Hyperparameters
91
+
92
+ ```python
93
+ hyperparams = [
94
+ {
95
+ 'n_estimators': [100, 200],
96
+ 'max_depth': [5, 10],
97
+ 'criterion': ['entropy'],
98
+ 'max_features': ['sqrt', 'log2'],
99
+ 'bootstrap': [True, False]
100
+ },
101
+ {
102
+ 'n_estimators': [300, 400],
103
+ 'max_depth': [10, 15],
104
+ 'criterion': ['entropy'],
105
+ 'max_features': ['sqrt', 'log2'],
106
+ 'bootstrap': [True, False]
107
+ }
108
+ ]
109
+
110
+ ```
111
+
112
+ ## Evaluation
113
+
114
+ Evaluation using CV and a left out cell line.
115
+
116
+ ### Testing Data, Factors & Metrics
117
+
118
+ #### Testing Data
119
+
120
+ Tested on independent data sets:
121
+ - [PCAWG-Pilot63](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000178.v11.p8)
122
+ - [SEQC2 WES samples](ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/)
123
+
124
+ #### Metrics
125
+
126
+ Sensitivity and precision, with sensitivity as the primary metric since the aim was to filter out noticeable false positives instead of coming up with a final list of variants.
127
+
128
+ ### Results
129
+
130
+ | Metric | Test Set |
131
+ |-------------|----------|
132
+ | Precision | 0.8564 |
133
+ | Recall | 0.6048 |
134
+ | F1 Score | 0.7089 |
135
+
136
+ #### Summary
137
+
138
+ Test set recall dropped slightly compared to control (0.6167 -> 0.6048), but precision increased (0.8216 -> 0.8564).
139
+