saanikat
commited on
Commit
Β·
3877882
1
Parent(s):
0c93e0f
restructuring
Browse files- README.md +39 -17
- bedbase_schema/bedbase_schema_design.yaml +47 -0
- label_encoder_bedbase.pkl β bedbase_schema/label_encoder_bedbase.pkl +0 -0
- model_bedbase.pth β bedbase_schema/model_bedbase.pth +0 -0
- vectorizer_bedbase.pkl β bedbase_schema/vectorizer_bedbase.pkl +0 -0
- encode_schema/encode_schema_design.yaml +59 -0
- label_encoder_encode.pkl β encode_schema/label_encoder_encode.pkl +0 -0
- model_encode.pth β encode_schema/model_encode.pth +0 -0
- vectorizer_encode.pkl β encode_schema/vectorizer_encode.pkl +0 -0
- fairtracks_schema/fairtracks_schema_design.yaml +48 -0
- label_encoder_fairtracks.pkl β fairtracks_schema/label_encoder_fairtracks.pkl +0 -0
- model_fairtracks.pth β fairtracks_schema/model_fairtracks.pth +0 -0
- vectorizer_fairtracks.pkl β fairtracks_schema/vectorizer_fairtracks.pkl +0 -0
README.md
CHANGED
|
@@ -1,24 +1,46 @@
|
|
| 1 |
### Model Description
|
| 2 |
|
| 3 |
-
This repository
|
| 4 |
-
Both of these models are used by the `attribute-standardizer` for standardizing the metadata based on user choice.
|
| 5 |
|
| 6 |
-
###
|
| 7 |
-
1. [model_encode.pth](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/model_encode.pth) : This has the ENCODE metadata trained model.
|
| 8 |
-
2. [model_fairtracks.pth](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/model_fairtracks.pth) : This has the FAIRTRACKS BLUEPRINT metadata trained model.
|
| 9 |
-
3. [vectorizer_encode.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/vectorizer_encode.pkl) : This is a pickle file which contains a serialized `CountVectorizer` instance from the `scikit-learn` library. It is used for Bag of Words encoding which is used an an input to the model when the user selects ENCODE schema.
|
| 10 |
-
4. [vectorizer_fairtracks.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/vectorizer_fairtracks.pkl): This is a pickle file which contains a serialized `CountVectorizer` instance from the `scikit-learn` library. It is used for Bag of Words encoding which is used an an input to the model when the user selects FAIRTRACKS schema.
|
| 11 |
-
5. [label_encoder_encode.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/label_encoder_encode.pkl): This is a pickle file which contains the unqiue label values derived from the training data. The model classifies the output into these labels for ENCODE schema.
|
| 12 |
-
6. [label_encoder_fairtracks.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/label_encoder_fairtracks.pkl): This is a pickle file which contains the unqiue label values derived from the training data. The model classifies the output into these labels for FAIRTRACKS schema.
|
| 13 |
|
| 14 |
-
### Usage
|
| 15 |
-
To load this model:
|
| 16 |
```
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
```
|
| 22 |
-
To use this model, refer to the GitHub repository of `bedmess`:
|
| 23 |
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
### Model Description
|
| 2 |
|
| 3 |
+
This repository hosts three pre-trained models desgined for metadata attribute standardization for genomic regions metadata. The three pre-trained models are: `ENCODE`, `FAIRTRACKS` and `BEDBASE`. These models, along with their associated files and schema designs are used for standardization by `BEDMS` (BED Metadata Standardizer). To know more about BEDMS, you can visit: https://github.com/databio/bedms
|
|
|
|
| 4 |
|
| 5 |
+
### Directory struture
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
|
|
|
|
|
|
| 7 |
```
|
| 8 |
+
/attribute-standardizer-model6
|
| 9 |
+
/bedbase_schema
|
| 10 |
+
- bedbase_schema_design.yaml # BEDBASE schema
|
| 11 |
+
- label_encoder_bedbase.pkl # Unqiue label values derived from training data, model classifies the output into these labels for BEDBASE schema
|
| 12 |
+
- model_bedbase.pth # BEDBASE schema trained model
|
| 13 |
+
- vectorizer_bedbase.pkl # CountVectorizer instance from the `scikit-learn` library for Bag of Words encoding used as input to the model
|
| 14 |
+
/encode_schema
|
| 15 |
+
- encode_schema_design.yaml #ENCODE schema
|
| 16 |
+
- label_encoder_encode.pkl # Unqiue label values derived from training data, model classifies the output into these labels for ENCODE schema
|
| 17 |
+
- model_encode.pth # ENCODE schema trained model
|
| 18 |
+
- vectorizer_encode.pkl # CountVectorizer instance from the `scikit-learn` library for Bag of Words encoding used as input to the model
|
| 19 |
+
/fairtracks_schema
|
| 20 |
+
- fairtracks_schema_design.yaml # FAIRTRACKS schema
|
| 21 |
+
- label_encoder_fairtracks.pkl # Unqiue label values derived from training data, model classifies the output into these labels for FAIRTRACKS schema
|
| 22 |
+
- model_fairtracks.pth #FAIRTRACKS schema trained model
|
| 23 |
+
- vectorizer_fairtracks.pkl # CountVectorizer instance from the `scikit-learn` library for Bag of Words encoding used as input to the model
|
| 24 |
```
|
|
|
|
| 25 |
|
| 26 |
+
### Usage
|
| 27 |
+
|
| 28 |
+
To use this model, refer to the GitHub repository of `bedms`:
|
| 29 |
+
|
| 30 |
+
[BEDMS](https://github.com/databio/bedms)
|
| 31 |
+
|
| 32 |
+
### Contribution
|
| 33 |
+
|
| 34 |
+
To add a schema model:
|
| 35 |
+
1. You should first train the new model using [BEDMS](https://github.com/databio/bedms).
|
| 36 |
+
2. Create a new directory within this repository with the name of the new schema. ( For example, "new_schema").
|
| 37 |
+
3. Maintain the directory structure like this:
|
| 38 |
+
|
| 39 |
+
```
|
| 40 |
+
/attribute-standardizer-model6
|
| 41 |
+
/new_schema
|
| 42 |
+
- new_schema_design.yaml
|
| 43 |
+
- label_encoder_new_schema.pkl
|
| 44 |
+
- model_new_schema.pth
|
| 45 |
+
- vectorizer_new_schema.pkl
|
| 46 |
+
```
|
bedbase_schema/bedbase_schema_design.yaml
ADDED
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
description: Attribute Standardizer Output Schema in alignment with BEDBASE schema
|
| 2 |
+
|
| 3 |
+
properties:
|
| 4 |
+
sample_name: #Not predicted by the model.
|
| 5 |
+
type: string
|
| 6 |
+
description: "Name of the sample"
|
| 7 |
+
genome:
|
| 8 |
+
type: string
|
| 9 |
+
description: "Type of Genome Assemblies (eg. GRCh38)"
|
| 10 |
+
species_name:
|
| 11 |
+
type: string
|
| 12 |
+
description: "Name of species. e.g. Homo sapiens.", alias="organism"
|
| 13 |
+
species_id:
|
| 14 |
+
type: string
|
| 15 |
+
description: "Species identifier, resolvable by identifiers.org (eg. taxonomy:9606)"
|
| 16 |
+
genotype:
|
| 17 |
+
type: string
|
| 18 |
+
description: "Genotype of the sample"
|
| 19 |
+
phenotype:
|
| 20 |
+
type: string
|
| 21 |
+
description: "Phenotype of the sample"
|
| 22 |
+
cell_type:
|
| 23 |
+
type: string
|
| 24 |
+
description: "Cell type, population of cells that can be grown indefinitely in the lab, used for research, drug testing, and studying biological processes"
|
| 25 |
+
cell_line:
|
| 26 |
+
type: string
|
| 27 |
+
description: "A cultured, immortalized cell population derived from a single cell type, used for experimental research or therapeutic purposes."
|
| 28 |
+
tissue:
|
| 29 |
+
type: string
|
| 30 |
+
description: "Tissue type"
|
| 31 |
+
library_source:
|
| 32 |
+
type: string
|
| 33 |
+
description: "Library source (e.g. genomic, transcriptomic)"
|
| 34 |
+
assay:
|
| 35 |
+
type: string
|
| 36 |
+
description: "Experimental protocol (e.g. ChIP-seq)", alias="exp_protocol"
|
| 37 |
+
antibody:
|
| 38 |
+
type: string
|
| 39 |
+
description: "Antibody used in the assay"
|
| 40 |
+
target:
|
| 41 |
+
type: string
|
| 42 |
+
description: "Target of the assay (e.g. H3K4me3)"
|
| 43 |
+
treatment:
|
| 44 |
+
type: string
|
| 45 |
+
description: "Treatment of the sample (e.g. drug treatment)"
|
| 46 |
+
required:
|
| 47 |
+
- None
|
label_encoder_bedbase.pkl β bedbase_schema/label_encoder_bedbase.pkl
RENAMED
|
File without changes
|
model_bedbase.pth β bedbase_schema/model_bedbase.pth
RENAMED
|
File without changes
|
vectorizer_bedbase.pkl β bedbase_schema/vectorizer_bedbase.pkl
RENAMED
|
File without changes
|
encode_schema/encode_schema_design.yaml
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
description: Attribute Standardizer Output Schema in alignment with ENCODE schema
|
| 2 |
+
|
| 3 |
+
properties:
|
| 4 |
+
File accession:
|
| 5 |
+
type: string
|
| 6 |
+
description: "File ID/Accession"
|
| 7 |
+
File Type :
|
| 8 |
+
type: string
|
| 9 |
+
description: "File Types (eg. bed, bigWig)"
|
| 10 |
+
File format type:
|
| 11 |
+
type: string
|
| 12 |
+
description: "File Convention/Format (eg. narrowPeak)"
|
| 13 |
+
Output type:
|
| 14 |
+
type: string
|
| 15 |
+
description: "The output type provides additional information about the expected contents in the file."
|
| 16 |
+
File assembly:
|
| 17 |
+
type: string
|
| 18 |
+
description: "Type of Genome Assemblies (eg. GRCh38)"
|
| 19 |
+
Assay:
|
| 20 |
+
type: string
|
| 21 |
+
description: "The name of the assay performed"
|
| 22 |
+
Biosample term name:
|
| 23 |
+
type: string
|
| 24 |
+
description: "The human readable ontology name used to describe the biosample."
|
| 25 |
+
Biosample type:
|
| 26 |
+
type: string
|
| 27 |
+
description: "A categorization of biosamples into major groups(eg.induced pluripotent stem cell, stem cell)."
|
| 28 |
+
Biosample Organism:
|
| 29 |
+
type: string
|
| 30 |
+
description: "The species of the biosample."
|
| 31 |
+
Biosample treatments:
|
| 32 |
+
type: string
|
| 33 |
+
description: "The name of the chemical or biological agent applied to a biosample in order to elicit a response."
|
| 34 |
+
Biosample genetic modifications methods:
|
| 35 |
+
type: string
|
| 36 |
+
description: "Experimental Techniques used for genetic modification (eg. CRISPR)"
|
| 37 |
+
Biosample genetic modifications categories:
|
| 38 |
+
type: string
|
| 39 |
+
description: "Type of genetic modification (eg. insertion)"
|
| 40 |
+
Experiment Target:
|
| 41 |
+
type: string
|
| 42 |
+
description: "Experimental Targets (eg. H3K27ac-human)"
|
| 43 |
+
Library made from:
|
| 44 |
+
type: string
|
| 45 |
+
description: "Types of libraries created from biological samples."
|
| 46 |
+
Experiment date released:
|
| 47 |
+
type: string
|
| 48 |
+
description: "Date of the experiment release"
|
| 49 |
+
Project:
|
| 50 |
+
type: string
|
| 51 |
+
description: "The project under which the experiment was performed( eg. ENCODE)"
|
| 52 |
+
Lab:
|
| 53 |
+
type: string
|
| 54 |
+
description: "Lab where the processing took place."
|
| 55 |
+
File Download URL:
|
| 56 |
+
type: string
|
| 57 |
+
description: "File Download URL"
|
| 58 |
+
required:
|
| 59 |
+
- None
|
label_encoder_encode.pkl β encode_schema/label_encoder_encode.pkl
RENAMED
|
File without changes
|
model_encode.pth β encode_schema/model_encode.pth
RENAMED
|
File without changes
|
vectorizer_encode.pkl β encode_schema/vectorizer_encode.pkl
RENAMED
|
File without changes
|
fairtracks_schema/fairtracks_schema_design.yaml
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
description: Attribute Standardizer Output Schema in alignment with FAIRtracks schema
|
| 2 |
+
|
| 3 |
+
properties:
|
| 4 |
+
global_id:
|
| 5 |
+
type: string
|
| 6 |
+
description: "Global sample identifier, resolvable by identifiers.org"
|
| 7 |
+
local_id:
|
| 8 |
+
type: string
|
| 9 |
+
description: "Submitter-local identifier for sample (eg. S00B1LH1)"
|
| 10 |
+
species_id:
|
| 11 |
+
type: string
|
| 12 |
+
description: "Species identifier, resolvable by identifiers.org (eg. taxonomy:9606)"
|
| 13 |
+
species_name:
|
| 14 |
+
type: string
|
| 15 |
+
description: "Species name (eg. Homo sapiens)"
|
| 16 |
+
donor_age:
|
| 17 |
+
type: string
|
| 18 |
+
description: "Sample donor age ranges (eg. 50-60)"
|
| 19 |
+
donor_ethnicity:
|
| 20 |
+
type: string
|
| 21 |
+
description: "Ethnicity of the donor (eg. Northern European)"
|
| 22 |
+
donor_health_status:
|
| 23 |
+
type: string
|
| 24 |
+
description: "Health of the donor during sample collection eg. Normal, Chronic Lymphocytic Leukemia"
|
| 25 |
+
donor_id:
|
| 26 |
+
type: string
|
| 27 |
+
description: "Donor identifier eg.182CLL"
|
| 28 |
+
donor_sex:
|
| 29 |
+
type: string
|
| 30 |
+
description: "Sex of the donor eg. Male, Female, Unknown"
|
| 31 |
+
biospecimen_class_term_id:
|
| 32 |
+
type: string
|
| 33 |
+
description: "URL of ontology term used for classification of the sample"
|
| 34 |
+
biospecimen_class_term_label:
|
| 35 |
+
type: string
|
| 36 |
+
description: "Structural unit for the classification for the sample eg. Cell, Organism Part"
|
| 37 |
+
sample_type_term_id:
|
| 38 |
+
type: string
|
| 39 |
+
description: "URL of the sample term"
|
| 40 |
+
sample_type_term_label:
|
| 41 |
+
type: string
|
| 42 |
+
description: "Main classification of the sample eg. venous blood"
|
| 43 |
+
phenotype_term_id:
|
| 44 |
+
type: string
|
| 45 |
+
description: "Identifier for the phenotype"
|
| 46 |
+
phenotype_term_label:
|
| 47 |
+
type: string
|
| 48 |
+
description: "Main phenotype related to the sample eg. Acute Myeloid Leukemia"
|
label_encoder_fairtracks.pkl β fairtracks_schema/label_encoder_fairtracks.pkl
RENAMED
|
File without changes
|
model_fairtracks.pth β fairtracks_schema/model_fairtracks.pth
RENAMED
|
File without changes
|
vectorizer_fairtracks.pkl β fairtracks_schema/vectorizer_fairtracks.pkl
RENAMED
|
File without changes
|