restructuring

Files changed (13) hide show

README.md +39 -17
bedbase_schema/bedbase_schema_design.yaml +47 -0
label_encoder_bedbase.pkl → bedbase_schema/label_encoder_bedbase.pkl +0 -0
model_bedbase.pth → bedbase_schema/model_bedbase.pth +0 -0
vectorizer_bedbase.pkl → bedbase_schema/vectorizer_bedbase.pkl +0 -0
encode_schema/encode_schema_design.yaml +59 -0
label_encoder_encode.pkl → encode_schema/label_encoder_encode.pkl +0 -0
model_encode.pth → encode_schema/model_encode.pth +0 -0
vectorizer_encode.pkl → encode_schema/vectorizer_encode.pkl +0 -0
fairtracks_schema/fairtracks_schema_design.yaml +48 -0
label_encoder_fairtracks.pkl → fairtracks_schema/label_encoder_fairtracks.pkl +0 -0
model_fairtracks.pth → fairtracks_schema/model_fairtracks.pth +0 -0
vectorizer_fairtracks.pkl → fairtracks_schema/vectorizer_fairtracks.pkl +0 -0

README.md CHANGED Viewed

@@ -1,24 +1,46 @@
 ### Model Description
-This repository has two models - `model_encode.pth` and `model_fairtracks.pth`.
-Both of these models are used by the `attribute-standardizer` for standardizing the metadata based on user choice.
-### Files Description
-1. [model_encode.pth](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/model_encode.pth) : This has the ENCODE metadata trained model.
-2. [model_fairtracks.pth](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/model_fairtracks.pth) : This has the FAIRTRACKS BLUEPRINT metadata trained model.
-3. [vectorizer_encode.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/vectorizer_encode.pkl) : This is a pickle file which contains a serialized `CountVectorizer` instance from the `scikit-learn` library. It is used for Bag of Words encoding which is used an an input to the model when the user selects ENCODE schema.
-4. [vectorizer_fairtracks.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/vectorizer_fairtracks.pkl): This is a pickle file which contains a serialized `CountVectorizer` instance from the `scikit-learn` library. It is used for Bag of Words encoding which is used an an input to the model when the user selects FAIRTRACKS schema.
-5. [label_encoder_encode.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/label_encoder_encode.pkl): This is a pickle file which contains the unqiue label values derived from the training data. The model classifies the output into these labels for ENCODE schema.
-6. [label_encoder_fairtracks.pkl](https://huggingface.co/databio/attribute-standardizer-model6/blob/main/label_encoder_fairtracks.pkl): This is a pickle file which contains the unqiue label values derived from the training data. The model classifies the output into these labels for FAIRTRACKS schema.
-### Usage
-To load this model:
 ```
-from huggingface_hub import hf_hub_download
-model_fairtracks = hf_hub_download(repo_id="databio/attribute-standardizer-model6", filename="model_fairtracks.pth")
-model_encode = hf_hub_download(repo_id="databio/attribute-standardizer-model6", filename="model_encode.pth")
 ```
-To use this model, refer to the GitHub repository of `bedmess`:
-[BEDMess](https://github.com/databio/bedmess)

 ### Model Description
+This repository hosts three pre-trained models desgined for metadata attribute standardization for genomic regions metadata. The three pre-trained models are: `ENCODE`, `FAIRTRACKS` and `BEDBASE`. These models, along with their associated files and schema designs are used for standardization by `BEDMS` (BED Metadata Standardizer). To know more about BEDMS, you can visit: https://github.com/databio/bedms
+### Directory struture
 ```
+/attribute-standardizer-model6
+    /bedbase_schema
+        - bedbase_schema_design.yaml # BEDBASE schema
+        - label_encoder_bedbase.pkl # Unqiue label values derived from training data, model classifies the output into these labels for BEDBASE schema
+        - model_bedbase.pth # BEDBASE schema trained model
+        - vectorizer_bedbase.pkl # CountVectorizer instance from the `scikit-learn` library for Bag of Words encoding used as input to the model
+    /encode_schema
+        - encode_schema_design.yaml #ENCODE schema
+        - label_encoder_encode.pkl # Unqiue label values derived from training data, model classifies the output into these labels for ENCODE schema
+        - model_encode.pth # ENCODE schema trained model
+        - vectorizer_encode.pkl # CountVectorizer instance from the `scikit-learn` library for Bag of Words encoding used as input to the model
+    /fairtracks_schema
+        - fairtracks_schema_design.yaml # FAIRTRACKS schema
+        - label_encoder_fairtracks.pkl # Unqiue label values derived from training data, model classifies the output into these labels for FAIRTRACKS schema
+        - model_fairtracks.pth #FAIRTRACKS schema trained model
+        - vectorizer_fairtracks.pkl # CountVectorizer instance from the `scikit-learn` library for Bag of Words encoding used as input to the model
 ```
+### Usage
+To use this model, refer to the GitHub repository of `bedms`:
+[BEDMS](https://github.com/databio/bedms)
+### Contribution
+To add a schema model:
+1. You should first train the new model using [BEDMS](https://github.com/databio/bedms).
+2. Create a new directory within this repository with the name of the new schema. ( For example, "new_schema").
+3. Maintain the directory structure like this:
+```
+/attribute-standardizer-model6
+    /new_schema
+        - new_schema_design.yaml
+        - label_encoder_new_schema.pkl
+        - model_new_schema.pth
+        - vectorizer_new_schema.pkl
+```

bedbase_schema/bedbase_schema_design.yaml ADDED Viewed

	@@ -0,0 +1,47 @@

+description: Attribute Standardizer Output Schema in alignment with BEDBASE schema
+properties:
+  sample_name: #Not predicted by the model.
+    type: string
+    description: "Name of the sample"
+  genome:
+    type: string
+    description: "Type of Genome Assemblies (eg. GRCh38)"
+  species_name:
+    type: string
+    description: "Name of species. e.g. Homo sapiens.", alias="organism"
+  species_id:
+    type: string
+    description: "Species identifier, resolvable by identifiers.org (eg. taxonomy:9606)"
+  genotype:
+    type: string
+    description: "Genotype of the sample"
+  phenotype:
+    type: string
+    description: "Phenotype of the sample"
+  cell_type:
+    type: string
+    description: "Cell type, population of cells that can be grown indefinitely in the lab, used for research, drug testing, and studying biological processes"
+  cell_line:
+    type: string
+    description: "A cultured, immortalized cell population derived from a single cell type, used for experimental research or therapeutic purposes."
+  tissue:
+    type: string
+    description: "Tissue type"
+  library_source:
+    type: string
+    description: "Library source (e.g. genomic, transcriptomic)"
+  assay:
+    type: string
+    description: "Experimental protocol (e.g. ChIP-seq)", alias="exp_protocol"
+  antibody:
+    type: string
+    description: "Antibody used in the assay"
+  target:
+    type: string
+    description: "Target of the assay (e.g. H3K4me3)"
+  treatment:
+    type: string
+    description: "Treatment of the sample (e.g. drug treatment)"
+required:
+  - None

label_encoder_bedbase.pkl → bedbase_schema/label_encoder_bedbase.pkl RENAMED Viewed

File without changes

model_bedbase.pth → bedbase_schema/model_bedbase.pth RENAMED Viewed

File without changes

vectorizer_bedbase.pkl → bedbase_schema/vectorizer_bedbase.pkl RENAMED Viewed

File without changes

encode_schema/encode_schema_design.yaml ADDED Viewed

	@@ -0,0 +1,59 @@

+description: Attribute Standardizer Output Schema in alignment with ENCODE schema
+properties:
+  File accession:
+    type: string
+    description: "File ID/Accession"
+  File Type :
+    type: string
+    description: "File Types (eg. bed, bigWig)"
+  File format type:
+    type: string
+    description: "File Convention/Format (eg. narrowPeak)"
+  Output type:
+    type: string
+    description: "The output type provides additional information about the expected contents in the file."
+  File assembly:
+    type: string
+    description: "Type of Genome Assemblies (eg. GRCh38)"
+  Assay:
+    type: string
+    description: "The name of the assay performed"
+  Biosample term name:
+    type: string
+    description: "The human readable ontology name used to describe the biosample."
+  Biosample type:
+    type: string
+    description: "A categorization of biosamples into major groups(eg.induced pluripotent stem cell, stem cell)."
+  Biosample Organism:
+    type: string
+    description: "The species of the biosample."
+  Biosample treatments:
+    type: string
+    description: "The name of the chemical or biological agent applied to a biosample in order to elicit a response."
+  Biosample genetic modifications methods:
+    type: string
+    description: "Experimental Techniques used for genetic modification (eg. CRISPR)"
+  Biosample genetic modifications categories:
+    type: string
+    description: "Type of genetic modification (eg. insertion)"
+  Experiment Target:
+    type: string
+    description: "Experimental Targets (eg. H3K27ac-human)"
+  Library made from:
+    type: string
+    description: "Types of libraries created from biological samples."
+  Experiment date released:
+    type: string
+    description: "Date of the experiment release"
+  Project:
+    type: string
+    description: "The project under which the experiment was performed( eg. ENCODE)"
+  Lab:
+    type: string
+    description: "Lab where the processing took place."
+  File Download URL:
+    type: string
+    description: "File Download URL"
+required:
+  - None

label_encoder_encode.pkl → encode_schema/label_encoder_encode.pkl RENAMED Viewed

File without changes

model_encode.pth → encode_schema/model_encode.pth RENAMED Viewed

File without changes

vectorizer_encode.pkl → encode_schema/vectorizer_encode.pkl RENAMED Viewed

File without changes

fairtracks_schema/fairtracks_schema_design.yaml ADDED Viewed

	@@ -0,0 +1,48 @@

+description: Attribute Standardizer Output Schema in alignment with FAIRtracks schema
+properties:
+  global_id:
+    type: string
+    description: "Global sample identifier, resolvable by identifiers.org"
+  local_id:
+    type: string
+    description: "Submitter-local identifier for sample (eg. S00B1LH1)"
+  species_id:
+    type: string
+    description: "Species identifier, resolvable by identifiers.org (eg. taxonomy:9606)"
+  species_name:
+    type: string
+    description: "Species name (eg. Homo sapiens)"
+  donor_age:
+    type: string
+    description: "Sample donor age ranges (eg. 50-60)"
+  donor_ethnicity:
+    type: string
+    description: "Ethnicity of the donor (eg. Northern European)"
+  donor_health_status:
+    type: string
+    description: "Health of the donor during sample collection eg. Normal, Chronic Lymphocytic Leukemia"
+  donor_id:
+    type: string
+    description: "Donor identifier eg.182CLL"
+  donor_sex:
+    type: string
+    description: "Sex of the donor eg. Male, Female, Unknown"
+  biospecimen_class_term_id:
+    type: string
+    description: "URL of ontology term used for classification of the sample"
+  biospecimen_class_term_label:
+    type: string
+    description: "Structural unit for the classification for the sample eg. Cell, Organism Part"
+  sample_type_term_id:
+    type: string
+    description: "URL of the sample term"
+  sample_type_term_label:
+    type: string
+    description: "Main classification of the sample eg. venous blood"
+  phenotype_term_id:
+    type: string
+    description: "Identifier for the phenotype"
+  phenotype_term_label:
+    type: string
+    description: "Main phenotype related to the sample eg. Acute Myeloid Leukemia"

label_encoder_fairtracks.pkl → fairtracks_schema/label_encoder_fairtracks.pkl RENAMED Viewed

File without changes

model_fairtracks.pth → fairtracks_schema/model_fairtracks.pth RENAMED Viewed

File without changes

vectorizer_fairtracks.pkl → fairtracks_schema/vectorizer_fairtracks.pkl RENAMED Viewed

File without changes