Spaces:

RosettaCommons
/

MolecularDatasetCurationGuide

Running

App Files Files Community

maom commited on Feb 25

Commit

92dbf79

verified ·

1 Parent(s): 1788b20

Update sections/07_practical_recommendations.md

Browse files

Files changed (1) hide show

sections/07_practical_recommendations.md +144 -142

sections/07_practical_recommendations.md CHANGED Viewed

@@ -14,192 +14,194 @@
     * dataset3                   \# ΔG measurements
     * dataset3\_single       \# ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
     * dataset3\_single\_cv \# 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
-  * To load a specific subdataset:
-    * datasets.load\_dataset(path \= "RosettaCommons/MegaScale", name \= "dataset1", data\_dir \= "dataset1")
 #### Example: One .csv file dataset
 One table named `outcomes.csv` to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
 First load the dataset locally then push it to the hub:
-import datasets
-dataset \= datasets.load\_dataset(
-    "csv",
-    data\_files \="outcomes.csv",
-    keep\_in\_memory \= True)
-dataset.push\_to\_hub(repo\_id \= "`maomlab/example_dataset`")
 This will create the following files in the repo
-data/
-	train-00000-of-00001.parquet
 and add the following to the header of README.md
-dataset\_info:
-  features:
-    \- name: id
-      dtype: int64
-    \- name: value
-      dtype: int64
-  splits:
-    \- name: train
-      num\_bytes: 64
-      num\_examples: 4
-  download\_size: 1332
-  dataset\_size: 64
-configs:
-  \- config\_name: default
-    data\_files:
-      \- split: train
-        path: data/train-\*
 to load these data from HuggingFace
-`dataset = datasets.load_dataset("maomlab/example_dataset")`
 #### Example: train/valid/test split .csv files
 Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
 load the three splits into one dataset and push it to the hub:
-import datasets
-dataset \= datasets.load\_dataset(
-'csv',
-data\_dir \= "/tmp",
-data\_files \= {
-'train': 'train.csv',
-'valid': 'valid.csv',
-'test': 'test.csv'},
-keep\_in\_memory \= True)
-	dataset.push\_to\_hub(repo\_id \= "maomlab/example\_dataset")
 This will create the following files in the repo
-data/
-	train-00000-of-00001.parquet
-	valid-00000-of-00001.parquet
-	test-00000-of-00001.parquet
 and add the following to the header of the README.md
-dataset\_info:
-  features:
-    \- name: id
-      dtype: int64
-    \- name: value
-      dtype: int64
-  splits:
-    \- name: train
-      num\_bytes: 64
-      num\_examples: 4
-    \- name: valid
-      num\_bytes: 64
-      num\_examples: 4
-    \- name: test
-      num\_bytes: 64
-      num\_examples: 4
-  download\_size: 3996
-  dataset\_size: 192
-configs:
-  \- config\_name: default
-    data\_files:
-      \- split: train
-        path: data/train-\*
-      \- split: valid
-        path: data/valid-\*
-      \- split: test
-        path: data/test-\*
 to load these data from HuggingFace
-`dataset = datasets.load_dataset("maomlab/example_dataset")`
 #### Example: sub-datasets
 If you have different related datasets (`dataset1.csv`, `dataset2.csv`, `dataset3.csv`) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name.
 	import datasets
-dataset1 \= datasets.load\_dataset('csv', data\_files \= '/tmp/dataset1.csv', keep\_in\_memory \= True)
-dataset2 \= datasets.load\_dataset('csv', data\_files \= '/tmp/dataset2.csv', keep\_in\_memory \= True)
-dataset3 \= datasets.load\_dataset('csv', data\_files \= '/tmp/dataset3.csv', keep\_in\_memory \= True)
-dataset1.push\_to\_hub(repo\_id \= "`maomlab/example_dataset`", config\_name \= 'dataset1', data\_dir \= 'dataset1/data')
-dataset2.push\_to\_hub(repo\_id \= "`maomlab/example_dataset`", config\_name \= 'dataset2', data\_dir \= 'dataset2/data')
-dataset3.push\_to\_hub(repo\_id \= "`maomlab/example_dataset`", config\_name \= 'dataset3', data\_dir \= 'dataset3/data')
 This will create the following files in the repo
-dataset1/
-	data/
-	train-00000-of-00001.parquet
-dataset2/
-	data/
-	train-00000-of-00001.parquet
-dataset3/
-	data/
-	train-00000-of-00001.parquet
 and add the following to the header of the README.md
-dataset\_info:
-  \- config\_name: dataset1
-    features:
-      \- name: id
-        dtype: int64
-      \- name: value1
-        dtype: int64
-    splits:
-      \- name: train
-        num\_bytes: 64
-        num\_examples: 4
-    download\_size: 1344
-    dataset\_size: 64
-  \- config\_name: dataset2
-    features:
-      \- name: id
-        dtype: int64
-      \- name: value2
-        dtype: int64
-    splits:
-      \- name: train
-        num\_bytes: 64
-        num\_examples: 4
-    download\_size: 1344
-    dataset\_size: 64
-  \- config\_name: dataset3
-    features:
-      \- name: id
-        dtype: int64
-      \- name: value3
-        dtype: int64
-    splits:
-      \- name: train
-        num\_bytes: 64
-        num\_examples: 4
-    download\_size: 1344
-    dataset\_size: 64
-configs:
-  \- config\_name: dataset1
-    data\_files:
-      \- split: train
-        path: dataset1/data/train-\*
-  \- config\_name: dataset2
-    data\_files:
-      \- split: train
-        path: dataset2/data/train-\*
-  \- config\_name: dataset3
-    data\_files:
-      \- split: train
-        path: dataset3/data/train-\*
 to load these datasets from HuggingFace
-`dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')`
-	`dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')`
-	`dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')`
 ### **Format of a dataset**
@@ -209,7 +211,7 @@ The columns should follow typical database design guidelines
 * Identifier columns
   * sequential key
-    * For example: \[1, 2, 3, …\]
   * primary key
     * single column that uniquely identify each row
       * distinct for every row
@@ -218,7 +220,7 @@ The columns should follow typical database design guidelines
   * composite key
     * A set of columns that uniquely identify each row
       * Either hierarchical or complementary ids that characterize the observation
-      * For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier
   * additional/foreign key identifiers
     * identifiers to link the observation with other data
     * For example
@@ -272,7 +274,7 @@ The columns should follow typical database design guidelines
     * Often very fast to read/write, but may not be robust for across language/OS versions
     * Not easily interoperable across programming languages
 * In memory formats
-  * R data.frame/dplyr::tibble
     * Widely used format for R data science
     * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
   * Python pandas DataFrame
@@ -292,5 +294,5 @@ Recommendations
     * Smaller than .csv/.tsv
     * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
 * In memory
-  * Use dplyr::tibble / pandas DataFrame for data science tables
   * Use numpy array / pytorch dataset for machine learning

     * dataset3                   \# ΔG measurements
     * dataset3\_single       \# ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
     * dataset3\_single\_cv \# 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
+  * To load a specific subdataset:
+    ```
+    datasets.load_dataset(path = "RosettaCommons/MegaScale", name = "dataset1", data_dir = "dataset1")
+    ```
 #### Example: One .csv file dataset
 One table named `outcomes.csv` to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
 First load the dataset locally then push it to the hub:
+    import datasets
+    dataset = datasets.load_dataset(
+        "csv",
+        data_files ="outcomes.csv",
+        keep_in_memory = True)
+    dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")
 This will create the following files in the repo
+    data/
+	    train-00000-of-00001.parquet
 and add the following to the header of README.md
+    dataset_info:
+      features:
+        - name: id
+          dtype: int64
+        - name: value
+          dtype: int64
+      splits:
+        - name: train
+          num_bytes: 64
+          num_examples: 4
+      download_size: 1332
+      dataset_size: 64
+    configs:
+      - config_name: default
+        data_files:
+          - split: train
+            path: data/train-*
 to load these data from HuggingFace
+    dataset = datasets.load_dataset("maomlab/example_dataset")
 #### Example: train/valid/test split .csv files
 Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
 load the three splits into one dataset and push it to the hub:
+    import datasets
+    dataset = datasets.load_dataset(
+        'csv',
+        data_dir = "/tmp",
+        data_files = {
+          'train': 'train.csv',
+          'valid': 'valid.csv',
+          'test': 'test.csv'},
+        keep_in_memory = True)
+	dataset.push_to_hub(repo_id = "maomlab/example_dataset")
 This will create the following files in the repo
+    data/
+	    train-00000-of-00001.parquet
+	    valid-00000-of-00001.parquet
+	    test-00000-of-00001.parquet
 and add the following to the header of the README.md
+    dataset_info:
+      features:
+        - name: id
+          dtype: int64
+        - name: value
+          dtype: int64
+       splits:
+        - name: train
+          num_bytes: 64
+          num_examples: 4
+        - name: valid
+          num_bytes: 64
+          num_examples: 4
+        - name: test
+          num_bytes: 64
+          num_examples: 4
+      download_size: 3996
+      dataset_size: 192
+    configs:
+      - config_name: default
+        data_files:
+          - split: train
+            path: data/train-*
+          - split: valid
+            path: data/valid-*
+          - split: test
+            path: data/test-*
 to load these data from HuggingFace
+    dataset = datasets.load_dataset("maomlab/example_dataset")
 #### Example: sub-datasets
 If you have different related datasets (`dataset1.csv`, `dataset2.csv`, `dataset3.csv`) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name.
 	import datasets
+    dataset1 = datasets.load_dataset('csv', data_files = '/tmp/dataset1.csv', keep_in\_memory = True)
+    dataset2 = datasets.load_dataset('csv', data_files = '/tmp/dataset2.csv', keep_in\_memory = True)
+    dataset3 = datasets.load_dataset('csv', data_files = '/tmp/dataset3.csv', keep_in\_memory = True)
+    dataset1.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset1', data_dir = 'dataset1/data')
+    dataset2.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset2', data_dir = 'dataset2/data')
+    dataset3.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset3', data_dir = 'dataset3/data')
 This will create the following files in the repo
+    dataset1/
+	    data/
+	    train-00000-of-00001.parquet
+    dataset2/
+	    data/
+	    train-00000-of-00001.parquet
+    dataset3/
+	    data/
+	    train-00000-of-00001.parquet
 and add the following to the header of the README.md
+    dataset_info:
+      - config_name: dataset1
+        features:
+          - name: id
+            dtype: int64
+          - name: value1
+            dtype: int64
+        splits:
+          - name: train
+            num_bytes: 64
+            num_examples: 4
+        download_size: 1344
+        dataset_size: 64
+      - config_name: dataset2
+        features:
+          - name: id
+            dtype: int64
+          - name: value2
+            dtype: int64
+        splits:
+          - name: train
+            num_bytes: 64
+            num_examples: 4
+        download_size: 1344
+        dataset_size: 64
+      - config_name: dataset3
+        features:
+          - name: id
+            dtype: int64
+          - name: value3
+            dtype: int64
+        splits:
+          - name: train
+            num_bytes: 64
+            num_examples: 4
+        download_size: 1344
+        dataset_size: 64
+    configs:
+      - config_name: dataset1
+        data_files:
+          - split: train
+            path: dataset1/data/train-*
+      - config_name: dataset2
+        data_files:
+          - split: train
+            path: dataset2/data/train-*
+      - config_name: dataset3
+        data_files:
+          - split: train
+            path: dataset3/data/train-*
 to load these datasets from HuggingFace
+    dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')
+	dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')
+	dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')
 ### **Format of a dataset**
 * Identifier columns
   * sequential key
+    * For example: `[1, 2, 3, ...]`
   * primary key
     * single column that uniquely identify each row
       * distinct for every row
   * composite key
     * A set of columns that uniquely identify each row
       * Either hierarchical or complementary ids that characterize the observation
+      * For example, for an observation of mutations, the (`structure_id`, `residue_id`, `mutation_aa`) is a unique identifier
   * additional/foreign key identifiers
     * identifiers to link the observation with other data
     * For example
     * Often very fast to read/write, but may not be robust for across language/OS versions
     * Not easily interoperable across programming languages
 * In memory formats
+  * R `data.frame`/`dplyr::tibble`
     * Widely used format for R data science
     * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
   * Python pandas DataFrame
     * Smaller than .csv/.tsv
     * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
 * In memory
+  * Use `dplyr::tibble` / pandas DataFrame for data science tables
   * Use numpy array / pytorch dataset for machine learning