File size: 11,437 Bytes
120e0ad 9a6e5f9 92dbf79 9a6e5f9 92dbf79 9a6e5f9 92dbf79 9a6e5f9 92dbf79 9a6e5f9 92dbf79 9a6e5f9 92dbf79 9a6e5f9 92dbf79 9a6e5f9 92dbf79 9a6e5f9 92dbf79 9a6e5f9 92dbf79 9a6e5f9 92dbf79 9a6e5f9 92dbf79 9a6e5f9 92dbf79 9a6e5f9 92dbf79 9a6e5f9 92dbf79 9a6e5f9 92dbf79 120e0ad 92dbf79 120e0ad 92dbf79 120e0ad 92dbf79 120e0ad 92dbf79 120e0ad | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 | ## Practical Recommendations
### **Structure of data in a HuggingFace datasets**
#### Datasets, sub-datasets, splits
* A HuggingFace dataset contains multiple sub-datasets e.g. at different filter/stringency levels.
* Each sub-dataset has one or more splits, typically ('train', 'validate', 'test'). If the data does not have splits it will be 'train'.
* The data in different splits of a single sub-dataset should non-overlapping
* Example:
* The [MegaScale](https://huggingface.co/datasets/RosettaCommons/MegaScale) contains 6 datasets
* dataset1 \# all stability measurements
* dataset2 \# high-quality folding stabilities
* dataset3 \# ΔG measurements
* dataset3\_single \# ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
* dataset3\_single\_cv \# 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
* To load a specific subdataset:
```
datasets.load_dataset(path = "RosettaCommons/MegaScale", name = "dataset1", data_dir = "dataset1")
```
#### Example: One .csv file dataset
One table named `outcomes.csv` to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
First load the dataset locally then push it to the hub:
import datasets
dataset = datasets.load_dataset(
"csv",
data_files ="outcomes.csv",
keep_in_memory = True)
dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")
This will create the following files in the repo
data/
train-00000-of-00001.parquet
and add the following to the header of README.md
dataset_info:
features:
- name: id
dtype: int64
- name: value
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1332
dataset_size: 64
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
to load these data from HuggingFace
dataset = datasets.load_dataset("maomlab/example_dataset")
#### Example: train/valid/test split .csv files
Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
load the three splits into one dataset and push it to the hub:
import datasets
dataset = datasets.load_dataset(
'csv',
data_dir = "/tmp",
data_files = {
'train': 'train.csv',
'valid': 'valid.csv',
'test': 'test.csv'},
keep_in_memory = True)
dataset.push_to_hub(repo_id = "maomlab/example_dataset")
This will create the following files in the repo
data/
train-00000-of-00001.parquet
valid-00000-of-00001.parquet
test-00000-of-00001.parquet
and add the following to the header of the README.md
dataset_info:
features:
- name: id
dtype: int64
- name: value
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
- name: valid
num_bytes: 64
num_examples: 4
- name: test
num_bytes: 64
num_examples: 4
download_size: 3996
dataset_size: 192
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: valid
path: data/valid-*
- split: test
path: data/test-*
to load these data from HuggingFace
dataset = datasets.load_dataset("maomlab/example_dataset")
#### Example: sub-datasets
If you have different related datasets (`dataset1.csv`, `dataset2.csv`, `dataset3.csv`) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name.
import datasets
dataset1 = datasets.load_dataset('csv', data_files = '/tmp/dataset1.csv', keep_in\_memory = True)
dataset2 = datasets.load_dataset('csv', data_files = '/tmp/dataset2.csv', keep_in\_memory = True)
dataset3 = datasets.load_dataset('csv', data_files = '/tmp/dataset3.csv', keep_in\_memory = True)
dataset1.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset1', data_dir = 'dataset1/data')
dataset2.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset2', data_dir = 'dataset2/data')
dataset3.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset3', data_dir = 'dataset3/data')
This will create the following files in the repo
dataset1/
data/
train-00000-of-00001.parquet
dataset2/
data/
train-00000-of-00001.parquet
dataset3/
data/
train-00000-of-00001.parquet
and add the following to the header of the README.md
dataset_info:
- config_name: dataset1
features:
- name: id
dtype: int64
- name: value1
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1344
dataset_size: 64
- config_name: dataset2
features:
- name: id
dtype: int64
- name: value2
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1344
dataset_size: 64
- config_name: dataset3
features:
- name: id
dtype: int64
- name: value3
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1344
dataset_size: 64
configs:
- config_name: dataset1
data_files:
- split: train
path: dataset1/data/train-*
- config_name: dataset2
data_files:
- split: train
path: dataset2/data/train-*
- config_name: dataset3
data_files:
- split: train
path: dataset3/data/train-*
to load these datasets from HuggingFace
dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')
dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')
dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')
### **Format of a dataset**
A dataset should consist of a single table where each row is a single observation
The columns should follow typical database design guidelines
* Identifier columns
* sequential key
* For example: `[1, 2, 3, ...]`
* primary key
* single column that uniquely identify each row
* distinct for every row
* no non-missing values
* For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
* composite key
* A set of columns that uniquely identify each row
* Either hierarchical or complementary ids that characterize the observation
* For example, for an observation of mutations, the (`structure_id`, `residue_id`, `mutation_aa`) is a unique identifier
* additional/foreign key identifiers
* identifiers to link the observation with other data
* For example
* for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
* FDA drug name or the IUPAC substance name
* Tidy key/value columns
* [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)
* tidy data sometimes called (long) has one measurement per row
* Multiple columns can be used to give details for each measurement including type, units, metadata
* Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
* Can handle variable number of measurements per object
* Duplicates object identifier columns for each measurement
* array data sometimes called (wide) has one object per row and multiple measurements as different columns
* Typically each measurement is typically a single column
* More compact, i.e. no duplication of identifier columns
* Good for certain ML/matrix based computational workflows
#### Molecular formats
* Store molecular structure in standard text formats
* protein structure: PDB, mmCIF, modelCIF
* small molecule: SMILES, InChi
* use uncompressed, plaintext format
* Easier to computationally analyze
* the whole dataset will be compressed for data serialization
* Filtering / Standardization / sanitization
* Be clear about process methods used to process the molecular data
* Be especially careful for inferred / aspects of the data
* protonation states,
* salt form, stereochemistry for small molecules
* data missingness including unstructured loops for proteins
* Tools
* MolVS is useful for small molecule sanitization
#### Computational data formats
* On disk formats
* parquet format disk format
* column oriented (so can load only data that is needed, easier to compress)
* robust reader/write codes from apache arrow for Python, R etc.
* ArrowTable
* In memory format closely aligned with the on disk parquet format
* Native format for datasets stored in datasets python package
* tab/comma separated table
* Prefer tab separated, more consistent parsing without needing escaping values
* Widely used row-oriented text format for storing tabular data to disk
* Does not store data format and often needs custom format conversion code/QC for loading into python/R
* Can be compressed on disk but row-oriented, so less compressible than .parquet
* .pickle / .Rdata
* language specific serialization of complex data structures
* Often very fast to read/write, but may not be robust for across language/OS versions
* Not easily interoperable across programming languages
* In memory formats
* R `data.frame`/`dplyr::tibble`
* Widely used format for R data science
* Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
* Python pandas DataFrame
* Widely used for python data science
* Out of the box not super fast for data science
* Python numpy array / R Matrix
* Uses single data type for all data
* Useful for efficient/matrix manipulation
* Python Pytorch dataset
* Format specifically geared for loading data for Pytorch deep-learning
Recommendations
* On disk
* For small, config level tables use .tsv
* For large data format use .parquet
* Smaller than .csv/.tsv
* Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
* In memory
* Use `dplyr::tibble` / pandas DataFrame for data science tables
* Use numpy array / pytorch dataset for machine learning |