IlPakoZ commited on
Commit
c39a9ed
·
1 Parent(s): 9f4d3e3

Corrected inaccuracies in README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -35,7 +35,7 @@ To load `M5ModelForRegression` explicitly:
35
  ```python
36
  from transformers import AutoModelForSequenceClassification
37
 
38
- model = AutoModelForSequenceClassification.from_pretrained(
39
  "IlPakoZ/m5-encoder", trust_remote_code=True
40
  )
41
  ```
@@ -69,7 +69,7 @@ hidden = outputs.last_hidden_state # (1, seq_len, 512)
69
  A function ``model.collate_for_dataset`` is also available to perform collation for use in Pytorch's DataLoader. The function gets a list of tuples, each of which is composed of:
70
  - the first element is a dictionary with keys ``"input_ids"`` (``np.ndarray``, shape ``(L,)``) and ``"attention_mask"`` (``np.ndarray``, shape ``(L,)``), as produced by a tokenizer
71
  - the second element contains the positional embedding matrix;
72
- - (optional) token regression labels. This is maintained mostly for reproducibility of our paper's results, but it can be left to None in most circumstances.
73
 
74
  ## Architecture
75
 
@@ -127,7 +127,7 @@ Alpha and beta spin-orbital energies from DFT calculations:
127
 
128
  ### Group 4 — Atom Löwdin charges (PubChemQC B3LYP/PM6)
129
 
130
- Up to 1023 partial charges (`lowdin_0` … `lowdin_1022`), one per atom, predicted using each atom's corresponding output token embedding. This head covers well beyond the maximum number of atoms observed in the dataset.
131
 
132
  ## Dataset
133
 
@@ -147,9 +147,9 @@ The processed dataset contains **82,686,706 SMILES sequences**, each paired with
147
  | Validation | 8,268,673 | tbd |
148
  | Test | 8,268,669 | ~ 0.82 B (×2 with augmentation → ~1.64 B) |
149
 
150
- Training augmentation generates randomized SELFIES on the fly from each SMILES. Labels are normalized before training.
151
 
152
- The HDF5 files are available for download below. These are intended to be processed with the bundled `data_processing` library into LMDB datasets optimised for fast training throughput; the resulting LMDB files are too large to distribute directly.
153
 
154
  | Split | Download |
155
  |---|---|
@@ -160,6 +160,6 @@ The HDF5 files are available for download below. These are intended to be proces
160
  ## Limitations
161
 
162
  - **Token length:** The built-in `prepare_data` helper encodes pairwise molecular-graph distances in an `int16` matrix.
163
- This was done to decrease the size of pairwise-distance matrices in case one intends to pre-compute them before training. Due to the `prepare_data` limitations, molecules whose SELFIES tokenization exceeds **32,766 tokens** (`numpy.iinfo(numpy.int16).max - 1`) are not supported. In practice, most molecule will be well below this limit.
164
  - **Conformer handling:** Duplicate SMILES representing different conformers are kept in the dataset. The model therefore predicts an implicit average over conformers rather than a geometry-specific value, which may reduce accuracy for conformation-sensitive properties.
165
- - **Scope:** The model is pretrained on molecules present in PubChemQC. Performance on certain compounds types and large macromolecules outside the training distribution has not been evaluated. Therefore, the model will be stronger with molecules of MW <= 1000 or number of heavy atoms <= 79.
 
35
  ```python
36
  from transformers import AutoModelForSequenceClassification
37
 
38
+ regression_model = AutoModelForSequenceClassification.from_pretrained(
39
  "IlPakoZ/m5-encoder", trust_remote_code=True
40
  )
41
  ```
 
69
  A function ``model.collate_for_dataset`` is also available to perform collation for use in Pytorch's DataLoader. The function gets a list of tuples, each of which is composed of:
70
  - the first element is a dictionary with keys ``"input_ids"`` (``np.ndarray``, shape ``(L,)``) and ``"attention_mask"`` (``np.ndarray``, shape ``(L,)``), as produced by a tokenizer
71
  - the second element contains the positional embedding matrix;
72
+ - (optional) token regression labels. This is maintained mostly for reproducibility of our paper's results, but it can be left to ``None`` in most circumstances.
73
 
74
  ## Architecture
75
 
 
127
 
128
  ### Group 4 — Atom Löwdin charges (PubChemQC B3LYP/PM6)
129
 
130
+ Up to 1023 partial charges (`lowdin_0` … `lowdin_1022`), one per atom, predicted using each atom's corresponding output token embedding. This head covers well beyond the maximum number of atoms observed in the dataset. In practice, our training set covers up to `lowdin_149`.
131
 
132
  ## Dataset
133
 
 
147
  | Validation | 8,268,673 | tbd |
148
  | Test | 8,268,669 | ~ 0.82 B (×2 with augmentation → ~1.64 B) |
149
 
150
+ Training is performed with augmentation through SELFIES generated from randomly traversed versions of the original SMILES. This process is done by the method `get_positional_encodings_and_align` bundled in the model. Labels are normalized before training.
151
 
152
+ The HDF5 files containing the data used for training are available for download below (**coming soon**). These files are used for the training training our model, but are first converted into .lmdb format through the `data_processing` library in our GitHub repository (**coming soon**) to ensure fast access and stop CPU bottlenecking. The resulting LMDB files are too large to distribute directly at the moment, as input pre-computation (relative position encodings, input ids, attention masks and regression labels with augmentation) is performed.
153
 
154
  | Split | Download |
155
  |---|---|
 
160
  ## Limitations
161
 
162
  - **Token length:** The built-in `prepare_data` helper encodes pairwise molecular-graph distances in an `int16` matrix.
163
+ This was done to decrease the memory footprint of pairwise-distance matrices in case one intends to pre-compute them before training. Due to the `prepare_data` limitations, molecules whose SELFIES tokenization exceeds **32,766 tokens** (`numpy.iinfo(numpy.int16).max - 1`) are not supported. In practice, most molecule will lie well below this limit.
164
  - **Conformer handling:** Duplicate SMILES representing different conformers are kept in the dataset. The model therefore predicts an implicit average over conformers rather than a geometry-specific value, which may reduce accuracy for conformation-sensitive properties.
165
+ - **Scope:** The model is pretrained on molecules present in PubChemQC. Performance on certain compounds types and large macromolecules outside the training distribution has not been evaluated. Therefore, the model will be stronger with molecules of **MW <= 1000** or **number of heavy atoms <= 79**.