IlPakoZ
/

m5-encoder

@@ -35,7 +35,7 @@ To load `M5ModelForRegression` explicitly:
 ```python
 from transformers import AutoModelForSequenceClassification
-model = AutoModelForSequenceClassification.from_pretrained(
     "IlPakoZ/m5-encoder", trust_remote_code=True
 )
 ```
@@ -69,7 +69,7 @@ hidden  = outputs.last_hidden_state   # (1, seq_len, 512)
 A function ``model.collate_for_dataset`` is also available to perform collation for use in Pytorch's DataLoader. The function gets a list of tuples, each of which is composed of:
 - the first element is a dictionary with keys ``"input_ids"`` (``np.ndarray``, shape ``(L,)``) and ``"attention_mask"`` (``np.ndarray``, shape ``(L,)``), as produced by a tokenizer
 - the second element contains the positional embedding matrix;
-- (optional) token regression labels. This is maintained mostly for reproducibility of our paper's results, but it can be left to None in most circumstances.
 ## Architecture
@@ -127,7 +127,7 @@ Alpha and beta spin-orbital energies from DFT calculations:
 ### Group 4 — Atom Löwdin charges (PubChemQC B3LYP/PM6)
-Up to 1023 partial charges (`lowdin_0` … `lowdin_1022`), one per atom, predicted using each atom's corresponding output token embedding. This head covers well beyond the maximum number of atoms observed in the dataset.
 ## Dataset
@@ -147,9 +147,9 @@ The processed dataset contains **82,686,706 SMILES sequences**, each paired with
 | Validation | 8,268,673 | tbd |
 | Test | 8,268,669 | ~ 0.82 B (×2 with augmentation → ~1.64 B) |
-Training augmentation generates randomized SELFIES on the fly from each SMILES. Labels are normalized before training.
-The HDF5 files are available for download below. These are intended to be processed with the bundled `data_processing` library into LMDB datasets optimised for fast training throughput; the resulting LMDB files are too large to distribute directly.
 | Split | Download |
 |---|---|
@@ -160,6 +160,6 @@ The HDF5 files are available for download below. These are intended to be proces
 ## Limitations
 - **Token length:** The built-in `prepare_data` helper encodes pairwise molecular-graph distances in an `int16` matrix.
-This was done to decrease the size of pairwise-distance matrices in case one intends to pre-compute them before training. Due to the `prepare_data` limitations, molecules whose SELFIES tokenization exceeds **32,766 tokens** (`numpy.iinfo(numpy.int16).max - 1`) are not supported. In practice, most molecule will be well below this limit.
 - **Conformer handling:** Duplicate SMILES representing different conformers are kept in the dataset. The model therefore predicts an implicit average over conformers rather than a geometry-specific value, which may reduce accuracy for conformation-sensitive properties.
-- **Scope:** The model is pretrained on molecules present in PubChemQC. Performance on certain compounds types and large macromolecules outside the training distribution has not been evaluated. Therefore, the model will be stronger with molecules of MW <= 1000 or number of heavy atoms <= 79.

 ```python
 from transformers import AutoModelForSequenceClassification
+regression_model = AutoModelForSequenceClassification.from_pretrained(
     "IlPakoZ/m5-encoder", trust_remote_code=True
 )
 ```
 A function ``model.collate_for_dataset`` is also available to perform collation for use in Pytorch's DataLoader. The function gets a list of tuples, each of which is composed of:
 - the first element is a dictionary with keys ``"input_ids"`` (``np.ndarray``, shape ``(L,)``) and ``"attention_mask"`` (``np.ndarray``, shape ``(L,)``), as produced by a tokenizer
 - the second element contains the positional embedding matrix;
+- (optional) token regression labels. This is maintained mostly for reproducibility of our paper's results, but it can be left to ``None`` in most circumstances.
 ## Architecture
 ### Group 4 — Atom Löwdin charges (PubChemQC B3LYP/PM6)
+Up to 1023 partial charges (`lowdin_0` … `lowdin_1022`), one per atom, predicted using each atom's corresponding output token embedding. This head covers well beyond the maximum number of atoms observed in the dataset. In practice, our training set covers up to `lowdin_149`.
 ## Dataset
 | Validation | 8,268,673 | tbd |
 | Test | 8,268,669 | ~ 0.82 B (×2 with augmentation → ~1.64 B) |
+Training is performed with augmentation through SELFIES generated from randomly traversed versions of the original SMILES. This process is done by the method `get_positional_encodings_and_align` bundled in the model. Labels are normalized before training.
+The HDF5 files containing the data used for training are available for download below (**coming soon**). These files are used for the training training our model, but are first converted into .lmdb format through the `data_processing` library in our GitHub repository (**coming soon**) to ensure fast access and stop CPU bottlenecking. The resulting LMDB files are too large to distribute directly at the moment, as input pre-computation (relative position encodings, input ids, attention masks and regression labels with augmentation) is performed.
 | Split | Download |
 |---|---|
 ## Limitations
 - **Token length:** The built-in `prepare_data` helper encodes pairwise molecular-graph distances in an `int16` matrix.
+This was done to decrease the memory footprint of pairwise-distance matrices in case one intends to pre-compute them before training. Due to the `prepare_data` limitations, molecules whose SELFIES tokenization exceeds **32,766 tokens** (`numpy.iinfo(numpy.int16).max - 1`) are not supported. In practice, most molecule will lie well below this limit.
 - **Conformer handling:** Duplicate SMILES representing different conformers are kept in the dataset. The model therefore predicts an implicit average over conformers rather than a geometry-specific value, which may reduce accuracy for conformation-sensitive properties.
+- **Scope:** The model is pretrained on molecules present in PubChemQC. Performance on certain compounds types and large macromolecules outside the training distribution has not been evaluated. Therefore, the model will be stronger with molecules of **MW <= 1000** or **number of heavy atoms <= 79**.