Fill-Mask
Transformers
Safetensors
roberta
ChemBERTa
cheminformatics
Eval Results (legacy)
eacortes commited on
Commit
98c0c1f
·
verified ·
1 Parent(s): 0260447

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -3
README.md CHANGED
@@ -2,10 +2,116 @@
2
  license: apache-2.0
3
  datasets:
4
  - Derify/augmented_canonical_pubchem_13m
 
 
 
5
  library_name: transformers
6
  tags:
7
  - ChemBERTa
8
  - cheminformatics
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
  This model is a ChemBERTa model trained on the augmented_canonical_pubchem_13m dataset.
@@ -14,6 +120,58 @@ The model was trained for 24 epochs using NVIDIA Apex's FusedAdam optimizer with
14
  To improve performance, mixed precision (fp16), TF32, and torch.compile were enabled. Training used gradient accumulation (16 steps) and batch size of 128 for efficient resource utilization.
15
  Evaluation was performed at regular intervals, with the best model selected based on validation performance.
16
 
17
- ## Citation
18
- - Chithrananda, Seyone, et al. "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction." _arXiv [Cs.LG]_, 2020. [Link](http://arxiv.org/abs/2010.09885).
19
- - Ahmad, Walid, et al. "ChemBERTa-2: Towards Chemical Foundation Models." _arXiv [Cs.LG]_, 2022. [Link](http://arxiv.org/abs/2209.01712).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  datasets:
4
  - Derify/augmented_canonical_pubchem_13m
5
+ metrics:
6
+ - roc_auc
7
+ - rmse
8
  library_name: transformers
9
  tags:
10
  - ChemBERTa
11
  - cheminformatics
12
+ pipeline_tag: fill-mask
13
+ model-index:
14
+ - name: Derify/augmented_canonical_pubchem_13m
15
+ results:
16
+ - task:
17
+ type: text-classification
18
+ name: Classification (ROC AUC)
19
+ dataset:
20
+ name: BACE
21
+ type: BACE
22
+ metrics:
23
+ - type: roc_auc
24
+ value: 0.8008
25
+ - task:
26
+ type: text-classification
27
+ name: Classification (ROC AUC)
28
+ dataset:
29
+ name: BBBP
30
+ type: BBBP
31
+ metrics:
32
+ - type: roc_auc
33
+ value: 0.7418
34
+ - task:
35
+ type: text-classification
36
+ name: Classification (ROC AUC)
37
+ dataset:
38
+ name: TOX21
39
+ type: TOX21
40
+ metrics:
41
+ - type: roc_auc
42
+ value: 0.7548
43
+ - task:
44
+ type: text-classification
45
+ name: Classification (ROC AUC)
46
+ dataset:
47
+ name: HIV
48
+ type: HIV
49
+ metrics:
50
+ - type: roc_auc
51
+ value: 0.7744
52
+ - task:
53
+ type: text-classification
54
+ name: Classification (ROC AUC)
55
+ dataset:
56
+ name: SIDER
57
+ type: SIDER
58
+ metrics:
59
+ - type: roc_auc
60
+ value: 0.6313
61
+ - task:
62
+ type: text-classification
63
+ name: Classification (ROC AUC)
64
+ dataset:
65
+ name: CLINTOX
66
+ type: CLINTOX
67
+ metrics:
68
+ - type: roc_auc
69
+ value: 0.9621
70
+ - task:
71
+ type: regression
72
+ name: Regression (RMSE)
73
+ dataset:
74
+ name: ESOL
75
+ type: ESOL
76
+ metrics:
77
+ - type: rmse
78
+ value: 0.8798
79
+ - task:
80
+ type: regression
81
+ name: Regression (RMSE)
82
+ dataset:
83
+ name: FREESOLV
84
+ type: FREESOLV
85
+ metrics:
86
+ - type: rmse
87
+ value: 0.5282
88
+ - task:
89
+ type: regression
90
+ name: Regression (RMSE)
91
+ dataset:
92
+ name: LIPO
93
+ type: LIPO
94
+ metrics:
95
+ - type: rmse
96
+ value: 0.6853
97
+ - task:
98
+ type: regression
99
+ name: Regression (RMSE)
100
+ dataset:
101
+ name: BACE
102
+ type: BACE
103
+ metrics:
104
+ - type: rmse
105
+ value: 0.9554
106
+ - task:
107
+ type: regression
108
+ name: Regression (RMSE)
109
+ dataset:
110
+ name: CLEARANCE
111
+ type: CLEARANCE
112
+ metrics:
113
+ - type: rmse
114
+ value: 45.4362
115
  ---
116
 
117
  This model is a ChemBERTa model trained on the augmented_canonical_pubchem_13m dataset.
 
120
  To improve performance, mixed precision (fp16), TF32, and torch.compile were enabled. Training used gradient accumulation (16 steps) and batch size of 128 for efficient resource utilization.
121
  Evaluation was performed at regular intervals, with the best model selected based on validation performance.
122
 
123
+ ## Benchmarks
124
+ ### Classification Datasets (ROC AUC - Higher is better)
125
+
126
+ | Model | BACE↑ | BBBP↑ | TOX21↑ | HIV↑ | SIDER↑ | CLINTOX↑ |
127
+ | ------------------------- | ------ | ------ | ------ | ------ | ------ | -------- |
128
+ | **Tasks** | 1 | 1 | 12 | 1 | 27 | 2 |
129
+ | Derify/ChemBERTa-druglike | 0.8008 | 0.7418 | 0.7548 | 0.7744 | 0.6313 | 0.9621 |
130
+
131
+ ### Regression Datasets (RMSE - Lower is better)
132
+
133
+ | Model | ESOL↓ | FREESOLV↓ | LIPO↓ | BACE↓ | CLEARANCE↓ |
134
+ | ------------------------- | ------ | --------- | ------ | ------ | ---------- |
135
+ | **Tasks** | 1 | 1 | 1 | 1 | 1 |
136
+ | Derify/ChemBERTa-druglike | 0.8798 | 0.5282 | 0.6853 | 0.9554 | 45.4362 |
137
+
138
+ Benchmarks were conducted using the [chemberta3](https://github.com/deepforestsci/chemberta3) framework.
139
+ Datasets were split with DeepChem’s scaffold splits and filtered to include only molecules with SMILES length ≤200, following MolFormer paper's recommendation.
140
+ The ChemBERTa-druglike model was fine-tuned for 100 epochs with a learning rate of 3e-5 and batch size of 32.
141
+ Each task was run with 3 different random seeds, and the mean performance is reported.
142
+
143
+ ## References
144
+ ### ChemBERTa Series
145
+ ```
146
+ @misc{chithrananda2020chembertalargescaleselfsupervisedpretraining,
147
+ title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction},
148
+ author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
149
+ year={2020},
150
+ eprint={2010.09885},
151
+ archivePrefix={arXiv},
152
+ primaryClass={cs.LG},
153
+ url={https://arxiv.org/abs/2010.09885},
154
+ }
155
+ ```
156
+ ```
157
+ @misc{ahmad2022chemberta2chemicalfoundationmodels,
158
+ title={ChemBERTa-2: Towards Chemical Foundation Models},
159
+ author={Walid Ahmad and Elana Simon and Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
160
+ year={2022},
161
+ eprint={2209.01712},
162
+ archivePrefix={arXiv},
163
+ primaryClass={cs.LG},
164
+ url={https://arxiv.org/abs/2209.01712},
165
+ }
166
+ ```
167
+ ```
168
+ @misc{singh2025chemberta3opensource,
169
+ title={ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models},
170
+ author={Singh, R. and Barsainyan, A. A. and Irfan, R. and Amorin, C. J. and He, S. and Davis, T. and others},
171
+ year={2025},
172
+ howpublished={ChemRxiv},
173
+ doi={10.26434/chemrxiv-2025-4glrl-v2},
174
+ note={This content is a preprint and has not been peer-reviewed},
175
+ url={https://doi.org/10.26434/chemrxiv-2025-4glrl-v2}
176
+ }
177
+ ```